Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-06 Thread Michal Hocko
On Tue 05-06-18 13:03:29, Andrew Morton wrote:
[...]
> > As for why we would do something silly as dropping the caches every hour 
> > (in a
> > cronjob), we started doing this recently because after kernel 4.4, we got
> > frequent OOM kills despite having gigabytes of available memory (e.g. 12GB 
> > in
> > use, 20GB page cache and 16GB empty swap and bang, mysql gets killed). We 
> > found
> > that that the debian 4.9 kernel is unusable, and 4.14 works, *iff* we use 
> > the
> > above as an hourly cron job, so we did that, and afterwards run into
> > rtorrent/journald corruption issues. Without the echo in place, mysql 
> > usually
> > gets oom-killed after a few days of uptime.

Do you have any oom reports to share?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [External] Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD

2018-05-28 Thread Michal Hocko
On Fri 25-05-18 09:43:09, Huaisheng HS1 Ye wrote:
> From: Michal Hocko [mailto:mho...@kernel.org]
> Sent: Thursday, May 24, 2018 8:19 PM> 
> > > Let me try to reply your questions.
> > > Exactly, GFP_ZONE_TABLE is too complicated. I think there are two 
> > > advantages
> > > from the series of patches.
> > >
> > > 1. XOR operation is simple and efficient, GFP_ZONE_TABLE/BAD need to do 
> > > twice
> > > shift operations, the first is for getting a zone_type and the second is 
> > > for
> > > checking the to be returned type is a correct or not. But with these 
> > > patch XOR
> > > operation just needs to use once. Because the bottom 3 bits of GFP 
> > > bitmask have
> > > been used to represent the encoded zone number, we can say there is no 
> > > bad zone
> > > number if all callers could use it without buggy way. Of course, the 
> > > returned
> > > zone type in gfp_zone needs to be no more than ZONE_MOVABLE.
> > 
> > But you are losing the ability to check for wrong usage. And it seems
> > that the sad reality is that the existing code do screw up.
> 
> In my opinion, originally there shouldn't be such many wrong
> combinations of these bottom 3 bits. For any user, whether or
> driver and fs, they should make a decision that which zone is they
> preferred. Matthew's idea is great, because with it the user must
> offer an unambiguous flag to gfp zone bits.

Well, I would argue that those shouldn't really care about any zones at
all. All they should carea bout is whether they really need a low mem
zone (aka directly accessible to the kernel), highmem or they are the
allocation is generally movable. Mixing zones into the picture just
makes the whole thing more complicated and error prone.
[...]
> > That being said. I am not saying that I am in love with GFP_ZONE_TABLE.
> > It always makes my head explode when I look there but it seems to work
> > with the current code and it is optimized for it. If you want to change
> > this then you should make sure you describe reasons _why_ this is an
> > improvement. And I would argue that "we can have more zones" is a
> > relevant one.
> 
> Yes, GFP_ZONE_TABLE is too complicated. The patches have 4 advantages as 
> below.
> 
> * The address zone modifiers have new operation method, that is, user should 
> decide which zone is preferred at first, then give the encoded zone number to 
> bottom 3 bits in GFP mask. That is much direct and clear than before.
> 
> * No bad zone combination, because user should choose just one address zone 
> modifier always.
> * Better performance and efficiency, current gfp_zone has to take shifting 
> operation twice for GFP_ZONE_TABLE and GFP_ZONE_BAD. With these patches, 
> gfp_zone() just needs one XOR.
> * Up to 8 zones can be used. At least it isn't a disadvantage, right?

This should be a part of the changelog. Please note that you should
provide some number if you claim performance benefits. The complexity
will always be subjective.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD

2018-05-28 Thread Michal Hocko
On Fri 25-05-18 05:00:44, Matthew Wilcox wrote:
> On Thu, May 24, 2018 at 05:29:43PM +0200, Michal Hocko wrote:
> > > ie if we had more,
> > > could we solve our pain by making them more generic?
> > 
> > Well, if you have more you will consume more bits in the struct pages,
> > right?
> 
> Not necessarily ... the zone number is stored in the struct page
> currently, so either two or three bits are used right now.  In my
> proposal, one can infer the zone of a page from its PFN, except for
> ZONE_MOVABLE.  So we could trim down to just one bit per struct page
> for 32-bit machines while using 3 bits on 64-bit machines, where there
> is plenty of space.

Just be warned that page_zone is called from many hot paths. I am not
sure adding something more complex there is going to fly.

> > > it more-or-less sucks that the devices with 28-bit DMA limits are forced
> > > to allocate from the low 16MB when they're perfectly capable of using the
> > > low 256MB.
> > 
> > Do we actually care all that much about those? If yes then we should
> > probably follow the ZONE_DMA (x86) path and use a CMA region for them.
> > I mean most devices should be good with very limited addressability or
> > below 4G, no?
> 
> Sure.  One other thing I meant to mention was the media devices
> (TV capture cards and so on) which want a vmalloc_32() allocation.
> On 32-bit machines right now, we allocate from LOWMEM, when we really
> should be allocating from the 1GB-4GB region.  32-bit machines generally
> don't have a ZONE_DMA32 today.

Well, _I_ think that vmalloc on 32b is just lost case...

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD

2018-05-24 Thread Michal Hocko
On Thu 24-05-18 08:18:18, Matthew Wilcox wrote:
> On Thu, May 24, 2018 at 02:23:23PM +0200, Michal Hocko wrote:
> > > If we had eight ZONEs, we could offer:
> > 
> > No, please no more zones. What we have is quite a maint. burden on its
> > own. Ideally we should only have lowmem, highmem and special/device
> > zones for directly kernel accessible memory, the one that the kernel
> > cannot or must not use and completely special memory managed out of
> > the page allocator. All the remaining constrains should better be
> > implemented on top.
> 
> I believe you when you say that they're a maintenance pain.  Is that
> maintenance pain because they're so specialised?

Well, it used to be LRU balancing which is gone with the node reclaim
but that brings new challenges. Now as you say their meaning is not
really clear to users and that leads to bugs left and right.

> ie if we had more,
> could we solve our pain by making them more generic?

Well, if you have more you will consume more bits in the struct pages,
right?

[...]

> > But those already do have aproper API, IIUC. So do we really need to
> > make our GFP_*/Zone API more complicated than it already is?
> 
> I don't want to change the driver API (setting the DMA mask, etc),
> but we don't actually have a good API to the page allocator for the
> implementation of dma_alloc_foo() to request pages.  More or less,
> architectures do:
> 
>   if (mask < 4GB)
>   alloc_page(GFP_DMA)
>   else if (mask < 64EB)
>   alloc_page(GFP_DMA32)
>   else
>   alloc_page(GFP_HIGHMEM)
> 
> it more-or-less sucks that the devices with 28-bit DMA limits are forced
> to allocate from the low 16MB when they're perfectly capable of using the
> low 256MB.

Do we actually care all that much about those? If yes then we should
probably follow the ZONE_DMA (x86) path and use a CMA region for them.
I mean most devices should be good with very limited addressability or
below 4G, no?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD

2018-05-24 Thread Michal Hocko
On Wed 23-05-18 22:19:19, Matthew Wilcox wrote:
> On Tue, May 22, 2018 at 08:37:28PM +0200, Michal Hocko wrote:
> > So why is this any better than the current code. Sure I am not a great
> > fan of GFP_ZONE_TABLE because of how it is incomprehensible but this
> > doesn't look too much better, yet we are losing a check for incompatible
> > gfp flags. The diffstat looks really sound but then you just look and
> > see that the large part is the comment that at least explained the gfp
> > zone modifiers somehow and the debugging code. So what is the selling
> > point?
> 
> I have a plan, but it's not exactly fully-formed yet.
> 
> One of the big problems we have today is that we have a lot of users
> who have constraints on the physical memory they want to allocate,
> but we have very limited abilities to provide them with what they're
> asking for.  The various different ZONEs have different meanings on
> different architectures and are generally a mess.

Agreed.

> If we had eight ZONEs, we could offer:

No, please no more zones. What we have is quite a maint. burden on its
own. Ideally we should only have lowmem, highmem and special/device
zones for directly kernel accessible memory, the one that the kernel
cannot or must not use and completely special memory managed out of
the page allocator. All the remaining constrains should better be
implemented on top.

> ZONE_16M  // 24 bit
> ZONE_256M // 28 bit
> ZONE_LOWMEM   // CONFIG_32BIT only
> ZONE_4G   // 32 bit
> ZONE_64G  // 36 bit
> ZONE_1T   // 40 bit
> ZONE_ALL  // everything larger
> ZONE_MOVABLE  // movable allocations; no physical address guarantees
> 
> #ifdef CONFIG_64BIT
> #define ZONE_NORMAL   ZONE_ALL
> #else
> #define ZONE_NORMAL   ZONE_LOWMEM
> #endif
> 
> This would cover most driver DMA mask allocations; we could tweak the
> offered zones based on analysis of what people need.

But those already do have aproper API, IIUC. So do we really need to
make our GFP_*/Zone API more complicated than it already is?

> #define GFP_HIGHUSER  (GFP_USER | ZONE_ALL)
> #define GFP_HIGHUSER_MOVABLE  (GFP_USER | ZONE_MOVABLE)
> 
> One other thing I want to see is that fallback from zones happens from
> highest to lowest normally (ie if you fail to allocate in 1T, then you
> try to allocate from 64G), but movable allocations hapen from lowest
> to highest.  So ZONE_16M ends up full of page cache pages which are
> readily evictable for the rare occasions when we need to allocate memory
> below 16MB.
> 
> I'm sure there are lots of good reasons why this won't work, which is
> why I've been hesitant to propose it before now.

I am worried you are playing with a can of worms...
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [External] Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD

2018-05-24 Thread Michal Hocko
On Wed 23-05-18 16:07:16, Huaisheng HS1 Ye wrote:
> From: Michal Hocko [mailto:mho...@kernel.org]
> Sent: Wednesday, May 23, 2018 2:37 AM
> > 
> > On Mon 21-05-18 23:20:21, Huaisheng Ye wrote:
> > > From: Huaisheng Ye <ye...@lenovo.com>
> > >
> > > Replace GFP_ZONE_TABLE and GFP_ZONE_BAD with encoded zone number.
> > >
> > > Delete ___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 from GFP bitmasks,
> > > the bottom three bits of GFP mask is reserved for storing encoded
> > > zone number.
> > >
> > > The encoding method is XOR. Get zone number from enum zone_type,
> > > then encode the number with ZONE_NORMAL by XOR operation.
> > > The goal is to make sure ZONE_NORMAL can be encoded to zero. So,
> > > the compatibility can be guaranteed, such as GFP_KERNEL and GFP_ATOMIC
> > > can be used as before.
> > >
> > > Reserve __GFP_MOVABLE in bit 3, so that it can continue to be used as
> > > a flag. Same as before, __GFP_MOVABLE respresents movable migrate type
> > > for ZONE_DMA, ZONE_DMA32, and ZONE_NORMAL. But when it is enabled with
> > > __GFP_HIGHMEM, ZONE_MOVABLE shall be returned instead of ZONE_HIGHMEM.
> > > __GFP_ZONE_MOVABLE is created to realize it.
> > >
> > > With this patch, just enabling __GFP_MOVABLE and __GFP_HIGHMEM is not
> > > enough to get ZONE_MOVABLE from gfp_zone. All callers should use
> > > GFP_HIGHUSER_MOVABLE or __GFP_ZONE_MOVABLE directly to achieve that.
> > >
> > > Decode zone number directly from bottom three bits of flags in gfp_zone.
> > > The theory of encoding and decoding is,
> > > A ^ B ^ B = A
> > 
> > So why is this any better than the current code. Sure I am not a great
> > fan of GFP_ZONE_TABLE because of how it is incomprehensible but this
> > doesn't look too much better, yet we are losing a check for incompatible
> > gfp flags. The diffstat looks really sound but then you just look and
> > see that the large part is the comment that at least explained the gfp
> > zone modifiers somehow and the debugging code. So what is the selling
> > point?
> 
> Dear Michal,
> 
> Let me try to reply your questions.
> Exactly, GFP_ZONE_TABLE is too complicated. I think there are two advantages
> from the series of patches.
> 
> 1. XOR operation is simple and efficient, GFP_ZONE_TABLE/BAD need to do twice
> shift operations, the first is for getting a zone_type and the second is for
> checking the to be returned type is a correct or not. But with these patch XOR
> operation just needs to use once. Because the bottom 3 bits of GFP bitmask 
> have
> been used to represent the encoded zone number, we can say there is no bad 
> zone
> number if all callers could use it without buggy way. Of course, the returned
> zone type in gfp_zone needs to be no more than ZONE_MOVABLE.

But you are losing the ability to check for wrong usage. And it seems
that the sad reality is that the existing code do screw up.

> 2. GFP_ZONE_TABLE has limit with the amount of zone types. Current 
> GFP_ZONE_TABLE
> is 32 bits, in general, there are 4 zone types for most ofX86_64 platform, 
> they
> are ZONE_DMA, ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE. If we want to expand 
> the
> amount of zone types to larger than 4, the zone shift should be 3.

But we do not want to expand the number of zones IMHO. The existing zoo
is quite a maint. pain.
 
That being said. I am not saying that I am in love with GFP_ZONE_TABLE.
It always makes my head explode when I look there but it seems to work
with the current code and it is optimized for it. If you want to change
this then you should make sure you describe reasons _why_ this is an
improvement. And I would argue that "we can have more zones" is a
relevant one.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD

2018-05-22 Thread Michal Hocko
On Mon 21-05-18 23:20:21, Huaisheng Ye wrote:
> From: Huaisheng Ye <ye...@lenovo.com>
> 
> Replace GFP_ZONE_TABLE and GFP_ZONE_BAD with encoded zone number.
> 
> Delete ___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 from GFP bitmasks,
> the bottom three bits of GFP mask is reserved for storing encoded
> zone number.
> 
> The encoding method is XOR. Get zone number from enum zone_type,
> then encode the number with ZONE_NORMAL by XOR operation.
> The goal is to make sure ZONE_NORMAL can be encoded to zero. So,
> the compatibility can be guaranteed, such as GFP_KERNEL and GFP_ATOMIC
> can be used as before.
> 
> Reserve __GFP_MOVABLE in bit 3, so that it can continue to be used as
> a flag. Same as before, __GFP_MOVABLE respresents movable migrate type
> for ZONE_DMA, ZONE_DMA32, and ZONE_NORMAL. But when it is enabled with
> __GFP_HIGHMEM, ZONE_MOVABLE shall be returned instead of ZONE_HIGHMEM.
> __GFP_ZONE_MOVABLE is created to realize it.
> 
> With this patch, just enabling __GFP_MOVABLE and __GFP_HIGHMEM is not
> enough to get ZONE_MOVABLE from gfp_zone. All callers should use
> GFP_HIGHUSER_MOVABLE or __GFP_ZONE_MOVABLE directly to achieve that.
> 
> Decode zone number directly from bottom three bits of flags in gfp_zone.
> The theory of encoding and decoding is,
> A ^ B ^ B = A

So why is this any better than the current code. Sure I am not a great
fan of GFP_ZONE_TABLE because of how it is incomprehensible but this
doesn't look too much better, yet we are losing a check for incompatible
gfp flags. The diffstat looks really sound but then you just look and
see that the large part is the comment that at least explained the gfp
zone modifiers somehow and the debugging code. So what is the selling
point?

> Changes since v1,
> 
> v2: Add __GFP_ZONE_MOVABLE and modify GFP_HIGHUSER_MOVABLE to help
> callers to get ZONE_MOVABLE. Add __GFP_ZONE_MASK to mask lowest 3
> bits of GFP bitmasks.
> Modify some callers' gfp flag to update usage of address zone
> modifiers.
> Modify inline function gfp_zone to get better performance according
> to Matthew's suggestion.
> 
> Link: https://marc.info/?l=linux-mm=152596791931266=2
> 
> Huaisheng Ye (12):
>   include/linux/gfp.h: get rid of GFP_ZONE_TABLE/BAD
>   arch/x86/kernel/amd_gart_64: update usage of address zone modifiers
>   arch/x86/kernel/pci-calgary_64: update usage of address zone modifiers
>   drivers/iommu/amd_iommu: update usage of address zone modifiers
>   include/linux/dma-mapping: update usage of address zone modifiers
>   drivers/xen/swiotlb-xen: update usage of address zone modifiers
>   fs/btrfs/extent_io: update usage of address zone modifiers
>   drivers/block/zram/zram_drv: update usage of address zone modifiers
>   mm/vmpressure: update usage of address zone modifiers
>   mm/zsmalloc: update usage of address zone modifiers
>   include/linux/highmem: update usage of movableflags
>   arch/x86/include/asm/page.h: update usage of movableflags
> 
>  arch/x86/include/asm/page.h  |  3 +-
>  arch/x86/kernel/amd_gart_64.c|  2 +-
>  arch/x86/kernel/pci-calgary_64.c |  2 +-
>  drivers/block/zram/zram_drv.c|  6 +--
>  drivers/iommu/amd_iommu.c|  2 +-
>  drivers/xen/swiotlb-xen.c|  2 +-
>  fs/btrfs/extent_io.c |  2 +-
>  include/linux/dma-mapping.h  |  2 +-
>  include/linux/gfp.h  | 98 
> +---
>  include/linux/highmem.h      |  4 +-
>  mm/vmpressure.c  |  2 +-
>  mm/zsmalloc.c|  4 +-
>  12 files changed, 26 insertions(+), 103 deletions(-)
> 
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault

2017-12-14 Thread Michal Hocko
On Thu 14-12-17 18:55:27, Yan, Zheng wrote:
> We recently got an Oops report:
> 
> BUG: unable to handle kernel NULL pointer dereference at (null)
> IP: jbd2__journal_start+0x38/0x1a2
> [...]
> Call Trace:
>   ext4_page_mkwrite+0x307/0x52b
>   _ext4_get_block+0xd8/0xd8
>   do_page_mkwrite+0x6e/0xd8
>   handle_mm_fault+0x686/0xf9b
>   mntput_no_expire+0x1f/0x21e
>   __do_page_fault+0x21d/0x465
>   dput+0x4a/0x2f7
>   page_fault+0x22/0x30
>   copy_user_generic_string+0x2c/0x40
>   copy_page_to_iter+0x8c/0x2b8
>   generic_file_read_iter+0x26e/0x845
>   timerqueue_del+0x31/0x90
>   ceph_read_iter+0x697/0xa33 [ceph]
>   hrtimer_cancel+0x23/0x41
>   futex_wait+0x1c8/0x24d
>   get_futex_key+0x32c/0x39a
>   __vfs_read+0xe0/0x130
>   vfs_read.part.1+0x6c/0x123
>   handle_mm_fault+0x831/0xf9b
>   __fget+0x7e/0xbf
>   SyS_read+0x4d/0xb5
> 
> ceph_read_iter() uses current->journal_info to pass context info to
> ceph_readpages(). Because ceph_readpages() needs to know if its caller
> has already gotten capability of using page cache (distinguish read
> from readahead/fadvise). ceph_read_iter() set current->journal_info,
> then calls generic_file_read_iter().
> 
> In above Oops, page fault happened when copying data to userspace.
> Page fault handler called ext4_page_mkwrite(). Ext4 code read
> current->journal_info and assumed it is journal handle.
> 
> I checked other filesystems, btrfs probably suffers similar problem
> for its readpage. (page fault happens when write() copies data from
> userspace memory and the memory is mapped to a file in btrfs.
> verify_parent_transid() can be called during readpage)
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: "Yan, Zheng" <z...@redhat.com>

I am not an FS expert so (ab)using journal_info for unrelated purposes
might be acceptable in general but hooking into the generic PF path like
this is just too ugly to live. Can this be limited to a FS code so that
not everybody has to pay additional cycles? With a big fat warning that
(ab)users might want to find a better way to comunicate their internal
stuff.

> ---
>  mm/memory.c | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index a728bed16c20..db2a50233c49 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, 
> unsigned long address,
>   unsigned int flags)
>  {
>   int ret;
> + void *old_journal_info;
>  
>   __set_current_state(TASK_RUNNING);
>  
> @@ -4065,11 +4066,24 @@ int handle_mm_fault(struct vm_area_struct *vma, 
> unsigned long address,
>   if (flags & FAULT_FLAG_USER)
>   mem_cgroup_oom_enable();
>  
> + /*
> +  * Fault can happen when filesystem A's read_iter()/write_iter()
> +  * copies data to/from userspace. Filesystem A may have set
> +  * current->journal_info. If the userspace memory is MAP_SHARED
> +  * mapped to a file in filesystem B, we later may call filesystem
> +  * B's vm operation. Filesystem B may also want to read/set
> +  * current->journal_info.
> +  */
> + old_journal_info = current->journal_info;
> + current->journal_info = NULL;
> +
>   if (unlikely(is_vm_hugetlb_page(vma)))
>   ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
>   else
>   ret = __handle_mm_fault(vma, address, flags);
>  
> + current->journal_info = old_journal_info;
> +
>   if (flags & FAULT_FLAG_USER) {
>   mem_cgroup_oom_disable();
>   /*
> -- 
> 2.13.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7] mm: introduce memalloc_nofs_{save,restore} API

2017-03-07 Thread Michal Hocko
On Mon 06-03-17 13:22:14, Andrew Morton wrote:
> On Mon,  6 Mar 2017 14:14:05 +0100 Michal Hocko <mho...@kernel.org> wrote:
[...]
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -210,8 +210,16 @@ struct vm_area_struct;
> >   *
> >   * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
> >   *   that do not require the starting of any physical IO.
> > + *   Please try to avoid using this flag directly and instead use
> > + *   memalloc_noio_{save,restore} to mark the whole scope which cannot
> > + *   perform any IO with a short explanation why. All allocation requests
> > + *   will inherit GFP_NOIO implicitly.
> >   *
> >   * GFP_NOFS will use direct reclaim but will not use any filesystem 
> > interfaces.
> > + *   Please try to avoid using this flag directly and instead use
> > + *   memalloc_nofs_{save,restore} to mark the whole scope which 
> > cannot/shouldn't
> > + *   recurse into the FS layer with a short explanation why. All allocation
> > + *   requests will inherit GFP_NOFS implicitly.
> 
> I wonder if these are worth a checkpatch rule.

I am not really sure, to be honest. This may easilly end up people
replacing

do_alloc(GFP_NOFS)

with

memalloc_nofs_save()
do_alloc(GFP_KERNEL)
memalloc_nofs_restore()

which doesn't make any sense of course. From my experience, people tend
to do stupid things just to silent checkpatch warnings very often.
Moreover I believe we need to do the transition to the new api first
before we can push back on the explicit GFP_NOFS usage. Maybe then we
can think about the a checkpatch warning.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] lockdep: allow to disable reclaim lockup detection

2017-03-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.

=
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G   O
-
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(_nondir_ilock_class){-+}, at: [] 
xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
  [] mark_held_locks+0x79/0xa0
  [] lockdep_trace_alloc+0xb3/0x100
  [] kmem_cache_alloc+0x33/0x230
  [] kmem_zone_alloc+0x81/0x120 [xfs]
  [] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
  [] __xfs_refcount_find_shared+0x75/0x580 [xfs]
  [] xfs_refcount_find_shared+0x84/0xb0 [xfs]
  [] xfs_getbmap+0x608/0x8c0 [xfs]
  [] xfs_vn_fiemap+0xab/0xc0 [xfs]
  [] do_vfs_ioctl+0x498/0x670
  [] SyS_ioctl+0x79/0x90
  [] entry_SYSCALL_64_fastpath+0x12/0x6f

   CPU0
   
  lock(_nondir_ilock_class);
  
lock(_nondir_ilock_class);

 *** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G   O4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

 82a34f10 88003aa078d0 813a14f9 88003d8551c0
 88003aa07920 8110ec65  0001
 8801 000b 0008 88003d855aa0
Call Trace:
 [] dump_stack+0x4b/0x72
 [] print_usage_bug+0x215/0x240
 [] mark_lock+0x1f5/0x660
 [] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [] __lock_acquire+0xa80/0x1e50
 [] ? kmem_cache_alloc+0x15e/0x230
 [] ? kmem_zone_alloc+0x81/0x120 [xfs]
 [] lock_acquire+0xd8/0x1e0
 [] ? xfs_ilock+0x177/0x200 [xfs]
 [] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [] down_write_nested+0x5e/0xc0
 [] ? xfs_ilock+0x177/0x200 [xfs]
 [] xfs_ilock+0x177/0x200 [xfs]
 [] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
 [] evict+0xc5/0x190
 [] dispose_list+0x39/0x60
 [] prune_icache_sb+0x4b/0x60
 [] super_cache_scan+0x14f/0x1a0
 [] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
 [] shrink_zone+0x15e/0x170
 [] kswapd+0x4f1/0xa80
 [] ? zone_reclaim+0x230/0x230
 [] kthread+0xf2/0x110
 [] ? kthread_create_on_node+0x220/0x220
 [] ret_from_fork+0x3f/0x70
 [] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Suggested-by: Peter Zijlstra <pet...@infradead.org>
Acked-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Acked-by: Vlastimil Babka <vba...@suse.cz>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
 include/linux/gfp.h  | 10 +-
 kernel/locking/lockdep.c |  4 
 lib/radix-tree.c |  2 ++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index db373b9d3223..978232a3b4ae 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -40,6 +40,11 @@ struct vm_area_struct;
 #define ___GFP_DIRECT_RECLAIM  0x40u
 #define ___GFP_WRITE   0x80u
 #define ___GFP_KSWAPD_RECLAIM  0x100u
+#ifdef CONFIG_LOCKDEP
+#define ___GFP_NOLOCKDEP   0x400u
+#else
+#define ___GFP_NOLOCKDEP   0
+#endif
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -179,8 +184,11 @@ struct vm_area_struct;
 #define __GFP_NOTRACK  ((__force g

[PATCH 7/7] jbd2: make the whole kjournald2 kthread NOFS safe

2017-03-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

Suggested-by: Jan Kara <j...@suse.cz>
Reviewed-by: Jan Kara <j...@suse.cz>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/jbd2/journal.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index a1a359bfcc9c..78433ce1db40 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -206,6 +207,13 @@ static int kjournald2(void *arg)
wake_up(>j_wait_done_commit);
 
/*
+* Make sure that no allocations from this kernel thread will ever 
recurse
+* to the fs layer because we are responsible for the transaction commit
+* and any fs involvement might get stuck waiting for the trasn. commit.
+*/
+   memalloc_nofs_save();
+
+   /*
 * And now, wait forever for commit wakeup events.
 */
write_lock(>j_state_lock);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] jbd2: mark the transaction context with the scope GFP_NOFS context

2017-03-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

Reviewed-by: Jan Kara <j...@suse.cz>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/jbd2/transaction.c | 12 
 include/linux/jbd2.h  |  2 ++
 2 files changed, 14 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 5e659ee08d6a..d8f09f34285f 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -388,6 +389,11 @@ static int start_this_handle(journal_t *journal, handle_t 
*handle,
 
rwsem_acquire_read(>j_trans_commit_map, 0, 0, _THIS_IP_);
jbd2_journal_free_transaction(new_transaction);
+   /*
+* Make sure that no allocations done while the transaction is
+* open is going to recurse back to the fs layer.
+*/
+   handle->saved_alloc_context = memalloc_nofs_save();
return 0;
 }
 
@@ -466,6 +472,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int 
nblocks, int rsv_blocks,
trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
handle->h_transaction->t_tid, type,
line_no, nblocks);
+
return handle;
 }
 EXPORT_SYMBOL(jbd2__journal_start);
@@ -1760,6 +1767,11 @@ int jbd2_journal_stop(handle_t *handle)
if (handle->h_rsv_handle)
jbd2_journal_free_reserved(handle->h_rsv_handle);
 free_and_exit:
+   /*
+* scope of th GFP_NOFS context is over here and so we can
+* restore the original alloc context.
+*/
+   memalloc_nofs_restore(handle->saved_alloc_context);
jbd2_free_handle(handle);
return err;
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index dfaa1f4dcb0c..606b6bce3a5b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -491,6 +491,8 @@ struct jbd2_journal_handle
 
unsigned long   h_start_jiffies;
unsigned inth_requested_credits;
+
+   unsigned intsaved_alloc_context;
 };
 
 
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/7] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

2017-03-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Acked-by: Vlastimil Babka <vba...@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>
Reviewed-by: Brian Foster <bfos...@redhat.com>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/xfs/kmem.c |  4 ++--
 fs/xfs/kmem.h |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c |  6 +++---
 fs/xfs/xfs_trans.c| 12 ++--
 include/linux/sched.h |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 2dfdc62f795e..e14da724a0b5 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -81,13 +81,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 * the filesystem here and potentially deadlocking.
 */
-   if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
noio_flag = memalloc_noio_save();
 
lflags = kmem_flags_convert(flags);
ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-   if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
memalloc_noio_restore(noio_flag);
 
return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..d973dbfc2bfa 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
-   if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
lflags &= ~__GFP_FS;
}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index c3decedc9455..3059a3ec7ecb 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2886,7 +2886,7 @@ xfs_btree_split_worker(
struct xfs_btree_split_args *args = container_of(work,
struct xfs_btree_split_args, 
work);
unsigned long   pflags;
-   unsigned long   new_pflags = PF_FSTRANS;
+   unsigned long   new_pflags = PF_MEMALLOC_NOFS;
 
/*
 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index bf65a9ea8642..330c6019120e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
 * We hand off the transaction to the completion thread now, so
 * clear the flag here.
 */
-   current_restore_flags_nested(>t_pflags, PF_FSTRANS);
+   current_restore_flags_nested(>t_pflags, PF_MEMALLOC_NOFS);
return 0;
 }
 
@@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
 * thus we need to mark ourselves as being in a transaction manually.
 * Similarly for freeze protection.
 */
-   current_set_flags_nested(>t_pflags, PF_FSTRANS);
+   current_set_flags_nested(>t_pflags, PF_MEMALLOC_NOFS);
__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
/* we abort the update if there was an IO error */
@@ -1021,7 +1021,7 @@ xfs_do_writepage(
 * Given that we do not allow direct reclaim to call us, we should
 * never be called while in a filesystem transaction.
 */
-   if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+   if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
goto redirty;
 
/*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 70f42ea86dfb..f5969c8274fc 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -134,7 +134,7 @@ xfs_trans_reserve(
boolrsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
/* Mark this thread as being in a transaction */
-   current_set_flags_nested(>t_pflags, PF_FSTRANS);
+   current_set_flags_nested(>t_pflags, PF_MEMALLOC_NOFS);
 
/*
 * Attempt to reserve the needed disk blocks by decrementing
@@ -144,7 +144,7 @@ xfs_trans_reserv

[PATCH 1/7] lockdep: teach lockdep about memalloc_noio_save

2017-03-06 Thread Michal Hocko
From: Nikolay Borisov <nbori...@suse.com>

Commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O
during memory allocation") added the memalloc_noio_(save|restore) functions
to enable people to modify the MM behavior by disabling I/O during memory
allocation. This was further extended in Fixes: 934f3072c17c ("mm: clear
__GFP_FS when PF_MEMALLOC_NOIO is set"). memalloc_noio_* functions prevent
allocation paths recursing back into the filesystem without explicitly
changing the flags for every allocation site. However, lockdep hasn't been
keeping up with the changes and it entirely misses handling the memalloc_noio
adjustments. Instead, it is left to the callers of __lockdep_trace_alloc to
call the function after they have shaven the respective GFP flags which
can lead to false positives:

[  644.173373] =
[  644.174012] [ INFO: inconsistent lock state ]
[  644.174012] 4.10.0-nbor #134 Not tainted
[  644.174012] -
[  644.174012] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
[  644.174012] fsstress/3365 [HC0[0]:SC0[0]:HE1:SE1] takes:
[  644.174012]  (_nondir_ilock_class){?.}, at: [] 
xfs_ilock+0x141/0x230
[  644.174012] {IN-RECLAIM_FS-W} state was registered at:
[  644.174012]   __lock_acquire+0x62a/0x17c0
[  644.174012]   lock_acquire+0xc5/0x220
[  644.174012]   down_write_nested+0x4f/0x90
[  644.174012]   xfs_ilock+0x141/0x230
[  644.174012]   xfs_reclaim_inode+0x12a/0x320
[  644.174012]   xfs_reclaim_inodes_ag+0x2c8/0x4e0
[  644.174012]   xfs_reclaim_inodes_nr+0x33/0x40
[  644.174012]   xfs_fs_free_cached_objects+0x19/0x20
[  644.174012]   super_cache_scan+0x191/0x1a0
[  644.174012]   shrink_slab+0x26f/0x5f0
[  644.174012]   shrink_node+0xf9/0x2f0
[  644.174012]   kswapd+0x356/0x920
[  644.174012]   kthread+0x10c/0x140
[  644.174012]   ret_from_fork+0x31/0x40
[  644.174012] irq event stamp: 173777
[  644.174012] hardirqs last  enabled at (173777): [] 
__local_bh_enable_ip+0x70/0xc0
[  644.174012] hardirqs last disabled at (173775): [] 
__local_bh_enable_ip+0x37/0xc0
[  644.174012] softirqs last  enabled at (173776): [] 
_xfs_buf_find+0x67a/0xb70
[  644.174012] softirqs last disabled at (173774): [] 
_xfs_buf_find+0x5db/0xb70
[  644.174012]
[  644.174012] other info that might help us debug this:
[  644.174012]  Possible unsafe locking scenario:
[  644.174012]
[  644.174012]CPU0
[  644.174012]
[  644.174012]   lock(_nondir_ilock_class);
[  644.174012]   
[  644.174012] lock(_nondir_ilock_class);
[  644.174012]
[  644.174012]  *** DEADLOCK ***
[  644.174012]
[  644.174012] 4 locks held by fsstress/3365:
[  644.174012]  #0:  (sb_writers#10){++}, at: [] 
mnt_want_write+0x24/0x50
[  644.174012]  #1:  (>s_type->i_mutex_key#12){++}, at: 
[] vfs_setxattr+0x6f/0xb0
[  644.174012]  #2:  (sb_internal#2){++}, at: [] 
xfs_trans_alloc+0xfc/0x140
[  644.174012]  #3:  (_nondir_ilock_class){?.}, at: 
[] xfs_ilock+0x141/0x230
[  644.174012]
[  644.174012] stack backtrace:
[  644.174012] CPU: 0 PID: 3365 Comm: fsstress Not tainted 4.10.0-nbor #134
[  644.174012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  644.174012] Call Trace:
[  644.174012]  dump_stack+0x85/0xc9
[  644.174012]  print_usage_bug.part.37+0x284/0x293
[  644.174012]  ? print_shortest_lock_dependencies+0x1b0/0x1b0
[  644.174012]  mark_lock+0x27e/0x660
[  644.174012]  mark_held_locks+0x66/0x90
[  644.174012]  lockdep_trace_alloc+0x6f/0xd0
[  644.174012]  kmem_cache_alloc_node_trace+0x3a/0x2c0
[  644.174012]  ? vm_map_ram+0x2a1/0x510
[  644.174012]  vm_map_ram+0x2a1/0x510
[  644.174012]  ? vm_map_ram+0x46/0x510
[  644.174012]  _xfs_buf_map_pages+0x77/0x140
[  644.174012]  xfs_buf_get_map+0x185/0x2a0
[  644.174012]  xfs_attr_rmtval_set+0x233/0x430
[  644.174012]  xfs_attr_leaf_addname+0x2d2/0x500
[  644.174012]  xfs_attr_set+0x214/0x420
[  644.174012]  xfs_xattr_set+0x59/0xb0
[  644.174012]  __vfs_setxattr+0x76/0xa0
[  644.174012]  __vfs_setxattr_noperm+0x5e/0xf0
[  644.174012]  vfs_setxattr+0xae/0xb0
[  644.174012]  ? __might_fault+0x43/0xa0
[  644.174012]  setxattr+0x15e/0x1a0
[  644.174012]  ? __lock_is_held+0x53/0x90
[  644.174012]  ? rcu_read_lock_sched_held+0x93/0xa0
[  644.174012]  ? rcu_sync_lockdep_assert+0x2f/0x60
[  644.174012]  ? __sb_start_write+0x130/0x1d0
[  644.174012]  ? mnt_want_write+0x24/0x50
[  644.174012]  path_setxattr+0x8f/0xc0
[  644.174012]  SyS_lsetxattr+0x11/0x20
[  644.174012]  entry_SYSCALL_64_fastpath+0x23/0xc6

Let's fix this by making lockdep explicitly do the shaving of respective
GFP flags.

Fixes: 934f3072c17c ("mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set")
Acked-by: Michal Hocko <mho...@suse.cz>
Acked-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Signed-off-by: Nikolay Borisov <nbori...@suse.com>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
 kernel/loc

[PATCH 0/7 v5] scope GFP_NOFS api

2017-03-06 Thread Michal Hocko
Hi,
I have posted the previous version here [1]. There are no real changes
in the implementation since then. I've just added "lockdep: teach
lockdep about memalloc_noio_save" from Nikolay which is a lockdep bugfix
developed independently but "mm: introduce memalloc_nofs_{save,restore}
API" depends on it so I added it here. Then I've rebased the series on
top of 4.11-rc1 which contains sched.h split up which required to add
sched/mm.h include.

There didn't seem to be any real objections and so I think we should go
and finally merge this - ideally in this release cycle as it doesn't
really introduce any functional changes. Those were separated out and
will be posted later. The risk of regressions should really be small
because we do not remove any real GFP_NOFS users yet.

Diffstat says
 fs/jbd2/journal.c |  8 
 fs/jbd2/transaction.c | 12 
 fs/xfs/kmem.c | 12 ++--
 fs/xfs/kmem.h |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c |  6 +++---
 fs/xfs/xfs_buf.c  |  8 
 fs/xfs/xfs_trans.c| 12 ++--
 include/linux/gfp.h   | 18 +-
 include/linux/jbd2.h  |  2 ++
 include/linux/sched.h |  6 +++---
 include/linux/sched/mm.h  | 26 +++---
 kernel/locking/lockdep.c  | 11 +--
 lib/radix-tree.c  |  2 ++
 mm/page_alloc.c   | 10 ++
 mm/vmscan.c   |  6 +++---
 16 files changed, 106 insertions(+), 37 deletions(-)

Shortlog:
Michal Hocko (6):
  lockdep: allow to disable reclaim lockup detection
  xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
  mm: introduce memalloc_nofs_{save,restore} API
  xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  jbd2: mark the transaction context with the scope GFP_NOFS context
  jbd2: make the whole kjournald2 kthread NOFS safe

Nikolay Borisov (1):
  lockdep: teach lockdep about memalloc_noio_save


[1] http://lkml.kernel.org/r/20170206140718.16222-1-mho...@kernel.org
[2] http://lkml.kernel.org/r/20170117030118.727jqyamjhojz...@thunk.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] mm: introduce memalloc_nofs_{save,restore} API

2017-03-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

GFP_NOFS context is used for the following 5 reasons currently
- to prevent from deadlocks when the lock held by the allocation
  context would be needed during the memory reclaim
- to prevent from stack overflows during the reclaim because
  the allocation is performed from a deep context already
- to prevent lockups when the allocation context depends on
  other reclaimers to make a forward progress indirectly
- just in case because this would be safe from the fs POV
- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Acked-by: Vlastimil Babka <vba...@suse.cz>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/xfs/kmem.h|  2 +-
 include/linux/gfp.h  |  8 
 include/linux/sched.h|  8 +++-
 include/linux/sched/mm.h | 26 +++---
 kernel/locking/lockdep.c |  6 +++---
 mm/page_alloc.c  | 10 ++
 mm/vmscan.c  |  6 +++---
 7 files changed, 47 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d973dbfc2bfa..ae08cfd9552a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
-   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+   if (flags & KM_NOFS)
lflags &= ~__GFP_FS;
}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 978232a3b4ae..2bfcfd33e476 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which 
cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4528f7c9789f..9c3ee2281a56 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1211,9 +1211,9 @@ extern struct pid *cad_pid;
 #define PF_USED_ASYNC  0x4000  /* Used async_schedule*(), used 
b

[PATCH 5/7] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

2017-03-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Acked-by: Vlastimil Babka <vba...@suse.cz>
Reviewed-by: Brian Foster <bfos...@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/xfs/kmem.c| 12 ++--
 fs/xfs/xfs_buf.c |  8 
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index e14da724a0b5..6b7b04468aa8 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -66,7 +66,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-   unsigned noio_flag = 0;
+   unsigned nofs_flag = 0;
void*ptr;
gfp_t   lflags;
 
@@ -78,17 +78,17 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 * __vmalloc() will allocate data pages and auxillary structures (e.g.
 * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context
 * here. Hence we need to tell memory reclaim that we are in such a
-* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
+* context via PF_MEMALLOC_NOFS to prevent memory reclaim re-entering
 * the filesystem here and potentially deadlocking.
 */
-   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-   noio_flag = memalloc_noio_save();
+   if (flags & KM_NOFS)
+   nofs_flag = memalloc_nofs_save();
 
lflags = kmem_flags_convert(flags);
ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-   memalloc_noio_restore(noio_flag);
+   if (flags & KM_NOFS)
+   memalloc_nofs_restore(nofs_flag);
 
return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index b6208728ba39..ca09061369cb 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -443,17 +443,17 @@ _xfs_buf_map_pages(
bp->b_addr = NULL;
} else {
int retried = 0;
-   unsigned noio_flag;
+   unsigned nofs_flag;
 
/*
 * vm_map_ram() will allocate auxillary structures (e.g.
 * pagetables) with GFP_KERNEL, yet we are likely to be under
 * GFP_NOFS context here. Hence we need to tell memory reclaim
-* that we are in such a context via PF_MEMALLOC_NOIO to prevent
+* that we are in such a context via PF_MEMALLOC_NOFS to prevent
 * memory reclaim re-entering the filesystem here and
 * potentially deadlocking.
 */
-   noio_flag = memalloc_noio_save();
+   nofs_flag = memalloc_nofs_save();
do {
bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
-1, PAGE_KERNEL);
@@ -461,7 +461,7 @@ _xfs_buf_map_pages(
break;
vm_unmap_aliases();
} while (retried++ <= 1);
-   memalloc_noio_restore(noio_flag);
+   memalloc_nofs_restore(nofs_flag);
 
if (!bp->b_addr)
return -ENOMEM;
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"

2017-03-06 Thread Michal Hocko
On Tue 17-01-17 08:54:50, Michal Hocko wrote:
> On Mon 16-01-17 22:01:18, Theodore Ts'o wrote:
> > On Fri, Jan 06, 2017 at 03:11:06PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mho...@suse.com>
> > > 
> > > This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
> > > sb_getblk_gfp is not really needed as
> > > sb_getblk
> > >   __getblk_gfp
> > > __getblk_slow
> > >   grow_buffers
> > > grow_dev_page
> > > gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp
> > > 
> > > so __GFP_FS is cleared unconditionally and therefore the above commit
> > > didn't have any real effect in fact.
> > > 
> > > This patch should not introduce any functional change. The main point
> > > of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
> > > make the review of the remaining usage easier.
> > > 
> > > Signed-off-by: Michal Hocko <mho...@suse.com>
> > > Reviewed-by: Jan Kara <j...@suse.cz>
> > 
> > If I'm not mistaken, this patch is not dependent on any of the other
> > patches in this series (and the other patches are not dependent on
> > this one).  Hence, I could take this patch via the ext4 tree, correct?
> 
> Yes, that is correct

Hi Ted,
this doesn't seem to be in any of the branches [1]. I plan to resend the
whole scope nofs series, should I add this to the pile or you are going
to route it via your tree?

[1] git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

2017-02-06 Thread Michal Hocko
On Tue 07-02-17 09:51:50, Dave Chinner wrote:
> On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote:
> > On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
[...]
> > > I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> > > wording of the old comment because it captures the uncertainty of
> > > whether or not we actually are already under NOFS.  If someone actually
> > > has audited this code well enough to know for sure then yes let's change
> > > the comment, but I haven't gone that far.
> > 
> > I believe we can drop the memalloc_nofs_save then as well because either
> > we are called from a potentially dangerous context and thus we are in
> > the nofs scope we we do not need the protection at all.
> 
> No, absolutely not. "Belief" is not a sufficient justification for
> removing low level deadlock avoidance infrastructure. This code
> needs to remain in _xfs_buf_map_pages() until a full audit of the
> caller paths is done and we're 100% certain that there are no
> lurking deadlocks.

Exactly. I was actually refering to "If someone actually has audited
this code" above... So I definitely do not want to justify anything
based on the belief

> For example, I'm pretty sure we can call into _xfs_buf_map_pages()
> outside of a transaction context but with an inode ILOCK held
> exclusively. If we then recurse into memory reclaim and try to run a
> transaction during reclaim, we have an inverted ILOCK vs transaction
> locking order. i.e. we are not allowed to call xfs_trans_reserve()
> with an ILOCK held as that can deadlock the log:  log full, locked
> inode pins tail of log, inode cannot be flushed because ILOCK is
> held by caller waiting for log space to become available
> 
> i.e. there are certain situations where holding a ILOCK is a
> deadlock vector. See xfs_lock_inodes() for an example of the lengths
> we go to avoid ILOCK based log deadlocks like this...

Thanks for the reference. This is really helpful!

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

2017-02-06 Thread Michal Hocko
On Mon 06-02-17 11:51:11, Darrick J. Wong wrote:
> On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote:
> > On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
> > > On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> > > > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > > > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > > > > > bp->b_addr = NULL;
> > > > > > } else {
> > > > > > int retried = 0;
> > > > > > -   unsigned noio_flag;
> > > > > > +   unsigned nofs_flag;
> > > > > >  
> > > > > > /*
> > > > > >  * vm_map_ram() will allocate auxillary structures (e.g.
> > > > > >  * pagetables) with GFP_KERNEL, yet we are likely to be 
> > > > > > under
> > > > > >  * GFP_NOFS context here. Hence we need to tell memory 
> > > > > > reclaim
> > > > > > -* that we are in such a context via PF_MEMALLOC_NOIO 
> > > > > > to prevent
> > > > > > +* that we are in such a context via PF_MEMALLOC_NOFS 
> > > > > > to prevent
> > > > > >  * memory reclaim re-entering the filesystem here and
> > > > > >  * potentially deadlocking.
> > > > > >  */
> > > > > 
> > > > > This comment feels out of date ... how about:
> > > > 
> > > > which part is out of date?
> > > > 
> > > > > 
> > > > >   /*
> > > > >* vm_map_ram will allocate auxiliary structures (eg 
> > > > > page
> > > > >* tables) with GFP_KERNEL.  If that tries to reclaim 
> > > > > memory
> > > > >* by calling back into this filesystem, we may 
> > > > > deadlock.
> > > > >* Prevent that by setting the NOFS flag.
> > > > >*/
> > > > 
> > > > dunno, the previous wording seems clear enough to me. Maybe little bit
> > > > more chatty than yours but I am not sure this is worth changing.
> > > 
> > > I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> > > wording of the old comment because it captures the uncertainty of
> > > whether or not we actually are already under NOFS.  If someone actually
> > > has audited this code well enough to know for sure then yes let's change
> > > the comment, but I haven't gone that far.
> 
> Ugh, /me hands himself another cup of coffee...
> 
> Somehow I mixed up _xfs_buf_map_pages and kmem_zalloc_large in this
> discussion.  Probably because they have similar code snippets with very
> similar comments to two totally different parts of xfs.
> 
> The _xfs_buf_map_pages can be called inside or outside of
> transaction context, so I think we still have to memalloc_nofs_save for
> that to avoid the lockdep complaints and deadlocks referenced in the
> commit that added all that (to _xfs_buf_map_pages) in the first place.
> ae687e58b3 ("xfs: use NOIO contexts for vm_map_ram")

Yes, and that memalloc_nofs_save would start with the transaction
context so this (_xfs_buf_map_pages) call would be already covered so
additional memalloc_nofs_save would be unnecessary. Right now I am not
sure whether this is always the case so I have kept this "just to be
sure" measure. Checking that would be in the next step when I would like
to remove other KM_NOFS usage so that we would always rely on the scope
inside the transaction or other potentially dangerous (e.g. from the
stack usage POV and who knows what else) contexts.

Does that make more sense now?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

2017-02-06 Thread Michal Hocko
On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
> On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > > > bp->b_addr = NULL;
> > > > } else {
> > > > int retried = 0;
> > > > -   unsigned noio_flag;
> > > > +   unsigned nofs_flag;
> > > >  
> > > > /*
> > > >  * vm_map_ram() will allocate auxillary structures (e.g.
> > > >  * pagetables) with GFP_KERNEL, yet we are likely to be 
> > > > under
> > > >  * GFP_NOFS context here. Hence we need to tell memory 
> > > > reclaim
> > > > -* that we are in such a context via PF_MEMALLOC_NOIO 
> > > > to prevent
> > > > +* that we are in such a context via PF_MEMALLOC_NOFS 
> > > > to prevent
> > > >  * memory reclaim re-entering the filesystem here and
> > > >  * potentially deadlocking.
> > > >  */
> > > 
> > > This comment feels out of date ... how about:
> > 
> > which part is out of date?
> > 
> > > 
> > >   /*
> > >* vm_map_ram will allocate auxiliary structures (eg page
> > >* tables) with GFP_KERNEL.  If that tries to reclaim memory
> > >* by calling back into this filesystem, we may deadlock.
> > >* Prevent that by setting the NOFS flag.
> > >*/
> > 
> > dunno, the previous wording seems clear enough to me. Maybe little bit
> > more chatty than yours but I am not sure this is worth changing.
> 
> I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> wording of the old comment because it captures the uncertainty of
> whether or not we actually are already under NOFS.  If someone actually
> has audited this code well enough to know for sure then yes let's change
> the comment, but I haven't gone that far.

I believe we can drop the memalloc_nofs_save then as well because either
we are called from a potentially dangerous context and thus we are in
the nofs scope we we do not need the protection at all.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

2017-02-06 Thread Michal Hocko
On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > bp->b_addr = NULL;
> > } else {
> > int retried = 0;
> > -   unsigned noio_flag;
> > +   unsigned nofs_flag;
> >  
> > /*
> >  * vm_map_ram() will allocate auxillary structures (e.g.
> >  * pagetables) with GFP_KERNEL, yet we are likely to be under
> >  * GFP_NOFS context here. Hence we need to tell memory reclaim
> > -* that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > +* that we are in such a context via PF_MEMALLOC_NOFS to prevent
> >  * memory reclaim re-entering the filesystem here and
> >  * potentially deadlocking.
> >  */
> 
> This comment feels out of date ... how about:

which part is out of date?

> 
>   /*
>* vm_map_ram will allocate auxiliary structures (eg page
>* tables) with GFP_KERNEL.  If that tries to reclaim memory
>* by calling back into this filesystem, we may deadlock.
>* Prevent that by setting the NOFS flag.
>*/

dunno, the previous wording seems clear enough to me. Maybe little bit
more chatty than yours but I am not sure this is worth changing.

> 
> > -   noio_flag = memalloc_noio_save();
> > +   nofs_flag = memalloc_nofs_save();
> > do {
> > bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> > -1, PAGE_KERNEL);
> 
> Also, I think it shows that this is the wrong place in XFS to be calling
> memalloc_nofs_save().  I'm not arguing against including this patch;
> it's a step towards where we want to be.  I also don't know XFS well
> enough to know where to set that flag ;-)  Presumably when we start a
> transaction ... ?

Yes that is what I would like to achieve longterm. And the reason why I
didn't want to mimic this pattern in kvmalloc as some have suggested.
It just takes much more time to get there from the past experience and
we should really start somewhere.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] lockdep: allow to disable reclaim lockup detection

2017-02-06 Thread Michal Hocko
On Mon 06-02-17 07:24:00, Matthew Wilcox wrote:
> On Mon, Feb 06, 2017 at 03:34:50PM +0100, Michal Hocko wrote:
> > This part is not needed for the patch, strictly speaking but I wanted to
> > make the code more future proof.
> 
> Understood.  I took an extra bit myself for marking the radix tree as
> being used for an IDR (so the radix tree now uses 4 bits).  I see you
> already split out the address space GFP mask from the other flags :-)
> I would prefer not to do that with the radix tree, but I understand
> your desire for more GFP bits.  I'm not entirely sure that an implicit
> gfpmask makes a lot of sense for the radix tree, but it'd be a big effort
> to change all the callers.  Anyway, I'm going to update your line here
> for my current tree and add the build bug so we'll know if we ever hit
> any problems.

OK, do I get it right that the patch can stay as is and go to Andrew?
I would really like to not rebase the patch again for something that is
not merged yet. I really hope for getting this merged finally...

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] lockdep: allow to disable reclaim lockup detection

2017-02-06 Thread Michal Hocko
On Mon 06-02-17 06:26:41, Matthew Wilcox wrote:
> On Mon, Feb 06, 2017 at 03:07:13PM +0100, Michal Hocko wrote:
> > While we are at it also make sure that the radix tree doesn't
> > accidentaly override tags stored in the upper part of the gfp_mask.
> 
> > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > index 9dc093d5ef39..7550be09f9d6 100644
> > --- a/lib/radix-tree.c
> > +++ b/lib/radix-tree.c
> > @@ -2274,6 +2274,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
> >  void __init radix_tree_init(void)
> >  {
> > int ret;
> > +
> > +   BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
> > radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
> > sizeof(struct radix_tree_node), 0,
> > SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
> 
> That's going to have a conceptual conflict with some patches I have
> in flight.  I'll take this part through my radix tree patch collection.

This part is not needed for the patch, strictly speaking but I wanted to
make the code more future proof.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] lockdep: allow to disable reclaim lockup detection

2017-02-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.

=
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G   O
-
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(_nondir_ilock_class){-+}, at: [] 
xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
  [] mark_held_locks+0x79/0xa0
  [] lockdep_trace_alloc+0xb3/0x100
  [] kmem_cache_alloc+0x33/0x230
  [] kmem_zone_alloc+0x81/0x120 [xfs]
  [] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
  [] __xfs_refcount_find_shared+0x75/0x580 [xfs]
  [] xfs_refcount_find_shared+0x84/0xb0 [xfs]
  [] xfs_getbmap+0x608/0x8c0 [xfs]
  [] xfs_vn_fiemap+0xab/0xc0 [xfs]
  [] do_vfs_ioctl+0x498/0x670
  [] SyS_ioctl+0x79/0x90
  [] entry_SYSCALL_64_fastpath+0x12/0x6f

   CPU0
   
  lock(_nondir_ilock_class);
  
lock(_nondir_ilock_class);

 *** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G   O4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

 82a34f10 88003aa078d0 813a14f9 88003d8551c0
 88003aa07920 8110ec65  0001
 8801 000b 0008 88003d855aa0
Call Trace:
 [] dump_stack+0x4b/0x72
 [] print_usage_bug+0x215/0x240
 [] mark_lock+0x1f5/0x660
 [] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [] __lock_acquire+0xa80/0x1e50
 [] ? kmem_cache_alloc+0x15e/0x230
 [] ? kmem_zone_alloc+0x81/0x120 [xfs]
 [] lock_acquire+0xd8/0x1e0
 [] ? xfs_ilock+0x177/0x200 [xfs]
 [] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [] down_write_nested+0x5e/0xc0
 [] ? xfs_ilock+0x177/0x200 [xfs]
 [] xfs_ilock+0x177/0x200 [xfs]
 [] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
 [] evict+0xc5/0x190
 [] dispose_list+0x39/0x60
 [] prune_icache_sb+0x4b/0x60
 [] super_cache_scan+0x14f/0x1a0
 [] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
 [] shrink_zone+0x15e/0x170
 [] kswapd+0x4f1/0xa80
 [] ? zone_reclaim+0x230/0x230
 [] kthread+0xf2/0x110
 [] ? kthread_create_on_node+0x220/0x220
 [] ret_from_fork+0x3f/0x70
 [] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Suggested-by: Peter Zijlstra <pet...@infradead.org>
Acked-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Acked-by: Vlastimil Babka <vba...@suse.cz>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
 include/linux/gfp.h  | 10 +-
 kernel/locking/lockdep.c |  4 
 lib/radix-tree.c |  2 ++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index db373b9d3223..978232a3b4ae 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -40,6 +40,11 @@ struct vm_area_struct;
 #define ___GFP_DIRECT_RECLAIM  0x40u
 #define ___GFP_WRITE   0x80u
 #define ___GFP_KSWAPD_RECLAIM  0x100u
+#ifdef CONFIG_LOCKDEP
+#define ___GFP_NOLOCKDEP   0x400u
+#else
+#define ___GFP_NOLOCKDEP   0
+#endif
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -179,8 +184,11 @@ struct vm_area_struct;
 #define __GFP_NOTRACK  ((__force g

[PATCH 2/6] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

2017-02-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Acked-by: Vlastimil Babka <vba...@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>
Reviewed-by: Brian Foster <bfos...@redhat.com>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/xfs/kmem.c |  4 ++--
 fs/xfs/kmem.h |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c |  6 +++---
 fs/xfs/xfs_trans.c| 12 ++--
 include/linux/sched.h |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 339c696bbc01..a76a05dae96b 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 * the filesystem here and potentially deadlocking.
 */
-   if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
noio_flag = memalloc_noio_save();
 
lflags = kmem_flags_convert(flags);
ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-   if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
memalloc_noio_restore(noio_flag);
 
return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..d973dbfc2bfa 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
-   if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
lflags &= ~__GFP_FS;
}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 21e6a6ab6b9a..a2672ba4dc33 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
struct xfs_btree_split_args *args = container_of(work,
struct xfs_btree_split_args, 
work);
unsigned long   pflags;
-   unsigned long   new_pflags = PF_FSTRANS;
+   unsigned long   new_pflags = PF_MEMALLOC_NOFS;
 
/*
 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3a4434297697..b3d41c1d67ab 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
 * We hand off the transaction to the completion thread now, so
 * clear the flag here.
 */
-   current_restore_flags_nested(>t_pflags, PF_FSTRANS);
+   current_restore_flags_nested(>t_pflags, PF_MEMALLOC_NOFS);
return 0;
 }
 
@@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
 * thus we need to mark ourselves as being in a transaction manually.
 * Similarly for freeze protection.
 */
-   current_set_flags_nested(>t_pflags, PF_FSTRANS);
+   current_set_flags_nested(>t_pflags, PF_MEMALLOC_NOFS);
__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
/* we abort the update if there was an IO error */
@@ -1015,7 +1015,7 @@ xfs_do_writepage(
 * Given that we do not allow direct reclaim to call us, we should
 * never be called while in a filesystem transaction.
 */
-   if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+   if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
goto redirty;
 
/*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 70f42ea86dfb..f5969c8274fc 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -134,7 +134,7 @@ xfs_trans_reserve(
boolrsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
/* Mark this thread as being in a transaction */
-   current_set_flags_nested(>t_pflags, PF_FSTRANS);
+   current_set_flags_nested(>t_pflags, PF_MEMALLOC_NOFS);
 
/*
 * Attempt to reserve the needed disk blocks by decrementing
@@ -144,7 +144,7 @@ xfs_trans_reserv

[PATCH 0/6 v4] scope GFP_NOFS api

2017-02-06 Thread Michal Hocko
Hi,
I have posted the previous version here [1]. There are no real changes
in the implementation since then. Few acks added and one new user of
memalloc_noio_flags (in alloc_contig_range) converted. I have decided
to drop the last two ext4 related patches. One of them will be picked up
by Ted [2] and the other one will probably need more time to settle down.
I believe it is OK as is but let's not block the whole thing just because
of it.

There didn't seem to be any real objections and so I think we should
go and merge this to mmotm tree and target the next merge window. The
risk of regressions is really small because we do not remove any real
GFP_NOFS users yet.

I hope to get ext4 parts resolved in the follow up patches as well as
pull other filesystems in. There is still a lot work to do but having
the infrastructure in place should be very useful already.

The patchset is based on next-20170206

Diffstat says
 fs/jbd2/journal.c |  7 +++
 fs/jbd2/transaction.c | 11 +++
 fs/xfs/kmem.c | 12 ++--
 fs/xfs/kmem.h |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c |  6 +++---
 fs/xfs/xfs_buf.c  |  8 
 fs/xfs/xfs_trans.c| 12 ++--
 include/linux/gfp.h   | 18 +-
 include/linux/jbd2.h  |  2 ++
 include/linux/sched.h | 32 ++--
 kernel/locking/lockdep.c  |  6 +-
 lib/radix-tree.c  |  2 ++
 mm/page_alloc.c   | 10 ++
 mm/vmscan.c   |  6 +++---
 15 files changed, 100 insertions(+), 36 deletions(-)

Shortlog:
Michal Hocko (6):
  lockdep: allow to disable reclaim lockup detection
  xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
  mm: introduce memalloc_nofs_{save,restore} API
  xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  jbd2: mark the transaction context with the scope GFP_NOFS context
  jbd2: make the whole kjournald2 kthread NOFS safe

[1] http://lkml.kernel.org/r/20170106141107.23953-1-mho...@kernel.org
[2] http://lkml.kernel.org/r/20170117030118.727jqyamjhojz...@thunk.org

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] jbd2: mark the transaction context with the scope GFP_NOFS context

2017-02-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

Reviewed-by: Jan Kara <j...@suse.cz>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/jbd2/transaction.c | 11 +++
 include/linux/jbd2.h  |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index e1652665bd93..35a5d3d76182 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -388,6 +388,11 @@ static int start_this_handle(journal_t *journal, handle_t 
*handle,
 
rwsem_acquire_read(>j_trans_commit_map, 0, 0, _THIS_IP_);
jbd2_journal_free_transaction(new_transaction);
+   /*
+* Make sure that no allocations done while the transaction is
+* open is going to recurse back to the fs layer.
+*/
+   handle->saved_alloc_context = memalloc_nofs_save();
return 0;
 }
 
@@ -466,6 +471,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int 
nblocks, int rsv_blocks,
trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
handle->h_transaction->t_tid, type,
line_no, nblocks);
+
return handle;
 }
 EXPORT_SYMBOL(jbd2__journal_start);
@@ -1760,6 +1766,11 @@ int jbd2_journal_stop(handle_t *handle)
if (handle->h_rsv_handle)
jbd2_journal_free_reserved(handle->h_rsv_handle);
 free_and_exit:
+   /*
+* scope of th GFP_NOFS context is over here and so we can
+* restore the original alloc context.
+*/
+   memalloc_nofs_restore(handle->saved_alloc_context);
jbd2_free_handle(handle);
return err;
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index dfaa1f4dcb0c..606b6bce3a5b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -491,6 +491,8 @@ struct jbd2_journal_handle
 
unsigned long   h_start_jiffies;
unsigned inth_requested_credits;
+
+   unsigned intsaved_alloc_context;
 };
 
 
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

2017-02-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Acked-by: Vlastimil Babka <vba...@suse.cz>
Reviewed-by: Brian Foster <bfos...@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/xfs/kmem.c| 12 ++--
 fs/xfs/xfs_buf.c |  8 
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a76a05dae96b..0c9f94f41b6c 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-   unsigned noio_flag = 0;
+   unsigned nofs_flag = 0;
void*ptr;
gfp_t   lflags;
 
@@ -77,17 +77,17 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 * __vmalloc() will allocate data pages and auxillary structures (e.g.
 * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context
 * here. Hence we need to tell memory reclaim that we are in such a
-* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
+* context via PF_MEMALLOC_NOFS to prevent memory reclaim re-entering
 * the filesystem here and potentially deadlocking.
 */
-   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-   noio_flag = memalloc_noio_save();
+   if (flags & KM_NOFS)
+   nofs_flag = memalloc_nofs_save();
 
lflags = kmem_flags_convert(flags);
ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-   memalloc_noio_restore(noio_flag);
+   if (flags & KM_NOFS)
+   memalloc_nofs_restore(nofs_flag);
 
return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 8c7d01b75922..676a9ae75b9a 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -442,17 +442,17 @@ _xfs_buf_map_pages(
bp->b_addr = NULL;
} else {
int retried = 0;
-   unsigned noio_flag;
+   unsigned nofs_flag;
 
/*
 * vm_map_ram() will allocate auxillary structures (e.g.
 * pagetables) with GFP_KERNEL, yet we are likely to be under
 * GFP_NOFS context here. Hence we need to tell memory reclaim
-* that we are in such a context via PF_MEMALLOC_NOIO to prevent
+* that we are in such a context via PF_MEMALLOC_NOFS to prevent
 * memory reclaim re-entering the filesystem here and
 * potentially deadlocking.
 */
-   noio_flag = memalloc_noio_save();
+   nofs_flag = memalloc_nofs_save();
do {
bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
-1, PAGE_KERNEL);
@@ -460,7 +460,7 @@ _xfs_buf_map_pages(
break;
vm_unmap_aliases();
} while (retried++ <= 1);
-   memalloc_noio_restore(noio_flag);
+   memalloc_nofs_restore(nofs_flag);
 
if (!bp->b_addr)
return -ENOMEM;
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] jbd2: make the whole kjournald2 kthread NOFS safe

2017-02-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

Suggested-by: Jan Kara <j...@suse.cz>
Reviewed-by: Jan Kara <j...@suse.cz>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/jbd2/journal.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 704139625fbe..662531a70ce1 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -206,6 +206,13 @@ static int kjournald2(void *arg)
wake_up(>j_wait_done_commit);
 
/*
+* Make sure that no allocations from this kernel thread will ever 
recurse
+* to the fs layer because we are responsible for the transaction commit
+* and any fs involvement might get stuck waiting for the trasn. commit.
+*/
+   memalloc_nofs_save();
+
+   /*
 * And now, wait forever for commit wakeup events.
 */
write_lock(>j_state_lock);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] mm: introduce memalloc_nofs_{save,restore} API

2017-02-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

GFP_NOFS context is used for the following 5 reasons currently
- to prevent from deadlocks when the lock held by the allocation
  context would be needed during the memory reclaim
- to prevent from stack overflows during the reclaim because
  the allocation is performed from a deep context already
- to prevent lockups when the allocation context depends on
  other reclaimers to make a forward progress indirectly
- just in case because this would be safe from the fs POV
- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Acked-by: Vlastimil Babka <vba...@suse.cz>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/xfs/kmem.h|  2 +-
 include/linux/gfp.h  |  8 
 include/linux/sched.h| 34 ++
 kernel/locking/lockdep.c |  2 +-
 mm/page_alloc.c  | 10 ++
 mm/vmscan.c  |  6 +++---
 6 files changed, 45 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d973dbfc2bfa..ae08cfd9552a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
-   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+   if (flags & KM_NOFS)
lflags &= ~__GFP_FS;
}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 978232a3b4ae..2bfcfd33e476 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which 
cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5be9818e9bd9..6573e9f04aed 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2299,9 +2299,9 @@ extern void thread_group_cputime_adjusted(struct 
task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC  0x4000  /* used async_schedule

Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-02-03 Thread Michal Hocko
On Mon 30-01-17 09:12:10, Michal Hocko wrote:
> On Fri 27-01-17 11:40:42, Theodore Ts'o wrote:
> > On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> > > If this ever turn out to be a problem and with the vmapped stacks we
> > > have good chances to get a proper stack traces on a potential overflow
> > > we can add the scope API around the problematic code path with the
> > > explanation why it is needed.
> > 
> > Yeah, or maybe we can automate it?  Can the reclaim code check how
> > much stack space is left and do the right thing automatically?
> 
> I am not sure how to do that. Checking for some magic value sounds quite
> fragile to me. It also sounds a bit strange to focus only on the reclaim
> while other code paths might suffer from the same problem.
> 
> What is actually the deepest possible call chain from the slab reclaim
> where I stopped? I have tried to follow that path but hit the callback
> wall quite early.
>  
> > The reason why I'm nervous is that nojournal mode is not a common
> > configuration, and "wait until production systems start failing" is
> > not a strategy that I or many SRE-types find comforting.
> 
> I understand that but I would be much more happier if we did the
> decision based on the actual data rather than a fear something would
> break down.

ping on this. I would really like to move forward here and target 4.11
merge window. Is your concern so serious to block this patch?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-30 Thread Michal Hocko
On Fri 27-01-17 11:40:42, Theodore Ts'o wrote:
> On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> > If this ever turn out to be a problem and with the vmapped stacks we
> > have good chances to get a proper stack traces on a potential overflow
> > we can add the scope API around the problematic code path with the
> > explanation why it is needed.
> 
> Yeah, or maybe we can automate it?  Can the reclaim code check how
> much stack space is left and do the right thing automatically?

I am not sure how to do that. Checking for some magic value sounds quite
fragile to me. It also sounds a bit strange to focus only on the reclaim
while other code paths might suffer from the same problem.

What is actually the deepest possible call chain from the slab reclaim
where I stopped? I have tried to follow that path but hit the callback
wall quite early.
 
> The reason why I'm nervous is that nojournal mode is not a common
> configuration, and "wait until production systems start failing" is
> not a strategy that I or many SRE-types find comforting.

I understand that but I would be much more happier if we did the
decision based on the actual data rather than a fear something would
break down.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-27 Thread Michal Hocko
On Fri 27-01-17 01:13:18, Theodore Ts'o wrote:
> On Thu, Jan 26, 2017 at 08:44:55AM +0100, Michal Hocko wrote:
> > > > I'm convinced the current series is OK, only real life will tell us 
> > > > whether
> > > > we missed something or not ;)
> > > 
> > > I would like to extend the changelog of "jbd2: mark the transaction
> > > context with the scope GFP_NOFS context".
> > > 
> > > "
> > > Please note that setups without journal do not suffer from potential
> > > recursion problems and so they do not need the scope protection because
> > > neither ->releasepage nor ->evict_inode (which are the only fs entry
> > > points from the direct reclaim) can reenter a locked context which is
> > > doing the allocation currently.
> > > "
> > 
> > Could you comment on this Ted, please?
> 
> I guess   so there still is one way this could screw us, and it's this 
> reason for GFP_NOFS:
> 
> - to prevent from stack overflows during the reclaim because
> the allocation is performed from a deep context already
> 
> The writepages call stack can be pretty deep.  (Especially if we're
> using ext4 in no journal mode over, say, iSCSI.)
> 
> How much stack space can get consumed by a reclaim?

./scripts/stackusage with allyesconfig says:

./mm/page_alloc.c:3745  __alloc_pages_nodemask  264 static
./mm/page_alloc.c:3531  __alloc_pages_slowpath  520 static
./mm/vmscan.c:2946  try_to_free_pages   216 static
./mm/vmscan.c:2753  do_try_to_free_pages304 static
./mm/vmscan.c:2517  shrink_node 352 static
./mm/vmscan.c:2317  shrink_node_memcg   560 static
./mm/vmscan.c:1692  shrink_inactive_list688 static
./mm/vmscan.c:908   shrink_page_list608 static

So this would be 3512 for the standard LRUs reclaim whether we have
GFP_FS or not. shrink_page_list can recurse to releasepage but there is
no NOFS protection there so it doesn't make much sense to check this
path. So we are left with the slab shrinkers path

./mm/page_alloc.c:3745  __alloc_pages_nodemask  264 static
./mm/page_alloc.c:3531  __alloc_pages_slowpath  520 static
./mm/vmscan.c:2946  try_to_free_pages   216 static
./mm/vmscan.c:2753  do_try_to_free_pages304 static
./mm/vmscan.c:2517  shrink_node 352 static
./mm/vmscan.c:427   shrink_slab 336 static
./fs/super.c:56 super_cache_scan104 static << here we have 
the NOFS protection
./fs/dcache.c:1089  prune_dcache_sb 152 static
./fs/dcache.c:939   shrink_dentry_list  96  static
./fs/dcache.c:509   __dentry_kill   72  static
./fs/dcache.c:323   dentry_unlink_inode 64  static
./fs/inode.c:1527   iput80  static
./fs/inode.c:532evict   72  static

This is where the fs specific callbacks play role and I am not sure
which paths can pass through for ext4 in the nojournal mode and how much
of the stack this can eat. But currently we are at +536 wrt. NOFS
context. This is quite a lot but still much less (2632 vs. 3512) than
the regular reclaim. So there is quite some stack space to eat... I am
wondering whether we have to really treat nojournal mode any special
just because of the stack usage?

If this ever turn out to be a problem and with the vmapped stacks we
have good chances to get a proper stack traces on a potential overflow
we can add the scope API around the problematic code path with the
explanation why it is needed.

Does that make sense to you?

Thanks!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-25 Thread Michal Hocko
On Thu 19-01-17 10:44:05, Michal Hocko wrote:
> On Thu 19-01-17 10:22:36, Jan Kara wrote:
> > On Thu 19-01-17 09:39:56, Michal Hocko wrote:
> > > On Tue 17-01-17 18:29:25, Jan Kara wrote:
> > > > On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > > > > But before going to play with that I am really wondering whether 
> > > > > > > we need
> > > > > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > > > > journal lock(s) which is the biggest problem from the reclaim 
> > > > > > > recursion
> > > > > > > point of view. What would cause a deadlock in no journal mode?
> > > > > > 
> > > > > > We still have the original problem for why we need GFP_NOFS even in
> > > > > > ext2.  If we are in a writeback path, and we need to allocate 
> > > > > > memory,
> > > > > > we don't want to recurse back into the file system's writeback path.
> > > > > 
> > > > > But we do not enter the writeback path from the direct reclaim. Or do
> > > > > you mean something other than pageout()'s mapping->a_ops->writepage?
> > > > > There is only try_to_release_page where we get back to the filesystems
> > > > > but I do not see any NOFS protection in ext4_releasepage.
> > > > 
> > > > Maybe to expand a bit: These days, direct reclaim can call 
> > > > ->releasepage()
> > > > callback, ->evict_inode() callback (and only for inodes with i_nlink > 
> > > > 0),
> > > > shrinkers. That's it. So the recursion possibilities are rather more 
> > > > limited
> > > > than they used to be several years ago and we likely do not need as much
> > > > GFP_NOFS protection as we used to.
> > > 
> > > Thanks for making my remark more clear Jack! I would just want to add
> > > that I was playing with the patch below (it is basically
> > > GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
> > > debugging patch which means they are called from within transaction) and
> > > it didn't hit the lockdep when running xfstests both with or without the
> > > enabled journal.
> > > 
> > > So am I still missing something or the nojournal mode is safe and the
> > > current series is OK wrt. ext*?
> > 
> > I'm convinced the current series is OK, only real life will tell us whether
> > we missed something or not ;)
> 
> I would like to extend the changelog of "jbd2: mark the transaction
> context with the scope GFP_NOFS context".
> 
> "
> Please note that setups without journal do not suffer from potential
> recursion problems and so they do not need the scope protection because
> neither ->releasepage nor ->evict_inode (which are the only fs entry
> points from the direct reclaim) can reenter a locked context which is
> doing the allocation currently.
> "

Could you comment on this Ted, please?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-19 Thread Michal Hocko
On Thu 19-01-17 10:22:36, Jan Kara wrote:
> On Thu 19-01-17 09:39:56, Michal Hocko wrote:
> > On Tue 17-01-17 18:29:25, Jan Kara wrote:
> > > On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > > > But before going to play with that I am really wondering whether we 
> > > > > > need
> > > > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > > > journal lock(s) which is the biggest problem from the reclaim 
> > > > > > recursion
> > > > > > point of view. What would cause a deadlock in no journal mode?
> > > > > 
> > > > > We still have the original problem for why we need GFP_NOFS even in
> > > > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > > > we don't want to recurse back into the file system's writeback path.
> > > > 
> > > > But we do not enter the writeback path from the direct reclaim. Or do
> > > > you mean something other than pageout()'s mapping->a_ops->writepage?
> > > > There is only try_to_release_page where we get back to the filesystems
> > > > but I do not see any NOFS protection in ext4_releasepage.
> > > 
> > > Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> > > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> > > shrinkers. That's it. So the recursion possibilities are rather more 
> > > limited
> > > than they used to be several years ago and we likely do not need as much
> > > GFP_NOFS protection as we used to.
> > 
> > Thanks for making my remark more clear Jack! I would just want to add
> > that I was playing with the patch below (it is basically
> > GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
> > debugging patch which means they are called from within transaction) and
> > it didn't hit the lockdep when running xfstests both with or without the
> > enabled journal.
> > 
> > So am I still missing something or the nojournal mode is safe and the
> > current series is OK wrt. ext*?
> 
> I'm convinced the current series is OK, only real life will tell us whether
> we missed something or not ;)

I would like to extend the changelog of "jbd2: mark the transaction
context with the scope GFP_NOFS context".

"
Please note that setups without journal do not suffer from potential
recursion problems and so they do not need the scope protection because
neither ->releasepage nor ->evict_inode (which are the only fs entry
points from the direct reclaim) can reenter a locked context which is
doing the allocation currently.
"
 
> > The following patch in its current form is WIP and needs a proper review
> > before I post it.
> 
> So jbd2 changes look confusing (although technically correct) to me - we
> *always* should run in NOFS context in those place so having GFP_KERNEL
> there looks like it is unnecessarily hiding what is going on. So in those
> places I'd prefer to keep GFP_NOFS or somehow else make it very clear these
> allocations are expected to be GFP_NOFS (and assert that). Otherwise the
> changes look good to me.

I would really like to get rid most of NOFS direct usage and only
dictate it via the scope API otherwise I suspect we will just grow more
users and end up in the same situation as we are now currently over time.
In principle only the context which changes the reclaim reentrancy policy
should care about NOFS and everybody else should just pretend nothing
like that exists. There might be few exceptions of course, I am not yet
sure whether jbd2 is that case. But I am not proposing this change yet
(thanks for checking anyway)...
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-19 Thread Michal Hocko
On Tue 17-01-17 18:29:25, Jan Kara wrote:
> On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > But before going to play with that I am really wondering whether we need
> > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > point of view. What would cause a deadlock in no journal mode?
> > > 
> > > We still have the original problem for why we need GFP_NOFS even in
> > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > we don't want to recurse back into the file system's writeback path.
> > 
> > But we do not enter the writeback path from the direct reclaim. Or do
> > you mean something other than pageout()'s mapping->a_ops->writepage?
> > There is only try_to_release_page where we get back to the filesystems
> > but I do not see any NOFS protection in ext4_releasepage.
> 
> Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> shrinkers. That's it. So the recursion possibilities are rather more limited
> than they used to be several years ago and we likely do not need as much
> GFP_NOFS protection as we used to.

Thanks for making my remark more clear Jack! I would just want to add
that I was playing with the patch below (it is basically
GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
debugging patch which means they are called from within transaction) and
it didn't hit the lockdep when running xfstests both with or without the
enabled journal.

So am I still missing something or the nojournal mode is safe and the
current series is OK wrt. ext*?

The following patch in its current form is WIP and needs a proper review
before I post it.
---
 fs/ext4/inode.c   |  4 ++--
 fs/ext4/mballoc.c | 14 +++---
 fs/ext4/xattr.c   |  2 +-
 fs/jbd2/journal.c |  4 ++--
 fs/jbd2/revoke.c  |  2 +-
 fs/jbd2/transaction.c |  2 +-
 6 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b7d141c3b810..841cb8c4cb5e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2085,7 +2085,7 @@ static int ext4_writepage(struct page *page,
return __ext4_journalled_writepage(page, len);
 
ext4_io_submit_init(_submit, wbc);
-   io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
+   io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
if (!io_submit.io_end) {
redirty_page_for_writepage(wbc, page);
unlock_page(page);
@@ -3794,7 +3794,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
int err = 0;
 
page = find_or_create_page(mapping, from >> PAGE_SHIFT,
-  mapping_gfp_constraint(mapping, ~__GFP_FS));
+  mapping_gfp_mask(mapping));
if (!page)
return -ENOMEM;
 
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d9fd184b049e..67b97cd6e3d6 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1251,7 +1251,7 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, 
ext4_group_t group,
 static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
  struct ext4_buddy *e4b)
 {
-   return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS);
+   return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_KERNEL);
 }
 
 static void ext4_mb_unload_buddy(struct ext4_buddy *e4b)
@@ -2054,7 +2054,7 @@ static int ext4_mb_good_group(struct 
ext4_allocation_context *ac,
 
/* We only do this if the grp has never been initialized */
if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
-   int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
+   int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_KERNEL);
if (ret)
return ret;
}
@@ -3600,7 +3600,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
BUG_ON(ac->ac_status != AC_STATUS_FOUND);
BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
 
-   pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
+   pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
if (pa == NULL)
return -ENOMEM;
 
@@ -3694,7 +3694,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
 
BUG_ON(ext4_pspace_cachep == NULL);
-   pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
+   pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
if (pa == NULL)
return -ENOMEM;
 
@@ -4479,7 +4479,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
}
}
 
-  

Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-18 Thread Michal Hocko
On Tue 17-01-17 14:04:03, Andreas Dilger wrote:
> On Jan 17, 2017, at 8:59 AM, Theodore Ts'o <ty...@mit.edu> wrote:
> > 
> > On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
> >> 
> >> OK, so I've been staring into the code and AFAIU current->journal_info
> >> can contain my stored information. I could either hijack part of the
> >> word as the ref counting is only consuming low 12b. But that looks too
> >> ugly to live. Or I can allocate some placeholder.
> > 
> > Yeah, I was looking at something similar.  Can you guarantee that the
> > context will only take one or two bits?  (Looks like it only needs one
> > bit ATM, even though at the moment you're storing the whole GFP mask,
> > correct?)
> > 
> >> But before going to play with that I am really wondering whether we need
> >> all this with no journal at all. AFAIU what Jack told me it is the
> >> journal lock(s) which is the biggest problem from the reclaim recursion
> >> point of view. What would cause a deadlock in no journal mode?
> > 
> > We still have the original problem for why we need GFP_NOFS even in
> > ext2.  If we are in a writeback path, and we need to allocate memory,
> > we don't want to recurse back into the file system's writeback path.
> > Certainly not for the same inode, and while we could make it work if
> > the mm was writing back another inode, or another superblock, there
> > are also stack depth considerations that would make this be a bad
> > idea.  So we do need to be able to assert GFP_NOFS even in no journal
> > mode, and for any file system including ext2, for that matter.
> > 
> > Because of the fact that we're going to have to play games with
> > current->journal_info, maybe this is something that I should take
> > responsibility for, and to go through the the ext4 tree after the main
> > patch series go through?  Maybe you could use xfs and ext2 as sample
> > (simple) implementations?
> > 
> > My only ask is that the memalloc nofs context be a well defined N
> > bits, where N < 16, and I'll find some place to put them (probably
> > journal_info).
> 
> I think Dave was suggesting that the NOFS context allow a pointer to
> an arbitrary struct, so that it is possible to dereference this in
> the filesystem itself to determine if the recursion is safe or not.

Yes, but can we start with a simpler approach first? Even this approach
takes quite some time to be used.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-17 Thread Michal Hocko
On Tue 17-01-17 10:59:16, Theodore Ts'o wrote:
> On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
> > 
> > OK, so I've been staring into the code and AFAIU current->journal_info
> > can contain my stored information. I could either hijack part of the
> > word as the ref counting is only consuming low 12b. But that looks too
> > ugly to live. Or I can allocate some placeholder.
> 
> Yeah, I was looking at something similar.  Can you guarantee that the
> context will only take one or two bits?  (Looks like it only needs one
> bit ATM, even though at the moment you're storing the whole GFP mask,
> correct?)

No, I am just storing PF_MEMALLOC_NO{FS,IO} but I assume further changes
might want to pull in more changes into the scope context.

> > But before going to play with that I am really wondering whether we need
> > all this with no journal at all. AFAIU what Jack told me it is the
> > journal lock(s) which is the biggest problem from the reclaim recursion
> > point of view. What would cause a deadlock in no journal mode?
> 
> We still have the original problem for why we need GFP_NOFS even in
> ext2.  If we are in a writeback path, and we need to allocate memory,
> we don't want to recurse back into the file system's writeback path.

But we do not enter the writeback path from the direct reclaim. Or do
you mean something other than pageout()'s mapping->a_ops->writepage?
There is only try_to_release_page where we get back to the filesystems
but I do not see any NOFS protection in ext4_releasepage.

> Certainly not for the same inode, and while we could make it work if
> the mm was writing back another inode, or another superblock, there
> are also stack depth considerations that would make this be a bad
> idea.  So we do need to be able to assert GFP_NOFS even in no journal
> mode, and for any file system including ext2, for that matter.
> 
> Because of the fact that we're going to have to play games with
> current->journal_info, maybe this is something that I should take
> responsibility for, and to go through the the ext4 tree after the main
> patch series go through?

How do you see a possibility that we would handle nojournal mode on
top of "[PATCH 5/8] jbd2: mark the transaction context with the scope
GFP_NOFS context" in a separate patch?

But anyway, I agree that we should go with the API sooner rather than
later.

>   Maybe you could use xfs and ext2 as sample
> (simple) implementations?
> 
> My only ask is that the memalloc nofs context be a well defined N
> bits, where N < 16, and I'll find some place to put them (probably
> journal_info).

I am pretty sure that we won't need more than a bit or two in a
foreseeable future (I can think of GFP_NOWAIT being one candidate).
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-17 Thread Michal Hocko
On Tue 17-01-17 09:24:25, Michal Hocko wrote:
> On Mon 16-01-17 21:56:07, Theodore Ts'o wrote:
> > On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mho...@suse.com>
> > > 
> > > This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> > > the transaction context uses memalloc_nofs_save and all allocations
> > > within the this context inherit GFP_NOFS automatically, there is no
> > > reason to mark specific allocations explicitly.
> > > 
> > > This patch should not introduce any functional change. The main point
> > > of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> > > to make the review of the remaining usage easier.
> > > 
> > > Signed-off-by: Michal Hocko <mho...@suse.com>
> > > Reviewed-by: Jan Kara <j...@suse.cz>
> > 
> > Changes in the jbd2 layer aren't going to guarantee that
> > memalloc_nofs_save() will be executed if we are running ext4 without a
> > journal (aka in no journal mode).  And this is a *very* common
> > configuration; it's how ext4 is used inside Google in our production
> > servers.
> 
> OK, I wasn't aware of that.
> 
> > So that means the earlier patches will probably need to be changed so
> > the nOFS scope is done in the ext4_journal_{start,stop} functions in
> > fs/ext4/ext4_jbd2.c.
> 
> I could definitely appreciated some help here. The call paths are rather
> complex and I am not familiar with the code enough. On of the biggest
> problem I have currently is that there doesn't seem to be an easy place
> to store the old allocation context. 

OK, so I've been staring into the code and AFAIU current->journal_info
can contain my stored information. I could either hijack part of the
word as the ref counting is only consuming low 12b. But that looks too
ugly to live. Or I can allocate some placeholder.

But before going to play with that I am really wondering whether we need
all this with no journal at all. AFAIU what Jack told me it is the
journal lock(s) which is the biggest problem from the reclaim recursion
point of view. What would cause a deadlock in no journal mode?

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-17 Thread Michal Hocko
On Mon 16-01-17 21:56:07, Theodore Ts'o wrote:
> On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mho...@suse.com>
> > 
> > This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> > the transaction context uses memalloc_nofs_save and all allocations
> > within the this context inherit GFP_NOFS automatically, there is no
> > reason to mark specific allocations explicitly.
> > 
> > This patch should not introduce any functional change. The main point
> > of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> > to make the review of the remaining usage easier.
> > 
> > Signed-off-by: Michal Hocko <mho...@suse.com>
> > Reviewed-by: Jan Kara <j...@suse.cz>
> 
> Changes in the jbd2 layer aren't going to guarantee that
> memalloc_nofs_save() will be executed if we are running ext4 without a
> journal (aka in no journal mode).  And this is a *very* common
> configuration; it's how ext4 is used inside Google in our production
> servers.

OK, I wasn't aware of that.

> So that means the earlier patches will probably need to be changed so
> the nOFS scope is done in the ext4_journal_{start,stop} functions in
> fs/ext4/ext4_jbd2.c.

I could definitely appreciated some help here. The call paths are rather
complex and I am not familiar with the code enough. On of the biggest
problem I have currently is that there doesn't seem to be an easy place
to store the old allocation context. The original patch had it inside
the journal handle. I was thinking about putting it into superblock but
ext4_journal_stop doesn't seem to have access to the sb if there is no
handle. Now, if ext4_journal_start is never called from a nested context
then this is not a big deal but there are just too many caller to
check...
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"

2017-01-16 Thread Michal Hocko
On Mon 16-01-17 22:01:18, Theodore Ts'o wrote:
> On Fri, Jan 06, 2017 at 03:11:06PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mho...@suse.com>
> > 
> > This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
> > sb_getblk_gfp is not really needed as
> > sb_getblk
> >   __getblk_gfp
> > __getblk_slow
> >   grow_buffers
> > grow_dev_page
> >   gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp
> > 
> > so __GFP_FS is cleared unconditionally and therefore the above commit
> > didn't have any real effect in fact.
> > 
> > This patch should not introduce any functional change. The main point
> > of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
> > make the review of the remaining usage easier.
> > 
> > Signed-off-by: Michal Hocko <mho...@suse.com>
> > Reviewed-by: Jan Kara <j...@suse.cz>
> 
> If I'm not mistaken, this patch is not dependent on any of the other
> patches in this series (and the other patches are not dependent on
> this one).  Hence, I could take this patch via the ext4 tree, correct?

Yes, that is correct

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] etrfs: fix up misleading GFP_NOFS usage in btrfs_releasepage

2017-01-11 Thread Michal Hocko
On Wed 11-01-17 14:55:50, David Sterba wrote:
> On Mon, Jan 09, 2017 at 03:39:02PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mho...@suse.com>
> > 
> > b335b0034e25 ("Btrfs: Avoid using __GFP_HIGHMEM with slab allocator")
> > has reduced the allocation mask in btrfs_releasepage to GFP_NOFS just
> > to prevent from giving an unappropriate gfp mask to the slab allocator
> > deeper down the callchain (in alloc_extent_state). This is wrong for
> > two reasons a) GFP_NOFS might be just too restrictive for the calling
> > context b) it is better to tweak the gfp mask down when it needs that.
> > 
> > So just remove the mask tweaking from btrfs_releasepage and move it
> > down to alloc_extent_state where it is needed.
> > 
> > Signed-off-by: Michal Hocko <mho...@suse.com>
> > ---
> >  fs/btrfs/extent_io.c | 5 +
> >  fs/btrfs/inode.c | 2 +-
> >  2 files changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index b38150eec6b4..f6ae94a4acad 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -226,6 +226,11 @@ static struct extent_state *alloc_extent_state(gfp_t 
> > mask)
> >  {
> > struct extent_state *state;
> >  
> > +   /*
> > +* The given mask might be not appropriate for the slab allocator,
> > +* drop the unsupported bits
> > +*/
> > +   mask &= ~(__GFP_DMA32|__GFP_HIGHMEM);
> 
> Is this future proof enough? As it's enumerating some gfp flags, what if
> more are necessary in the future? I'm interested about some synthetic
> gfp flags that would not require knowledge about what is or is not
> acceptable for slab allocator.

Well, I agree, that something like slab_restrict_gfp_mask(gfp_t gfp_mask)
would be much better. And in fact that sounds like a nice future
cleanup. I haven't checked how many users would find it useful yet but I
am putting that on my todo list.

> But otherwise looks ok to me, I'm going to merge the patch. Thanks.

Thanks!

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] etrfs: fix up misleading GFP_NOFS usage in btrfs_releasepage

2017-01-11 Thread Michal Hocko
On Wed 11-01-17 14:55:50, David Sterba wrote:
[...]
> But otherwise looks ok to me, I'm going to merge the patch. Thanks.

I have only now noticed typo in the subject. s@etrfs:@btrfs:@

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] etrfs: fix up misleading GFP_NOFS usage in btrfs_releasepage

2017-01-09 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

b335b0034e25 ("Btrfs: Avoid using __GFP_HIGHMEM with slab allocator")
has reduced the allocation mask in btrfs_releasepage to GFP_NOFS just
to prevent from giving an unappropriate gfp mask to the slab allocator
deeper down the callchain (in alloc_extent_state). This is wrong for
two reasons a) GFP_NOFS might be just too restrictive for the calling
context b) it is better to tweak the gfp mask down when it needs that.

So just remove the mask tweaking from btrfs_releasepage and move it
down to alloc_extent_state where it is needed.

Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/btrfs/extent_io.c | 5 +
 fs/btrfs/inode.c | 2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b38150eec6b4..f6ae94a4acad 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -226,6 +226,11 @@ static struct extent_state *alloc_extent_state(gfp_t mask)
 {
struct extent_state *state;
 
+   /*
+* The given mask might be not appropriate for the slab allocator,
+* drop the unsupported bits
+*/
+   mask &= ~(__GFP_DMA32|__GFP_HIGHMEM);
state = kmem_cache_alloc(extent_state_cache, mask);
if (!state)
return state;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index baa40d34d2c9..d118d4659c28 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8994,7 +8994,7 @@ static int btrfs_releasepage(struct page *page, gfp_t 
gfp_flags)
 {
if (PageWriteback(page) || PageDirty(page))
return 0;
-   return __btrfs_releasepage(page, gfp_flags & GFP_NOFS);
+   return __btrfs_releasepage(page, gfp_flags);
 }
 
 static void btrfs_invalidatepage(struct page *page, unsigned int offset,
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: drop gfp mask tweaking in try_release_extent_state

2017-01-09 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

try_release_extent_state reduces the gfp mask to GFP_NOFS if it is
compatible. This is true for GFP_KERNEL as well. There is no real
reason to do that though. There is no new lock taken down the
the only consumer of the gfp mask which is
try_release_extent_state
  clear_extent_bit
__clear_extent_bit
  alloc_extent_state

So this seems just unnecessary and confusing.

Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/btrfs/extent_io.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f6ae94a4acad..8158930c8d4a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4326,8 +4326,6 @@ static int try_release_extent_state(struct 
extent_map_tree *map,
   EXTENT_IOBITS, 0, NULL))
ret = 0;
else {
-   if ((mask & GFP_NOFS) == GFP_NOFS)
-   mask = GFP_NOFS;
/*
 * at this point we can safely clear everything except the
 * locked bit and the nodatasum bit
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

2017-01-09 Thread Michal Hocko
On Mon 09-01-17 13:59:05, Vlastimil Babka wrote:
> On 01/06/2017 03:11 PM, Michal Hocko wrote:
> > From: Michal Hocko <mho...@suse.com>
> > 
> > xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> > some time ago. We would like to make this concept more generic and use
> > it for other filesystems as well. Let's start by giving the flag a
> > more generic name PF_MEMALLOC_NOFS which is in line with an exiting
> > PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> > contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> > step before we introduce a full API for it as xfs uses the flag directly
> > anyway.
> > 
> > This patch doesn't introduce any functional change.
> > 
> > Signed-off-by: Michal Hocko <mho...@suse.com>
> > Reviewed-by: Brian Foster <bfos...@redhat.com>
> 
> Acked-by: Vlastimil Babka <vba...@suse.cz>

Thanks!

> 
> A nit:
> 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct 
> > task_struct *p, cputime_t *ut,
> >  #define PF_FREEZER_SKIP0x4000  /* Freezer should not count it 
> > as freezable */
> >  #define PF_SUSPEND_TASK 0x8000  /* this thread called 
> > freeze_processes and should not be frozen */
> >  
> > +#define PF_MEMALLOC_NOFS PF_FSTRANS/* Transition to a more generic 
> > GFP_NOFS scope semantic */
> 
> I don't see why this transition is needed, as there are already no users
> of PF_FSTRANS after this patch. The next patch doesn't remove any more,
> so this is just extra churn IMHO. But not a strong objection.

I just wanted to have this transparent for the xfs in this patch.
AFAIR Dave wanted to have xfs and generic parts separated. So it was the
easiest to simply keep the flag and then remove it in a separate patach.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

2017-01-09 Thread Michal Hocko
On Mon 09-01-17 15:08:27, Vlastimil Babka wrote:
> On 01/06/2017 03:11 PM, Michal Hocko wrote:
> > From: Michal Hocko <mho...@suse.com>
> > 
> > kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> > API to prevent from reclaim recursion into the fs because vmalloc can
> > invoke unconditional GFP_KERNEL allocations and these functions might be
> > called from the NOFS contexts. The memalloc_noio_save will enforce
> > GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> > unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> > provide exactly what we need here - implicit GFP_NOFS context.
> > 
> > Changes since v1
> > - s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
> >   as per Brian Foster
> > 
> > Signed-off-by: Michal Hocko <mho...@suse.com>
> 
> Not a xfs expert, but seems correct.
> 
> Acked-by: Vlastimil Babka <vba...@suse.cz>

Thanks!

> 
> Nit below:
> 
> > ---
> >  fs/xfs/kmem.c| 10 +-
> >  fs/xfs/xfs_buf.c |  8 
> >  2 files changed, 9 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index a76a05dae96b..d69ed5e76621 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
> >  void *
> >  kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> >  {
> > -   unsigned noio_flag = 0;
> > +   unsigned nofs_flag = 0;
> > void*ptr;
> > gfp_t   lflags;
> >  
> > @@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> >  * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
> >  * the filesystem here and potentially deadlocking.
> 
> The comment above is now largely obsolete, or minimally should be
> changed to PF_MEMALLOC_NOFS?
---
diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index d69ed5e76621..0c9f94f41b6c 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -77,7 +77,7 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 * __vmalloc() will allocate data pages and auxillary structures (e.g.
 * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context
 * here. Hence we need to tell memory reclaim that we are in such a
-* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
+* context via PF_MEMALLOC_NOFS to prevent memory reclaim re-entering
 * the filesystem here and potentially deadlocking.
 */
if (flags & KM_NOFS)

I will fold it into the original patch.

Thanks!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API

2017-01-09 Thread Michal Hocko
On Mon 09-01-17 14:42:10, Michal Hocko wrote:
> On Mon 09-01-17 14:04:21, Vlastimil Babka wrote:
[...]
> Now that you have opened this I have noticed that the code is wrong
> here because GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK would overwrite
> the removed GFP_FS.

Blee, it wouldn't because ~GFP_RECLAIM_MASK will not contain neither
GFP_FS nor GFP_IO. So all is good here.

> I guess it would be better and less error prone
> to move the current_gfp_context part into the direct reclaim entry -
> do_try_to_free_pages - and put the comment like this

well, after more thinking about we, should probably keep it where it is.
If for nothing else try_to_free_mem_cgroup_pages has a tracepoint which
prints the gfp mask so we should use the filtered one. So let's just
scratch this follow up fix.

> ---
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4ea6b610f20e..df7975185f11 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2756,6 +2756,13 @@ static unsigned long do_try_to_free_pages(struct 
> zonelist *zonelist,
>   int initial_priority = sc->priority;
>   unsigned long total_scanned = 0;
>   unsigned long writeback_threshold;
> +
> + /*
> +  * Make sure that the gfp context properly handles scope gfp mask.
> +  * This might weaken the reclaim context (e.g. make it GFP_NOFS or
> +  * GFP_NOIO).
> +  */
> + sc->gfp_mask = current_gfp_context(sc->gfp_mask);
>  retry:
>   delayacct_freepages_start();
>  
> @@ -2949,7 +2956,7 @@ unsigned long try_to_free_pages(struct zonelist 
> *zonelist, int order,
>   unsigned long nr_reclaimed;
>   struct scan_control sc = {
>   .nr_to_reclaim = SWAP_CLUSTER_MAX,
> - .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
> + .gfp_mask = gfp_mask,
>   .reclaim_idx = gfp_zone(gfp_mask),
>   .order = order,
>   .nodemask = nodemask,
> @@ -3029,8 +3036,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct 
> mem_cgroup *memcg,
>   int nid;
>   struct scan_control sc = {
>   .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> - .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
> - (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
> + .gfp_mask = GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK,
>   .reclaim_idx = MAX_NR_ZONES - 1,
>   .target_mem_cgroup = memcg,
>   .priority = DEF_PRIORITY,
> @@ -3723,7 +3729,7 @@ static int __node_reclaim(struct pglist_data *pgdat, 
> gfp_t gfp_mask, unsigned in
>   int classzone_idx = gfp_zone(gfp_mask);
>   struct scan_control sc = {
>   .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> - .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
> + .gfp_mask = gfp_mask,
>   .order = order,
>   .priority = NODE_RECLAIM_PRIORITY,
>   .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API

2017-01-09 Thread Michal Hocko
On Mon 09-01-17 14:04:21, Vlastimil Babka wrote:
[...]
> > +static inline unsigned int memalloc_nofs_save(void)
> > +{
> > +   unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
> > +   current->flags |= PF_MEMALLOC_NOFS;
> 
> So this is not new, as same goes for memalloc_noio_save, but I've
> noticed that e.g. exit_signal() does tsk->flags |= PF_EXITING;
> So is it possible that there's a r-m-w hazard here?

exit_signals operates on current and all task_struct::flags should be
used only on the current.
[...]

> > @@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct 
> > mem_cgroup *memcg,
> > int nid;
> > struct scan_control sc = {
> > .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> > -   .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> > +   .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
> 
> So this function didn't do memalloc_noio_flags() before? Is it a bug
> that should be fixed separately or at least mentioned? Because that
> looks like a functional change...

We didn't need it. Kmem charges are opt-in and current all of them
support GFP_IO. The LRU pages are not charged in NOIO context either.
We need it now because there will be callers to charge GFP_KERNEL while
being inside the NOFS scope.

Now that you have opened this I have noticed that the code is wrong
here because GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK would overwrite
the removed GFP_FS. I guess it would be better and less error prone
to move the current_gfp_context part into the direct reclaim entry -
do_try_to_free_pages - and put the comment like this
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4ea6b610f20e..df7975185f11 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2756,6 +2756,13 @@ static unsigned long do_try_to_free_pages(struct 
zonelist *zonelist,
int initial_priority = sc->priority;
unsigned long total_scanned = 0;
unsigned long writeback_threshold;
+
+   /*
+* Make sure that the gfp context properly handles scope gfp mask.
+* This might weaken the reclaim context (e.g. make it GFP_NOFS or
+* GFP_NOIO).
+*/
+   sc->gfp_mask = current_gfp_context(sc->gfp_mask);
 retry:
delayacct_freepages_start();
 
@@ -2949,7 +2956,7 @@ unsigned long try_to_free_pages(struct zonelist 
*zonelist, int order,
unsigned long nr_reclaimed;
struct scan_control sc = {
.nr_to_reclaim = SWAP_CLUSTER_MAX,
-   .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
+   .gfp_mask = gfp_mask,
.reclaim_idx = gfp_zone(gfp_mask),
.order = order,
.nodemask = nodemask,
@@ -3029,8 +3036,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct 
mem_cgroup *memcg,
int nid;
struct scan_control sc = {
.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-   .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
-   (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
+   .gfp_mask = GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK,
.reclaim_idx = MAX_NR_ZONES - 1,
.target_mem_cgroup = memcg,
.priority = DEF_PRIORITY,
@@ -3723,7 +3729,7 @@ static int __node_reclaim(struct pglist_data *pgdat, 
gfp_t gfp_mask, unsigned in
int classzone_idx = gfp_zone(gfp_mask);
struct scan_control sc = {
.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-   .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
+   .gfp_mask = gfp_mask,
.order = order,
    .priority = NODE_RECLAIM_PRIORITY,
.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[DEBUG PATCH 2/2] silent warnings which we cannot do anything about

2017-01-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

There are some code paths used by all the filesystems which we cannot
change to drop the GFP_NOFS, yet they generate a lot of warnings.
Provide {disable,enable}_scope_gfp_check to silence those.
alloc_page_buffers and grow_dev_page are silenced right away.

Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/buffer.c   |  4 
 include/linux/sched.h | 11 +++
 mm/page_alloc.c   |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 28484b3ebc98..dbe529e7881b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -873,7 +873,9 @@ struct buffer_head *alloc_page_buffers(struct page *page, 
unsigned long size,
head = NULL;
offset = PAGE_SIZE;
while ((offset -= size) >= 0) {
+   disable_scope_gfp_check();
bh = alloc_buffer_head(GFP_NOFS);
+   enable_scope_gfp_check();
if (!bh)
goto no_grow;
 
@@ -1003,7 +1005,9 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 */
gfp_mask |= __GFP_NOFAIL;
 
+   disable_scope_gfp_check();
page = find_or_create_page(inode->i_mapping, index, gfp_mask);
+   enable_scope_gfp_check();
if (!page)
return ret;
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 59428926e989..f60294732ed5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1988,6 +1988,7 @@ struct task_struct {
/* A live task holds one reference. */
atomic_t stack_refcount;
 #endif
+   bool disable_scope_gfp_warn;
unsigned long nofs_caller;
unsigned long noio_caller;
 /* CPU-specific state of this task */
@@ -2390,6 +2391,16 @@ static inline unsigned int __memalloc_nofs_save(unsigned 
long caller)
return flags;
 }
 
+static inline void disable_scope_gfp_check(void)
+{
+   current->disable_scope_gfp_warn = true;
+}
+
+static inline void enable_scope_gfp_check(void)
+{
+   current->disable_scope_gfp_warn = false;
+}
+
 #define memalloc_nofs_save()   __memalloc_nofs_save(_RET_IP_)
 
 static inline void memalloc_nofs_restore(unsigned int flags)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 87a2bb5262b2..5405278bd733 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3762,6 +3762,9 @@ void debug_scope_gfp_context(gfp_t gfp_mask)
if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
return;
 
+   if (current->disable_scope_gfp_warn)
+   return;
+
if (current->flags & PF_MEMALLOC_NOIO)
restrict_mask = __GFP_IO;
else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[DEBUG PATCH 1/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context

2017-01-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.

Let's help this process and add a debugging tool to catch when an
explicit allocation request for GFP_NO{FS,IO} is done from the scope
context. The printed stacktrace should help to identify the caller
and evaluate whether it can be changed to use a wider context or whether
it is called from another potentially dangerous context which needs
a scope protection as well.

The checks have to be enabled explicitly by debug_scope_gfp kernel
command line parameter.

Signed-off-by: Michal Hocko <mho...@suse.com>
---
 include/linux/sched.h | 14 +++--
 include/linux/slab.h  |  3 +++
 mm/page_alloc.c   | 58 +++
 3 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2032fc642a26..59428926e989 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1988,6 +1988,8 @@ struct task_struct {
/* A live task holds one reference. */
atomic_t stack_refcount;
 #endif
+   unsigned long nofs_caller;
+   unsigned long noio_caller;
 /* CPU-specific state of this task */
struct thread_struct thread;
 /*
@@ -2345,6 +2347,8 @@ extern void thread_group_cputime_adjusted(struct 
task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+extern void debug_scope_gfp_context(gfp_t gfp_mask);
+
 /*
  * Applies per-task gfp context to the given allocation flags.
  * PF_MEMALLOC_NOIO implies GFP_NOIO
@@ -2363,25 +2367,31 @@ static inline gfp_t current_gfp_context(gfp_t flags)
return flags;
 }
 
-static inline unsigned int memalloc_noio_save(void)
+static inline unsigned int __memalloc_noio_save(unsigned long caller)
 {
unsigned int flags = current->flags & PF_MEMALLOC_NOIO;
current->flags |= PF_MEMALLOC_NOIO;
+   current->noio_caller = caller;
return flags;
 }
 
+#define memalloc_noio_save()   __memalloc_noio_save(_RET_IP_)
+
 static inline void memalloc_noio_restore(unsigned int flags)
 {
current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
-static inline unsigned int memalloc_nofs_save(void)
+static inline unsigned int __memalloc_nofs_save(unsigned long caller)
 {
unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
current->flags |= PF_MEMALLOC_NOFS;
+   current->nofs_caller = caller;
return flags;
 }
 
+#define memalloc_nofs_save()   __memalloc_nofs_save(_RET_IP_)
+
 static inline void memalloc_nofs_restore(unsigned int flags)
 {
current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 084b12bad198..6559668e29db 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -477,6 +477,7 @@ static __always_inline void *kmalloc_large(size_t size, 
gfp_t flags)
  */
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
+   debug_scope_gfp_context(flags);
if (__builtin_constant_p(size)) {
if (size > KMALLOC_MAX_CACHE_SIZE)
return kmalloc_large(size, flags);
@@ -517,6 +518,7 @@ static __always_inline int kmalloc_size(int n)
 
 static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
+   debug_scope_gfp_context(flags);
 #ifndef CONFIG_SLOB
if (__builtin_constant_p(size) &&
size <= KMALLOC_MAX_CACHE_SIZE && !(flags & GFP_DMA)) {
@@ -575,6 +577,7 @@ int memcg_update_all_caches(int num_memcgs);
  */
 static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
 {
+   debug_scope_gfp_context(flags);
if (size != 0 && n > SIZE_MAX / size)
return NULL;
if (__builtin_constant_p(n) && __builtin_constant_p(size))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5138b46a4295..87a2bb5262b2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3738,6 +3738,63 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
return page;
 }
 
+static bool debug_scope_gfp;
+
+static int __init enable_debug_scope_gfp(char *unused)
+{
+   debug_scope_gfp = true;
+   return 0;
+}
+
+/*
+ * spit the stack trace if the given gfp_mask clears flags which are context
+ * wide cleared. Such a caller can remove special flags clearing and rely on
+ * the context wide mask.
+ */
+void debug_scope_gfp_context(gfp_t gfp_mask)
+{
+   gfp_t restrict_mask;
+
+   if (likely(!debug_scope_gfp))
+   return;
+
+   /* both NOFS, NOIO are irrelevant when direct reclaim is disabled */
+   if (!(gfp_mask & __GFP_DIRECT_RECLAIM))

[DEBUG PATCH 0/2] debug explicit GFP_NO{FS,IO} usage from the scope context

2017-01-06 Thread Michal Hocko
These two patches should help to identify explicit GFP_NO{FS,IO} usage
from withing a scope context and reduce such a usage as a result. Such
a usage can be changed to the full GFP_KERNEL because all the calls
from within the NO{FS,IO} scope will drop the __GFP_FS resp. __GFP_IO
automatically and if the function is called outside of the scope then
we do not need to restrict it to NOFS/NOIO as long as all the reclaim
recursion unsafe contexts are marked properly. This means that each such
a reported allocation site has to be checked before converted.

The debugging has to be enabled explicitly by a kernel command line
parameter and then it reports the stack trace of the allocation and
also the function which has started the current scope.

These two patches are _not_ intended to be merged and they are only
aimed at debugging.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/8] lockdep: allow to disable reclaim lockup detection

2017-01-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.

=
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G   O
-
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(_nondir_ilock_class){-+}, at: [] 
xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
  [] mark_held_locks+0x79/0xa0
  [] lockdep_trace_alloc+0xb3/0x100
  [] kmem_cache_alloc+0x33/0x230
  [] kmem_zone_alloc+0x81/0x120 [xfs]
  [] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
  [] __xfs_refcount_find_shared+0x75/0x580 [xfs]
  [] xfs_refcount_find_shared+0x84/0xb0 [xfs]
  [] xfs_getbmap+0x608/0x8c0 [xfs]
  [] xfs_vn_fiemap+0xab/0xc0 [xfs]
  [] do_vfs_ioctl+0x498/0x670
  [] SyS_ioctl+0x79/0x90
  [] entry_SYSCALL_64_fastpath+0x12/0x6f

   CPU0
   
  lock(_nondir_ilock_class);
  
lock(_nondir_ilock_class);

 *** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G   O4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

 82a34f10 88003aa078d0 813a14f9 88003d8551c0
 88003aa07920 8110ec65  0001
 8801 000b 0008 88003d855aa0
Call Trace:
 [] dump_stack+0x4b/0x72
 [] print_usage_bug+0x215/0x240
 [] mark_lock+0x1f5/0x660
 [] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [] __lock_acquire+0xa80/0x1e50
 [] ? kmem_cache_alloc+0x15e/0x230
 [] ? kmem_zone_alloc+0x81/0x120 [xfs]
 [] lock_acquire+0xd8/0x1e0
 [] ? xfs_ilock+0x177/0x200 [xfs]
 [] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [] down_write_nested+0x5e/0xc0
 [] ? xfs_ilock+0x177/0x200 [xfs]
 [] xfs_ilock+0x177/0x200 [xfs]
 [] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
 [] evict+0xc5/0x190
 [] dispose_list+0x39/0x60
 [] prune_icache_sb+0x4b/0x60
 [] super_cache_scan+0x14f/0x1a0
 [] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
 [] shrink_zone+0x15e/0x170
 [] kswapd+0x4f1/0xa80
 [] ? zone_reclaim+0x230/0x230
 [] kthread+0xf2/0x110
 [] ? kthread_create_on_node+0x220/0x220
 [] ret_from_fork+0x3f/0x70
 [] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Suggested-by: Peter Zijlstra <pet...@infradead.org>
Acked-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
 include/linux/gfp.h  | 10 +-
 kernel/locking/lockdep.c |  4 
 lib/radix-tree.c |  2 ++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..1a934383cc20 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -41,6 +41,11 @@ struct vm_area_struct;
 #define ___GFP_OTHER_NODE  0x80u
 #define ___GFP_WRITE   0x100u
 #define ___GFP_KSWAPD_RECLAIM  0x200u
+#ifdef CONFIG_LOCKDEP
+#define ___GFP_NOLOCKDEP   0x400u
+#else
+#define ___GFP_NOLOCKDEP   0
+#endif
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -186,8 +191,11 @@ struct vm_area_struct;
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 #define __GFP_OTHER_NODE ((__f

[PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"

2017-01-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
sb_getblk_gfp is not really needed as
sb_getblk
  __getblk_gfp
__getblk_slow
  grow_buffers
grow_dev_page
  gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp

so __GFP_FS is cleared unconditionally and therefore the above commit
didn't have any real effect in fact.

This patch should not introduce any functional change. The main point
of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
make the review of the remaining usage easier.

Signed-off-by: Michal Hocko <mho...@suse.com>
Reviewed-by: Jan Kara <j...@suse.cz>
---
 fs/ext4/extents.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 3e295d3350a9..9867b9e5ad8f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -518,7 +518,7 @@ __read_extent_tree_block(const char *function, unsigned int 
line,
struct buffer_head  *bh;
int err;
 
-   bh = sb_getblk_gfp(inode->i_sb, pblk, __GFP_MOVABLE | GFP_NOFS);
+   bh = sb_getblk(inode->i_sb, pblk);
if (unlikely(!bh))
return ERR_PTR(-ENOMEM);
 
@@ -1096,7 +1096,7 @@ static int ext4_ext_split(handle_t *handle, struct inode 
*inode,
err = -EFSCORRUPTED;
goto cleanup;
}
-   bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
+   bh = sb_getblk(inode->i_sb, newblock);
if (unlikely(!bh)) {
err = -ENOMEM;
goto cleanup;
@@ -1290,7 +1290,7 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct 
inode *inode,
if (newblock == 0)
return err;
 
-   bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
+   bh = sb_getblk(inode->i_sb, newblock);
if (unlikely(!bh))
return -ENOMEM;
lock_buffer(bh);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API

2017-01-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

GFP_NOFS context is used for the following 5 reasons currently
- to prevent from deadlocks when the lock held by the allocation
  context would be needed during the memory reclaim
- to prevent from stack overflows during the reclaim because
  the allocation is performed from a deep context already
- to prevent lockups when the allocation context depends on
  other reclaimers to make a forward progress indirectly
- just in case because this would be safe from the fs POV
- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/xfs/kmem.h|  2 +-
 include/linux/gfp.h  |  8 
 include/linux/sched.h| 34 ++
 kernel/locking/lockdep.c |  2 +-
 mm/page_alloc.c  |  8 +---
 mm/vmscan.c  |  6 +++---
 6 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d973dbfc2bfa..ae08cfd9552a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
-   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+   if (flags & KM_NOFS)
lflags &= ~__GFP_FS;
}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a934383cc20..bfe53d95c25b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -217,8 +217,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which 
cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index abeb84604d32..2032fc642a26 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2307,9 +2307,9 @@ extern void thread_group_cputime_adjusted(struct 
task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC  0x4000  /* used async_schedule*(), used by 
module init */
 #define PF_NOFREEZE0x

[PATCH 0/8 v3] scope GFP_NOFS api

2017-01-06 Thread Michal Hocko
Hi,
I have posted the previous version here [1]. Since then I've added some
reviewed bys and fixed some minor issues. I've dropped patch 2 [2] based
on Dave's request [3]. I agree that this can be done later and doing
all at once. I still think that __GFP_NOLOCKDEP should be added by this
series to make the further development easier.

There didn't seem to be any real objections and so I think we should go
and merge this and build further on top. I would like to get rid of all
explicit GFP_NOFS usage in ext4 code. I have something half baked already
and will send it later on. I also hope we can get further with the xfs
as well.

I haven't heard anything from btrfs or other filesystems guys which is a
bit unfortunate but I do not want to wait for them to much longer, they
can join the effort later on.

The patchset is based on next-20170106

Diffstat says
 fs/ext4/acl.c |  6 +++---
 fs/ext4/extents.c |  8 
 fs/ext4/resize.c  |  4 ++--
 fs/ext4/xattr.c   |  4 ++--
 fs/jbd2/journal.c |  7 +++
 fs/jbd2/transaction.c | 11 +++
 fs/xfs/kmem.c | 10 +-
 fs/xfs/kmem.h |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c |  6 +++---
 fs/xfs/xfs_buf.c  |  8 
 fs/xfs/xfs_trans.c| 12 ++--
 include/linux/gfp.h   | 18 +-
 include/linux/jbd2.h  |  2 ++
 include/linux/sched.h | 32 ++--
 kernel/locking/lockdep.c  |  6 +-
 lib/radix-tree.c  |  2 ++
 mm/page_alloc.c   |  8 +---
 mm/vmscan.c   |  6 +++---
 19 files changed, 109 insertions(+), 45 deletions(-)

Shortlog:
Michal Hocko (8):
  lockdep: allow to disable reclaim lockup detection
  xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
  mm: introduce memalloc_nofs_{save,restore} API
  xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  jbd2: mark the transaction context with the scope GFP_NOFS context
  jbd2: make the whole kjournald2 kthread NOFS safe
  Revert "ext4: avoid deadlocks in the writeback path by using 
sb_getblk_gfp"
  Revert "ext4: fix wrong gfp type under transaction"

[1] http://lkml.kernel.org/r/20161215140715.12732-1-mho...@kernel.org
[2] http://lkml.kernel.org/r/20161215140715.12732-3-mho...@kernel.org
[3] http://lkml.kernel.org/r/20161219212413.GN4326@dastard


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/8] jbd2: make the whole kjournald2 kthread NOFS safe

2017-01-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

Suggested-by: Jan Kara <j...@suse.cz>
Signed-off-by: Michal Hocko <mho...@suse.com>
Reviewed-by: Jan Kara <j...@suse.cz>
---
 fs/jbd2/journal.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index a097048ed1a3..3a449150f834 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -206,6 +206,13 @@ static int kjournald2(void *arg)
wake_up(>j_wait_done_commit);
 
/*
+* Make sure that no allocations from this kernel thread will ever 
recurse
+* to the fs layer because we are responsible for the transaction commit
+* and any fs involvement might get stuck waiting for the trasn. commit.
+*/
+   memalloc_nofs_save();
+
+   /*
 * And now, wait forever for commit wakeup events.
 */
write_lock(>j_state_lock);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

2017-01-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Michal Hocko <mho...@suse.com>
Reviewed-by: Brian Foster <bfos...@redhat.com>
---
 fs/xfs/kmem.c |  4 ++--
 fs/xfs/kmem.h |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c |  6 +++---
 fs/xfs/xfs_trans.c| 12 ++--
 include/linux/sched.h |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 339c696bbc01..a76a05dae96b 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 * the filesystem here and potentially deadlocking.
 */
-   if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
noio_flag = memalloc_noio_save();
 
lflags = kmem_flags_convert(flags);
ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-   if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
memalloc_noio_restore(noio_flag);
 
return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..d973dbfc2bfa 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
-   if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
lflags &= ~__GFP_FS;
}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 21e6a6ab6b9a..a2672ba4dc33 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
struct xfs_btree_split_args *args = container_of(work,
struct xfs_btree_split_args, 
work);
unsigned long   pflags;
-   unsigned long   new_pflags = PF_FSTRANS;
+   unsigned long   new_pflags = PF_MEMALLOC_NOFS;
 
/*
 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index ef382bfb402b..d4094bb55033 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
 * We hand off the transaction to the completion thread now, so
 * clear the flag here.
 */
-   current_restore_flags_nested(>t_pflags, PF_FSTRANS);
+   current_restore_flags_nested(>t_pflags, PF_MEMALLOC_NOFS);
return 0;
 }
 
@@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
 * thus we need to mark ourselves as being in a transaction manually.
 * Similarly for freeze protection.
 */
-   current_set_flags_nested(>t_pflags, PF_FSTRANS);
+   current_set_flags_nested(>t_pflags, PF_MEMALLOC_NOFS);
__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
/* we abort the update if there was an IO error */
@@ -1015,7 +1015,7 @@ xfs_do_writepage(
 * Given that we do not allow direct reclaim to call us, we should
 * never be called while in a filesystem transaction.
 */
-   if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+   if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
goto redirty;
 
/*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 70f42ea86dfb..f5969c8274fc 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -134,7 +134,7 @@ xfs_trans_reserve(
boolrsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
/* Mark this thread as being in a transaction */
-   current_set_flags_nested(>t_pflags, PF_FSTRANS);
+   current_set_flags_nested(>t_pflags, PF_MEMALLOC_NOFS);
 
/*
 * Attempt to reserve the needed disk blocks by decrementing
@@ -144,7 +144,7 @@ xfs_trans_reserve(
if (blocks > 0) {
error = xfs_mod_fdblocks(tp->t_mountp, 

[PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

2017-01-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/xfs/kmem.c| 10 +-
 fs/xfs/xfs_buf.c |  8 
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a76a05dae96b..d69ed5e76621 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-   unsigned noio_flag = 0;
+   unsigned nofs_flag = 0;
void*ptr;
gfp_t   lflags;
 
@@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 * the filesystem here and potentially deadlocking.
 */
-   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-   noio_flag = memalloc_noio_save();
+   if (flags & KM_NOFS)
+   nofs_flag = memalloc_nofs_save();
 
lflags = kmem_flags_convert(flags);
ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-   memalloc_noio_restore(noio_flag);
+   if (flags & KM_NOFS)
+   memalloc_nofs_restore(nofs_flag);
 
return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 7f0a01f7b592..8cb8dd4cdfd8 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -441,17 +441,17 @@ _xfs_buf_map_pages(
bp->b_addr = NULL;
} else {
int retried = 0;
-   unsigned noio_flag;
+   unsigned nofs_flag;
 
/*
 * vm_map_ram() will allocate auxillary structures (e.g.
 * pagetables) with GFP_KERNEL, yet we are likely to be under
 * GFP_NOFS context here. Hence we need to tell memory reclaim
-* that we are in such a context via PF_MEMALLOC_NOIO to prevent
+* that we are in such a context via PF_MEMALLOC_NOFS to prevent
 * memory reclaim re-entering the filesystem here and
 * potentially deadlocking.
 */
-   noio_flag = memalloc_noio_save();
+   nofs_flag = memalloc_nofs_save();
do {
bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
-1, PAGE_KERNEL);
@@ -459,7 +459,7 @@ _xfs_buf_map_pages(
break;
vm_unmap_aliases();
} while (retried++ <= 1);
-   memalloc_noio_restore(noio_flag);
+   memalloc_nofs_restore(nofs_flag);
 
if (!bp->b_addr)
return -ENOMEM;
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/8] jbd2: mark the transaction context with the scope GFP_NOFS context

2017-01-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

Signed-off-by: Michal Hocko <mho...@suse.com>
Reviewed-by: Jan Kara <j...@suse.cz>
---
 fs/jbd2/transaction.c | 11 +++
 include/linux/jbd2.h  |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index e1652665bd93..35a5d3d76182 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -388,6 +388,11 @@ static int start_this_handle(journal_t *journal, handle_t 
*handle,
 
rwsem_acquire_read(>j_trans_commit_map, 0, 0, _THIS_IP_);
jbd2_journal_free_transaction(new_transaction);
+   /*
+* Make sure that no allocations done while the transaction is
+* open is going to recurse back to the fs layer.
+*/
+   handle->saved_alloc_context = memalloc_nofs_save();
return 0;
 }
 
@@ -466,6 +471,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int 
nblocks, int rsv_blocks,
trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
handle->h_transaction->t_tid, type,
line_no, nblocks);
+
return handle;
 }
 EXPORT_SYMBOL(jbd2__journal_start);
@@ -1760,6 +1766,11 @@ int jbd2_journal_stop(handle_t *handle)
if (handle->h_rsv_handle)
jbd2_journal_free_reserved(handle->h_rsv_handle);
 free_and_exit:
+   /*
+* scope of th GFP_NOFS context is over here and so we can
+* restore the original alloc context.
+*/
+   memalloc_nofs_restore(handle->saved_alloc_context);
jbd2_free_handle(handle);
return err;
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index dfaa1f4dcb0c..606b6bce3a5b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -491,6 +491,8 @@ struct jbd2_journal_handle
 
unsigned long   h_start_jiffies;
unsigned inth_requested_credits;
+
+   unsigned intsaved_alloc_context;
 };
 
 
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-06 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
the transaction context uses memalloc_nofs_save and all allocations
within the this context inherit GFP_NOFS automatically, there is no
reason to mark specific allocations explicitly.

This patch should not introduce any functional change. The main point
of this change is to reduce explicit GFP_NOFS usage inside ext4 code
to make the review of the remaining usage easier.

Signed-off-by: Michal Hocko <mho...@suse.com>
Reviewed-by: Jan Kara <j...@suse.cz>
---
 fs/ext4/acl.c | 6 +++---
 fs/ext4/extents.c | 2 +-
 fs/ext4/resize.c  | 4 ++--
 fs/ext4/xattr.c   | 4 ++--
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index fd389935ecd1..9e98092c2a4b 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -32,7 +32,7 @@ ext4_acl_from_disk(const void *value, size_t size)
return ERR_PTR(-EINVAL);
if (count == 0)
return NULL;
-   acl = posix_acl_alloc(count, GFP_NOFS);
+   acl = posix_acl_alloc(count, GFP_KERNEL);
if (!acl)
return ERR_PTR(-ENOMEM);
for (n = 0; n < count; n++) {
@@ -94,7 +94,7 @@ ext4_acl_to_disk(const struct posix_acl *acl, size_t *size)
 
*size = ext4_acl_size(acl->a_count);
ext_acl = kmalloc(sizeof(ext4_acl_header) + acl->a_count *
-   sizeof(ext4_acl_entry), GFP_NOFS);
+   sizeof(ext4_acl_entry), GFP_KERNEL);
if (!ext_acl)
return ERR_PTR(-ENOMEM);
ext_acl->a_version = cpu_to_le32(EXT4_ACL_VERSION);
@@ -159,7 +159,7 @@ ext4_get_acl(struct inode *inode, int type)
}
retval = ext4_xattr_get(inode, name_index, "", NULL, 0);
if (retval > 0) {
-   value = kmalloc(retval, GFP_NOFS);
+   value = kmalloc(retval, GFP_KERNEL);
if (!value)
return ERR_PTR(-ENOMEM);
retval = ext4_xattr_get(inode, name_index, "", value, retval);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 9867b9e5ad8f..0371e7aa7bea 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2933,7 +2933,7 @@ int ext4_ext_remove_space(struct inode *inode, 
ext4_lblk_t start,
le16_to_cpu(path[k].p_hdr->eh_entries)+1;
} else {
path = kzalloc(sizeof(struct ext4_ext_path) * (depth + 1),
-  GFP_NOFS);
+  GFP_KERNEL);
if (path == NULL) {
ext4_journal_stop(handle);
return -ENOMEM;
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index cf681004b196..e121f4e048b8 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -816,7 +816,7 @@ static int add_new_gdb(handle_t *handle, struct inode 
*inode,
 
n_group_desc = ext4_kvmalloc((gdb_num + 1) *
 sizeof(struct buffer_head *),
-GFP_NOFS);
+GFP_KERNEL);
if (!n_group_desc) {
err = -ENOMEM;
ext4_warning(sb, "not enough memory for %lu groups",
@@ -943,7 +943,7 @@ static int reserve_backup_gdb(handle_t *handle, struct 
inode *inode,
int res, i;
int err;
 
-   primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_NOFS);
+   primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_KERNEL);
if (!primary)
return -ENOMEM;
 
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 5a94fa52b74f..172317462238 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -875,7 +875,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 
unlock_buffer(bs->bh);
ea_bdebug(bs->bh, "cloning");
-   s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
+   s->base = kmalloc(bs->bh->b_size, GFP_KERNEL);
error = -ENOMEM;
if (s->base == NULL)
goto cleanup;
@@ -887,7 +887,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
}
} else {
/* Allocate a buffer where we construct the new block. */
-   s->base = kzalloc(sb->s_blocksize, GFP_NOFS);
+   s->base = kzalloc(sb->s_blocksize, GFP_KERNEL);
/* assert(header == s->base) */
error = -ENOMEM;
if (s->base == NULL)
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-30 Thread Michal Hocko
On Fri 30-12-16 10:19:26, Mel Gorman wrote:
> On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > 
> > > > Nils, even though this is still highly experimental, could you give it a
> > > > try please?
> > > 
> > > Yes, no problem! So I kept the very first patch you sent but had to
> > > revert the latest version of the debugging patch (the one in
> > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > memory cgroups enabled again, and the first thing that strikes the eye
> > > is that I get this during boot:
> > > 
> > > [1.568174] [ cut here ]
> > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > mem_cgroup_update_lru_size+0x118/0x130
> > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > > not empty
> > 
> > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > my patch (I double account) and b) the detection for the empty list
> > cannot work after my change because per node zone will not match per
> > zone statistics. The updated patch is below. So I hope my brain already
> > works after it's been mostly off last few days...
> > ---
> > From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mho...@suse.com>
> > Date: Fri, 23 Dec 2016 15:11:54 +0100
> > Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests 
> > when
> >  memcg is enabled
> > 
> > Nils Holland has reported unexpected OOM killer invocations with 32b
> > kernel starting with 4.8 kernels
> > 
> 
> I think it's unfortunate that per-zone stats are reintroduced to the
> memcg structure.

the original patch I had didn't add per zone stats but rather did a
nr_highmem counter to mem_cgroup_per_node (inside ifdeff CONFIG_HIGMEM).
This would help for this particular case but it wouldn't work for other
lowmem requests (e.g. GFP_DMA32) and with the kmem accounting this might
be a problem in future. So I've decided to go with a more generic
approach which requires per-zone tracking. I cannot say I would be
overly happy about this at all.

> I can't help but think that it would have also worked
> to always rotate a small number of pages if !inactive_list_is_low and
> reclaiming for memcg even if it distorted page aging.

I am not really sure how that would work. Do you mean something like the
following?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fa30010a5277..563ada3c02ac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2044,6 +2044,9 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
bool file,
inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
 
+   if (!mem_cgroup_disabled())
+   goto out;
+
/*
 * For zone-constrained allocations, it is necessary to check if
 * deactivations are required for lowmem to be reclaimed. This
@@ -2063,6 +2066,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
bool file,
active -= min(active, active_zone);
}
 
+out:
gb = (inactive + active) >> (30 - PAGE_SHIFT);
if (gb)
inactive_ratio = int_sqrt(10 * gb);

The problem I see with such an approach is that chances are that this
would reintroduce what f8d1a31163fc ("mm: consider whether to decivate
based on eligible zones inactive ratio") tried to fix. But maybe I have
missed your point.

> However, given that such an approach would be less robust and this has
> been heavily tested;
> 
> Acked-by: Mel Gorman <mgor...@suse.de>

Thanks!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-30 Thread Michal Hocko
On Fri 30-12-16 11:05:22, Minchan Kim wrote:
> On Thu, Dec 29, 2016 at 10:04:32AM +0100, Michal Hocko wrote:
> > On Thu 29-12-16 10:20:26, Minchan Kim wrote:
> > > On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
[...]
> > > > + * given zone_idx
> > > > + */
> > > > +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> > > > +   enum lru_list lru, int zone_idx)
> > > 
> > > Nit:
> > > 
> > > Although there is a comment, function name is rather confusing when I 
> > > compared
> > > it with lruvec_zone_lru_size.
> > 
> > I am all for a better name.
> > 
> > > lruvec_eligible_zones_lru_size is better?
> > 
> > this would be too easy to confuse with lruvec_eligible_zone_lru_size.
> > What about lruvec_lru_size_eligible_zones?
> 
> Don't mind.

I will go with lruvec_lru_size_eligible_zones then.

> > > Nit:
> > > 
> > > With this patch, inactive_list_is_low can use lruvec_lru_size_zone_idx 
> > > rather than
> > > own custom calculation to filter out non-eligible pages. 
> > 
> > Yes, that would be possible and I was considering that. But then I found
> > useful to see total and reduced numbers in the tracepoint
> > http://lkml.kernel.org/r/20161228153032.10821-8-mho...@kernel.org
> > and didn't want to call lruvec_lru_size 2 times. But if you insist then
> > I can just do that.
> 
> I don't mind either but I think we need to describe the reason if you want to
> go with your open-coded version. Otherwise, someone will try to fix it.

OK, I will go with the follow up patch on top of the tracepoints series.
I was hoping that the way how tracing is full of macros would allow us
to evaluate arguments only when the tracepoint is enabled but this
doesn't seem to be the case. Let's CC Steven. Would it be possible to
define a tracepoint in such a way that all given arguments are evaluated
only when the tracepoint is enabled?
---
>From 9a561d652f91f3557db22161600f10ca2462c74f Mon Sep 17 00:00:00 2001
From: Michal Hocko <mho...@suse.com>
Date: Fri, 30 Dec 2016 11:28:20 +0100
Subject: [PATCH] mm, vmscan: cleanup up inactive_list_is_low

inactive_list_is_low is effectively duplicating logic implemented by
lruvec_lru_size_eligibe_zones. Let's use the dedicated function to
get the number of eligible pages on the lru list and ask use
lruvec_lru_size to get the total LRU lize only when the tracing is
really requested. We are still iterating over all LRUs two times in that
case but a) inactive_list_is_low is not a hot path and b) this can be
addressed at the tracing layer and only evaluate arguments only when the
tracing is enabled in future if that ever matters.

Signed-off-by: Michal Hocko <mho...@suse.com>
---
 mm/vmscan.c | 38 ++
 1 file changed, 10 insertions(+), 28 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 137bc85067d3..a9c881f06c0e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2054,11 +2054,10 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
bool file,
struct scan_control *sc, bool 
trace)
 {
unsigned long inactive_ratio;
-   unsigned long total_inactive, inactive;
-   unsigned long total_active, active;
+   unsigned long inactive, active;
+   enum lru_list inactive_lru = file * LRU_FILE;
+   enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
unsigned long gb;
-   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-   int zid;
 
/*
 * If we don't have swap space, anonymous page deactivation
@@ -2067,27 +2066,8 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
bool file,
if (!file && !total_swap_pages)
return false;
 
-   total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
-   total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + 
LRU_ACTIVE);
-
-   /*
-* For zone-constrained allocations, it is necessary to check if
-* deactivations are required for lowmem to be reclaimed. This
-* calculates the inactive/active pages available in eligible zones.
-*/
-   for (zid = sc->reclaim_idx + 1; zid < MAX_NR_ZONES; zid++) {
-   struct zone *zone = >node_zones[zid];
-   unsigned long inactive_zone, active_zone;
-
-   if (!managed_zone(zone))
-   continue;
-
-   inactive_zone = lruvec_zone_lru_size(lruvec, file * LRU_FILE, 
zid);
-   active_zone = lruvec_zone_lru_size(lruvec, (file * LRU_FILE) + 
LRU_ACTIVE, zid);
-
-   inactive -= min(inactive, inactive_zone);
-   active -= min(active, active_zone);
-

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-29 Thread Michal Hocko
On Thu 29-12-16 10:20:26, Minchan Kim wrote:
> On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> > Hi,
> > could you try to run with the following patch on top of the previous
> > one? I do not think it will make a large change in your workload but
> > I think we need something like that so some testing under which is known
> > to make a high lowmem pressure would be really appreciated. If you have
> > more time to play with it then running with and without the patch with
> > mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> > whether it make any difference at all.
> > 
> > I would also appreciate if Mel and Johannes had a look at it. I am not
> > yet sure whether we need the same thing for anon/file balancing in
> > get_scan_count. I suspect we need but need to think more about that.
> > 
> > Thanks a lot again!
> > ---
> > From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mho...@suse.com>
> > Date: Tue, 27 Dec 2016 16:28:44 +0100
> > Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count
> > 
> > get_scan_count considers the whole node LRU size when
> > - doing SCAN_FILE due to many page cache inactive pages
> > - calculating the number of pages to scan
> > 
> > in both cases this might lead to unexpected behavior especially on 32b
> > systems where we can expect lowmem memory pressure very often.
> > 
> > A large highmem zone can easily distort SCAN_FILE heuristic because
> > there might be only few file pages from the eligible zones on the node
> > lru and we would still enforce file lru scanning which can lead to
> > trashing while we could still scan anonymous pages.
> 
> Nit:
> It doesn't make thrashing because isolate_lru_pages filter out them
> but I agree it makes pointless CPU burning to find eligible pages.

This is not about isolate_lru_pages. The trashing could happen if we had
lowmem pagecache user which would constantly reclaim recently faulted
in pages while there is anonymous memory in the lowmem which could be
reclaimed instead.
 
[...]
> >  /*
> > + * Return the number of pages on the given lru which are eligibne for the
> eligible

fixed

> > + * given zone_idx
> > + */
> > +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> > +   enum lru_list lru, int zone_idx)
> 
> Nit:
> 
> Although there is a comment, function name is rather confusing when I compared
> it with lruvec_zone_lru_size.

I am all for a better name.

> lruvec_eligible_zones_lru_size is better?

this would be too easy to confuse with lruvec_eligible_zone_lru_size.
What about lruvec_lru_size_eligible_zones?
 
> Nit:
> 
> With this patch, inactive_list_is_low can use lruvec_lru_size_zone_idx rather 
> than
> own custom calculation to filter out non-eligible pages. 

Yes, that would be possible and I was considering that. But then I found
useful to see total and reduced numbers in the tracepoint
http://lkml.kernel.org/r/20161228153032.10821-8-mho...@kernel.org
and didn't want to call lruvec_lru_size 2 times. But if you insist then
I can just do that.

> Anyway, I think this patch does right things so I suppose this.
> 
> Acked-by: Minchan Kim <minc...@kernel.org>

Thanks for the review!

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-29 Thread Michal Hocko
On Thu 29-12-16 09:48:24, Minchan Kim wrote:
> On Thu, Dec 29, 2016 at 09:31:54AM +0900, Minchan Kim wrote:
[...]
> > Acked-by: Minchan Kim <minc...@kernel.org>

Thanks!
 
> Nit:
> 
> WARNING: line over 80 characters
> #53: FILE: include/linux/memcontrol.h:689:
> +unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, enum 
> lru_list lru,
> 
> WARNING: line over 80 characters
> #147: FILE: mm/vmscan.c:248:
> +unsigned long lruvec_zone_lru_size(struct lruvec *lruvec, enum lru_list lru, 
> int zone_idx)
> 
> WARNING: line over 80 characters
> #177: FILE: mm/vmscan.c:1446:
> +   mem_cgroup_update_lru_size(lruvec, lru, zid, 
> -nr_zone_taken[zid]);

fixed

> WARNING: line over 80 characters
> #201: FILE: mm/vmscan.c:2099:
> +   inactive_zone = lruvec_zone_lru_size(lruvec, file * LRU_FILE, 
> zid);
> 
> WARNING: line over 80 characters
> #202: FILE: mm/vmscan.c:2100:
> +   active_zone = lruvec_zone_lru_size(lruvec, (file * LRU_FILE) 
> + LRU_ACTIVE, zid);

I would prefer to have those on the same line though. It will make them
easier to follow.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Michal Hocko
On Tue 27-12-16 20:33:09, Nils Holland wrote:
> On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> > Hi,
> > could you try to run with the following patch on top of the previous
> > one? I do not think it will make a large change in your workload but
> > I think we need something like that so some testing under which is known
> > to make a high lowmem pressure would be really appreciated. If you have
> > more time to play with it then running with and without the patch with
> > mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> > whether it make any difference at all.
> 
> Of course, no problem!
> 
> First, about the events to trace: mm_vmscan_direct_reclaim_start
> doesn't seem to exist, but mm_vmscan_direct_reclaim_begin does. I'm
> sure that's what you meant and so I took that one instead.

yes, sorry about the confusion

> Then I have to admit in both cases (once without the latest patch,
> once with) very little trace data was actually produced. In the case
> without the patch, the reclaim was started more often and reclaimed a
> smaller number of pages each time, in the case with the patch it was
> invoked less often, and with the last time it was invoked it reclaimed
> a rather big number of pages. I have no clue, however, if that
> happened "by chance" or if it was actually causes by the patch and
> thus an expected change.

yes that seems to be a variation of the workload I would say because if
anything the patch should reduce the number of scanned pages.

> In both cases, my test case was: Reboot, setup logging, do "emerge
> firefox" (which unpacks and builds the firefox sources), then, when
> the emerge had come so far that the unpacking was done and the
> building had started, switch to another console and untar the latest
> kernel, libreoffice and (once more) firefox sources there. After that
> had completed, I aborted the emerge build process and stopped tracing.
> 
> Here's the trace data captured without the latest patch applied:
> 
> khugepaged-22[000]    566.123383: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000] .N..   566.165520: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1100
> khugepaged-22[001]    587.515424: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000]    587.596035: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1029
> khugepaged-22[001]    599.879536: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000]    601.000812: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1100
> khugepaged-22[001]    601.228137: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    601.309952: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1081
> khugepaged-22[001]    694.935267: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001] .N..   695.081943: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1071
> khugepaged-22[001]    701.370707: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    701.372798: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1089
> khugepaged-22[001]    764.752036: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000]    771.047905: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1039
> khugepaged-22[000]    781.760515: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    781.826543: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1040
> khugepaged-22[001]    782.595575: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000]    782.638591: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1040
> khugepaged-22[001]    782.930455: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    782.993608: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1040
> khugepaged-22[001]    783.330378: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    783.369653: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1040
> 
> And this is the same with the patch applied:
> 

Re: [PATCH] mm, vmscan: consider eligible zones in get_scan_count

2016-12-28 Thread Michal Hocko
On Wed 28-12-16 00:28:38, kbuild test robot wrote:
> Hi Michal,
> 
> [auto build test ERROR on mmotm/master]
> [also build test ERROR on v4.10-rc1 next-20161224]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-vmscan-consider-eligible-zones-in-get_scan_count/20161228-000917
> base:   git://git.cmpxchg.org/linux-mmotm.git master
> config: i386-tinyconfig (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386 
> 
> All errors (new ones prefixed by >>):
> 
>mm/vmscan.c: In function 'lruvec_lru_size_zone_idx':
> >> mm/vmscan.c:264:10: error: implicit declaration of function 
> >> 'lruvec_zone_lru_size' [-Werror=implicit-function-declaration]
>   size = lruvec_zone_lru_size(lruvec, lru, zid);

this patch depends on the previous one
http://lkml.kernel.org/r/20161226124839.gb20...@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-27 Thread Michal Hocko
Hi,
could you try to run with the following patch on top of the previous
one? I do not think it will make a large change in your workload but
I think we need something like that so some testing under which is known
to make a high lowmem pressure would be really appreciated. If you have
more time to play with it then running with and without the patch with
mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
whether it make any difference at all.

I would also appreciate if Mel and Johannes had a look at it. I am not
yet sure whether we need the same thing for anon/file balancing in
get_scan_count. I suspect we need but need to think more about that.

Thanks a lot again!
---
>From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mho...@suse.com>
Date: Tue, 27 Dec 2016 16:28:44 +0100
Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count

get_scan_count considers the whole node LRU size when
- doing SCAN_FILE due to many page cache inactive pages
- calculating the number of pages to scan

in both cases this might lead to unexpected behavior especially on 32b
systems where we can expect lowmem memory pressure very often.

A large highmem zone can easily distort SCAN_FILE heuristic because
there might be only few file pages from the eligible zones on the node
lru and we would still enforce file lru scanning which can lead to
trashing while we could still scan anonymous pages.

The later use of lruvec_lru_size can be problematic as well. Especially
when there are not many pages from the eligible zones. We would have to
skip over many pages to find anything to reclaim but shrink_node_memcg
would only reduce the remaining number to scan by SWAP_CLUSTER_MAX
at maximum. Therefore we can end up going over a large LRU many times
without actually having chance to reclaim much if anything at all. The
closer we are out of memory on lowmem zone the worse the problem will
be.

Signed-off-by: Michal Hocko <mho...@suse.com>
---
 mm/vmscan.c | 30 --
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c98b1a585992..785b4d7fb8a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -252,6 +252,32 @@ unsigned long lruvec_zone_lru_size(struct lruvec *lruvec, 
enum lru_list lru, int
 }
 
 /*
+ * Return the number of pages on the given lru which are eligibne for the
+ * given zone_idx
+ */
+static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
+   enum lru_list lru, int zone_idx)
+{
+   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+   unsigned long lru_size;
+   int zid;
+
+   lru_size = lruvec_lru_size(lruvec, lru);
+   for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
+   struct zone *zone = >node_zones[zid];
+   unsigned long size;
+
+   if (!managed_zone(zone))
+   continue;
+
+   size = lruvec_zone_lru_size(lruvec, lru, zid);
+   lru_size -= min(size, lru_size);
+   }
+
+   return lru_size;
+}
+
+/*
  * Add a shrinker callback to be called from the vm.
  */
 int register_shrinker(struct shrinker *shrinker)
@@ -2207,7 +2233,7 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
 * system is under heavy pressure.
 */
if (!inactive_list_is_low(lruvec, true, sc) &&
-   lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
+   lruvec_lru_size_zone_idx(lruvec, LRU_INACTIVE_FILE, 
sc->reclaim_idx) >> sc->priority) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2274,7 +2300,7 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
unsigned long size;
unsigned long scan;
 
-   size = lruvec_lru_size(lruvec, lru);
+   size = lruvec_lru_size_zone_idx(lruvec, lru, 
sc->reclaim_idx);
scan = size >> sc->priority;
 
        if (!scan && pass && force_scan)
-- 
2.10.2

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-27 Thread Michal Hocko
On Tue 27-12-16 12:23:13, Nils Holland wrote:
> On Tue, Dec 27, 2016 at 09:08:38AM +0100, Michal Hocko wrote:
> > On Mon 26-12-16 19:57:03, Nils Holland wrote:
> > > On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > > > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > > > 
> > > > > > Nils, even though this is still highly experimental, could you give 
> > > > > > it a
> > > > > > try please?
> > > > > 
> > > > > Yes, no problem! So I kept the very first patch you sent but had to
> > > > > revert the latest version of the debugging patch (the one in
> > > > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > > > memory cgroups enabled again, and the first thing that strikes the eye
> > > > > is that I get this during boot:
> > > > > 
> > > > > [1.568174] [ cut here ]
> > > > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > > > mem_cgroup_update_lru_size+0x118/0x130
> > > > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 
> > > > > but not empty
> > > > 
> > > > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > > > my patch (I double account) and b) the detection for the empty list
> > > > cannot work after my change because per node zone will not match per
> > > > zone statistics. The updated patch is below. So I hope my brain already
> > > > works after it's been mostly off last few days...
> > > 
> > > I tried the updated patch, and I can confirm that the warning during
> > > boot is gone. Also, I've tried my ordinary procedure to reproduce my
> > > testcase, and I can say that a kernel with this new patch also works
> > > fine and doesn't produce OOMs or similar issues.
> > > 
> > > I had the previous version of the patch in use on a machine non-stop
> > > for the last few days during normal day-to-day workloads and didn't
> > > notice any issues. Now I'll keep a machine running during the next few
> > > days with this patch, and in case I notice something that doesn't look
> > > normal, I'll of course report back!
> > 
> > Thanks for your testing! Can I add your
> > Tested-by: Nils Holland <nholl...@tisys.org>
> 
> Yes, I think so! The patch has now been running for 16 hours on my two
> machines, and that's an uptime that was hard to achieve since 4.8 for
> me. ;-) So my tests clearly suggest that the patch is good! :-)

OK, thanks a lot for your testing! I will wait few more days before I
send it to Andrew.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-27 Thread Michal Hocko
On Mon 26-12-16 19:57:03, Nils Holland wrote:
> On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > 
> > > > Nils, even though this is still highly experimental, could you give it a
> > > > try please?
> > > 
> > > Yes, no problem! So I kept the very first patch you sent but had to
> > > revert the latest version of the debugging patch (the one in
> > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > memory cgroups enabled again, and the first thing that strikes the eye
> > > is that I get this during boot:
> > > 
> > > [1.568174] [ cut here ]
> > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > mem_cgroup_update_lru_size+0x118/0x130
> > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > > not empty
> > 
> > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > my patch (I double account) and b) the detection for the empty list
> > cannot work after my change because per node zone will not match per
> > zone statistics. The updated patch is below. So I hope my brain already
> > works after it's been mostly off last few days...
> 
> I tried the updated patch, and I can confirm that the warning during
> boot is gone. Also, I've tried my ordinary procedure to reproduce my
> testcase, and I can say that a kernel with this new patch also works
> fine and doesn't produce OOMs or similar issues.
> 
> I had the previous version of the patch in use on a machine non-stop
> for the last few days during normal day-to-day workloads and didn't
> notice any issues. Now I'll keep a machine running during the next few
> days with this patch, and in case I notice something that doesn't look
> normal, I'll of course report back!

Thanks for your testing! Can I add your
Tested-by: Nils Holland <nholl...@tisys.org>
?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [lkp-developer] [mm, memcg] d18e2b2aca: WARNING:at_mm/memcontrol.c:#mem_cgroup_update_lru_size

2016-12-26 Thread Michal Hocko
On Mon 26-12-16 13:26:51, Michal Hocko wrote:
> On Mon 26-12-16 06:25:56, kernel test robot wrote:
[...]
> > [   95.226364] init: tty6 main process (990) killed by TERM signal
> > [   95.314020] init: plymouth-upstart-bridge main process (1039) terminated 
> > with status 1
> > [   97.588568] [ cut here ]
> > [   97.594364] WARNING: CPU: 0 PID: 1055 at mm/memcontrol.c:1032 
> > mem_cgroup_update_lru_size+0xdd/0x12b
> > [   97.606654] mem_cgroup_update_lru_size(40297f00, 0, -1): lru_size 1 but 
> > empty
> > [   97.615140] Modules linked in:
> > [   97.618834] CPU: 0 PID: 1055 Comm: killall5 Not tainted 
> > 4.9.0-mm1-00095-gd18e2b2 #82
> > [   97.628008] Call Trace:
> > [   97.631025]  dump_stack+0x16/0x18
> > [   97.635107]  __warn+0xaf/0xc6
> > [   97.638729]  ? mem_cgroup_update_lru_size+0xdd/0x12b
> 
> Do you have the full backtrace?

It's not needed. I found the bug in my patch and it should be fixed by
the updated patch http://lkml.kernel.org/r/20161226124839.gb20...@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-26 Thread Michal Hocko
On Fri 23-12-16 23:26:00, Nils Holland wrote:
> On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > 
> > Nils, even though this is still highly experimental, could you give it a
> > try please?
> 
> Yes, no problem! So I kept the very first patch you sent but had to
> revert the latest version of the debugging patch (the one in
> which you added the "mm_vmscan_inactive_list_is_low" event) because
> otherwise the patch you just sent wouldn't apply. Then I rebooted with
> memory cgroups enabled again, and the first thing that strikes the eye
> is that I get this during boot:
> 
> [1.568174] [ cut here ]
> [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> mem_cgroup_update_lru_size+0x118/0x130
> [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but not 
> empty

Ohh, I can see what is wrong! a) there is a bug in the accounting in
my patch (I double account) and b) the detection for the empty list
cannot work after my change because per node zone will not match per
zone statistics. The updated patch is below. So I hope my brain already
works after it's been mostly off last few days...
---
>From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
From: Michal Hocko <mho...@suse.com>
Date: Fri, 23 Dec 2016 15:11:54 +0100
Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
 memcg is enabled

Nils Holland has reported unexpected OOM killer invocations with 32b
kernel starting with 4.8 kernels

kworker/u4:5 invoked oom-killer: 
gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
kworker/u4:5 cpuset=/ mems_allowed=0
CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
[...]
Mem-Info:
active_anon:58685 inactive_anon:90 isolated_anon:0
 active_file:274324 inactive_file:281962 isolated_file:0
 unevictable:0 dirty:649 writeback:0 unstable:0
 slab_reclaimable:40662 slab_unreclaimable:17754
 mapped:7382 shmem:202 pagetables:351 bounce:0
 free:206736 free_pcp:332 free_cma:0
Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB 
shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB 
pages_scanned:0 all_unreclaimable? no
DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 813 3474 3474
Normal free:41332kB min:41368kB low:51708kB high:62048kB 
active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB 
slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB 
pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
lowmem_reserve[]: 0 0 21292 21292
HighMem free:781660kB min:512kB low:34356kB high:68200kB 
active_anon:234740kB inactive_anon:360kB active_file:557232kB 
inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB 
managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB 
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB 
free_cma:0kB

the oom killer is clearly pre-mature because there there is still a
lot of page cache in the zone Normal which should satisfy this lowmem
request. Further debugging has shown that the reclaim cannot make any
forward progress because the page cache is hidden in the active list
which doesn't get rotated because inactive_list_is_low is not memcg
aware.
It simply subtracts per-zone highmem counters from the respective
memcg's lru sizes which doesn't make any sense. We can simply end up
always seeing the resulting active and inactive counts 0 and return
false. This issue is not limited to 32b kernels but in practice the
effect on systems without CONFIG_HIGHMEM would be much harder to notice
because we do not invoke the OOM killer for allocations requests
targeting < ZONE_NORMAL.

Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
and subtract per-memcg highmem counts when memcg is enabled. Introduce
helper lruvec_zone_lru_size which redirects to either zone counters or
mem_cgroup_get_zone_lru_size when appropriate.

We are loosing empty LRU but non-zero lru size detection introduced by
ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
of the inherent zone vs. node discrepancy.

Fixes: f8d1a31163fc (

Re: [lkp-developer] [mm, memcg] d18e2b2aca: WARNING:at_mm/memcontrol.c:#mem_cgroup_update_lru_size

2016-12-26 Thread Michal Hocko
On Mon 26-12-16 06:25:56, kernel test robot wrote:
> 
> FYI, we noticed the following commit:
> 
> commit: d18e2b2aca0396849f588241e134787a829c707d ("mm, memcg: fix (Re: OOM: 
> Better, but still there on)")
> url: 
> https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-memcg-fix-Re-OOM-Better-but-still-there-on/20161223-225057
> base: git://git.cmpxchg.org/linux-mmotm.git master
> 
> in testcase: boot
> 
> on test machine: qemu-system-i386 -enable-kvm -m 360M
> 
> caused below changes:
> 
> 
> ++++
> || c7d85b880b | 
> d18e2b2aca |
> ++++
> | boot_successes | 8  | 0 
>  |
> | boot_failures  | 0  | 2 
>  |
> | WARNING:at_mm/memcontrol.c:#mem_cgroup_update_lru_size | 0  | 2 
>  |
> | kernel_BUG_at_mm/memcontrol.c  | 0  | 2 
>  |
> | invalid_opcode:#[##]DEBUG_PAGEALLOC| 0  | 2 
>  |
> | Kernel_panic-not_syncing:Fatal_exception   | 0  | 2 
>  |
> ++++
> 
> 
> 
> [   95.226364] init: tty6 main process (990) killed by TERM signal
> [   95.314020] init: plymouth-upstart-bridge main process (1039) terminated 
> with status 1
> [   97.588568] [ cut here ]
> [   97.594364] WARNING: CPU: 0 PID: 1055 at mm/memcontrol.c:1032 
> mem_cgroup_update_lru_size+0xdd/0x12b
> [   97.606654] mem_cgroup_update_lru_size(40297f00, 0, -1): lru_size 1 but 
> empty
> [   97.615140] Modules linked in:
> [   97.618834] CPU: 0 PID: 1055 Comm: killall5 Not tainted 
> 4.9.0-mm1-00095-gd18e2b2 #82
> [   97.628008] Call Trace:
> [   97.631025]  dump_stack+0x16/0x18
> [   97.635107]  __warn+0xaf/0xc6
> [   97.638729]  ? mem_cgroup_update_lru_size+0xdd/0x12b

Do you have the full backtrace?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-23 Thread Michal Hocko
[Add Mel, Johannes and Vladimir - the email thread started here
http://lkml.kernel.org/r/20161215225702.ga27...@boerne.fritz.box
The long story short, the zone->node reclaim change has broken active
list aging for lowmem requests when memory cgroups are enabled. More
details below.

On Fri 23-12-16 13:57:28, Michal Hocko wrote:
> On Fri 23-12-16 13:18:51, Nils Holland wrote:
> > On Fri, Dec 23, 2016 at 11:51:57AM +0100, Michal Hocko wrote:
> > > TL;DR
> > > drop the last patch, check whether memory cgroup is enabled and retest
> > > with cgroup_disable=memory to see whether this is memcg related and if
> > > it is _not_ then try to test with the patch below
> > 
> > Right, it seems we might be looking in the right direction! So I
> > removed the previous patch from my kernel and verified if memory
> > cgroup was enabled, and indeed, it was. So I booted with
> > cgroup_disable=memory and ran my ordinary test again ... and in fact,
> > no ooms!
> 
> OK, thanks for confirmation. I could have figured that earlier. The
> pagecache differences in such a short time should have raised the red
> flag and point towards memcgs...
> 
> [...]
> > > I would appreciate to stick with your setup to not pull new unknows into
> > > the picture.
> > 
> > No problem! It's just likely that I won't be able to test during the
> > following days until Dec 27th, but after that I should be back to
> > normal and thus be able to run further tests in a timely fashion. :-)
> 
> no problem at all. I will try to cook up a patch in the mean time.

So here is my attempt. Only compile tested so be careful, it might eat
your kittens or do more harm. I would appreciate other guys to have a
look to see whether this is sane. There are probably other places which
would need some tweaks. I think that get_scan_count needs some tweaks
as well because we should only consider eligible zones when counting the
number of pages to scan. This would be for a separate patch which I will
send later. I just want to fix this one first.

Nils, even though this is still highly experimental, could you give it a
try please?
---
>From a66fd89d43e9fd8ca9afa7e6c7252ab73d22b686 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mho...@suse.com>
Date: Fri, 23 Dec 2016 15:11:54 +0100
Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
 memcg is enabled

Nils Holland has reported unexpected OOM killer invocations with 32b
kernel starting with 4.8 kernels

kworker/u4:5 invoked oom-killer: 
gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
kworker/u4:5 cpuset=/ mems_allowed=0
CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
[...]
Mem-Info:
active_anon:58685 inactive_anon:90 isolated_anon:0
 active_file:274324 inactive_file:281962 isolated_file:0
 unevictable:0 dirty:649 writeback:0 unstable:0
 slab_reclaimable:40662 slab_unreclaimable:17754
 mapped:7382 shmem:202 pagetables:351 bounce:0
 free:206736 free_pcp:332 free_cma:0
Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB 
shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB 
pages_scanned:0 all_unreclaimable? no
DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 813 3474 3474
Normal free:41332kB min:41368kB low:51708kB high:62048kB 
active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB 
slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB 
pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
lowmem_reserve[]: 0 0 21292 21292
HighMem free:781660kB min:512kB low:34356kB high:68200kB 
active_anon:234740kB inactive_anon:360kB active_file:557232kB 
inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB 
managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB 
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB 
free_cma:0kB

the oom killer is clearly pre-mature because there there is still a
lot of page cache in the zone Normal which should satisfy this lowmem
request. Further debugging has shown that the reclaim cannot make any
forward progress because the page cache is hidden in the activ

Re: OOM: Better, but still there on

2016-12-23 Thread Michal Hocko
On Fri 23-12-16 13:18:51, Nils Holland wrote:
> On Fri, Dec 23, 2016 at 11:51:57AM +0100, Michal Hocko wrote:
> > TL;DR
> > drop the last patch, check whether memory cgroup is enabled and retest
> > with cgroup_disable=memory to see whether this is memcg related and if
> > it is _not_ then try to test with the patch below
> 
> Right, it seems we might be looking in the right direction! So I
> removed the previous patch from my kernel and verified if memory
> cgroup was enabled, and indeed, it was. So I booted with
> cgroup_disable=memory and ran my ordinary test again ... and in fact,
> no ooms!

OK, thanks for confirmation. I could have figured that earlier. The
pagecache differences in such a short time should have raised the red
flag and point towards memcgs...

[...]
> > I would appreciate to stick with your setup to not pull new unknows into
> > the picture.
> 
> No problem! It's just likely that I won't be able to test during the
> following days until Dec 27th, but after that I should be back to
> normal and thus be able to run further tests in a timely fashion. :-)

no problem at all. I will try to cook up a patch in the mean time.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OOM: Better, but still there on

2016-12-23 Thread Michal Hocko
TL;DR
drop the last patch, check whether memory cgroup is enabled and retest
with cgroup_disable=memory to see whether this is memcg related and if
it is _not_ then try to test with the patch below

On Thu 22-12-16 22:46:11, Nils Holland wrote:
> On Thu, Dec 22, 2016 at 08:17:19PM +0100, Michal Hocko wrote:
> > TL;DR I still do not see what is going on here and it still smells like
> > multiple issues. Please apply the patch below on _top_ of what you had.
> 
> I've run the usual procedure again with the new patch on top and the
> log is now up at:
> 
> http://ftp.tisys.org/pub/misc/boerne_2016-12-22_2.log.xz

OK, so there are still large page cache fluctuations even with the
locking applied:
472.042409 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 
total_inactive=450451 inactive=0 total_active=210056 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
472.042442 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=0 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
472.042451 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=12 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
472.042484 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 
total_inactive=11944 inactive=0 total_active=117286 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB

One thing that didn't occure to me previously was that this might be an
effect of the memory cgroups. Do you have memory cgroups enabled? If
yes then reruning with cgroup_disable=memory would be interesting
as well.

Anyway, now I am looking at get_scan_count which determines how many pages
we should scan on each LRU list. The problem I can see there is that
it doesn't reflect eligible zones (or at least it doesn't do that
consistently). So it might happen we simply decide to scan the whole LRU
list (when we get down to prio 0 because we cannot make any progress)
and then _slowly_ scan through it in SWAP_CLUSTER_MAX chunks each
time. This can take a lot of time and who knows what might have happened
if there are many such reclaimers in parallel.

[...]

> This might suggest - although I have to admit, again, that this is
> inconclusive, as I've not used a final 4.9 kernel - that you could
> very easily reproduce the issue yourself by just setting up a 32 bit
> system with a btrfs filesystem and then unpacking a few huge tarballs.
> Of course, I'm more than happy to continue giving any patches sent to
> me a spin, but I thought I'd still mention this in case it makes
> things easier for you. :-)

I would appreciate to stick with your setup to not pull new unknows into
the picture.
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cb82913b62bb..533bb591b0be 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -243,6 +243,35 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum 
lru_list lru)
 }
 
 /*
+ * Return the number of pages on the given lru which are eligibne for the
+ * given zone_idx
+ */
+static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
+   enum lru_list lru, int zone_idx)
+{
+   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+   unsigned long lru_size;
+   int zid;
+
+   if (!mem_cgroup_disabled())
+   return mem_cgroup_get_lru_size(lruvec, lru);
+
+   lru_size = lruvec_lru_size(lruvec, lru);
+   for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
+   struct zone *zone = >node_zones[zid];
+   unsigned long size;
+
+   if (!managed_zone(zone))
+   continue;
+
+   size = zone_page_state(zone, NR_ZONE_LRU_BASE + lru);
+   lru_size -= min(size, lru_size);
+   }
+
+   return lru_size;
+}
+
+/*
  * Add a shrinker callback to be called from the vm.
  */
 int register_shrinker(struct shrinker *shrinker)
@@ -2228,7 +2257,7 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
 * system is under heavy pressure.
 */
if (!inactive_list_is_low(lruvec, true, sc) &&
-   lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
+   lruvec_lru_size_zone_idx(lruvec, LRU_INACTIVE_FILE, 
sc->reclaim_idx) >> sc->priority) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2295,7 +2324,7 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
unsigned long size;
unsigned long scan;
 
-   size = lruvec_lru_size(lruvec, lru);
+   size = lruvec_lru_size_zone_idx(lruvec, lru, 
sc->reclaim_idx);
scan = size >> sc->priority;
 
if (!scan && pass && force_scan)
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linu

Re: OOM: Better, but still there on

2016-12-22 Thread Michal Hocko
8kB 
pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
484.147319 lowmem_reserve[]: 0 808 3849 3849
484.147387 Normal free:41016kB min:41100kB low:51372kB high:61644kB 
active_anon:0kB inactive_anon:0kB active_file:464688kB inactive_file:48kB 
unevictable:0kB writepending:2684kB present:897016kB managed:831472kB 
mlocked:0kB slab_reclaimable:215812kB slab_unreclaimable:90092kB 
kernel_stack:1336kB pagetables:1436kB bounce:0kB free_pcp:372kB local_pcp:176kB 
free_cma:0kB
484.149971 lowmem_reserve[]: 0 0 24330 24330
484.152390 HighMem free:332648kB min:512kB low:39184kB high:77856kB 
active_anon:100688kB inactive_anon:380kB active_file:823856kB 
inactive_file:1847984kB unevictable:0kB writepending:18144kB present:3114256kB 
managed:3114256kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB 
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:836kB local_pcp:156kB 
free_cma:0kB

Unfortunately LOST EVENT are not logged with the timestamp but there are
many lost events between 10:55:31-33 which corresponds to above time
range in timestamps:
$ xzgrep "10:55:3[1-3].*LOST" boerne_2016-12-22.log.xz | awk 
'{sum+=$6}END{print sum}'
5616415

so we do not have a good picture again :/ One thing is highly suspicious
though. I really doubt the _whole_ pagecache went down to zero and then up
in such a short time:
482.722379 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=89 
inactive=0 total_active=1301 active=0 ratio=1 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722397 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=1 
inactive=0 total_active=21 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722401 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=450730 
inactive=0 total_active=206026 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC

File inactive 450730 resp. active 206026 roughly match the global
counters in the oom report so I would trust this to be more realistic. I
simply do not see any large source of the LRU isolation. Maybe those
pages have been truncated and new ones allocated. The time window is
really short though but who knows...

Another possibility would be a misaccounting but I do not see anything
that would use __mod_zone_page_state and __mod_node_page_state on LRU
handles node vs. zone counters inconsistently. Everything seems to go
via __update_lru_size.

Another thing to check would be the per-cpu counters usage. The
following patch should use the more precise numbers. I am also not
sure about the lockless nature of inactive_list_is_low so the patch
below adds the lru_lock there.

The only clear thing is that mm_vmscan_lru_isolate indeed skipped
through the whole list without finding a single suitable page
when it couldn't isolate any pages. So the failure is not due to
get_page_unless_zero.
$ xzgrep "mm_vmscan_lru_isolate.*nr_taken=0" boerne_2016-12-22.log.xz | sed 
's@.*nr_scanned=\([0-9]*\).*@\1@' | sort | uniq -c
   7941 0

I am not able to draw any conclusion now. I am suspecting get_scan_count
as well. Let's see whether the patch below makes any difference and if
not I will dig into g_s_c some more. I will think about it some more,
maybe somebody else will notice something so I am sending this half
baked analysis.

---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cb82913b62bb..8727b68a8e70 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -239,7 +239,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum 
lru_list lru)
if (!mem_cgroup_disabled())
return mem_cgroup_get_lru_size(lruvec, lru);
 
-   return node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
+   return node_page_state_snapshot(lruvec_pgdat(lruvec), NR_LRU_BASE + 
lru);
 }
 
 /*
@@ -2056,6 +2056,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
bool file,
if (!file && !total_swap_pages)
return false;
 
+   spin_lock_irq(>lru_lock);
total_inactive = inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE + 
LRU_ACTIVE);
 
@@ -2071,14 +2072,15 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
bool file,
if (!managed_zone(zone))
continue;
 
-   inactive_zone = zone_page_state(zone,
+   inactive_zone = zone_page_state_snapshot(zone,
NR_ZONE_LRU_BASE + (file * LRU_FILE));
-   active_zone = zone_page_state(zone,
+   active_zone = zone_page_state_snapshot(zone,
NR_ZONE_LRU_BASE + (file * LRU_FILE) + 
LRU_ACTIVE);
 
inactive -= min(inactive, inactive_zone);
active -= min(active, active_zone);
}
+   spin_unlock_irq(>lru_lock);
 
gb = (inactive + active) >> (30 - PAGE_SHIFT);
if (gb)
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the

Re: OOM: Better, but still there on

2016-12-22 Thread Michal Hocko
On Thu 22-12-16 11:10:29, Nils Holland wrote:
> On Wed, Dec 21, 2016 at 08:36:59AM +0100, Michal Hocko wrote:
> > TL;DR
> > there is another version of the debugging patch. Just revert the
> > previous one and apply this one instead. It's still not clear what
> > is going on but I suspect either some misaccounting or unexpeted
> > pages on the LRU lists. I have added one more tracepoint, so please
> > enable also mm_vmscan_inactive_list_is_low.
> 
> Right, I did just that and can provide a new log. I was also able, in
> this case, to reproduce the OOM issues again and not just the "page
> allocation stalls" that were the only thing visible in the previous
> log.

Thanks a lot for testing! I will have a look later today.

> However, the log comes from machine #2 again today, as I'm
> unfortunately forced to try this via VPN from work to home today, so I
> have exactly one attempt per machine before it goes down and locks up
> (and I can only restart it later tonight).

This is really surprising to me. Are you sure that you have sysrq
configured properly. At least sysrq+b shouldn't depend on any memory
allocations and should allow you to reboot immediately. A sysrq+m right
before the reboot might turn out being helpful as well.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9 v2] scope GFP_NOFS api

2016-12-22 Thread Michal Hocko
Are there any objections to the approach and can we have this merged to
the mm tree?

Dave has expressed the patch2 should be dropped for now. I will do that
in a next submission but I do not want to resubmit until there is a
consensus on this.

What do other than xfs/ext4 developers think about this API. Can we find
a way to use it?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OOM: Better, but still there on

2016-12-21 Thread Michal Hocko
On Wed 21-12-16 20:00:38, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > TL;DR
> > there is another version of the debugging patch. Just revert the
> > previous one and apply this one instead. It's still not clear what
> > is going on but I suspect either some misaccounting or unexpeted
> > pages on the LRU lists. I have added one more tracepoint, so please
> > enable also mm_vmscan_inactive_list_is_low.
> > 
> > Hopefully the additional data will tell us more.
> > 
> > On Tue 20-12-16 03:08:29, Nils Holland wrote:
[...]
> > > http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz
> > 
> > This is the stall report:
> > [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, 
> > order:0, mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
> > [ 1661.485859] CPU: 1 PID: 1950 Comm: btrfs-transacti Not tainted 
> > 4.9.0-gentoo #4
> > 
> > pid 1950 is trying to allocate for a _long_ time. Considering that this
> > is the only stall report, this means that reclaim took really long so we
> > didn't get to the page allocator for that long. It sounds really crazy!
> 
> warn_alloc() reports only if !__GFP_NOWARN.

yes and the above allocation clear is !__GFP_NOWARN allocation which is
reported after 611s! If there are no prior/lost warn_alloc() then it
implies we have spent _that_ much time in the reclaim. Considering the
tracing data we cannot really rule that out. All the reclaimers would
fight over the lru_lock and considering we are scanning the whole LRU
this will take some time.

[...]

> By the way, Michal, I'm feeling strange because it seems to me that your
> analysis does not refer to the implications of "x86_32 kernel". Maybe
> you already referred x86_32 by "they are from the highmem zone" though.

yes Highmem as well all those scanning anomalies is the 32b kernel
specific thing. I believe I have already mentioned that the 32b kernel
suffers from some inherent issues but I would like to understand what is
going on here before blaming the 32b.

One thing to note here, when we are talking about 32b kernel, things
have changed in 4.8 when we moved from the zone based to node based
reclaim (see b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a
per-node basis") and associated patches). It is possible that the
reporter is hitting some pathological path which needs fixing but it
might be also related to something else. So I am rather not trying to
blame 32b yet...

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OOM: Better, but still there on

2016-12-20 Thread Michal Hocko
TL;DR
there is another version of the debugging patch. Just revert the
previous one and apply this one instead. It's still not clear what
is going on but I suspect either some misaccounting or unexpeted
pages on the LRU lists. I have added one more tracepoint, so please
enable also mm_vmscan_inactive_list_is_low.

Hopefully the additional data will tell us more.

On Tue 20-12-16 03:08:29, Nils Holland wrote:
> On Mon, Dec 19, 2016 at 02:45:34PM +0100, Michal Hocko wrote:
> 
> > Unfortunatelly shrink_active_list doesn't have any tracepoint so we do
> > not know whether we managed to rotate those pages. If they are referenced
> > quickly enough we might just keep refaulting them... Could you try to apply
> > the followin diff on top what you have currently. It should add some more
> > tracepoint data which might tell us more. We can reduce the amount of
> > tracing data by enabling only mm_vmscan_lru_isolate,
> > mm_vmscan_lru_shrink_inactive and mm_vmscan_lru_shrink_active.
> 
> So, the results are in! I applied your patch and rebuild the kernel,
> then I rebooted the machine, set up tracing so that only the three
> events you mentioned were being traced, and captured the output over
> the network.
> 
> Things went a bit different this time: The trace events started to
> appear after a while and a whole lot of them were generated, but
> suddenly they stopped. A short while later, we get

It is possible that you are hitting multiple issues so it would be
great to focus at one at the time. The underlying problem might be
same/similar in the end but this is hard to tell now. Could you try to
reproduce and provide data for the OOM killer situation as well?
 
> [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, 
> mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
> 
> along with a backtrace and memory information, and then there was
> silence.

> When I walked up to the machine, it had completely died; it
> wouldn't turn on its screen on key press any more, blindly trying to
> reboot via SysRequest had no effect, but the caps lock LED also wasn't
> blinking, like it normally does when a kernel panic occurs. Good
> question what state it was in. The OOM reaper didn't really seem to
> kick in and kill processes this time, it seems.
> 
> The complete capture is up at:
> 
> http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz

This is the stall report:
[ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, 
mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
[ 1661.485859] CPU: 1 PID: 1950 Comm: btrfs-transacti Not tainted 4.9.0-gentoo 
#4

pid 1950 is trying to allocate for a _long_ time. Considering that this
is the only stall report, this means that reclaim took really long so we
didn't get to the page allocator for that long. It sounds really crazy!

$ xzgrep -w 1950 teela_2016-12-20.log.xz | grep mm_vmscan_lru_shrink_inactive | 
sed 's@.*nr_reclaimed=\([0-9\]*\).*@\1@' | sort | uniq -c
509 0
  1 1
  1 10
  5 11
  1 12
  1 14
  1 16
  2 19
  5 2
  1 22
  2 23
  1 25
  3 28
  2 3
  1 4
  4 5

It barely managed to reclaim something. While it has tried a lot. It
had hard times to actually isolate anything:

$ xzgrep -w 1950 teela_2016-12-20.log.xz | grep mm_vmscan_lru_isolate: | sed 
's@.*nr_taken=@@' | sort | uniq -c
   8284 0 file=1
  8 11 file=1
  4 14 file=1
  1 1 file=1
  7 23 file=1
  1 25 file=1
  9 2 file=1
501 32 file=1
  1 3 file=1
  7 5 file=1
  1 6 file=1

a typical mm_vmscan_lru_isolate looks as follows

btrfs-transacti-1950  [001] d...  1368.508008: mm_vmscan_lru_isolate: 
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=266727 nr_taken=0 
file=1

so the whole inactive lru has been scanned it seems. But we couldn't
isolate a single page. There are two possibilities here. Either we skip
them all because they are from the highmem zone or we fail to
__isolate_lru_page them. Counters will not tell us because nr_scanned
includes skipped pages. I have updated the debugging patch to make this
distinction. I suspect we are skipping all of them...
The later option would be really surprising because the only way to fail
__isolate_lru_page with the 0 isolate_mode is if get_page_unless_zero(page)
fails which would mean we would have pages with 0 reference count on the
LRU list.

The stall message is from a later time so the situation might have
changed but
[ 1661.490170] Node 0   active_anon:139296kBinactive_anon:432kB 
active_file:1088996kB   inactive_file:1114524kB
[ 1661.490745] DMA  active_anon:0kB inactive_anon:0kB   
active_file:9540kB  inactive_file:0kB
[ 1661.491528] Normal   active_anon:0kB inactive_anon:0kB   
active_file:530560kBinactive_file:452kB
[ 1661.513077] HighMem 

Re: [PATCH 2/9] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives

2016-12-20 Thread Michal Hocko
On Tue 20-12-16 08:24:13, Dave Chinner wrote:
> On Thu, Dec 15, 2016 at 03:07:08PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mho...@suse.com>
> > 
> > Now that the page allocator offers __GFP_NOLOCKDEP let's introduce
> > KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it
> > also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing
> > KM_NOFS tags to keep lockdep happy") and use the new flag for them
> > instead. There is really no reason to make these allocations contexts
> > weaker just because of the lockdep which even might not be enabled
> > in most cases.
> > 
> > Signed-off-by: Michal Hocko <mho...@suse.com>
> 
> I'd suggest that it might be better to drop this patch for now -
> it's not necessary for the context flag changeover but does
> introduce a risk of regressions if the conversion is wrong.
> 
> Hence I think this is better as a completely separate series
> which audits and changes all the unnecessary KM_NOFS allocations
> in one go. I've never liked whack-a-mole style changes like this -
> do it once, do it properly

OK, fair enough. I thought it might be better to have an example user so
that others can follow but as you say, the risk of regression is really
there and these kind of changes definitely need a throughout review.

I am not sure I will be able to post more of those changes because that
requires an intimate knowledge of the fs so I hope somebody can take
over there and follow up.

Thanks!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OOM: Better, but still there on

2016-12-19 Thread Michal Hocko
sted will stall.
 */
-   if (nr_dirty && nr_dirty == nr_congested)
+   if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
set_bit(PGDAT_CONGESTED, >flags);
 
/*
@@ -1802,7 +1813,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
 * the pgdat PGDAT_DIRTY and kswapd will start writing pages 
from
 * reclaim context.
 */
-   if (nr_unqueued_dirty == nr_taken)
+   if (stat.nr_unqueued_dirty == nr_taken)
set_bit(PGDAT_DIRTY, >flags);
 
/*
@@ -1811,7 +1822,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
 * that pages are cycling through the LRU faster than
 * they are written so also forcibly stall.
 */
-   if (nr_immediate && current_may_throttle())
+   if (stat.nr_immediate && current_may_throttle())
congestion_wait(BLK_RW_ASYNC, HZ/10);
}
 
@@ -1826,6 +1837,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
 
trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
nr_scanned, nr_reclaimed,
+   stat.nr_dirty,  stat.nr_writeback,
+   stat.nr_congested, stat.nr_immediate,
+   stat.nr_activate, stat.nr_ref_keep, stat.nr_unmap_fail,
sc->priority, file);
return nr_reclaimed;
 }
@@ -1846,9 +1860,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
  *
  * The downside is that we have to touch page->_refcount against each page.
  * But we had to alter page->flags anyway.
+ *
+ * Returns the number of pages moved to the given lru.
  */
 
-static void move_active_pages_to_lru(struct lruvec *lruvec,
+static int move_active_pages_to_lru(struct lruvec *lruvec,
 struct list_head *list,
 struct list_head *pages_to_free,
 enum lru_list lru)
@@ -1857,6 +1873,7 @@ static void move_active_pages_to_lru(struct lruvec 
*lruvec,
unsigned long pgmoved = 0;
struct page *page;
int nr_pages;
+   int nr_moved = 0;
 
while (!list_empty(list)) {
page = lru_to_page(list);
@@ -1882,11 +1899,15 @@ static void move_active_pages_to_lru(struct lruvec 
*lruvec,
spin_lock_irq(>lru_lock);
} else
list_add(>lru, pages_to_free);
+   } else {
+   nr_moved++;
}
}
 
if (!is_active_lru(lru))
__count_vm_events(PGDEACTIVATE, pgmoved);
+
+   return nr_moved;
 }
 
 static void shrink_active_list(unsigned long nr_to_scan,
@@ -1902,7 +1923,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
LIST_HEAD(l_inactive);
struct page *page;
struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
-   unsigned long nr_rotated = 0;
+   unsigned long nr_rotated = 0, nr_unevictable = 0;
+   unsigned long nr_freed, nr_deactivate, nr_activate;
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
@@ -1935,6 +1957,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
if (unlikely(!page_evictable(page))) {
putback_lru_page(page);
+   nr_unevictable++;
continue;
}
 
@@ -1980,13 +2003,16 @@ static void shrink_active_list(unsigned long nr_to_scan,
 */
reclaim_stat->recent_rotated[file] += nr_rotated;
 
-   move_active_pages_to_lru(lruvec, _active, _hold, lru);
-   move_active_pages_to_lru(lruvec, _inactive, _hold, lru - 
LRU_ACTIVE);
+   nr_activate = move_active_pages_to_lru(lruvec, _active, _hold, lru);
+   nr_deactivate = move_active_pages_to_lru(lruvec, _inactive, _hold, 
lru - LRU_ACTIVE);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
spin_unlock_irq(>lru_lock);
 
mem_cgroup_uncharge_list(_hold);
-   free_hot_cold_page_list(_hold, true);
+   nr_freed = free_hot_cold_page_list(_hold, true);
+   trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_scanned, nr_freed,
+   nr_unevictable, nr_deactivate, nr_rotated,
+   sc->priority, file);
 }
 
 /*
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically

2016-12-18 Thread Michal Hocko
On Sat 17-12-16 20:17:07, Tetsuo Handa wrote:
[...]
> I feel that allowing access to memory reserves based on __GFP_NOFAIL might not
> make sense. My understanding is that actual I/O operation triggered by I/O
> requests by filesystem code are processed by other threads. Even if we grant
> access to memory reserves to GFP_NOFS | __GFP_NOFAIL allocations by fs code,
> I think that it is possible that memory allocations by underlying bio code
> fails to make a further progress unless memory reserves are granted as well.

IO layer should rely on mempools to guarantee a forward progress.

> Below is a typical trace which I observe under OOM lockuped situation (though
> this trace is from an OOM stress test using XFS).
> 
> 
> [ 1845.187246] MemAlloc: kworker/2:1(14498) flags=0x4208060 switches=323636 
> seq=48 gfp=0x240(GFP_NOIO) order=0 delay=430400 uninterruptible
> [ 1845.187248] kworker/2:1 D12712 14498  2 0x0080
> [ 1845.187251] Workqueue: events_freezable_power_ disk_events_workfn
> [ 1845.187252] Call Trace:
> [ 1845.187253]  ? __schedule+0x23f/0xba0
> [ 1845.187254]  schedule+0x38/0x90
> [ 1845.187255]  schedule_timeout+0x205/0x4a0
> [ 1845.187256]  ? del_timer_sync+0xd0/0xd0
> [ 1845.187257]  schedule_timeout_uninterruptible+0x25/0x30
> [ 1845.187258]  __alloc_pages_nodemask+0x1035/0x10e0
> [ 1845.187259]  ? alloc_request_struct+0x14/0x20
> [ 1845.187261]  alloc_pages_current+0x96/0x1b0
> [ 1845.187262]  ? bio_alloc_bioset+0x20f/0x2e0
> [ 1845.187264]  bio_copy_kern+0xc4/0x180
> [ 1845.187265]  blk_rq_map_kern+0x6f/0x120
> [ 1845.187268]  __scsi_execute.isra.23+0x12f/0x160
> [ 1845.187270]  scsi_execute_req_flags+0x8f/0x100
> [ 1845.187271]  sr_check_events+0xba/0x2b0 [sr_mod]
> [ 1845.187274]  cdrom_check_events+0x13/0x30 [cdrom]
> [ 1845.187275]  sr_block_check_events+0x25/0x30 [sr_mod]
> [ 1845.187276]  disk_check_events+0x5b/0x150
> [ 1845.187277]  disk_events_workfn+0x17/0x20
> [ 1845.187278]  process_one_work+0x1fc/0x750
> [ 1845.187279]  ? process_one_work+0x167/0x750
> [ 1845.187279]  worker_thread+0x126/0x4a0
> [ 1845.187280]  kthread+0x10a/0x140
> [ 1845.187281]  ? process_one_work+0x750/0x750
> [ 1845.187282]  ? kthread_create_on_node+0x60/0x60
> [ 1845.187283]  ret_from_fork+0x2a/0x40
> 
> 
> I think that this GFP_NOIO allocation request needs to consume more memory 
> reserves
> than GFP_NOFS allocation request to make progress. 

AFAIU, this is an allocation path which doesn't block a forward progress
on a regular IO. It is merely a check whether there is a new medium in
the CDROM (aka regular polling of the device). I really fail to see any
reason why this one should get any access to memory reserves at all.

I actually do not see any reason why it should be NOIO in the first
place but I am not familiar with this code much so there might be some
reasons for that. The fact that it might stall under a heavy memory
pressure is sad but who actually cares?

> Do we want to add __GFP_NOFAIL to this GFP_NOIO allocation request
> in order to allow access to memory reserves as well as GFP_NOFS |
> __GFP_NOFAIL allocation request?

Why?

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OOM: Better, but still there on

2016-12-16 Thread Michal Hocko
On Fri 16-12-16 19:47:00, Nils Holland wrote:
[...]
> Despite the fact that I'm no expert, I can see that there's no more
> GFP_NOFS being logged, which seems to be what the patches tried to
> achieve. What the still present OOMs mean remains up for
> interpretation by the experts, all I can say is that in the (pre-4.8?)
> past, doing all of the things I just did would probably slow down my
> machine quite a bit, but I can't remember to have ever seen it OOM or
> even crash completely.
> 
> Dec 16 18:56:24 boerne.fritz.box kernel: Purging GPU memory, 37 pages freed, 
> 10219 pages still pinned.
> Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd invoked oom-killer: 
> gfp_mask=0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=0, 
> order=1, oom_score_adj=0
> Dec 16 18:56:29 boerne.fritz.box kernel: kthreadd cpuset=/ mems_allowed=0
[...]
> Dec 16 18:56:29 boerne.fritz.box kernel: Normal free:41008kB min:41100kB 
> low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB 
> active_file:470556kB inactive_file:148kB unevictable:0kB writepending:1616kB 
> present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:213172kB 
> slab_unreclaimable:86236kB kernel_stack:1864kB pagetables:3572kB bounce:0kB 
> free_pcp:532kB local_pcp:456kB free_cma:0kB

this is a GFP_KERNEL allocation so it cannot use the highmem zone again.
There is no anonymous memory in this zone but the allocation
context implies the full reclaim context so the file LRU should be
reclaimable. For some reason ~470MB of the active file LRU is still
there. This is quite unexpected. It is harder to tell more without
further data. It would be great if you could enable reclaim related
tracepoints:

mount -t tracefs none /debug/trace
echo 1 > /debug/trace/events/vmscan/enable
cat /debug/trace/trace_pipe > trace.log

should help
[...]

> Dec 16 18:56:31 boerne.fritz.box kernel: xfce4-terminal invoked oom-killer: 
> gfp_mask=0x25000c0(GFP_KERNEL_ACCOUNT), nodemask=0, order=0, oom_score_adj=0

another allocation in a short time. Killing the task has obviously
didn't help because the lowmem memory pressure hasn't been relieved

[...]
> Dec 16 18:56:32 boerne.fritz.box kernel: Normal free:41028kB min:41100kB 
> low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB 
> active_file:472164kB inactive_file:108kB unevictable:0kB writepending:112kB 
> present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:213236kB 
> slab_unreclaimable:86360kB kernel_stack:1584kB pagetables:2564kB bounce:32kB 
> free_pcp:180kB local_pcp:24kB free_cma:0kB

in fact we have even more pages on the file LRUs.

[...]

> Dec 16 18:56:32 boerne.fritz.box kernel: xfce4-terminal invoked oom-killer: 
> gfp_mask=0x25000c0(GFP_KERNEL_ACCOUNT), nodemask=0, order=0, oom_score_adj=0
[...]
> Dec 16 18:56:32 boerne.fritz.box kernel: Normal free:40988kB min:41100kB 
> low:51372kB high:61644kB active_anon:0kB inactive_anon:0kB 
> active_file:472436kB inactive_file:144kB unevictable:0kB writepending:312kB 
> present:897016kB managed:831480kB mlocked:0kB slab_reclaimable:213236kB 
> slab_unreclaimable:86360kB kernel_stack:1584kB pagetables:2464kB bounce:32kB 
> free_pcp:116kB local_pcp:0kB free_cma:0kB

same here. All that suggests that the page cache cannot be reclaimed for
some reason. It is hard to tell why but there is definitely something
bad going on.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OOM: Better, but still there on 4.9

2016-12-16 Thread Michal Hocko
On Fri 16-12-16 17:47:25, Chris Mason wrote:
> On 12/16/2016 05:14 PM, Michal Hocko wrote:
> > On Fri 16-12-16 13:15:18, Chris Mason wrote:
> > > On 12/16/2016 02:39 AM, Michal Hocko wrote:
> > [...]
> > > > I believe the right way to go around this is to pursue what I've started
> > > > in [1]. I will try to prepare something for testing today for you. Stay
> > > > tuned. But I would be really happy if somebody from the btrfs camp could
> > > > check the NOFS aspect of this allocation. We have already seen
> > > > allocation stalls from this path quite recently
> > > 
> > > Just double checking, are you asking why we're using GFP_NOFS to avoid 
> > > going
> > > into btrfs from the btrfs writepages call, or are you asking why we aren't
> > > allowing highmem?
> > 
> > I am more interested in the NOFS part. Why cannot this be a full
> > GFP_KERNEL context? What kind of locks we would lock up when recursing
> > to the fs via slab shrinkers?
> > 
> 
> Since this is our writepages call, any jump into direct reclaim would go to
> writepage, which would end up calling the same set of code to read metadata
> blocks, which would do a GFP_KERNEL allocation and end up back in writepage
> again.

But we are not doing pageout on the page cache from the direct reclaim
for a long time. So basically the only way to recurse back to the fs
code is via slab ([di]cache) shrinkers. Are those a problem as well?

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OOM: Better, but still there on 4.9

2016-12-16 Thread Michal Hocko
On Fri 16-12-16 13:15:18, Chris Mason wrote:
> On 12/16/2016 02:39 AM, Michal Hocko wrote:
[...]
> > I believe the right way to go around this is to pursue what I've started
> > in [1]. I will try to prepare something for testing today for you. Stay
> > tuned. But I would be really happy if somebody from the btrfs camp could
> > check the NOFS aspect of this allocation. We have already seen
> > allocation stalls from this path quite recently
> 
> Just double checking, are you asking why we're using GFP_NOFS to avoid going
> into btrfs from the btrfs writepages call, or are you asking why we aren't
> allowing highmem?

I am more interested in the NOFS part. Why cannot this be a full
GFP_KERNEL context? What kind of locks we would lock up when recursing
to the fs via slab shrinkers?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically

2016-12-16 Thread Michal Hocko
On Fri 16-12-16 12:31:51, Johannes Weiner wrote:
> On Fri, Dec 16, 2016 at 04:58:08PM +0100, Michal Hocko wrote:
> > @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc)
> >  * make sure exclude 0 mask - all other users should have at least
> >  * ___GFP_DIRECT_RECLAIM to get here.
> >  */
> > -   if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
> > +   if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
> > return true;
> 
> This makes sense, we should go back to what we had here. Because it's
> not that the reported OOMs are premature - there is genuinely no more
> memory reclaimable from the allocating context - but that this class
> of allocations should never invoke the OOM killer in the first place.

agreed, at least not with the current implementtion. If we had a proper
accounting where we know that the memory pinned by the fs is not really
there then we could invoke the oom killer and be safe

> > @@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> > order,
> >  */
> > WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
> >  
> > +   /*
> > +* Help non-failing allocations by giving them access to memory
> > +* reserves but do not use ALLOC_NO_WATERMARKS because this
> > +* could deplete whole memory reserves which would just make
> > +* the situation worse
> > +*/
> > +   page = __alloc_pages_cpuset_fallback(gfp_mask, order, 
> > ALLOC_HARDER, ac);
> > +   if (page)
> > +   goto got_pg;
> > +
> 
> But this should be a separate patch, IMO.
> 
> Do we observe GFP_NOFS lockups when we don't do this? 

this is hard to tell but considering users like grow_dev_page we can get
stuck with a very slow progress I believe. Those allocations could see
some help.

> Don't we risk
> premature exhaustion of the memory reserves, and it's better to wait
> for other reclaimers to make some progress instead?

waiting for other reclaimers would be preferable but we should at least
give these some priority, which is what ALLOC_HARDER should help with.

> Should we give
> reserve access to all GFP_NOFS allocations, or just the ones from a
> reclaim/cleaning context?

I would focus only for those which are important enough. Which are those
is a harder question. But certainly those with GFP_NOFAIL are important
enough.

> All that should go into the changelog of a separate allocation booster
> patch, I think.

The reason I did both in the same patch is to address the concern about
potential lockups when NOFS|NOFAIL cannot make any progress. I've chosen
ALLOC_HARDER to give the minimum portion of the reserves so that we do
not risk other high priority users to be blocked out but still help a
bit at least and prevent from starvation when other reclaimers are
faster to consume the reclaimed memory.

I can extend the changelog of course but I believe that having both
changes together makes some sense. NOFS|NOFAIL allocations are not all
that rare and sometimes we really depend on them making a further
progress.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9 v2] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives

2016-12-16 Thread Michal Hocko
On Fri 16-12-16 11:37:50, Brian Foster wrote:
> On Fri, Dec 16, 2016 at 04:40:41PM +0100, Michal Hocko wrote:
> > Updated patch after Mike noticed a BUG_ON when KM_NOLOCKDEP is used.
> > ---
> > From 1497e713e11639157aef21cae29052cb3dc7ab44 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mho...@suse.com>
> > Date: Thu, 15 Dec 2016 13:06:43 +0100
> > Subject: [PATCH] xfs: introduce and use KM_NOLOCKDEP to silence reclaim
> >  lockdep false positives
> > 
> > Now that the page allocator offers __GFP_NOLOCKDEP let's introduce
> > KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it
> > also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing
> > KM_NOFS tags to keep lockdep happy") and use the new flag for them
> > instead. There is really no reason to make these allocations contexts
> > weaker just because of the lockdep which even might not be enabled
> > in most cases.
> > 
> 
> Hi Michal,
> 
> I haven't gone back to fully grok b17cb364dbbb ("xfs: fix missing
> KM_NOFS tags to keep lockdep happy"), so I'm not really familiar with
> the original problem. FWIW, there was another KM_NOFS instance added by
> that commit in xlog_cil_prepare_log_vecs() that is now in
> xlog_cil_alloc_shadow_bufs(). Perhaps Dave can confirm whether the
> original issue still applies..?

Yes, I've noticed that but the reworked code looked sufficiently
different that I didn't dare to simply convert it.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/9 v2] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

2016-12-16 Thread Michal Hocko
On Fri 16-12-16 11:38:11, Brian Foster wrote:
> On Thu, Dec 15, 2016 at 03:07:11PM +0100, Michal Hocko wrote:
[...]
> > @@ -459,7 +459,7 @@ _xfs_buf_map_pages(
> > break;
> > vm_unmap_aliases();
> > } while (retried++ <= 1);
> > -   memalloc_noio_restore(noio_flag);
> > +   memalloc_noio_restore(nofs_flag);
> 
> memalloc_nofs_restore() ?

Ups, you are right of course. Fixed.
---
>From 47826112e59014030ffe27a673c1b1de345dd9de Mon Sep 17 00:00:00 2001
From: Michal Hocko <mho...@suse.com>
Date: Thu, 15 Dec 2016 13:10:53 +0100
Subject: [PATCH] xfs: use memalloc_nofs_{save,restore} instead of
 memalloc_noio*

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/xfs/kmem.c| 10 +-
 fs/xfs/xfs_buf.c |  8 
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a76a05dae96b..d69ed5e76621 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-   unsigned noio_flag = 0;
+   unsigned nofs_flag = 0;
void*ptr;
gfp_t   lflags;
 
@@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 * the filesystem here and potentially deadlocking.
 */
-   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-   noio_flag = memalloc_noio_save();
+   if (flags & KM_NOFS)
+   nofs_flag = memalloc_nofs_save();
 
lflags = kmem_flags_convert(flags);
ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-   memalloc_noio_restore(noio_flag);
+   if (flags & KM_NOFS)
+   memalloc_nofs_restore(nofs_flag);
 
return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index f31ae592dcae..e9eec256056c 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -441,17 +441,17 @@ _xfs_buf_map_pages(
bp->b_addr = NULL;
} else {
int retried = 0;
-   unsigned noio_flag;
+   unsigned nofs_flag;
 
/*
 * vm_map_ram() will allocate auxillary structures (e.g.
 * pagetables) with GFP_KERNEL, yet we are likely to be under
 * GFP_NOFS context here. Hence we need to tell memory reclaim
-* that we are in such a context via PF_MEMALLOC_NOIO to prevent
+* that we are in such a context via PF_MEMALLOC_NOFS to prevent
 * memory reclaim re-entering the filesystem here and
 * potentially deadlocking.
 */
-   noio_flag = memalloc_noio_save();
+   nofs_flag = memalloc_nofs_save();
do {
bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
-1, PAGE_KERNEL);
@@ -459,7 +459,7 @@ _xfs_buf_map_pages(
break;
vm_unmap_aliases();
} while (retried++ <= 1);
-   memalloc_noio_restore(noio_flag);
+   memalloc_nofs_restore(nofs_flag);
 
if (!bp->b_addr)
return -ENOMEM;
-- 
2.10.2


-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically

2016-12-16 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

__alloc_pages_may_oom makes sure to skip the OOM killer depending on
the allocation request. This includes lowmem requests, costly high
order requests and others. For a long time __GFP_NOFAIL acted as an
override for all those rules. This is not documented and it can be quite
surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
the existing open coded loops around allocator to nofail request (and we
have done that in the past) then such a change would have a non trivial
side effect which is not obvious. Note that the primary motivation for
skipping the OOM killer is to prevent from pre-mature invocation.

The exception has been added by 82553a937f12 ("oom: invoke oom killer
for __GFP_NOFAIL"). The changelog points out that the oom killer has to
be invoked otherwise the request would be looping for ever. But this
argument is rather weak because the OOM killer doesn't really guarantee
any forward progress for those exceptional cases:
- it will hardly help to form costly order which in turn can
  result in the system panic because of no oom killable task in
  the end - I believe we certainly do not want to put the system
  down just because there is a nasty driver asking for order-9
  page with GFP_NOFAIL not realizing all the consequences. It is
  much better this request would loop for ever than the massive
  system disruption
- lowmem is also highly unlikely to be freed during OOM killer
- GFP_NOFS request could trigger while there is still a lot of
  memory pinned by filesystems.

The pre-mature OOM killer is a real issue as reported by Nils Holland
kworker/u4:5 invoked oom-killer: 
gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
kworker/u4:5 cpuset=/ mems_allowed=0
CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 
08/06/2014
Workqueue: writeback wb_workfn (flush-btrfs-1)
 eff0b604 c142bcce eff0b734  eff0b634 c1163332  0292
 eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 e7fa2900 c1b58785 eff0b734
 eff0b678 c110795f c1043895 eff0b664 c11075c7 0007  
Call Trace:
 [] dump_stack+0x47/0x69
 [] dump_header+0x60/0x178
 [] ? ___ratelimit+0x86/0xe0
 [] oom_kill_process+0x20f/0x3d0
 [] ? has_capability_noaudit+0x15/0x20
 [] ? oom_badness.part.13+0xb7/0x130
 [] out_of_memory+0xd9/0x260
 [] __alloc_pages_nodemask+0xbfb/0xc80
 [] pagecache_get_page+0xad/0x270
 [] alloc_extent_buffer+0x116/0x3e0
 [] btrfs_find_create_tree_block+0xe/0x10
[...]
Normal free:41332kB min:41368kB low:51708kB high:62048kB 
active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB 
slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB 
pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
lowmem_reserve[]: 0 0 21292 21292
HighMem free:781660kB min:512kB low:34356kB high:68200kB 
active_anon:234740kB inactive_anon:360kB active_file:557232kB 
inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB 
managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB 
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB 
free_cma:0kB

this is a GFP_NOFS|__GFP_NOFAIL request which invokes the OOM killer
because there is clearly nothing reclaimable in the zone Normal while
there is a lot of page cache which is most probably pinned by the fs but
GFP_NOFS cannot reclaim it.

This patch simply removes the __GFP_NOFAIL special case in order to have
a more clear semantic without surprising side effects. Instead we do
allow nofail requests to access memory reserves to move forward in both
cases when the OOM killer is invoked and when it should be supressed.
In the later case we are more careful and only allow a partial access
because we do not want to risk the whole reserves depleting. There
are users doing GFP_NOFS|__GFP_NOFAIL heavily (e.g. __getblk_gfp ->
grow_dev_page).

Introduce __alloc_pages_cpuset_fallback helper which allows to bypass
allocation constrains for the given gfp mask while it enforces cpusets
whenever possible.

Reported-by: Nils Holland <nholl...@tisys.org>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
 mm/oom_kill.c   |  2 +-
 mm/page_alloc.c | 97 -
 2 files changed, 62 insertions(+), 37 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ec9f11d4f094..12a6fce85f61 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_ki

[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath

2016-12-16 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

Tetsuo Handa has pointed out that 0a0337e0d1d1 ("mm, oom: rework oom
detection") has subtly changed semantic for costly high order requests
with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail right now.
My code inspection didn't reveal any such users in the tree but it is
true that this might lead to unexpected allocation failures and
subsequent OOPs.

__alloc_pages_slowpath wrt. GFP_NOFAIL is hard to follow currently.
There are few special cases but we are lacking a catch all place to be
sure we will not miss any case where the non failing allocation might
fail. This patch reorganizes the code a bit and puts all those special
cases under nopage label which is the generic go-to-fail path. Non
failing allocations are retried or those that cannot retry like
non-sleeping allocation go to the failure point directly. This should
make the code flow much easier to follow and make it less error prone
for future changes.

While we are there we have to move the stall check up to catch
potentially looping non-failing allocations.

Changes since v1
- do not skip direct reclaim for TIF_MEMDIE && GFP_NOFAIL as per Hillf
- do not skip __alloc_pages_may_oom for TIF_MEMDIE && GFP_NOFAIL as
  per Tetsuo

Signed-off-by: Michal Hocko <mho...@suse.com>
Acked-by: Vlastimil Babka <vba...@suse.cz>
Acked-by: Johannes Weiner <han...@cmpxchg.org>
Acked-by: Hillf Danton <hillf...@alibaba-inc.com>
---
 mm/page_alloc.c | 75 +
 1 file changed, 44 insertions(+), 31 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f2c9e535f7f..095e2fa286de 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3640,35 +3640,21 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
goto got_pg;
 
/* Caller is not willing to reclaim, we can't balance anything */
-   if (!can_direct_reclaim) {
-   /*
-* All existing users of the __GFP_NOFAIL are blockable, so warn
-* of any new users that actually allow this type of allocation
-* to fail.
-*/
-   WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL);
+   if (!can_direct_reclaim)
goto nopage;
-   }
 
-   /* Avoid recursion of direct reclaim */
-   if (current->flags & PF_MEMALLOC) {
-   /*
-* __GFP_NOFAIL request from this context is rather bizarre
-* because we cannot reclaim anything and only can loop waiting
-* for somebody to do a work for us.
-*/
-   if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
-   cond_resched();
-   goto retry;
-   }
-   goto nopage;
+   /* Make sure we know about allocations which stall for too long */
+   if (time_after(jiffies, alloc_start + stall_timeout)) {
+   warn_alloc(gfp_mask,
+   "page alloction stalls for %ums, order:%u",
+   jiffies_to_msecs(jiffies-alloc_start), order);
+   stall_timeout += 10 * HZ;
}
 
-   /* Avoid allocations with no watermarks from looping endlessly */
-   if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
+   /* Avoid recursion of direct reclaim */
+   if (current->flags & PF_MEMALLOC)
goto nopage;
 
-
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
_some_progress);
@@ -3692,14 +3678,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
goto nopage;
 
-   /* Make sure we know about allocations which stall for too long */
-   if (time_after(jiffies, alloc_start + stall_timeout)) {
-   warn_alloc(gfp_mask,
-   "page allocation stalls for %ums, order:%u",
-   jiffies_to_msecs(jiffies-alloc_start), order);
-   stall_timeout += 10 * HZ;
-   }
-
if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
 did_some_progress > 0, _progress_loops))
goto retry;
@@ -3721,6 +3699,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
if (page)
goto got_pg;
 
+   /* Avoid allocations with no watermarks from looping endlessly */
+   if (test_thread_flag(TIF_MEMDIE))
+   goto nopage;
+
/* Retry as long as the OOM killer is making progress */
if (did_some_progress) {
no_progress_loops = 0;
@@ -3728,6 +3710,37 @@ __alloc_pages_slowpath(gfp_t gfp_mask, un

Re: OOM: Better, but still there on

2016-12-16 Thread Michal Hocko
On Fri 16-12-16 08:39:41, Michal Hocko wrote:
[...]
> That being said, the OOM killer invocation is clearly pointless and
> pre-mature. We normally do not invoke it normally for GFP_NOFS requests
> exactly for these reasons. But this is GFP_NOFS|__GFP_NOFAIL which
> behaves differently. I am about to change that but my last attempt [1]
> has to be rethought.
> 
> Now another thing is that the __GFP_NOFAIL which has this nasty side
> effect has been introduced by me d1b5c5671d01 ("btrfs: Prevent from
> early transaction abort") in 4.3 so I am quite surprised that this has
> shown up only in 4.8. Anyway there might be some other changes in the
> btrfs which could make it more subtle.
> 
> I believe the right way to go around this is to pursue what I've started
> in [1]. I will try to prepare something for testing today for you. Stay
> tuned. But I would be really happy if somebody from the btrfs camp could
> check the NOFS aspect of this allocation. We have already seen
> allocation stalls from this path quite recently

Could you try to run with the two following patches?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/9 v2] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives

2016-12-16 Thread Michal Hocko
Updated patch after Mike noticed a BUG_ON when KM_NOLOCKDEP is used.
---
>From 1497e713e11639157aef21cae29052cb3dc7ab44 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mho...@suse.com>
Date: Thu, 15 Dec 2016 13:06:43 +0100
Subject: [PATCH] xfs: introduce and use KM_NOLOCKDEP to silence reclaim
 lockdep false positives

Now that the page allocator offers __GFP_NOLOCKDEP let's introduce
KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it
also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing
KM_NOFS tags to keep lockdep happy") and use the new flag for them
instead. There is really no reason to make these allocations contexts
weaker just because of the lockdep which even might not be enabled
in most cases.

Changes since v1
- check for KM_NOLOCKDEP in kmem_flags_convert to not hit sanity BUG_ON
  as per Mike Galbraith

Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/xfs/kmem.h| 6 +-
 fs/xfs/libxfs/xfs_da_btree.c | 4 ++--
 fs/xfs/xfs_buf.c | 2 +-
 fs/xfs/xfs_dir2_readdir.c| 2 +-
 4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..d5d634ef1f7f 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -33,6 +33,7 @@ typedef unsigned __bitwise xfs_km_flags_t;
 #define KM_NOFS((__force xfs_km_flags_t)0x0004u)
 #define KM_MAYFAIL ((__force xfs_km_flags_t)0x0008u)
 #define KM_ZERO((__force xfs_km_flags_t)0x0010u)
+#define KM_NOLOCKDEP   ((__force xfs_km_flags_t)0x0020u)
 
 /*
  * We use a special process flag to avoid recursive callbacks into
@@ -44,7 +45,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 {
gfp_t   lflags;
 
-   BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO));
+   BUG_ON(flags & 
~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO|KM_NOLOCKDEP));
 
if (flags & KM_NOSLEEP) {
lflags = GFP_ATOMIC | __GFP_NOWARN;
@@ -57,6 +58,9 @@ kmem_flags_convert(xfs_km_flags_t flags)
if (flags & KM_ZERO)
lflags |= __GFP_ZERO;
 
+   if (flags & KM_NOLOCKDEP)
+   lflags |= __GFP_NOLOCKDEP;
+
return lflags;
 }
 
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index f2dc1a950c85..b8b5f6914863 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2429,7 +2429,7 @@ xfs_buf_map_from_irec(
 
if (nirecs > 1) {
map = kmem_zalloc(nirecs * sizeof(struct xfs_buf_map),
- KM_SLEEP | KM_NOFS);
+ KM_SLEEP | KM_NOLOCKDEP);
if (!map)
return -ENOMEM;
*mapp = map;
@@ -2488,7 +2488,7 @@ xfs_dabuf_map(
 */
if (nfsb != 1)
irecs = kmem_zalloc(sizeof(irec) * nfsb,
-   KM_SLEEP | KM_NOFS);
+   KM_SLEEP | KM_NOLOCKDEP);
 
nirecs = nfsb;
error = xfs_bmapi_read(dp, (xfs_fileoff_t)bno, nfsb, irecs,
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 7f0a01f7b592..f31ae592dcae 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1785,7 +1785,7 @@ xfs_alloc_buftarg(
 {
xfs_buftarg_t   *btp;
 
-   btp = kmem_zalloc(sizeof(*btp), KM_SLEEP | KM_NOFS);
+   btp = kmem_zalloc(sizeof(*btp), KM_SLEEP | KM_NOLOCKDEP);
 
btp->bt_mount = mp;
btp->bt_dev =  bdev->bd_dev;
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 003a99b83bd8..033ed65d7ce6 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -503,7 +503,7 @@ xfs_dir2_leaf_getdents(
length = howmany(bufsize + geo->blksize, (1 << geo->fsblog));
map_info = kmem_zalloc(offsetof(struct xfs_dir2_leaf_map_info, map) +
(length * sizeof(struct xfs_bmbt_irec)),
-  KM_SLEEP | KM_NOFS);
+  KM_SLEEP | KM_NOLOCKDEP);
    map_info->map_size = length;
 
/*
-- 
2.10.2

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9 v2] scope GFP_NOFS api

2016-12-16 Thread Michal Hocko
On Fri 16-12-16 16:05:58, Mike Galbraith wrote:
> On Thu, 2016-12-15 at 15:07 +0100, Michal Hocko wrote:
> > Hi,
> > I have posted the previous version here [1]. Since then I have added a
> > support to suppress reclaim lockdep warnings (__GFP_NOLOCKDEP) to allow
> > removing GFP_NOFS usage motivated by the lockdep false positives. On top
> > of that I've tried to convert few KM_NOFS usages to use the new flag in
> > the xfs code base. This would need a review from somebody familiar with
> > xfs of course.
> 
> The wild ass guess below prevents the xfs explosion below when running
> ltp zram tests.

Yes this looks correct. Thanks for noticing. I will fold it to the
patch2. Thanks for testing Mike!
> 
> ---
>  fs/xfs/kmem.h |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -45,7 +45,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  {
>   gfp_t   lflags;
>  
> - BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO));
> + BUG_ON(flags & 
> ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO|KM_NOLOCKDEP));
>  
>   if (flags & KM_NOSLEEP) {
>   lflags = GFP_ATOMIC | __GFP_NOWARN;
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[DEBUG PATCH 2/2] silent warnings which we cannot do anything about

2016-12-16 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

There are some code paths used by all the filesystems which we cannot
change to drop the GFP_NOFS, yet they generate a lot of warnings.
Provide {disable,enable}_scope_gfp_check to silence those.
alloc_page_buffers and grow_dev_page are silenced right away.

Signed-off-by: Michal Hocko <mho...@suse.com>
---
 fs/buffer.c   |  4 
 include/linux/sched.h | 11 +++
 mm/page_alloc.c   |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index d21771fcf7d3..d27e8f05f736 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -873,7 +873,9 @@ struct buffer_head *alloc_page_buffers(struct page *page, 
unsigned long size,
head = NULL;
offset = PAGE_SIZE;
while ((offset -= size) >= 0) {
+   disable_scope_gfp_check();
bh = alloc_buffer_head(GFP_NOFS);
+   enable_scope_gfp_check();
if (!bh)
goto no_grow;
 
@@ -1003,7 +1005,9 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 */
gfp_mask |= __GFP_NOFAIL;
 
+   disable_scope_gfp_check();
page = find_or_create_page(inode->i_mapping, index, gfp_mask);
+   enable_scope_gfp_check();
if (!page)
return ret;
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 288946bfc326..b379ef9ed464 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1988,6 +1988,7 @@ struct task_struct {
/* A live task holds one reference. */
atomic_t stack_refcount;
 #endif
+   bool disable_scope_gfp_warn;
unsigned long nofs_caller;
unsigned long noio_caller;
 /* CPU-specific state of this task */
@@ -2390,6 +2391,16 @@ static inline unsigned int __memalloc_nofs_save(unsigned 
long caller)
return flags;
 }
 
+static inline void disable_scope_gfp_check(void)
+{
+   current->disable_scope_gfp_warn = true;
+}
+
+static inline void enable_scope_gfp_check(void)
+{
+   current->disable_scope_gfp_warn = false;
+}
+
 #define memalloc_nofs_save()   __memalloc_nofs_save(_RET_IP_)
 
 static inline void memalloc_nofs_restore(unsigned int flags)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9e35fb2a8681..7ecae58abf74 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3758,6 +3758,9 @@ void debug_scope_gfp_context(gfp_t gfp_mask)
if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
return;
 
+   if (current->disable_scope_gfp_warn)
+   return;
+
if (current->flags & PF_MEMALLOC_NOIO)
restrict_mask = __GFP_IO;
else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[DEBUG PATCH 1/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context

2016-12-16 Thread Michal Hocko
From: Michal Hocko <mho...@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.

Let's help this process and add a debugging tool to catch when an
explicit allocation request for GFP_NO{FS,IO} is done from the scope
context. The printed stacktrace should help to identify the caller
and evaluate whether it can be changed to use a wider context or whether
it is called from another potentially dangerous context which needs
a scope protection as well.

The checks have to be enabled explicitly by debug_scope_gfp kernel
command line parameter.

Signed-off-by: Michal Hocko <mho...@suse.com>
---
 include/linux/sched.h | 14 +++--
 include/linux/slab.h  |  3 +++
 mm/page_alloc.c   | 58 +++
 3 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c9fbcbcfcc8..288946bfc326 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1988,6 +1988,8 @@ struct task_struct {
/* A live task holds one reference. */
atomic_t stack_refcount;
 #endif
+   unsigned long nofs_caller;
+   unsigned long noio_caller;
 /* CPU-specific state of this task */
struct thread_struct thread;
 /*
@@ -2345,6 +2347,8 @@ extern void thread_group_cputime_adjusted(struct 
task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+extern void debug_scope_gfp_context(gfp_t gfp_mask);
+
 /*
  * Applies per-task gfp context to the given allocation flags.
  * PF_MEMALLOC_NOIO implies GFP_NOIO
@@ -2363,25 +2367,31 @@ static inline gfp_t current_gfp_context(gfp_t flags)
return flags;
 }
 
-static inline unsigned int memalloc_noio_save(void)
+static inline unsigned int __memalloc_noio_save(unsigned long caller)
 {
unsigned int flags = current->flags & PF_MEMALLOC_NOIO;
current->flags |= PF_MEMALLOC_NOIO;
+   current->noio_caller = caller;
return flags;
 }
 
+#define memalloc_noio_save()   __memalloc_noio_save(_RET_IP_)
+
 static inline void memalloc_noio_restore(unsigned int flags)
 {
current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
-static inline unsigned int memalloc_nofs_save(void)
+static inline unsigned int __memalloc_nofs_save(unsigned long caller)
 {
unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
current->flags |= PF_MEMALLOC_NOFS;
+   current->nofs_caller = caller;
return flags;
 }
 
+#define memalloc_nofs_save()   __memalloc_nofs_save(_RET_IP_)
+
 static inline void memalloc_nofs_restore(unsigned int flags)
 {
current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 084b12bad198..6559668e29db 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -477,6 +477,7 @@ static __always_inline void *kmalloc_large(size_t size, 
gfp_t flags)
  */
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
+   debug_scope_gfp_context(flags);
if (__builtin_constant_p(size)) {
if (size > KMALLOC_MAX_CACHE_SIZE)
return kmalloc_large(size, flags);
@@ -517,6 +518,7 @@ static __always_inline int kmalloc_size(int n)
 
 static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
+   debug_scope_gfp_context(flags);
 #ifndef CONFIG_SLOB
if (__builtin_constant_p(size) &&
size <= KMALLOC_MAX_CACHE_SIZE && !(flags & GFP_DMA)) {
@@ -575,6 +577,7 @@ int memcg_update_all_caches(int num_memcgs);
  */
 static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
 {
+   debug_scope_gfp_context(flags);
if (size != 0 && n > SIZE_MAX / size)
return NULL;
if (__builtin_constant_p(n) && __builtin_constant_p(size))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e701be6b930a..9e35fb2a8681 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3734,6 +3734,63 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
return page;
 }
 
+static bool debug_scope_gfp;
+
+static int __init enable_debug_scope_gfp(char *unused)
+{
+   debug_scope_gfp = true;
+   return 0;
+}
+
+/*
+ * spit the stack trace if the given gfp_mask clears flags which are context
+ * wide cleared. Such a caller can remove special flags clearing and rely on
+ * the context wide mask.
+ */
+void debug_scope_gfp_context(gfp_t gfp_mask)
+{
+   gfp_t restrict_mask;
+
+   if (likely(!debug_scope_gfp))
+   return;
+
+   /* both NOFS, NOIO are irrelevant when direct reclaim is disabled */
+   if (!(gfp_mask & __GFP_DIRECT_RECLAIM))

[DEBUG PATCH 0/2] debug explicit GFP_NO{FS,IO} usage from the scope context

2016-12-16 Thread Michal Hocko
Hi,
I've forgot to add the following two patches which should help to
identify explicit GFP_NO{FS,IO} usage from withing a scope context. Such
a usage can be changed to the full GFP_KERNEL because all the calls
from within the NO{FS,IO} scope will drop the __GFP_FS resp. __GFP_IO
automatically and if the function is called outside of the scope then
we do not need to restrict it to NOFS/NOIO as long as all the reclaim
recursion unsafe contexts are marked properly. This means that each
such a reported allocation site has to be checked before converted.

The debugging has to be enabled explicitly by a kernel command line
parameter and then it reports the stack trace of the allocation and
also the function which has started the current scope.

These two patches are _not_ intended to be merged and they are only
aimed at debugging.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OOM: Better, but still there on 4.9

2016-12-15 Thread Michal Hocko
solated(file):0kB mapped:29528kB dirty:2596kB 
> writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 
> 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
> Dec 15 19:02:18 teela kernel: DMA free:3952kB min:788kB low:984kB high:1180kB 
> active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB 
> unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
> slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
> pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 813 3474 3474
> Dec 15 19:02:18 teela kernel: Normal free:41332kB min:41368kB low:51708kB 
> high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB 
> inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB 
> managed:836248kB mlocked:0kB slab_reclaimable:159448kB 
> slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB 
> free_pcp:528kB local_pcp:340kB free_cma:0kB

And this shows that there is no anonymous memory in the lowmem zone.
Note that this request cannot use the highmem zone so no swap out would
help. So if we are not able to reclaim those pages on the file LRU then
we are out of luck

> Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 0 21292 21292
> Dec 15 19:02:18 teela kernel: HighMem free:781660kB min:512kB low:34356kB 
> high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB 
> inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB 
> managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB 
> kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB 
> free_cma:0kB

That being said, the OOM killer invocation is clearly pointless and
pre-mature. We normally do not invoke it normally for GFP_NOFS requests
exactly for these reasons. But this is GFP_NOFS|__GFP_NOFAIL which
behaves differently. I am about to change that but my last attempt [1]
has to be rethought.

Now another thing is that the __GFP_NOFAIL which has this nasty side
effect has been introduced by me d1b5c5671d01 ("btrfs: Prevent from
early transaction abort") in 4.3 so I am quite surprised that this has
shown up only in 4.8. Anyway there might be some other changes in the
btrfs which could make it more subtle.

I believe the right way to go around this is to pursue what I've started
in [1]. I will try to prepare something for testing today for you. Stay
tuned. But I would be really happy if somebody from the btrfs camp could
check the NOFS aspect of this allocation. We have already seen
allocation stalls from this path quite recently

[1] http://lkml.kernel.org/r/20161201152517.27698-1-mho...@kernel.org
[2] http://lkml.kernel.org/r/20161214101743.ga25...@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >