Re: [PATCH] zram: pass down the bvec we need to read into in the work struct

2019-04-16 Thread Andrew Morton
On Wed, 10 Apr 2019 15:43:50 -0400 Jerome Glisse  wrote:

> Adding more Cc and stable (i thought this was 5.1 addition). Note that
> without this patch on arch/kernel where PAGE_SIZE != 4096 userspace
> could read random memory through a zram block device (thought userspace
> probably would have no control on the address being read).

Looks good to me.

Minchan & Sergey, can you please review?


From: Jérôme Glisse 
Subject: zram: pass down the bvec we need to read into in the work struct

When scheduling work item to read page we need to pass down the proper
bvec struct which points to the page to read into.  Before this patch it
uses a randomly initialized bvec (only if PAGE_SIZE != 4096) which is
wrong.

Note that without this patch on arch/kernel where PAGE_SIZE != 4096
userspace could read random memory through a zram block device (thought
userspace probably would have no control on the address being read).

Link: http://lkml.kernel.org/r/20190408183219.26377-1-jgli...@redhat.com
Signed-off-by: Jérôme Glisse 
Reviewed-by: Andrew Morton 
Cc: Minchan Kim 
Cc: Nitin Gupta 
Cc: Sergey Senozhatsky 
Cc: 
Signed-off-by: Andrew Morton 
---

 drivers/block/zram/zram_drv.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- 
a/drivers/block/zram/zram_drv.c~zram-pass-down-the-bvec-we-need-to-read-into-in-the-work-struct
+++ a/drivers/block/zram/zram_drv.c
@@ -774,18 +774,18 @@ struct zram_work {
struct zram *zram;
unsigned long entry;
struct bio *bio;
+   struct bio_vec bvec;
 };
 
 #if PAGE_SIZE != 4096
 static void zram_sync_read(struct work_struct *work)
 {
-   struct bio_vec bvec;
struct zram_work *zw = container_of(work, struct zram_work, work);
struct zram *zram = zw->zram;
unsigned long entry = zw->entry;
struct bio *bio = zw->bio;
 
-   read_from_bdev_async(zram, , entry, bio);
+   read_from_bdev_async(zram, >bvec, entry, bio);
 }
 
 /*
@@ -798,6 +798,7 @@ static int read_from_bdev_sync(struct zr
 {
struct zram_work work;
 
+   work.bvec = *bvec;
work.zram = zram;
work.entry = entry;
work.bio = bio;
_



Re: [PATCH 2/2] kernel: use sysctl shared variables for range check

2019-04-16 Thread Andrew Morton
On Wed, 10 Apr 2019 15:59:55 -0700 Kees Cook  wrote:

> On Wed, Apr 10, 2019 at 3:54 PM Matteo Croce  wrote:
> >
> > On Thu, Apr 11, 2019 at 12:34 AM Kees Cook  wrote:
> > >
> > > On Wed, Apr 10, 2019 at 3:30 PM Matteo Croce  wrote:
> > > >
> > > > FYI, this are the stats from my local repo, just to let you the size
> > > > of a series with all the changes in it:
> > > >
> > > > $ git --no-pager log --stat --oneline linus/master
> > > >  2 files changed, 4 insertions(+), 9 deletions(-)
> > > >  2 files changed, 7 insertions(+), 14 deletions(-)
> > > >  6 files changed, 15 insertions(+), 30 deletions(-)
> > > >  1 file changed, 16 insertions(+), 19 deletions(-)
> > > >  1 file changed, 3 insertions(+), 5 deletions(-)
> > > >  3 files changed, 15 insertions(+), 20 deletions(-)
> > > >  12 files changed, 93 insertions(+), 116 deletions(-)
> > > >  3 files changed, 98 insertions(+), 104 deletions(-)
> > > >  2 files changed, 9 insertions(+)
> > >
> > > Tiny! :) Seriously, though, I think this should be fine to take
> > > directly to Linus after patch 1 lands, or see if akpm wants to carry
> > > it directly.
> > >
> > > --
> > > Kees Cook
> >
> > Nice. Still as an 8 patches series, or squashed into a bigger one?
> 
> I suspect a single patch would be fine, but let's wait to hear from akpm.

Bring it!


Re: [PATCH] init: Initialize jump labels before command line option parsing

2019-04-16 Thread Andrew Morton
On Tue, 16 Apr 2019 13:54:04 -0700 Dan Williams  
wrote:

> When a module option, or core kernel argument, toggles a static-key it
> requires jump labels to be initialized early. While x86, PowerPC, and
> ARM64 arrange for jump_label_init() to be called before parse_args(),
> ARM does not.
> 
>   Kernel command line: rdinit=/sbin/init page_alloc.shuffle=1 panic=-1 
> console=ttyAMA0,115200 page_alloc.shuffle=1
>   [ cut here ]
>   WARNING: CPU: 0 PID: 0 at ./include/linux/jump_label.h:303
>   page_alloc_shuffle+0x12c/0x1ac
>   static_key_enable(): static key 'page_alloc_shuffle_key+0x0/0x4' used
>   before call to jump_label_init()
>   Modules linked in:
>   CPU: 0 PID: 0 Comm: swapper Not tainted
>   5.1.0-rc4-next-20190410-3-g3367c36ce744 #1
>   Hardware name: ARM Integrator/CP (Device Tree)
>   [] (unwind_backtrace) from [] (show_stack+0x10/0x18)
>   [] (show_stack) from [] (dump_stack+0x18/0x24)
>   [] (dump_stack) from [] (__warn+0xe0/0x108)
>   [] (__warn) from [] (warn_slowpath_fmt+0x44/0x6c)
>   [] (warn_slowpath_fmt) from []
>   (page_alloc_shuffle+0x12c/0x1ac)
>   [] (page_alloc_shuffle) from [] 
> (shuffle_store+0x28/0x48)
>   [] (shuffle_store) from [] (parse_args+0x1f4/0x350)
>   [] (parse_args) from [] (start_kernel+0x1c0/0x488)
> 
> Move the fallback call to jump_label_init() to occur before
> parse_args(). The redundant calls to jump_label_init() in other archs
> are left intact in case they have static key toggling use cases that are
> even earlier than option parsing.

Has it been confirmed that this fixes
mm-shuffle-initial-free-memory-to-improve-memory-side-cache-utilization.patch
on beaglebone-black?



Re: [PATCH] binfmt_elf: Move brk out of mmap when doing direct loader exec

2019-04-16 Thread Andrew Morton
On Tue, 16 Apr 2019 18:14:00 -0500 Kees Cook  wrote:

> On Tue, Apr 16, 2019 at 6:04 PM Andrew Morton  
> wrote:
> >
> > >
> > > Reported-by: Ali Saidi 
> > > Link: 
> > > https://lkml.kernel.org/r/CAGXu5jJ5sj3emOT2QPxQkNQk0qbU6zEfu9=omfhx_p0nckp...@mail.gmail.com
> > > Fixes: eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
> > > Signed-off-by: Kees Cook 
> >
> > No cc:stable?
> 
> Probably it should be, yes. I think I'm just shy about that when
> poking ELF mappings. :)

Well, the -stable bots will backport anything that might look slightly
like a fix anyway.

I'll add cc:stable and shall hold it out until 5.2-rc1, so it should
get a bit of a spin before being backported.


Re: [PATCH 0/4] z3fold: support page migration

2019-04-16 Thread Andrew Morton
On Thu, 11 Apr 2019 17:32:12 +0200 Vitaly Wool  wrote:

> This patchset implements page migration support and slightly better
> buddy search. To implement page migration support, z3fold has to move
> away from the current scheme of handle encoding. i. e. stop encoding
> page address in handles. Instead, a small per-page structure is created
> which will contain actual addresses for z3fold objects, while pointers
> to fields of that structure will be used as handles.

Can you please help find a reviewer for this work?

For some reason I'm seeing a massive number of rejects when trying to
apply these.  It looks like your mail client performed some sort of
selective space-stuffing.  I suggest you email a patch to yourself,
check that the result applies properly.





Re: [PATCH] binfmt_elf: Move brk out of mmap when doing direct loader exec

2019-04-16 Thread Andrew Morton
On Mon, 15 Apr 2019 21:23:20 -0700 Kees Cook  wrote:

> Commit eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"),
> made changes in the rare case when the ELF loader was directly invoked
> (e.g to set a non-inheritable LD_LIBRARY_PATH, testing new versions of
> the loader), by moving into the mmap region to avoid both ET_EXEC and PIE
> binaries. This had the effect of also moving the brk region into mmap,
> which could lead to the stack and brk being arbitrarily close to each
> other. An unlucky process wouldn't get its requested stack size and stack
> allocations could end up scribbling on the heap.
> 
> This is illustrated here. In the case of using the loader directly, brk
> (so helpfully identified as "[heap]") is allocated with the _loader_
> not the binary. For example, with ASLR entirely disabled, you can see
> this more clearly:
> 
> $ /bin/cat /proc/self/maps
> 4000-c000 r-xp  ... /bin/cat
> 5575b000-5575c000 r--p 7000 ... /bin/cat
> 5575c000-5575d000 rw-p 8000 ... /bin/cat
> 5575d000-5577e000 rw-p  ... [heap]
> ...
> 77ff7000-77ffa000 r--p  ... [vvar]
> 77ffa000-77ffc000 r-xp  ... [vdso]
> 77ffc000-77ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so
> 77ffd000-77ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so
> 77ffe000-77fff000 rw-p  ...
> 7ffde000-7000 rw-p  ... [stack]
> 
> $ /lib/x86_64-linux-gnu/ld-2.27.so /bin/cat /proc/self/maps
> ...
> 77bcc000-77bd4000 r-xp  ... /bin/cat
> 77bd4000-77dd3000 ---p 8000 ... /bin/cat
> 77dd3000-77dd4000 r--p 7000 ... /bin/cat
> 77dd4000-77dd5000 rw-p 8000 ... /bin/cat
> 77dd5000-77dfc000 r-xp  ... /lib/x86_64-linux-gnu/ld-2.27.so
> 77fb2000-77fd6000 rw-p  ...
> 77ff7000-77ffa000 r--p  ... [vvar]
> 77ffa000-77ffc000 r-xp  ... [vdso]
> 77ffc000-77ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so
> 77ffd000-77ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so
> 77ffe000-7802 rw-p  ... [heap]
> 7ffde000-7000 rw-p  ... [stack]
> 
> The solution is to move brk out of mmap and into ELF_ET_DYN_BASE since
> nothing is there in this direct loader case (and ET_EXEC still far
> away at 0x40). Anything that ran before should still work (i.e. the
> ultimately-launched binary already had the brk very far from its text,
> so this should be no different from a COMPAT_BRK standpoint). The only
> risk I see here is that if someone started to suddenly depend on the
> entire memory space above the mmap region being available when launching
> binaries via a direct loader execs which seems highly unlikely, I'd hope:
> this would mean a binary would _not_ work when execed normally.
> 
> Reported-by: Ali Saidi 
> Link: 
> https://lkml.kernel.org/r/CAGXu5jJ5sj3emOT2QPxQkNQk0qbU6zEfu9=omfhx_p0nckp...@mail.gmail.com
> Fixes: eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
> Signed-off-by: Kees Cook 

No cc:stable?


Re: [PATCH v15 1/3] /proc/pid/status: Add support for architecture specific output

2019-04-16 Thread Andrew Morton
On Tue, 16 Apr 2019 14:32:48 +0800 Aubrey Li  wrote:

> The architecture specific information of the running processes could
> be useful to the userland. Add support to examine process architecture
> specific information externally.

The implementation looks just fine to me.  Have you had any feedback on
the overall desirability of adding this feature?

> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -96,6 +96,11 @@
>  #include 
>  #include "internal.h"
>  
> +/* Add support for architecture specific output in /proc/pid/status */
> +#ifndef  arch_proc_pid_status
> +#define  arch_proc_pid_status(m, task)
> +#endif

To this I suggest adding

/* arch_proc_pid_status() must be defined in asm/processor.h */

Because we've regularly had different architectures defining such things
in different headers, resulting in a mess.




Re: linux-next: build warning after merge of the akpm-current tree

2019-04-16 Thread Andrew Morton
On Tue, 16 Apr 2019 16:52:56 +1000 Stephen Rothwell  
wrote:

> Hi all,
> 
> On Fri, 29 Mar 2019 13:39:14 +1100 Stephen Rothwell  
> wrote:
> >
> > After merging the akpm-current tree, today's linux-next build (arm
> > multi_v7_defconfig) produced this warning:
> > 
> > lib/list_sort.c:17:36: warning: 'pure' attribute ignored [-Wattributes]
> >struct list_head const *, struct list_head const *);
> > ^
> > 
> > Introduced by commit
> > 
> >   820c81be5237 ("lib/list_sort: simplify and remove MAX_LIST_LENGTH_BITS")
> 
> I am still getting that warning :-(

Me too and I can't figure it out.

Shrug, I guess we take a pass on it until someone has time/inclination
to revisit it.

--- a/lib/list_sort.c~lib-list_sort-simplify-and-remove-max_list_length_bits-fix
+++ a/lib/list_sort.c
@@ -7,13 +7,7 @@
 #include 
 #include 
 
-/*
- * By declaring the compare function with the __pure attribute, we give
- * the compiler more opportunity to optimize.  Ideally, we'd use this in
- * the prototype of list_sort(), but that would involve a lot of churn
- * at all call sites, so just cast the function pointer passed in.
- */
-typedef int __pure __attribute__((nonnull(2,3))) (*cmp_func)(void *,
+typedef int __attribute__((nonnull(2,3))) (*cmp_func)(void *,
struct list_head const *, struct list_head const *);
 
 /*
_



Re: [PATCH v2] module: add stubs for within_module functions

2019-04-16 Thread Andrew Morton
On Tue, 16 Apr 2019 11:56:21 -0700 Tri Vo  wrote:

> On Tue, Apr 16, 2019 at 10:55 AM Tri Vo  wrote:
> >
> > On Tue, Apr 16, 2019 at 8:21 AM Jessica Yu  wrote:
> > >
> > > +++ Tri Vo [15/04/19 11:18 -0700]:
> > > >Provide stubs for within_module_core(), within_module_init(), and
> > > >within_module() to prevent build errors when !CONFIG_MODULES.
> > > >
> > > >v2:
> > > >- Generalized commit message, as per Jessica.
> > > >- Stubs for within_module_core() and within_module_init(), as per Nick.
> > > >
> > > >Suggested-by: Matthew Wilcox 
> > > >Reported-by: Randy Dunlap 
> > > >Reported-by: kbuild test robot 
> > > >Link: https://marc.info/?l=linux-mm=155384681109231=2
> > > >Signed-off-by: Tri Vo 
> > >
> > > Applied, thanks!
> >
> > Thank you!
> 
> Andrew,
> this patch fixes 8c3d220cb6b5 ("gcov: clang support"). Could you
> re-apply the gcov patch? Sorry, if it's a dumb question. I'm not
> familiar with how cross-tree patches are handled in Linux.

hm, I wonder what Jessica applied this patch to?

Please resend a new version of "gcov: clang support".


Re: [PATCH v1 1/4] mm/memory_hotplug: Release memory resource after arch_remove_memory()

2019-04-09 Thread Andrew Morton
On Tue,  9 Apr 2019 12:01:45 +0200 David Hildenbrand  wrote:

> __add_pages() doesn't add the memory resource, so __remove_pages()
> shouldn't remove it. Let's factor it out. Especially as it is a special
> case for memory used as system memory, added via add_memory() and
> friends.
> 
> We now remove the resource after removing the sections instead of doing
> it the other way around. I don't think this change is problematic.
> 
> add_memory()
>   register memory resource
>   arch_add_memory()
> 
> remove_memory
>   arch_remove_memory()
>   release memory resource
> 
> While at it, explain why we ignore errors and that it only happeny if
> we remove memory in a different granularity as we added it.

Seems sane.

> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1820,6 +1806,25 @@ void try_offline_node(int nid)
>  }
>  EXPORT_SYMBOL(try_offline_node);
>  
> +static void __release_memory_resource(u64 start, u64 size)
> +{
> + int ret;
> +
> + /*
> +  * When removing memory in the same granularity as it was added,
> +  * this function never fails. It might only fail if resources
> +  * have to be adjusted or split. We'll ignore the error, as
> +  * removing of memory cannot fail.
> +  */
> + ret = release_mem_region_adjustable(_resource, start, size);
> + if (ret) {
> + resource_size_t endres = start + size - 1;
> +
> + pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
> + , , ret);
> + }
> +}

The types seem confused here.  Should `start' and `size' be
resource_size_t?  Or maybe phys_addr_t.

release_mem_region_adjustable() takes resource_size_t's.

Is %pa the way to print a resource_size_t?  I guess it happens to work
because resource_size_t happens to map onto phys_addr_t, which isn't
ideal.

Wanna have a headscratch over that?

>  /**
>   * remove_memory
>   * @nid: the node ID
> @@ -1854,6 +1859,7 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
>   memblock_remove(start, size);
>  
>   arch_remove_memory(nid, start, size, NULL);
> + __release_memory_resource(start, size);
>  
>   try_offline_node(nid);



Re: [PATCH v5 2/7] slob: Respect list_head abstraction layer

2019-04-09 Thread Andrew Morton
On Wed, 10 Apr 2019 06:06:49 +1000 "Tobin C. Harding"  wrote:

> On Tue, Apr 09, 2019 at 02:59:52PM +0200, Vlastimil Babka wrote:
> > On 4/3/19 11:13 PM, Tobin C. Harding wrote:
> > 
> > > According to 0day test robot this is triggering an error from
> > > CHECK_DATA_CORRUPTION when the kernel is built with CONFIG_DEBUG_LIST.
> > 
> > FWIW, that report [1] was for commit 15c8410c67adef from next-20190401. I've
> > checked and it's still the v4 version, although the report came after you
> > submitted v5 (it wasn't testing the patches from mailing list, but mmotm). I
> > don't see any report for the v5 version so I'd expect it to be indeed fixed 
> > by
> > the new approach that adds boolean return parameter to slob_page_alloc().
> > 
> > Vlastimil
> 
> Oh man thanks!  That is super cool, thanks for letting me know
> Vlastimil.

Yes, thanks for the followup.


Re: [PATCH] mm/hmm: fix hmm_range_dma_map()/hmm_range_dma_unmap()

2019-04-09 Thread Andrew Morton
On Tue,  9 Apr 2019 13:53:40 -0400 jgli...@redhat.com wrote:

> Was using wrong field and wrong enum for read only versus read and
> write mapping.

For thos who were wondering, this fixes
mm-hmm-add-an-helper-function-that-fault-pages-and-map-them-to-a-device-v3.patch,
which is presently queued in -mm.



Re: [PATCH] cpumask: fix double string traverse in cpumask_parse

2019-04-09 Thread Andrew Morton
On Tue,  9 Apr 2019 13:42:08 -0700 Yury Norov  wrote:

> From: Yury Norov 
> 
> cpumask_parse() finds first occurrence of either \n or \0 by calling
> strchr() and strlen(). We can do it better with a single call of
> strchrnul().

Fair enough.

> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -633,8 +633,7 @@ static inline int cpumask_parselist_user(const char 
> __user *buf, int len,
>   */
>  static inline int cpumask_parse(const char *buf, struct cpumask *dstp)
>  {
> - char *nl = strchr(buf, '\n');
> - unsigned int len = nl ? (unsigned int)(nl - buf) : strlen(buf);
> + unsigned int len = (unsigned int)(strchrnul(buf, '\n') - buf);

Does the cast do anything useful?  The compiler will convert a
ptrdiff_t to uint without issues, I think?

>   return bitmap_parse(buf, len, cpumask_bits(dstp), nr_cpumask_bits);
>  }



Re: [PATCH v2 2/2] mm, memory_hotplug: provide a more generic restrictions for memory hotplug

2019-04-08 Thread Andrew Morton
On Mon,  8 Apr 2019 10:26:33 +0200 Oscar Salvador  wrote:

> arch_add_memory, __add_pages take a want_memblock which controls whether
> the newly added memory should get the sysfs memblock user API (e.g.
> ZONE_DEVICE users do not want/need this interface). Some callers even
> want to control where do we allocate the memmap from by configuring
> altmap.
> 
> Add a more generic hotplug context for arch_add_memory and __add_pages.
> struct mhp_restrictions contains flags which contains additional
> features to be enabled by the memory hotplug (MHP_MEMBLOCK_API
> currently) and altmap for alternative memmap allocator.
> 
> This patch shouldn't introduce any functional change.

From: Andrew Morton 
Subject: 
mm-memory_hotplug-provide-a-more-generic-restrictions-for-memory-hotplug-fix

x86_64 allnoconfig:

In file included from ./include/linux/mmzone.h:744:0,
 from ./include/linux/gfp.h:6,
 from ./include/linux/umh.h:4,
 from ./include/linux/kmod.h:22,
 from ./include/linux/module.h:13,
 from init/do_mounts.c:1:
./include/linux/memory_hotplug.h:353:11: warning: ‘struct mhp_restrictions’ 
declared inside parameter list will not be visible outside of this definition 
or declaration
struct mhp_restrictions *restrictions);
   ^~~~

Fix this by moving the arch_add_memory() definition inside
CONFIG_MEMORY_HOTPLUG and moving the mhp_restrictions definition to a more
appropriate place.

Cc: Dan Williams 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Signed-off-by: Andrew Morton 
---

 include/linux/memory_hotplug.h |   24 
 1 file changed, 12 insertions(+), 12 deletions(-)

--- 
a/include/linux/memory_hotplug.h~mm-memory_hotplug-provide-a-more-generic-restrictions-for-memory-hotplug-fix
+++ a/include/linux/memory_hotplug.h
@@ -54,6 +54,16 @@ enum {
 };
 
 /*
+ * Restrictions for the memory hotplug:
+ * flags:  MHP_ flags
+ * altmap: alternative allocator for memmap array
+ */
+struct mhp_restrictions {
+   unsigned long flags;
+   struct vmem_altmap *altmap;
+};
+
+/*
  * Zone resizing functions
  *
  * Note: any attempt to resize a zone should has pgdat_resize_lock()
@@ -101,6 +111,8 @@ extern void __online_page_free(struct pa
 
 extern int try_online_node(int nid);
 
+extern int arch_add_memory(int nid, u64 start, u64 size,
+   struct mhp_restrictions *restrictions);
 extern u64 max_mem_size;
 
 extern bool memhp_auto_online;
@@ -126,16 +138,6 @@ extern int __remove_pages(struct zone *z
 
 #define MHP_MEMBLOCK_API   (1<<0)
 
-/*
- * Restrictions for the memory hotplug:
- * flags:  MHP_ flags
- * altmap: alternative allocator for memmap array
- */
-struct mhp_restrictions {
-   unsigned long flags;
-   struct vmem_altmap *altmap;
-};
-
 /* reasonably generic interface to expand the physical pages */
 extern int __add_pages(int nid, unsigned long start_pfn, unsigned long 
nr_pages,
   struct mhp_restrictions *restrictions);
@@ -349,8 +351,6 @@ extern int walk_memory_range(unsigned lo
 extern int __add_memory(int nid, u64 start, u64 size);
 extern int add_memory(int nid, u64 start, u64 size);
 extern int add_memory_resource(int nid, struct resource *resource);
-extern int arch_add_memory(int nid, u64 start, u64 size,
-   struct mhp_restrictions *restrictions);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages, struct vmem_altmap *altmap);
 extern bool is_memblock_offlined(struct memory_block *mem);
_



Re: [PATCH v1 bitops] bitops: Fix UBSAN undefined behavior warning for rotation right

2019-04-08 Thread Andrew Morton
(resend, cc Andrey)

On Sun,  7 Apr 2019 12:53:25 + Vadim Pasternak  wrote:

> The warning is caused by call to rorXX(), if the second parameters of
> this function "shift" is zero. In such case UBSAN reports the warning
> for the next expression: (word << (XX - shift), where XX is
> 64, 32, 16, 8 for respectively ror64, ror32, ror16, ror8.
> Fix adds validation of this parameter - in case it's equal zero, no
> need to rotate, just original "word" is to be returned to caller.
> 
> The UBSAN undefined behavior warning has been reported for call to
> ror32():
> [   11.426543] UBSAN: Undefined behaviour in ./include/linux/bitops.h:93:33
> [   11.434045] shift exponent 32 is too large for 32-bit type 'unsigned int'

hm, do we care?

> ...
>

> --- a/include/linux/bitops.h
> +++ b/include/linux/bitops.h
> @@ -70,6 +70,9 @@ static inline __u64 rol64(__u64 word, unsigned int shift)
>   */
>  static inline __u64 ror64(__u64 word, unsigned int shift)
>  {
> + if (!shift)
> + return word;
> +
>   return (word >> shift) | (word << (64 - shift));
>  }

Is there any known architecture or compiler for which UL<<64 doesn't
reliably produce zero?  Is there any prospect that this will become a
problem in the future?



Re: [PATCH v1 bitops] bitops: Fix UBSAN undefined behavior warning for rotation right

2019-04-08 Thread Andrew Morton
On Sun,  7 Apr 2019 12:53:25 + Vadim Pasternak  wrote:

> The warning is caused by call to rorXX(), if the second parameters of
> this function "shift" is zero. In such case UBSAN reports the warning
> for the next expression: (word << (XX - shift), where XX is
> 64, 32, 16, 8 for respectively ror64, ror32, ror16, ror8.
> Fix adds validation of this parameter - in case it's equal zero, no
> need to rotate, just original "word" is to be returned to caller.
> 
> The UBSAN undefined behavior warning has been reported for call to
> ror32():
> [   11.426543] UBSAN: Undefined behaviour in ./include/linux/bitops.h:93:33
> [   11.434045] shift exponent 32 is too large for 32-bit type 'unsigned int'

hm, do we care?

> ...
>

> --- a/include/linux/bitops.h
> +++ b/include/linux/bitops.h
> @@ -70,6 +70,9 @@ static inline __u64 rol64(__u64 word, unsigned int shift)
>   */
>  static inline __u64 ror64(__u64 word, unsigned int shift)
>  {
> + if (!shift)
> + return word;
> +
>   return (word >> shift) | (word << (64 - shift));
>  }

Is there any known architecture or compiler for which UL<<64 doesn't
reliably produce zero?  Is there any prospect that this will become a
problem in the future?



Re: [PATCH 1/2] fsl_hypervisor: dereferencing error pointers in ioctl

2019-04-04 Thread Andrew Morton
On Tue, 18 Dec 2018 11:20:03 +0300 Dan Carpenter  
wrote:

> The strndup_user() function returns error pointers on error, and then
> in the error handling we pass the error pointers to kfree().  It will
> cause an Oops.
> 

Looks good to me.

I guess we should fix this too?


From: Andrew Morton 
Subject: mm/util.c: fix strndup_user() comment

The kerneldoc misdescribes strndup_user()'s return value.

Cc: Dan Carpenter 
Cc: Timur Tabi 
Cc: Mihai Caraman 
Cc: Kumar Gala 
Signed-off-by: Andrew Morton 
---

 mm/util.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/util.c~mm-utilc-fix-strndup_user-comment
+++ a/mm/util.c
@@ -204,7 +204,7 @@ EXPORT_SYMBOL(vmemdup_user);
  * @s: The string to duplicate
  * @n: Maximum number of bytes to copy, including the trailing NUL.
  *
- * Return: newly allocated copy of @s or %NULL in case of error
+ * Return: newly allocated copy of @s or an ERR_PTR() in case of error
  */
 char *strndup_user(const char __user *s, long n)
 {
_



Re: [PATCH v2] writeback: use exact memcg dirty counts

2019-04-02 Thread Andrew Morton
On Mon, 1 Apr 2019 14:20:44 -0400 Johannes Weiner  wrote:

> On Fri, Mar 29, 2019 at 10:46:09AM -0700, Greg Thelen wrote:
> > @@ -3907,10 +3923,10 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, 
> > unsigned long *pfilepages,
> > struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
> > struct mem_cgroup *parent;
> >  
> > -   *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
> > +   *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY);
> >  
> > /* this should eventually include NR_UNSTABLE_NFS */
> > -   *pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
> > +   *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK);
> > *pfilepages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
> >  (1 << LRU_ACTIVE_FILE));
> 
> Andrew,
> 
> just a head-up: -mm has that LRU stat cleanup series queued ("mm:
> memcontrol: clean up the LRU counts tracking") that changes the
> mem_cgroup_nr_lru_pages() call here to two memcg_page_state().
> 
> I'm assuming Greg's fix here will get merged before the cleanup. When
> it gets picked up, it'll conflict with "mm: memcontrol: push down
> mem_cgroup_nr_lru_pages()".
> 
> "mm: memcontrol: push down mem_cgroup_nr_lru_pages()" will need to be
> changed to use memcg_exact_page_state() calls instead of the plain
> memcg_page_state() for *pfilepages.
> 

Thanks.  Like this?

void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 unsigned long *pheadroom, unsigned long *pdirty,
 unsigned long *pwriteback)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
struct mem_cgroup *parent;

*pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY);

/* this should eventually include NR_UNSTABLE_NFS */
*pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK);
*pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) +
memcg_exact_page_state(memcg, NR_ACTIVE_FILE);
*pheadroom = PAGE_COUNTER_MAX;

while ((parent = parent_mem_cgroup(memcg))) {
unsigned long ceiling = min(memcg->memory.max, memcg->high);
unsigned long used = page_counter_read(>memory);

*pheadroom = min(*pheadroom, ceiling - min(ceiling, used));
memcg = parent;
}
}



Re: [PATCH] mm/hugetlb: Get rid of NODEMASK_ALLOC

2019-04-02 Thread Andrew Morton
On Tue,  2 Apr 2019 15:34:15 +0200 Oscar Salvador  wrote:

> NODEMASK_ALLOC is used to allocate a nodemask bitmap, ant it does it by
> first determining whether it should be allocated in the stack or dinamically
> depending on NODES_SHIFT.
> Right now, it goes the dynamic path whenever the nodemask_t is above 32
> bytes.
> 
> Although we could bump it to a reasonable value, the largest a nodemask_t
> can get is 128 bytes, so since __nr_hugepages_store_common is called from
> a rather shore stack we can just get rid of the NODEMASK_ALLOC call here.
> 
> This reduces some code churn and complexity.

It took a bit of sleuthing to figure out that this patch applies to
Mike's "hugetlbfs: fix potential over/underflow setting node specific
nr_hugepages".  Should they be folded together?  I'm thinking not.

(Also, should "hugetlbfs: fix potential over/underflow setting node
specific nr_hugepages" have been -stableified?  I also think not, but I
bet it happens anyway).



Re: [RESEND PATCH 0/3] improve vmap allocation

2019-04-02 Thread Andrew Morton
On Tue,  2 Apr 2019 18:25:28 +0200 "Uladzislau Rezki (Sony)"  
wrote:

> Changes in v3
> -
> - simplify the __get_va_next_sibling() and __find_va_links() functions;
> - remove "unlikely". Place the WARN_ON_ONCE directly to the "if" condition;
> - replace inline to __always_inline;
> - move the debug code to separate patches;

Does v3 address the report from kernel test robot
, Subject "[mm/vmalloc.c] 7ae76449bd:
kernel_BUG_at_lib/list_debug.c".

For some reason I cannot find that email in my linux-kernel folder and
nor does google find it and
http://lkml.kernel.org/r/20190401132655.GD9217@shao2-debian doesn't
work.  Message-ID 20190401132655.GD9217@shao2-debian doesn't seem to
have got through the list server.  Ditto linux-mm.



Re: [PATCH 1/1] slob: Only use list functions when safe to do so

2019-04-02 Thread Andrew Morton
On Wed, 3 Apr 2019 06:05:38 +1100 "Tobin C. Harding"  wrote:

> > It's regrettable that this fixes
> > slob-respect-list_head-abstraction-layer.patch but doesn't apply to
> > that patch - slob-use-slab_list-instead-of-lru.patch gets in the way. 
> > So we end up with a patch series which introduces a bug and later
> > fixes it.
> 
> Yes I thought that also.  Do you rebase the mm tree?  Did you apply this
> right after slob-use-slab_list-instead-of-lru or to the current tip?

After slob-use-slab_list-instead-of-lru.patch

>  If
> it is applied to the tip does this effect the ability to later bisect in
> between these two commits (if the need arises for some unrelated reason)?

There is a bisection hole but it is short and the bug is hardish to
hit.

> > I guess we can live with that but if the need comes to respin this
> > series, please do simply fix
> > slob-respect-list_head-abstraction-layer.patch so we get a clean
> > series.
> 
> If its not too much work for you to apply the new series I'll do another
> version just to get this right.

I guess that would be best, thanks.


Re: [PATCH] lib: Fix possible incorrect result from rational fractions helper

2019-04-01 Thread Andrew Morton
On Sat, 30 Mar 2019 13:58:55 -0700 Trent Piepho  wrote:

> In some cases the previous algorithm would not return the closest
> approximation.  This would happen when a semi-convergent was the
> closest, as the previous algorithm would only consider convergents.
> 
> As an example, consider an initial value of 5/4, and trying to find the
> closest approximation with a maximum of 4 for numerator and denominator.
> The previous algorithm would return 1/1 as the closest approximation,
> while this version will return the correct answer of 4/3.
> 
> To do this, the main loop performs effectively the same operations as it
> did before.  It must now keep track of the last three approximations,
> n2/d2 .. n0/d0, while before it only needed the last two.
> 
> If an exact answer is not found, the algorithm will now calculate the
> best semi-convergent term, t, which is a single expression with two
> divisions:
> min((max_numerator - n0) / n1, (max_denominator - d0) / d1)
> 
> This will be used if it is better than previous convergent.  The test
> for this is generally a simple comparison, 2*t > a.  But in an edge
> case, where the convergent's final term is even and the best allowable
> semi-convergent has a final term of exactly half the convergent's final
> term, the more complex comparison (d0*dp > d1*d) is used.
> 
> I also wrote some comments explaining the code.  While one still needs
> to look up the math elsewhere, they should help a lot to follow how the
> code relates to that math.

What are the userspace-visible runtime effects of this change?


Re: [PATCH 1/1] slob: Only use list functions when safe to do so

2019-04-01 Thread Andrew Morton
On Tue,  2 Apr 2019 14:29:57 +1100 "Tobin C. Harding"  wrote:

> Currently we call (indirectly) list_del() then we manually try to combat
> the fact that the list may be in an undefined state by getting 'prev'
> and 'next' pointers in a somewhat contrived manner.  It is hard to
> verify that this works for all initial states of the list.  Clearly the
> author (me) got it wrong the first time because the 0day kernel testing
> robot managed to crash the kernel thanks to this code.
> 
> All this is done in order to do an optimisation aimed at preventing
> fragmentation at the start of a slab.  We can just skip this
> optimisation any time the list is put into an undefined state since this
> only occurs when an allocation completely fills the slab and in this
> case the optimisation is unnecessary since we have not fragmented the slab
> by this allocation.
> 
> Change the page pointer passed to slob_alloc_page() to be a double
> pointer so that we can set it to NULL to indicate that the page was
> removed from the list.  Skip the optimisation if the page was removed.
> 
> Found thanks to the kernel test robot, email subject:
> 
>   340d3d6178 ("mm/slob.c: respect list_head abstraction layer"):  kernel 
> BUG at lib/list_debug.c:31!
> 

It's regrettable that this fixes
slob-respect-list_head-abstraction-layer.patch but doesn't apply to
that patch - slob-use-slab_list-instead-of-lru.patch gets in the way. 
So we end up with a patch series which introduces a bug and later
fixes it.

I guess we can live with that but if the need comes to respin this
series, please do simply fix
slob-respect-list_head-abstraction-layer.patch so we get a clean
series.



Re: [RFC PATCH v2 0/1] improve vmap allocation

2019-04-01 Thread Andrew Morton
On Mon, 1 Apr 2019 13:03:47 +0200 Uladzislau Rezki  wrote:

> Hello, Andrew.
> 
> >
> > It's a lot of new code. I t looks decent and I'll toss it in there for
> > further testing.  Hopefully someone will be able to find the time for a
> > detailed review.
> > 
> I have got some proposals and comments about simplifying the code a bit.
> So i am about to upload the v3 for further review. I see that your commit
> message includes whole details and explanation:
> 
> http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=b6ac5ca4c512094f217a8140edeea17e6621b2ad
> 
> Should i base on and keep it in v3?
> 

It's probably best to do a clean resend at this stage.  I'll take a
look at the deltas and updating changelogs, etc.


Re: [RESEND PATCH v2 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-28 Thread Andrew Morton
On Tue, 19 Mar 2019 08:16:00 + George Spelvin  wrote:

> Rather than a fixed-size array of pending sorted runs, use the ->prev
> links to keep track of things.  This reduces stack usage, eliminates
> some ugly overflow handling, and reduces the code size.
> 
> Also:
> * merge() no longer needs to handle NULL inputs, so simplify.
> * The same applies to merge_and_restore_back_links(), which is renamed
>   to the less ponderous merge_final().  (It's a static helper function,
>   so we don't need a super-descriptive name; comments will do.)
> * Document the actual return value requirements on the (*cmp)()
>   function; some callers are already using this feature.
> 
> x86-64 code size 1086 -> 739 bytes (-347)
> 
> (Yes, I see checkpatch complaining about no space after comma in
> "__attribute__((nonnull(2,3,4,5)))".  Checkpatch is wrong.)

x86_64 allnoconfig with gcc-7.3.0:

lib/list_sort.c:17:36: warning: __pure__ attribute ignored [-Wattributes]
   struct list_head const *, struct list_head const *);



Re: [PATCH 2/6] bitmap_parselist: move non-parser logic to helpers

2019-03-25 Thread Andrew Morton
On Tue, 26 Mar 2019 00:07:44 +0300 Yury Norov  wrote:

> --- a/lib/bitmap.c
> +++ b/lib/bitmap.c
> @@ -477,6 +477,42 @@ int bitmap_print_to_pagebuf(bool list, char *buf, const 
> unsigned long *maskp,
>  }
>  EXPORT_SYMBOL(bitmap_print_to_pagebuf);
>  
> +/*
> + * Region 9-38:4/10 describes the following bitmap structure:
> + * 09  1218  38
> + * .......
> + *   ^  ^ ^   ^
> + *  start  off   grlen  end
> + */
> +struct region {
> + unsigned int start;
> + unsigned int off;
> + unsigned int grlen;
> + unsigned int end;
> +};

grlen means...  group_len, I think?

Perhaps it should be called group_len ;)



Re: [PATCH REBASED] mm, memcg: Make scan aggression always exclude protection

2019-03-22 Thread Andrew Morton
On Fri, 22 Mar 2019 16:03:07 + Chris Down  wrote:

> This patch is an incremental improvement on the existing
> memory.{low,min} relative reclaim work to base its scan pressure
> calculations on how much protection is available compared to the current
> usage, rather than how much the current usage is over some protection
> threshold.
> 
> Previously the way that memory.low protection works is that if you are
> 50% over a certain baseline, you get 50% of your normal scan pressure.
> This is certainly better than the previous cliff-edge behaviour, but it
> can be improved even further by always considering memory under the
> currently enforced protection threshold to be out of bounds. This means
> that we can set relatively low memory.low thresholds for variable or
> bursty workloads while still getting a reasonable level of protection,
> whereas with the previous version we may still trivially hit the 100%
> clamp. The previous 100% clamp is also somewhat arbitrary, whereas this
> one is more concretely based on the currently enforced protection
> threshold, which is likely easier to reason about.
> 
> There is also a subtle issue with the way that proportional reclaim
> worked previously -- it promotes having no memory.low, since it makes
> pressure higher during low reclaim. This happens because we base our
> scan pressure modulation on how far memory.current is between memory.min
> and memory.low, but if memory.low is unset, we only use the overage
> method. In most cromulent configurations, this then means that we end up
> with *more* pressure than with no memory.low at all when we're in low
> reclaim, which is not really very usable or expected.
> 
> With this patch, memory.low and memory.min affect reclaim pressure in a
> more understandable and composable way. For example, from a user
> standpoint, "protected" memory now remains untouchable from a reclaim
> aggression standpoint, and users can also have more confidence that
> bursty workloads will still receive some amount of guaranteed
> protection.

Could you please provide more description of the effect this has upon
userspace?  Preferably in real-world cases.  What problems were being
observed and how does this improve things?



Re: [PATCH 3/6] mm: memcontrol: replace node summing with memcg_page_state()

2019-03-21 Thread Andrew Morton
On Thu, 28 Feb 2019 11:30:17 -0500 Johannes Weiner  wrote:

> Instead of adding up the node counters, use memcg_page_state() to get
> the memcg state directly. This is a bit cheaper and more stream-lined.
> 
> ...
>
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -746,10 +746,13 @@ static unsigned long mem_cgroup_nr_lru_pages(struct 
> mem_cgroup *memcg,
>   unsigned int lru_mask)
>  {
>   unsigned long nr = 0;
> - int nid;
> + enum lru_list lru;
>  
> - for_each_node_state(nid, N_MEMORY)
> - nr += mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask);
> + for_each_lru(lru) {
> + if (!(BIT(lru) & lru_mask))
> + continue;
> + nr += memcg_page_state(memcg, NR_LRU_BASE + lru);
> + }

Might be able to use for_each_set_bit(_mnask) here, but it's much
of a muchness.



Re: [PATCH] writeback: sum memcg dirty counters as needed

2019-03-21 Thread Andrew Morton
On Thu,  7 Mar 2019 08:56:32 -0800 Greg Thelen  wrote:

> Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
> memory.stat reporting") memcg dirty and writeback counters are managed
> as:
> 1) per-memcg per-cpu values in range of [-32..32]
> 2) per-memcg atomic counter
> When a per-cpu counter cannot fit in [-32..32] it's flushed to the
> atomic.  Stat readers only check the atomic.
> Thus readers such as balance_dirty_pages() may see a nontrivial error
> margin: 32 pages per cpu.
> Assuming 100 cpus:
>4k x86 page_size:  13 MiB error per memcg
>   64k ppc page_size: 200 MiB error per memcg
> Considering that dirty+writeback are used together for some decisions
> the errors double.
> 
> This inaccuracy can lead to undeserved oom kills.  One nasty case is
> when all per-cpu counters hold positive values offsetting an atomic
> negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
> balance_dirty_pages() only consults the atomic and does not consider
> throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
> 13..200 MiB range then there's absolutely no dirty throttling, which
> burdens vmscan with only dirty+writeback pages thus resorting to oom
> kill.
> 
> It could be argued that tiny containers are not supported, but it's more
> subtle.  It's the amount the space available for file lru that matters.
> If a container has memory.max-200MiB of non reclaimable memory, then it
> will also suffer such oom kills on a 100 cpu machine.
> 
> ...
> 
> Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
> collect exact per memcg counters when a memcg is close to the
> throttling/writeback threshold.  This avoids the aforementioned oom
> kills.
> 
> This does not affect the overhead of memory.stat, which still reads the
> single atomic counter.
> 
> Why not use percpu_counter?  memcg already handles cpus going offline,
> so no need for that overhead from percpu_counter.  And the
> percpu_counter spinlocks are more heavyweight than is required.
> 
> It probably also makes sense to include exact dirty and writeback
> counters in memcg oom reports.  But that is saved for later.

Nice changelog, thanks.

> Signed-off-by: Greg Thelen 

Did you consider cc:stable for this?  We may as well - the stablebots
backport everything which might look slightly like a fix anyway :(

> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -573,6 +573,22 @@ static inline unsigned long memcg_page_state(struct 
> mem_cgroup *memcg,
>   return x;
>  }
>  
> +/* idx can be of type enum memcg_stat_item or node_stat_item */
> +static inline unsigned long
> +memcg_exact_page_state(struct mem_cgroup *memcg, int idx)
> +{
> + long x = atomic_long_read(>stat[idx]);
> +#ifdef CONFIG_SMP
> + int cpu;
> +
> + for_each_online_cpu(cpu)
> + x += per_cpu_ptr(memcg->stat_cpu, cpu)->count[idx];
> + if (x < 0)
> + x = 0;
> +#endif
> + return x;
> +}

This looks awfully heavyweight for an inline function.  Why not make it
a regular function and avoid the bloat and i-cache consumption?

Also, did you instead consider making this spill the percpu counters
into memcg->stat[idx]?  That might be more useful for potential future
callers.  It would become a little more expensive though.


Re: [PATCH] mm: page_mkclean vs MADV_DONTNEED race

2019-03-21 Thread Andrew Morton
On Thu, 21 Mar 2019 09:36:10 +0530 "Aneesh Kumar K.V" 
 wrote:

> MADV_DONTNEED is handled with mmap_sem taken in read mode.
> We call page_mkclean without holding mmap_sem.
> 
> MADV_DONTNEED implies that pages in the region are unmapped and subsequent
> access to the pages in that range is handled as a new page fault.
> This implies that if we don't have parallel access to the region when
> MADV_DONTNEED is run we expect those range to be unallocated.
> 
> w.r.t page_mkclean we need to make sure that we don't break the MADV_DONTNEED
> semantics. MADV_DONTNEED check for pmd_none without holding pmd_lock.
> This implies we skip the pmd if we temporarily mark pmd none. Avoid doing
> that while marking the page clean.
> 
> Keep the sequence same for dax too even though we don't support MADV_DONTNEED
> for dax mapping

What were the runtime effects of the bug?

Did you consider a -stable backport?


Re: [RFC PATCH v2 0/1] improve vmap allocation

2019-03-21 Thread Andrew Morton
On Thu, 21 Mar 2019 20:03:26 +0100 "Uladzislau Rezki (Sony)"  
wrote:

> Hello.
> 
> This is the v2 of the https://lkml.org/lkml/2018/10/19/786 rework. Instead of
> referring you to that link, i will go through it again describing the improved
> allocation method and provide changes between v1 and v2 in the end.
> 
> ...
>

> Performance analysis
> 

Impressive numbers.  But this is presumably a worst-case microbenchmark.

Are you able to describe the benefits which are observed in some
real-world workload which someone cares about?

It's a lot of new code. I t looks decent and I'll toss it in there for
further testing.  Hopefully someone will be able to find the time for a
detailed review.

Trivial point: the code uses "inline" a lot.  Nowadays gcc cheerfully
ignores that and does its own thing.  You might want to look at the
effects of simply deleting all that.  Is the generated code better or
worse or the same?  If something really needs to be inlined then use
__always_inline, preferably with a comment explaining why it is there.



Re: [PATCH v4] lib/string.c: implement a basic bcmp

2019-03-21 Thread Andrew Morton
On Thu, 21 Mar 2019 10:02:37 -0700 Nick Desaulniers  
wrote:

> Shall I send you a cleanup removing the undefs for bcmp, memcmp,
> strcat, strcpy, and strcmp?  Of those, I only see memcmp being
> `#defined` in arch/m68k/include/asm/string.h, arch/x86/boot/string.h,
> and arch/x86/include/asm/string_32.h.
> 
> Further, I can drop some of the __GNUC__ < 4 code in
> arch/x86/include/asm/string_32.h. (grepping for __GNUC__, looks like
> there's a fair amount of code that can be cleaned up).  We should
> probably check it since Clang lies about being GCC 4.2 compatible,
> which will surely break in horrific ways at some point.\

All sounds good.  Some time, when convenient, thanks.


Re: [PATCH v4] lib/string.c: implement a basic bcmp

2019-03-20 Thread Andrew Morton
On Wed, 13 Mar 2019 14:13:31 -0700 Nick Desaulniers  
wrote:

> A recent optimization in Clang (r355672) lowers comparisons of the
> return value of memcmp against zero to comparisons of the return value
> of bcmp against zero.  This helps some platforms that implement bcmp
> more efficiently than memcmp. glibc simply aliases bcmp to memcmp, but
> an optimized implementation is in the works.
> 
> This results in linkage failures for all targets with Clang due to the
> undefined symbol.  For now, just implement bcmp as a tailcail to memcmp
> to unbreak the build.  This routine can be further optimized in the
> future.
> 
> Other ideas discussed:
> * A weak alias was discussed, but breaks for architectures that define
> their own implementations of memcmp since aliases to declarations are
> not permitted (only definitions).  Arch-specific memcmp implementations
> typically declare memcmp in C headers, but implement them in assembly.
> * -ffreestanding also is used sporadically throughout the kernel.
> * -fno-builtin-bcmp doesn't work when doing LTO.

I guess we should backport this into -stable so that older kernels can
be built with newer Clang.

> ...
>
> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -866,6 +866,26 @@ __visible int memcmp(const void *cs, const void *ct, 
> size_t count)
>  EXPORT_SYMBOL(memcmp);
>  #endif
>  
> +#ifndef __HAVE_ARCH_BCMP
> +/**
> + * bcmp - returns 0 if and only if the buffers have identical contents.
> + * @a: pointer to first buffer.
> + * @b: pointer to second buffer.
> + * @len: size of buffers.
> + *
> + * The sign or magnitude of a non-zero return value has no particular
> + * meaning, and architectures may implement their own more efficient bcmp(). 
> So
> + * while this particular implementation is a simple (tail) call to memcmp, do
> + * not rely on anything but whether the return value is zero or non-zero.
> + */
> +#undef bcmp

What is the undef for?

> +int bcmp(const void *a, const void *b, size_t len)
> +{
> + return memcmp(a, b, len);
> +}
> +EXPORT_SYMBOL(bcmp);
> +#endif
> +
>  #ifndef __HAVE_ARCH_MEMSCAN
>  /**
>   * memscan - Find a character in an area of memory.
> -- 
> 2.21.0.360.g471c308f928-goog


Re: [PATCH] mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified

2019-03-20 Thread Andrew Morton
On Wed, 20 Mar 2019 11:23:03 +0530 Souptick Joarder  
wrote:

> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -447,6 +447,13 @@ static inline bool queue_pages_required(struct page 
> > *page,
> > return node_isset(nid, *qp->nmask) == !(flags & MPOL_MF_INVERT);
> >  }
> >
> > +/*
> > + * The queue_pages_pmd() may have three kind of return value.
> > + * 1 - pages are placed on he right node or queued successfully.
> 
> Minor typo -> s/he/the ?

Yes, that comment needs some help.  This?

--- 
a/mm/mempolicy.c~mm-mempolicy-make-mbind-return-eio-when-mpol_mf_strict-is-specified-fix
+++ a/mm/mempolicy.c
@@ -429,9 +429,9 @@ static inline bool queue_pages_required(
 }
 
 /*
- * The queue_pages_pmd() may have three kind of return value.
- * 1 - pages are placed on he right node or queued successfully.
- * 0 - THP get split.
+ * queue_pages_pmd() has three possible return values:
+ * 1 - pages are placed on the right node or queued successfully.
+ * 0 - THP was split.
  * -EIO - is migration entry or MPOL_MF_STRICT was specified and an existing
  *page was already on a node that does not follow the policy.
  */
_



Re: [RESEND#2 PATCH] mm/compaction: fix an undefined behaviour

2019-03-20 Thread Andrew Morton
On Wed, 20 Mar 2019 16:33:38 -0400 Qian Cai  wrote:

> In a low-memory situation, cc->fast_search_fail can keep increasing as
> it is unable to find an available page to isolate in
> fast_isolate_freepages(). As the result, it could trigger an error
> below, so just compare with the maximum bits can be shifted first.
> 
> UBSAN: Undefined behaviour in mm/compaction.c:1160:30
> shift exponent 64 is too large for 64-bit type 'unsigned long'
> CPU: 131 PID: 1308 Comm: kcompactd1 Kdump: loaded Tainted: G
> WL5.0.0+ #17
> Call trace:
>  dump_backtrace+0x0/0x450
>  show_stack+0x20/0x2c
>  dump_stack+0xc8/0x14c
>  __ubsan_handle_shift_out_of_bounds+0x7e8/0x8c4
>  compaction_alloc+0x2344/0x2484
>  unmap_and_move+0xdc/0x1dbc
>  migrate_pages+0x274/0x1310
>  compact_zone+0x26ec/0x43bc
>  kcompactd+0x15b8/0x1a24
>  kthread+0x374/0x390
>  ret_from_fork+0x10/0x18
> 
> Fixes: 70b44595eafe ("mm, compaction: use free lists to quickly locate a 
> migration source")
> Acked-by: Vlastimil Babka 
> Acked-by: Mel Gorman 
> Signed-off-by: Qian Cai 
> ---
>  mm/compaction.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e1a08fc92353..0d1156578114 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1157,7 +1157,9 @@ static bool suitable_migration_target(struct 
> compact_control *cc,
>  static inline unsigned int
>  freelist_scan_limit(struct compact_control *cc)
>  {
> - return (COMPACT_CLUSTER_MAX >> cc->fast_search_fail) + 1;
> + return (COMPACT_CLUSTER_MAX >>
> + min((unsigned short)(BITS_PER_LONG - 1), cc->fast_search_fail))
> + + 1;
>  }

That's rather an eyesore.  How about

static inline unsigned int
freelist_scan_limit(struct compact_control *cc)
{
unsigned short shift = BITS_PER_LONG - 1;

return (COMPACT_CLUSTER_MAX >> min(shift, cc->fast_search_fail)) + 1;
}



Re: [PATCH] mm, memory_hotplug: Fix the wrong usage of N_HIGH_MEMORY

2019-03-20 Thread Andrew Morton
On Wed, 20 Mar 2019 16:07:32 +0800 Baoquan He  wrote:

> In function node_states_check_changes_online(), N_HIGH_MEMORY is used
> to substitute ZONE_HIGHMEM directly. This is not right. N_HIGH_MEMORY
> always has value '3' if CONFIG_HIGHMEM=y, while ZONE_HIGHMEM's value
> is not. It depends on whether CONFIG_ZONE_DMA/CONFIG_ZONE_DMA32 are
> enabled. Obviously it's not true for CONFIG_ZONE_DMA32 on 32bit system,
> and CONFIG_ZONE_DMA is also optional.
> 
> Replace it with ZONE_HIGHMEM.
> 
> Fixes: 8efe33f40f3e ("mm/memory_hotplug.c: simplify 
> node_states_check_changes_online")

What are the runtime effects of this change?

> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -712,7 +712,7 @@ static void node_states_check_changes_online(unsigned 
> long nr_pages,
>   if (zone_idx(zone) <= ZONE_NORMAL && !node_state(nid, N_NORMAL_MEMORY))
>   arg->status_change_nid_normal = nid;
>  #ifdef CONFIG_HIGHMEM
> - if (zone_idx(zone) <= N_HIGH_MEMORY && !node_state(nid, N_HIGH_MEMORY))
> + if (zone_idx(zone) <= ZONE_HIGHMEM && !node_state(nid, N_HIGH_MEMORY))
>   arg->status_change_nid_high = nid;
>  #endif
>  }



Re: [PATCH RESEND] eventfd: prepare id to userspace via fdinfo

2019-03-20 Thread Andrew Morton
On Wed, 20 Mar 2019 18:29:29 +0900 Masatake YAMATO  wrote:

> Finding endpoints of an IPC channel is one of essential task to
> understand how a user program works. Procfs and netlink socket provide
> enough hints to find endpoints for IPC channels like pipes, unix
> sockets, and pseudo terminals. However, there is no simple way to find
> endpoints for an eventfd file from userland. An inode number doesn't
> hint. Unlike pipe, all eventfd files share the same inode object.
> 
> To provide the way to find endpoints of an eventfd file, this patch
> adds "eventfd-id" field to /proc/PID/fdinfo of eventfd as identifier.
> Address for eventfd context is used as id.
> 
> A tool like lsof can utilize the information to print endpoints.
> 
> ...
>
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -297,6 +297,7 @@ static void eventfd_show_fdinfo(struct seq_file *m, 
> struct file *f)
>   seq_printf(m, "eventfd-count: %16llx\n",
>  (unsigned long long)ctx->count);
>   spin_unlock_irq(>wqh.lock);
> + seq_printf(m, "eventfd-id: %p\n", ctx);
>  }
>  #endif

Is it a good idea to use a bare kernel address for this?  How does this
interact with printk pointer randomization and hashing?



Re: [RESEND PATCH] ptrace: take into account saved_sigmask in PTRACE_{GET,SET}SIGMASK

2019-03-19 Thread Andrew Morton
On Wed, 20 Mar 2019 01:32:53 +0300 "Dmitry V. Levin"  wrote:

> On Tue, Mar 19, 2019 at 12:19:57PM -0700, Andrei Vagin wrote:
> > There are a few system calls (pselect, ppoll, etc) which replace a task
> > sigmask while they are running in a kernel-space
> > 
> > When a task calls one of these syscalls, the kernel saves a current
> > sigmask in task->saved_sigmask and sets a syscall sigmask.
> > 
> > On syscall-exit-stop, ptrace traps a task before restoring the
> > saved_sigmask, so PTRACE_GETSIGMASK returns the syscall sigmask and
> > PTRACE_SETSIGMASK does nothing, because its sigmask is replaced by
> > saved_sigmask, when the task returns to user-space.
> > 
> > This patch fixes this problem.  PTRACE_GET_SIGMASK returns saved_sigmask
> > is it's set.  PTRACE_SETSIGMASK drops the TIF_RESTORE_SIGMASK flag.
> 
> If it's not too late, could somebody tweak the commit message so that
> PTRACE_GET_SIGMASK becomes PTRACE_GETSIGMASK and "is it's set" is changed
> to "if it's set", please?

I made those changes to my copy.


Re: [RESEND PATCH 0/7] Add FOLL_LONGTERM to GUP fast and use it

2019-03-19 Thread Andrew Morton
On Sun, 17 Mar 2019 11:34:31 -0700 ira.we...@intel.com wrote:

> Resending after rebasing to the latest mm tree.
> 
> HFI1, qib, and mthca, use get_user_pages_fast() due to it performance
> advantages.  These pages can be held for a significant time.  But
> get_user_pages_fast() does not protect against mapping FS DAX pages.
> 
> Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
> retains the performance while also adding the FS DAX checks.  XDP has also
> shown interest in using this functionality.[1]
> 
> In addition we change get_user_pages() to use the new FOLL_LONGTERM flag and
> remove the specialized get_user_pages_longterm call.

It would be helpful to include your response to Christoph's question
(http://lkml.kernel.org/r/20190220180255.ga12...@iweiny-desk2.sc.intel.com)
in the changelog.  Because if one person was wondering about this,
others will likely do so.

We have no record of acks or reviewed-by's.  At least one was missed
(http://lkml.kernel.org/r/caog9msttcd-9bcsdfc0wryqfvrnb4twozl0c4+6qxi-n_y4...@mail.gmail.com),
but that is very very partial.

This patchset is fairly DAX-centered, but Dan wasn't cc'ed!

So ho hum.  I'll scoop them up and shall make the above changes to the
[1/n] changelog, but we still have some work to do.



Re: [PATCH v2 1/1] userfaultfd/sysctl: add vm.unprivileged_userfaultfd

2019-03-19 Thread Andrew Morton
On Tue, 19 Mar 2019 11:07:22 +0800 Peter Xu  wrote:

> Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
> whether userfaultfd is allowed by unprivileged users.  When this is
> set to zero, only privileged users (root user, or users with the
> CAP_SYS_PTRACE capability) will be able to use the userfaultfd
> syscalls.

Please send along a full description of why you believe Linux needs
this feature, for me to add to the changelog.  What is the benefit to
our users?  How will it be used?

etcetera.  As it was presented I'm seeing no justification for adding
the patch!


Re: [PATCH 00/10] HMM updates for 5.1

2019-03-19 Thread Andrew Morton
On Tue, 19 Mar 2019 12:58:02 -0400 Jerome Glisse  wrote:

> > So I think I'll throw up my hands, drop them all and shall await
> > developments :(
> 
> What more do you want to see ? I can repost with the ack already given
> and the improve commit wording on some of the patch. But from user point
> of view nouveau is already upstream, ODP RDMA depends on this patchset
> and is posted and i have given link to it. amdgpu is queue up. What more
> do i need ?

I guess I can ignore linux-next for a few days.  

Yes, a resend against mainline with those various updates will be
helpful.  Please go through the various fixes which we had as well:


mm-hmm-use-reference-counting-for-hmm-struct.patch
mm-hmm-do-not-erase-snapshot-when-a-range-is-invalidated.patch
mm-hmm-improve-and-rename-hmm_vma_get_pfns-to-hmm_range_snapshot.patch
mm-hmm-improve-and-rename-hmm_vma_fault-to-hmm_range_fault.patch
mm-hmm-improve-driver-api-to-work-and-wait-over-a-range.patch
mm-hmm-improve-driver-api-to-work-and-wait-over-a-range-fix.patch
mm-hmm-improve-driver-api-to-work-and-wait-over-a-range-fix-fix.patch
mm-hmm-add-default-fault-flags-to-avoid-the-need-to-pre-fill-pfns-arrays.patch
mm-hmm-add-an-helper-function-that-fault-pages-and-map-them-to-a-device.patch
mm-hmm-support-hugetlbfs-snap-shoting-faulting-and-dma-mapping.patch
mm-hmm-allow-to-mirror-vma-of-a-file-on-a-dax-backed-filesystem.patch
mm-hmm-allow-to-mirror-vma-of-a-file-on-a-dax-backed-filesystem-fix.patch
mm-hmm-allow-to-mirror-vma-of-a-file-on-a-dax-backed-filesystem-fix-2.patch
mm-hmm-add-helpers-for-driver-to-safely-take-the-mmap_sem.patch

Also, the discussion regarding [07/10] is substantial and is ongoing so
please let's push along wth that.

What is the review/discussion status of "[PATCH 09/10] mm/hmm: allow to
mirror vma of a file on a DAX backed filesystem"?



Re: [PATCH 00/10] HMM updates for 5.1

2019-03-19 Thread Andrew Morton
On Mon, 18 Mar 2019 13:04:04 -0400 Jerome Glisse  wrote:

> On Wed, Mar 13, 2019 at 09:10:04AM -0700, Andrew Morton wrote:
> > On Tue, 12 Mar 2019 21:27:06 -0400 Jerome Glisse  wrote:
> > 
> > > Andrew you will not be pushing this patchset in 5.1 ?
> > 
> > I'd like to.  It sounds like we're converging on a plan.
> > 
> > It would be good to hear more from the driver developers who will be
> > consuming these new features - links to patchsets, review feedback,
> > etc.  Which individuals should we be asking?  Felix, Christian and
> > Jason, perhaps?
> > 
> 
> So i am guessing you will not send this to Linus ?

I was waiting to see how the discussion proceeds.  Was also expecting
various changelog updates (at least) - more acks from driver
developers, additional pointers to client driver patchsets, description
of their readiness, etc.

Today I discover that Alex has cherrypicked "mm/hmm: use reference
counting for HMM struct" into a tree which is fed into linux-next which
rather messes things up from my end and makes it hard to feed a
(possibly modified version of) that into Linus.

So I think I'll throw up my hands, drop them all and shall await
developments :(



Re: [PATCH 00/10] HMM updates for 5.1

2019-03-13 Thread Andrew Morton
On Tue, 12 Mar 2019 21:27:06 -0400 Jerome Glisse  wrote:

> Andrew you will not be pushing this patchset in 5.1 ?

I'd like to.  It sounds like we're converging on a plan.

It would be good to hear more from the driver developers who will be
consuming these new features - links to patchsets, review feedback,
etc.  Which individuals should we be asking?  Felix, Christian and
Jason, perhaps?



Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem

2019-03-13 Thread Andrew Morton
On Tue, 12 Mar 2019 20:10:19 -0400 Jerome Glisse  wrote:

> > You're correct.  We chose to go this way because the HMM code is so
> > large and all-over-the-place that developing it in a standalone tree
> > seemed impractical - better to feed it into mainline piecewise.
> > 
> > This decision very much assumed that HMM users would definitely be
> > merged, and that it would happen soon.  I was skeptical for a long time
> > and was eventually persuaded by quite a few conversations with various
> > architecture and driver maintainers indicating that these HMM users
> > would be forthcoming.
> > 
> > In retrospect, the arrival of HMM clients took quite a lot longer than
> > was anticipated and I'm not sure that all of the anticipated usage
> > sites will actually be using it.  I wish I'd kept records of
> > who-said-what, but I didn't and the info is now all rather dissipated.
> > 
> > So the plan didn't really work out as hoped.  Lesson learned, I would
> > now very much prefer that new HMM feature work's changelogs include
> > links to the driver patchsets which will be using those features and
> > acks and review input from the developers of those driver patchsets.
> 
> This is what i am doing now and this patchset falls into that. I did
> post the ODP and nouveau bits to use the 2 new functions (dma map and
> unmap). I expect to merge both ODP and nouveau bits for that during
> the next merge window.
> 
> Also with 5.1 everything that is upstream is use by nouveau at least.
> They are posted patches to use HMM for AMD, Intel, Radeon, ODP, PPC.
> Some are going through several revisions so i do not know exactly when
> each will make it upstream but i keep working on all this.
> 
> So the guideline we agree on:
> - no new infrastructure without user
> - device driver maintainer for which new infrastructure is done
>   must either sign off or review of explicitly say that they want
>   the feature I do not expect all driver maintainer will have
>   the bandwidth to do proper review of the mm part of the infra-
>   structure and it would not be fair to ask that from them. They
>   can still provide feedback on the API expose to the device
>   driver.

The patchset in -mm ("HMM updates for 5.1") has review from Ralph
Campbell @ nvidia.  Are there any other maintainers who we should have
feedback from?

> - driver bits must be posted at the same time as the new infra-
>   structure even if they target the next release cycle to avoid
>   inter-tree dependency
> - driver bits must be merge as soon as possible

Are there links to driver patchsets which we can add to the changelogs?

> Thing we do not agree on:
> - If driver bits miss for any reason the +1 target directly
>   revert the new infra-structure. I think it should not be black
>   and white and the reasons why the driver bit missed the merge
>   window should be taken into account. If the feature is still
>   wanted and the driver bits missed the window for simple reasons
>   then it means that we push everything by 2 release ie the
>   revert is done in +1 then we reupload the infra-structure in
>   +2 and finaly repush the driver bit in +3 so we loose 1 cycle.
>   Hence why i would rather that the revert would only happen if
>   it is clear that the infrastructure is not ready or can not
>   be use in timely (over couple kernel release) fashion by any
>   drivers.

I agree that this should be more a philosophy than a set of hard rules.



Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem

2019-03-12 Thread Andrew Morton
On Tue, 12 Mar 2019 12:30:52 -0700 Dan Williams  
wrote:

> On Tue, Mar 12, 2019 at 12:06 PM Jerome Glisse  wrote:
> > On Tue, Mar 12, 2019 at 09:06:12AM -0700, Dan Williams wrote:
> > > On Tue, Mar 12, 2019 at 8:26 AM Jerome Glisse  wrote:
> [..]
> > > > Spirit of the rule is better than blind application of rule.
> > >
> > > Again, I fail to see why HMM is suddenly unable to make forward
> > > progress when the infrastructure that came before it was merged with
> > > consumers in the same development cycle.
> > >
> > > A gate to upstream merge is about the only lever a reviewer has to
> > > push for change, and these requests to uncouple the consumer only
> > > serve to weaken that review tool in my mind.
> >
> > Well let just agree to disagree and leave it at that and stop
> > wasting each other time
> 
> I'm fine to continue this discussion if you are. Please be specific
> about where we disagree and what aspect of the proposed rules about
> merge staging are either acceptable, painful-but-doable, or
> show-stoppers. Do you agree that HMM is doing something novel with
> merge staging, am I off base there?

You're correct.  We chose to go this way because the HMM code is so
large and all-over-the-place that developing it in a standalone tree
seemed impractical - better to feed it into mainline piecewise.

This decision very much assumed that HMM users would definitely be
merged, and that it would happen soon.  I was skeptical for a long time
and was eventually persuaded by quite a few conversations with various
architecture and driver maintainers indicating that these HMM users
would be forthcoming.

In retrospect, the arrival of HMM clients took quite a lot longer than
was anticipated and I'm not sure that all of the anticipated usage
sites will actually be using it.  I wish I'd kept records of
who-said-what, but I didn't and the info is now all rather dissipated.

So the plan didn't really work out as hoped.  Lesson learned, I would
now very much prefer that new HMM feature work's changelogs include
links to the driver patchsets which will be using those features and
acks and review input from the developers of those driver patchsets.



Re: [PATCH] mm: remove unused variable

2019-03-12 Thread Andrew Morton
On Tue, 12 Mar 2019 15:03:52 +0100 Bartosz Golaszewski 
 wrote:

> wt., 12 mar 2019 o 14:59 Khalid Aziz  napisał(a):
> >
> > On 3/12/19 7:28 AM, Bartosz Golaszewski wrote:
> > > From: Bartosz Golaszewski 
> > >
> > > The mm variable is set but unused. Remove it.
> >
> > It is used. Look further down for calls to set_pte_at().
> >
> > --
> > Khalid
> >
> > >
> > > Signed-off-by: Bartosz Golaszewski 
> > > ---
> > >  mm/mprotect.c | 1 -
> > >  1 file changed, 1 deletion(-)
> > >
> > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > index 028c724dcb1a..130dac3ad04f 100644
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -39,7 +39,6 @@ static unsigned long change_pte_range(struct 
> > > vm_area_struct *vma, pmd_t *pmd,
> > >   unsigned long addr, unsigned long end, pgprot_t newprot,
> > >   int dirty_accountable, int prot_numa)
> > >  {
> > > - struct mm_struct *mm = vma->vm_mm;
> > >   pte_t *pte, oldpte;
> > >   spinlock_t *ptl;
> > >   unsigned long pages = 0;
> > >
> >
> >
> 
> Oops, I blindly assumed the compiler is right, sorry for that. GCC
> complains it's unused when building usermode linux. I guess it's a
> matter of how set_pte_at() is defined for ARCH=um. I'll take a second
> look.
> 

The problem is that set_pte_at() is implemented as a macro on some
architectures.

The appropriate fix is to make all architectures use a static inline C
functions in all cases.  That will make the compiler think that the
`mm' arg is used, even if it is not.



Re: [PATCH] Drop -Wdeclaration-after-statement

2019-03-12 Thread Andrew Morton
> > > 
> > > It is not good in my opinion to stick to -Wdeclaration-after-statement.
> > 
> > Why?
> 
> It is useful to have declarations mixed with code.

Am inclined to agree.  Maybe.

> Once kernel will switch to C99 or C11 it _will_ be used to the point of
> requiring it on the coding style level. The superstition of declaring
> everything in the beginning of a function will fall, so might as well
> start earlier.

Sure, it's a matter of kernel coding conventions.

But this isn't the way to get those changed!  The process for making such
changes involves large numbers of people arguing at each other (perhaps
at kernel summit) until Linus comes in an tells everyone how it's going
to be.



Re: [PATCH] Drop -Wdeclaration-after-statement

2019-03-12 Thread Andrew Morton
On Tue, 12 Mar 2019 20:24:47 +0300 Alexey Dobriyan  wrote:

> On Mon, Mar 11, 2019 at 05:38:45PM -0700, Andrew Morton wrote:
> > On Sun, 10 Mar 2019 16:35:35 +0300 Alexey Dobriyan  
> > wrote:
> > 
> > > Newly added static_assert() is formally a declaration, which will give
> > > a warning if used in the middle of the function.
> > > 
> > > ...
> > >
> > > --- a/Makefile
> > > +++ b/Makefile
> > > @@ -792,9 +792,6 @@ endif
> > >  # arch Makefile may override CC so keep this after arch Makefile is 
> > > included
> > >  NOSTDINC_FLAGS += -nostdinc -isystem $(shell $(CC) 
> > > -print-file-name=include)
> > >  
> > > -# warn about C99 declaration after statement
> > > -KBUILD_CFLAGS += -Wdeclaration-after-statement
> > > -
> > >  # Variable Length Arrays (VLAs) should not be used anywhere in the kernel
> > >  KBUILD_CFLAGS += $(call cc-option,-Wvla)
> > 
> > I do wish your changelogs were more elaborate :(
> 
> > So the proposal is to disable -Wdeclaration-after-statement in all
> > cases for all time because static_assert() doesn't work correctly?
> 
> Yes. I converted 2 cases in /proc to static_assert() and you can't write
> 
>   {
>   [code]
>   static_assert()
>   }
> 
> without a warning because static_assert() is declaration.
> So people would move BUILD_BUG_ON() to where it doesn't belong.

Sure.

> > Surely there's something we can do to squish the static_assert() issue
> > while retaining -Wdeclaration-after-statement?
> 
> It is not good in my opinion to stick to -Wdeclaration-after-statement.

Why?

> > Perhaps by making
> > static_assert() a nop if -Wdeclaration-after-statement is in use. 
> > Perhaps simply by putting { } around the static_assert()?
> 
> Making a statement out of it would disable current cases where it is
> placed in headers.

I think you mean cases where static_assert() is used outside functions?

We could have two versions of it, one for use inside functions, one for
use outside functions?



Re: KASAN: null-ptr-deref Read in reclaim_high

2019-03-12 Thread Andrew Morton
On Tue, 12 Mar 2019 07:08:38 +0100 Dmitry Vyukov  wrote:

> On Tue, Mar 12, 2019 at 12:37 AM Andrew Morton
>  wrote:
> >
> > On Mon, 11 Mar 2019 06:08:01 -0700 syzbot 
> >  wrote:
> >
> > > syzbot has bisected this bug to:
> > >
> > > commit 29a4b8e275d1f10c51c7891362877ef6cffae9e7
> > > Author: Shakeel Butt 
> > > Date:   Wed Jan 9 22:02:21 2019 +
> > >
> > >  memcg: schedule high reclaim for remote memcgs on high_work
> > >
> > > bisection log:  
> > > https://syzkaller.appspot.com/x/bisect.txt?x=155bf5db20
> > > start commit:   29a4b8e2 memcg: schedule high reclaim for remote memcgs 
> > > on..
> > > git tree:   linux-next
> > > final crash:
> > > https://syzkaller.appspot.com/x/report.txt?x=175bf5db20
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=135bf5db20
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=611f89e5b6868db
> > > dashboard link: 
> > > https://syzkaller.appspot.com/bug?extid=fa11f9da42b46cea3b4a
> > > userspace arch: amd64
> > > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1425901740
> > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=141630a0c0
> > >
> > > Reported-by: syzbot+fa11f9da42b46cea3...@syzkaller.appspotmail.com
> > > Fixes: 29a4b8e2 ("memcg: schedule high reclaim for remote memcgs on
> > > high_work")
> >
> > The following patch
> > memcg-schedule-high-reclaim-for-remote-memcgs-on-high_work-v3.patch
> > might have fixed this.  Was it applied?
> 
> Hi Andrew,
> 
> You mean if the patch was applied during the bisection?
> No, it wasn't. Bisection is very specifically done on the same tree
> where the bug was hit. There are already too many factors that make
> the result flaky/wrong/inconclusive without changing the tree state.
> Now, if syzbot would know about any pending fix for this bug, then it
> would not do the bisection at all. But it have not seen any patch in
> upstream/linux-next with the Reported-by tag, nor it received any syz
> fix commands for this bugs. Should have been it aware of the fix? How?

memcg-schedule-high-reclaim-for-remote-memcgs-on-high_work-v3.patch was
added to linux-next on Jan 10.  I take it that this bug was hit when
testing the entire linux-next tree, so we can assume that
memcg-schedule-high-reclaim-for-remote-memcgs-on-high_work-v3.patch
does not fix it, correct?

In which case, over to Shakeel!


Re: [PATCH v9 1/9] bitops: Introduce the for_each_set_clump8 macro

2019-03-11 Thread Andrew Morton
On Fri, 8 Mar 2019 09:31:00 +0100 Linus Walleij  
wrote:

> On Sun, Mar 3, 2019 at 8:47 AM William Breathitt Gray
>  wrote:
> 
> > This macro iterates for each 8-bit group of bits (clump) with set bits,
> > within a bitmap memory region. For each iteration, "start" is set to the
> > bit offset of the found clump, while the respective clump value is
> > stored to the location pointed by "clump". Additionally, the
> > bitmap_get_value8 and bitmap_set_value8 functions are introduced to
> > respectively get and set an 8-bit value in a bitmap memory region.
> >
> > Suggested-by: Andy Shevchenko 
> > Suggested-by: Rasmus Villemoes 
> > Cc: Arnd Bergmann 
> > Cc: Andrew Morton 
> > Reviewed-by: Andy Shevchenko 
> > Reviewed-by: Linus Walleij 
> > Signed-off-by: William Breathitt Gray 
> 
> Andrew: would you be OK with this being merged in v5.1?

Yup.  We have quite a few users there.  I assume this will go via the
gpio tree?

Feel free to add Acked-by: Andrew Morton ,
although it probably isn't worth churning the git tree to do so at this
late stage - your cvall.



Re: [PATCH] Drop -Wdeclaration-after-statement

2019-03-11 Thread Andrew Morton
On Sun, 10 Mar 2019 16:35:35 +0300 Alexey Dobriyan  wrote:

> Newly added static_assert() is formally a declaration, which will give
> a warning if used in the middle of the function.
> 
> ...
>
> --- a/Makefile
> +++ b/Makefile
> @@ -792,9 +792,6 @@ endif
>  # arch Makefile may override CC so keep this after arch Makefile is included
>  NOSTDINC_FLAGS += -nostdinc -isystem $(shell $(CC) -print-file-name=include)
>  
> -# warn about C99 declaration after statement
> -KBUILD_CFLAGS += -Wdeclaration-after-statement
> -
>  # Variable Length Arrays (VLAs) should not be used anywhere in the kernel
>  KBUILD_CFLAGS += $(call cc-option,-Wvla)

I do wish your changelogs were more elaborate :(

So the proposal is to disable -Wdeclaration-after-statement in all
cases for all time because static_assert() doesn't work correctly?

Surely there's something we can do to squish the static_assert() issue
while retaining -Wdeclaration-after-statement?  Perhaps by making
static_assert() a nop if -Wdeclaration-after-statement is in use. 
Perhaps simply by putting { } around the static_assert()?


Re: KASAN: null-ptr-deref Read in reclaim_high

2019-03-11 Thread Andrew Morton
On Mon, 11 Mar 2019 06:08:01 -0700 syzbot 
 wrote:

> syzbot has bisected this bug to:
> 
> commit 29a4b8e275d1f10c51c7891362877ef6cffae9e7
> Author: Shakeel Butt 
> Date:   Wed Jan 9 22:02:21 2019 +
> 
>  memcg: schedule high reclaim for remote memcgs on high_work
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=155bf5db20
> start commit:   29a4b8e2 memcg: schedule high reclaim for remote memcgs on..
> git tree:   linux-next
> final crash:https://syzkaller.appspot.com/x/report.txt?x=175bf5db20
> console output: https://syzkaller.appspot.com/x/log.txt?x=135bf5db20
> kernel config:  https://syzkaller.appspot.com/x/.config?x=611f89e5b6868db
> dashboard link: https://syzkaller.appspot.com/bug?extid=fa11f9da42b46cea3b4a
> userspace arch: amd64
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1425901740
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=141630a0c0
> 
> Reported-by: syzbot+fa11f9da42b46cea3...@syzkaller.appspotmail.com
> Fixes: 29a4b8e2 ("memcg: schedule high reclaim for remote memcgs on  
> high_work")

The following patch
memcg-schedule-high-reclaim-for-remote-memcgs-on-high_work-v3.patch
might have fixed this.  Was it applied?


Re: [LKP] [selftests/vm] a05ef00c97: kernel_selftests.vm.vmalloc_stability_smoke_test.fail

2019-03-11 Thread Andrew Morton
On Mon, 11 Mar 2019 10:20:06 -0700 Linus Torvalds 
 wrote:

> On Mon, Mar 11, 2019 at 12:43 AM kernel test robot
>  wrote:
> >
> > ./run_vmtests: line 217: ./test_vmalloc.sh: Permission denied
> 
> I marked that script executable:
> 
>   6bc3fe8e7e17 ("tools: mark 'test_vmalloc.sh' executable")
> 
> although perhaps the better fix would have been to explicitly use the
> shell to start it in the 'run_vmtests' script instead?
> 
> We have a lot of files that aren't marked executable, probably because
> they came in as (or were generated as) non-git patches. Often through
> Andrew's workflow.

Not only that.  patch(1) doesn't set the x bit.  So if someone
downloads and applies patch-5.0-rc1.xz (as we say to do in the
documentation), their kernel won't work correctly.

This happens fairly regularly and the fix is to replace

test_vmalloc.sh ...

with

/bin/sh test_vmalloc.sh ...




Re: [PATCH] proc: test with vsyscall in mind

2019-03-07 Thread Andrew Morton
On Thu, 7 Mar 2019 21:32:04 +0300 Alexey Dobriyan  wrote:

> Read from vsyscall page to tell if vsyscall is being used.
> 
> Reported-by: kernel test robot
> Signed-off-by: Alexey Dobriyan 

There's a lot of information missing from this changelog :(


I rewrote it to this:

: selftests: proc: proc-pid-vm
: 
: proc-pid-vm: proc-pid-vm.c:277: main: Assertion `rv == strlen(buf0)' failed.
: Aborted

Because the vsyscall mapping is enabled.  Read from vsyscall page to tell
if vsyscall is being used.

Link: http://lkml.kernel.org/r/20190307183204.GA11405@avx2
Link: http://lkml.kernel.org/r/20190219094722.GB28258@shao2-debian
Fixes: 34aab6bec23e7e9 ("proc: test /proc/*/maps, smaps, smaps_rollup, statm")
Signed-off-by: Alexey Dobriyan 
Reported-by: kernel test robot 
Cc: Shuah Khan 
Signed-off-by: Andrew Morton 




Re: [PATCH] ipc: prevent lockup on alloc_msg and free_msg

2019-03-07 Thread Andrew Morton
On Thu,  7 Mar 2019 16:10:22 +0800 Li RongQing  wrote:

> From: Li Rongqing 
> 
> msgctl10 of ltp triggers the following lockup When CONFIG_KASAN
> is enabled on large memory SMP systems, the pages initialization
> can take a long time, if msgctl10 requests a huge block memory,
> and it will block rcu scheduler, so release cpu actively.
> 
> ...
>
> Signed-off-by: Zhang Yu 
> Signed-off-by: Li RongQing 

This signoff ordering somewhat implies that Zhang Yu was the author. 
But you added "From: Li Rongqing", so you will be recorded as the
patch's author.  Is this correct?

> --- a/ipc/msgutil.c
> +++ b/ipc/msgutil.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "util.h"
>  
> @@ -72,6 +73,7 @@ static struct msg_msg *alloc_msg(size_t len)
>   seg->next = NULL;
>   pseg = >next;
>   len -= alen;
> + cond_resched();
>   }

This looks OK.

>   return msg;
> @@ -178,5 +180,6 @@ void free_msg(struct msg_msg *msg)
>   struct msg_msgseg *tmp = seg->next;
>   kfree(seg);
>   seg = tmp;
> + cond_resched();
>   }

This does not.  mqueue_evict_inode() (at least) calls free_msg() from
under spin_lock().



Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem

2019-03-07 Thread Andrew Morton
On Tue, 5 Mar 2019 20:20:10 -0800 Dan Williams  wrote:

> My hesitation would be drastically reduced if there was a plan to
> avoid dangling unconsumed symbols and functionality. Specifically one
> or more of the following suggestions:
> 
> * EXPORT_SYMBOL_GPL on all exports to avoid a growing liability
> surface for out-of-tree consumers to come grumble at us when we
> continue to refactor the kernel as we are wont to do.

The existing patches use EXPORT_SYMBOL() so that's a sticking point. 
Jerome, what would happen is we made these EXPORT_SYMBOL_GPL()?

> * A commitment to consume newly exported symbols in the same merge
> window, or the following merge window. When that goal is missed revert
> the functionality until such time that it can be consumed, or
> otherwise abandoned.

It sounds like we can tick this box.

> * No new symbol exports and functionality while existing symbols go 
> unconsumed.

Unsure about this one?


Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments

2019-03-06 Thread Andrew Morton
On Thu, 7 Mar 2019 00:32:09 +0100 Dominique Martinet  
wrote:

> Andrew Morton wrote on Wed, Mar 06, 2019:
> > On Wed, 6 Mar 2019 23:48:03 +0100 (CET) Jiri Kosina  
> > wrote:
> > 
> > > 3/3 is actually waiting for your decision, see
> > > 
> > >   https://lore.kernel.org/lkml/20190212063643.gl15...@dhcp22.suse.cz/
> > 
> > I pity anyone who tried to understand this code by reading this code. 
> > Can we please get some careful commentary in there explaining what is
> > going on, and why things are thus?
> > 
> > I guess the [3/3] change makes sense, although it's unclear whether
> > anyone really needs it?  5.0 was released with 574823bfab8 ("Change
> > mincore() to count "mapped" pages rather than "cached" pages") so we'll
> > have a release cycle to somewhat determine how much impact 574823bfab8
> > has on users.  How about I queue up [3/3] and we reevaluate its
> > desirability in a couple of months?
> 
> FWIW,
> 
> 574823bfab8 has been reverted in 30bac164aca750, included in 5.0-rc4, so
> the controversial change has only been there from 5.0-rc1 to 5.0-rc3

Ah, OK, thanks, I misread.

Linus, do you have thoughts on
http://lkml.kernel.org/r/20190130124420.1834-4-vba...@suse.cz ?



Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments

2019-03-06 Thread Andrew Morton
On Wed, 6 Mar 2019 23:48:03 +0100 (CET) Jiri Kosina  wrote:

> 3/3 is actually waiting for your decision, see
> 
>   https://lore.kernel.org/lkml/20190212063643.gl15...@dhcp22.suse.cz/

I pity anyone who tried to understand this code by reading this code. 
Can we please get some careful commentary in there explaining what is
going on, and why things are thus?

I guess the [3/3] change makes sense, although it's unclear whether
anyone really needs it?  5.0 was released with 574823bfab8 ("Change
mincore() to count "mapped" pages rather than "cached" pages") so we'll
have a release cycle to somewhat determine how much impact 574823bfab8
has on users.  How about I queue up [3/3] and we reevaluate its
desirability in a couple of months?





Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

2019-03-06 Thread Andrew Morton
On Wed, 30 Jan 2019 13:44:18 +0100 Vlastimil Babka  wrote:

> From: Jiri Kosina 
> 
> The semantics of what mincore() considers to be resident is not completely
> clear, but Linux has always (since 2.3.52, which is when mincore() was
> initially done) treated it as "page is available in page cache".
> 
> That's potentially a problem, as that [in]directly exposes meta-information
> about pagecache / memory mapping state even about memory not strictly 
> belonging
> to the process executing the syscall, opening possibilities for sidechannel
> attacks.
> 
> Change the semantics of mincore() so that it only reveals pagecache 
> information
> for non-anonymous mappings that belog to files that the calling process could
> (if it tried to) successfully open for writing.

"for writing" comes as a bit of a surprise.  Why not for reading?

Could we please explain the reasoning in the changelog and in the
(presently absent) comments which describe can_do_mincore()?

> @@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned long 
> pages, unsigned char *v
>   vma = find_vma(current->mm, addr);
>   if (!vma || addr < vma->vm_start)
>   return -ENOMEM;
> - mincore_walk.mm = vma->vm_mm;
>   end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
> + if (!can_do_mincore(vma)) {
> + unsigned long pages = (end - addr) >> PAGE_SHIFT;

I'm not sure this is correct in all cases.   If

addr = 4095
vma->vm_end = 4096
pages = 1000

then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it
should have been 1.

Please check?

A mincore test suite in tools/testing/selftests would be useful,
methinks.  To exercise such corner cases, check for future breakage,
etc.

> + memset(vec, 1, pages);
> + return pages;
> + }
> + mincore_walk.mm = vma->vm_mm;
>   err = walk_page_range(addr, end, _walk);
>   if (err < 0)
>   return err;




Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments

2019-03-06 Thread Andrew Morton
On Wed, 6 Mar 2019 13:11:39 +0100 (CET) Jiri Kosina  wrote:

> On Wed, 30 Jan 2019, Vlastimil Babka wrote:
> 
> > I've collected the patches from the discussion for formal posting. The first
> > two should be settled already, third one is the possible improvement I've
> > mentioned earlier, where only in restricted case we resort to existence of 
> > page
> > table mapping (the original and later reverted approach from Linus) instead 
> > of
> > faking the result completely. Review and testing welcome.
> > 
> > The consensus seems to be going through -mm tree for 5.1, unless Linus wants
> > them alredy for 5.0.
> > 
> > Jiri Kosina (2):
> >   mm/mincore: make mincore() more conservative
> >   mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O
> > 
> > Vlastimil Babka (1):
> >   mm/mincore: provide mapped status when cached status is not allowed
> 
> Andrew,
> 
> could you please take at least the correct and straightforward fix for 
> mincore() before we figure out how to deal with the slightly less 
> practical RWF_NOWAIT? Thanks.

I assume we're talking about [1/3] and [2/3] from this thread?

Can we have a resend please?  Gather the various acks and revisions,
make changelog changes to address the review questions and comments?

Thanks.


Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem

2019-03-06 Thread Andrew Morton
On Wed, 6 Mar 2019 10:49:04 -0500 Jerome Glisse  wrote:

> On Tue, Mar 05, 2019 at 02:16:35PM -0800, Andrew Morton wrote:
> > On Wed, 30 Jan 2019 21:44:46 -0800 Dan Williams  
> > wrote:
> > 
> > > >
> > > > > Another way to help allay these worries is commit to no new exports
> > > > > without in-tree users. In general, that should go without saying for
> > > > > any core changes for new or future hardware.
> > > >
> > > > I always intend to have an upstream user the issue is that the device
> > > > driver tree and the mm tree move a different pace and there is always
> > > > a chicken and egg problem. I do not think Andrew wants to have to
> > > > merge driver patches through its tree, nor Linus want to have to merge
> > > > drivers and mm trees in specific order. So it is easier to introduce
> > > > mm change in one release and driver change in the next. This is what
> > > > i am doing with ODP. Adding things necessary in 5.1 and working with
> > > > Mellanox to have the ODP HMM patch fully tested and ready to go in
> > > > 5.2 (the patch is available today and Mellanox have begin testing it
> > > > AFAIK). So this is the guideline i will be following. Post mm bits
> > > > with driver patches, push to merge mm bits one release and have the
> > > > driver bits in the next. I do hope this sound fine to everyone.
> > > 
> > > The track record to date has not been "merge HMM patch in one release
> > > and merge the driver updates the next". If that is the plan going
> > > forward that's great, and I do appreciate that this set came with
> > > driver changes, and maintain hope the existing exports don't go
> > > user-less for too much longer.
> > 
> > Decision time.  Jerome, how are things looking for getting these driver
> > changes merged in the next cycle?
> 
> nouveau is merge already.

Confused.  Nouveau in mainline is dependent upon "mm/hmm: allow to
mirror vma of a file on a DAX backed filesystem"?  That can't be the
case?

> > 
> > Dan, what's your overall take on this series for a 5.1-rc1 merge?
> > 
> > Jerome, what would be the risks in skipping just this [09/10] patch?
> 
> As nouveau is a new user it does not regress anything but for RDMA
> mlx5 (which i expect to merge new window) it would regress that
> driver.

Also confused.  How can omitting "mm/hmm: allow to mirror vma of a file
on a DAX backed filesystem" from 5.1-rc1 cause an mlx5 regression?



Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem

2019-03-05 Thread Andrew Morton
On Wed, 30 Jan 2019 21:44:46 -0800 Dan Williams  
wrote:

> >
> > > Another way to help allay these worries is commit to no new exports
> > > without in-tree users. In general, that should go without saying for
> > > any core changes for new or future hardware.
> >
> > I always intend to have an upstream user the issue is that the device
> > driver tree and the mm tree move a different pace and there is always
> > a chicken and egg problem. I do not think Andrew wants to have to
> > merge driver patches through its tree, nor Linus want to have to merge
> > drivers and mm trees in specific order. So it is easier to introduce
> > mm change in one release and driver change in the next. This is what
> > i am doing with ODP. Adding things necessary in 5.1 and working with
> > Mellanox to have the ODP HMM patch fully tested and ready to go in
> > 5.2 (the patch is available today and Mellanox have begin testing it
> > AFAIK). So this is the guideline i will be following. Post mm bits
> > with driver patches, push to merge mm bits one release and have the
> > driver bits in the next. I do hope this sound fine to everyone.
> 
> The track record to date has not been "merge HMM patch in one release
> and merge the driver updates the next". If that is the plan going
> forward that's great, and I do appreciate that this set came with
> driver changes, and maintain hope the existing exports don't go
> user-less for too much longer.

Decision time.  Jerome, how are things looking for getting these driver
changes merged in the next cycle?

Dan, what's your overall take on this series for a 5.1-rc1 merge?

Jerome, what would be the risks in skipping just this [09/10] patch?


Re: [PATCH v4] mm/hugetlb: Fix unsigned overflow in __nr_hugepages_store_common()

2019-03-05 Thread Andrew Morton
On Mon, 4 Mar 2019 20:15:40 -0800 Mike Kravetz  wrote:

> Andrew, this is on top of Alexandre Ghiti's "hugetlb: allow to free gigantic
> pages regardless of the configuration" patch.  Both patches modify
> __nr_hugepages_store_common().  Alex's patch is going to change slightly
> in this area.

OK, thanks, I missed that.  Are the changes significant?


Re: [PATCH] selftest/proc: Remove duplicate header

2019-03-04 Thread Andrew Morton
On Mon, 4 Mar 2019 23:57:19 +0530 Souptick Joarder  wrote:

> Remove duplicate header which is included twice.
> 
> Signed-off-by: Sabyasachi Gupta 
> Signed-off-by: Souptick Joarder 

This signoff order makes me suspect that Sabyasachi was the author.  If
so, the patch should have had Sabyasachi's From: line at the start of
the changelog.

Please clafify?



Re: [PATCH v6] panic: Avoid the extra noise dmesg

2019-03-01 Thread Andrew Morton
On Fri, 1 Mar 2019 16:57:52 -0500 Steven Rostedt  wrote:

> Looks good to me.
> 
> Acked-by: Steven Rostedt (VMware) 
> 
> Andrew, you want to take this patch?

Yup.


Re: [PATCH] mm/hugepages: fix "orig_pud" set but not used

2019-03-01 Thread Andrew Morton
On Thu, 28 Feb 2019 19:49:03 -0500 Qian Cai  wrote:

> The commit a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent
> hugepages") introduced pudp_huge_get_and_clear_full() but no one uses
> its return code, so just make it void.
> 
> mm/huge_memory.c: In function 'zap_huge_pud':
> mm/huge_memory.c:1982:8: warning: variable 'orig_pud' set but not used
> [-Wunused-but-set-variable]
>   pud_t orig_pud;
> ^~~~
> 
> ...
>
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -167,11 +167,11 @@ static inline pmd_t pmdp_huge_get_and_clear_full(struct 
> mm_struct *mm,
>  #endif
>  
>  #ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR_FULL
> -static inline pud_t pudp_huge_get_and_clear_full(struct mm_struct *mm,
> - unsigned long address, pud_t *pudp,
> - int full)
> +static inline void pudp_huge_get_and_clear_full(struct mm_struct *mm,
> + unsigned long address,
> + pud_t *pudp, int full)
>  {
> - return pudp_huge_get_and_clear(mm, address, pudp);
> + pudp_huge_get_and_clear(mm, address, pudp);
>  }

Not sure this is a good change.  Future callers might want that return
value.

> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1979,7 +1979,6 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct 
> vm_area_struct *vma)
>  int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>pud_t *pud, unsigned long addr)
>  {
> - pud_t orig_pud;
>   spinlock_t *ptl;
>  
>   ptl = __pud_trans_huge_lock(pud, vma);
> @@ -1991,8 +1990,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct 
> vm_area_struct *vma,
>* pgtable_trans_huge_withdraw after finishing pudp related
>* operations.
>*/
> - orig_pud = pudp_huge_get_and_clear_full(tlb->mm, addr, pud,
> - tlb->fullmm);
> + pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm);

In fact this code perhaps should be passing orig_pud into
pudp_huge_get_and_clear_full().  That could depend on what future
per-arch implementations of pudp_huge_get_and_clear_full() choose to
do.

Anyway, I'll await Matthew's feedback.


Re: [PATCH v2] mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC

2019-03-01 Thread Andrew Morton
On Fri,  1 Mar 2019 15:19:50 -0500 Qian Cai  wrote:

> When onlining a memory block with DEBUG_PAGEALLOC, it unmaps the pages
> in the block from kernel, However, it does not map those pages while
> offlining at the beginning. As the result, it triggers a panic below
> while onlining on ppc64le as it checks if the pages are mapped before
> unmapping. However, the imbalance exists for all arches where
> double-unmappings could happen. Therefore, let kernel map those pages in
> generic_online_page() before they have being freed into the page
> allocator for the first time where it will set the page count to one.
> 
> On the other hand, it works fine during the boot, because at least for
> IBM POWER8, it does,
> 
> early_setup
>   early_init_mmu
> harsh__early_init_mmu
>   htab_initialize [1]
> htab_bolt_mapping [2]
> 
> where it effectively map all memblock regions just like
> kernel_map_linear_page(), so later mem_init() -> memblock_free_all()
> will unmap them just fine without any imbalance. On other arches without
> this imbalance checking, it still unmap them once at the most.
> 
> [1]
> for_each_memblock(memory, reg) {
> base = (unsigned long)__va(reg->base);
> size = reg->size;
> 
> DBG("creating mapping for region: %lx..%lx (prot: %lx)\n",
> base, size, prot);
> 
> BUG_ON(htab_bolt_mapping(base, base + size, __pa(base),
> prot, mmu_linear_psize, mmu_kernel_ssize));
> }
> 
> [2] linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 0x80;
> 
> kernel BUG at arch/powerpc/mm/hash_utils_64.c:1815!
>
> ...
>
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -660,6 +660,7 @@ static void generic_online_page(struct page *page)
>  {
>   __online_page_set_limits(page);
>   __online_page_increment_counters(page);
> + kernel_map_pages(page, 1, 1);
>   __online_page_free(page);
>  }

This code was changed a lot by Arun's "mm/page_alloc.c: memory hotplug:
free pages as higher order".

I don't think hotplug+DEBUG_PAGEALLOC is important enough to disrupt
memory_hotplug-free-pages-as-higher-order.patch, which took a long time
to sort out.  So could you please take a look at linux-next, determine
whether the problem is still there and propose a suitable patch?

Thanks.


Re: next/master boot bisection: next-20190215 on beaglebone-black

2019-03-01 Thread Andrew Morton
On Fri, 1 Mar 2019 09:25:24 +0100 Guillaume Tucker 
 wrote:

> >>> Michal had asked if the free space accounting fix up addressed this
> >>> boot regression? I was awaiting word on that.
> >>
> >> hm, does b...@kernelci.org actually read emails?  Let's try info@ as well..
> 
> b...@kernelci.org is not person, it's a send-only account for
> automated reports.  So no, it doesn't read emails.
> 
> I guess the tricky point here is that the authors of the commits
> found by bisections may not always have the hardware needed to
> reproduce the problem.  So it needs to be dealt with on a
> case-by-case basis: sometimes they do have the hardware,
> sometimes someone else on the list or on CC does, and sometimes
> it's better for the people who have access to the test lab which
> ran the KernelCI test to deal with it.
> 
> This case seems to fall into the last category.  As I have access
> to the Collabora lab, I can do some quick checks to confirm
> whether the proposed patch does fix the issue.  I hadn't realised
> that someone was waiting for this to happen, especially as the
> BeagleBone Black is a very common platform.  Sorry about that,
> I'll take a look today.
> 
> It may be a nice feature to be able to give access to the
> KernelCI test infrastructure to anyone who wants to debug an
> issue reported by KernelCI or verify a fix, so they won't need to
> have the hardware locally.  Something to think about for the
> future.

Thanks, that all sounds good.

> >> Is it possible to determine whether this regression is still present in
> >> current linux-next?
> 
> I'll try to re-apply the patch that caused the issue, then see if
> the suggested change fixes it.  As far as the current linux-next
> master branch is concerned, KernelCI boot tests are passing fine
> on that platform.

They would, because I dropped
mm-shuffle-default-enable-all-shuffling.patch, so your tests presumably
now have shuffling disabled.

Is it possible to add the below to linux-next and try again?

Or I can re-add this to linux-next.  Where should we go to determine
the results of such a change?  There are a heck of a lot of results on
https://kernelci.org/boot/ and entering "beaglebone-black" doesn't get
me anything.

Thanks.



From: Dan Williams 
Subject: mm/shuffle: default enable all shuffling

Per Andrew's request arrange for all memory allocation shuffling code to
be enabled by default.

The page_alloc.shuffle command line parameter can still be used to disable
shuffling at boot, but the kernel will default enable the shuffling if the
command line option is not specified.

Link: 
http://lkml.kernel.org/r/154943713572.3858443.11206307988382889377.st...@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams 
Cc: Kees Cook 
Cc: Michal Hocko 
Cc: Dave Hansen 
Cc: Keith Busch 

Signed-off-by: Andrew Morton 
---

 init/Kconfig |4 ++--
 mm/shuffle.c |4 ++--
 mm/shuffle.h |2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

--- a/init/Kconfig~mm-shuffle-default-enable-all-shuffling
+++ a/init/Kconfig
@@ -1709,7 +1709,7 @@ config SLAB_MERGE_DEFAULT
  command line.
 
 config SLAB_FREELIST_RANDOM
-   default n
+   default y
depends on SLAB || SLUB
bool "SLAB freelist randomization"
help
@@ -1728,7 +1728,7 @@ config SLAB_FREELIST_HARDENED
 
 config SHUFFLE_PAGE_ALLOCATOR
bool "Page allocator randomization"
-   default SLAB_FREELIST_RANDOM && ACPI_NUMA
+   default y
help
  Randomization of the page allocator improves the average
  utilization of a direct-mapped memory-side-cache. See section
--- a/mm/shuffle.c~mm-shuffle-default-enable-all-shuffling
+++ a/mm/shuffle.c
@@ -9,8 +9,8 @@
 #include "internal.h"
 #include "shuffle.h"
 
-DEFINE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
-static unsigned long shuffle_state __ro_after_init;
+DEFINE_STATIC_KEY_TRUE(page_alloc_shuffle_key);
+static unsigned long shuffle_state __ro_after_init = 1 << SHUFFLE_ENABLE;
 
 /*
  * Depending on the architecture, module parameter parsing may run
--- a/mm/shuffle.h~mm-shuffle-default-enable-all-shuffling
+++ a/mm/shuffle.h
@@ -19,7 +19,7 @@ enum mm_shuffle_ctl {
 #define SHUFFLE_ORDER (MAX_ORDER-1)
 
 #ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
-DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
+DECLARE_STATIC_KEY_TRUE(page_alloc_shuffle_key);
 extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl);
 extern void __shuffle_free_memory(pg_data_t *pgdat);
 static inline void shuffle_free_memory(pg_data_t *pgdat)
_



Re: [PATCH] log2: make is_power_of_2() integer constant expression when possible

2019-03-01 Thread Andrew Morton
On Fri,  1 Mar 2019 14:52:07 +0200 Jani Nikula  wrote:

> While is_power_of_2() is an inline function and likely gets optimized
> for compile time constant arguments, it still doesn't produce an integer
> constant expression that could be used in, say, static data
> initialization or case labels.

hm, what code wants to do these things?

> Make is_power_of_2() an integer constant expression when possible,
> otherwise using the inline function to avoid multiple evaluation of the
> parameter.

Spose so.  But I fear that some gcc versions will get it right and
others will mess it up.  While this patch is under test I think it
would be best to also have at least one or two callsites which actually
use the compile-time evaluation feature.  Possible?



Re: next/master boot bisection: next-20190215 on beaglebone-black

2019-02-28 Thread Andrew Morton
On Tue, 26 Feb 2019 16:04:04 -0800 Dan Williams  
wrote:

> On Tue, Feb 26, 2019 at 4:00 PM Andrew Morton  
> wrote:
> >
> > On Fri, 15 Feb 2019 18:51:51 + Mark Brown  wrote:
> >
> > > On Fri, Feb 15, 2019 at 10:43:25AM -0800, Andrew Morton wrote:
> > > > On Fri, 15 Feb 2019 10:20:10 -0800 (PST) "kernelci.org bot" 
> > > >  wrote:
> > >
> > > > >   Details:https://kernelci.org/boot/id/5c666ea959b514b017fe6017
> > > > >   Plain log:  
> > > > > https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.txt
> > > > >   HTML log:   
> > > > > https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.html
> > >
> > > > Thanks.
> > >
> > > > But what actually went wrong?  Kernel doesn't boot?
> > >
> > > The linked logs show the kernel dying early in boot before the console
> > > comes up so yeah.  There should be kernel output at the bottom of the
> > > logs.
> >
> > I assume Dan is distracted - I'll keep this patchset on hold until we
> > can get to the bottom of this.
> 
> Michal had asked if the free space accounting fix up addressed this
> boot regression? I was awaiting word on that.

hm, does b...@kernelci.org actually read emails?  Let's try info@ as well..

Is it possible to determine whether this regression is still present in
current linux-next?

> I assume you're not willing to entertain a "depends
> NOT_THIS_ARM_BOARD" hack in the meantime?

We'd probably never be able to remove it.  And we don't know whether
other systems might be affected.


Re: [PATCH v5] panic: Avoid the extra noise dmesg

2019-02-28 Thread Andrew Morton
On Fri, 22 Feb 2019 14:09:59 +0800 Feng Tang  wrote:

> When kernel panic happens, it will first print the panic call stack,
> then the ending msg like:
> 
> [   35.743249] ---[ end Kernel panic - not syncing: Fatal exception
> [   35.749975] [ cut here ]
> 
> The above message are very useful for debugging.
> 
> But if system is configured to not reboot on panic, say the "panic_timeout"
> parameter equals 0, it will likely print out many noisy message like
> WARN() call stack for each and every CPU except the panic one, messages
> like below:
> 
>   WARNING: CPU: 1 PID: 280 at kernel/sched/core.c:1198 
> set_task_cpu+0x183/0x190
>   Call Trace:
>   
>   try_to_wake_up
>   default_wake_function
>   autoremove_wake_function
>   __wake_up_common
>   __wake_up_common_lock
>   __wake_up
>   wake_up_klogd_work_func
>   irq_work_run_list
>   irq_work_tick
>   update_process_times
>   tick_sched_timer
>   __hrtimer_run_queues
>   hrtimer_interrupt
>   smp_apic_timer_interrupt
>   apic_timer_interrupt

It's a fairly ugly-looking patch but I am inclined to agree.

The panicing CPU is spinning and blinking a LED and all CPUs have
interrupts enabled and the system is known to be in a messed up state. 
All sorts of kernel code could emit all sorts of output in such
circumstances.  So a global printk-killing knob seems appropriate.

Thoughts:

- why do the suppression in vprintk_emit()?  Doing it right at entry
  to printk() seems cleaner, more explicit?

- Other code sites may wish to suppress all printks.  Perhaps
  `panic_suppress_printk' should just be called `suppress_printk'?




Re: [PATCH v2 2/4] mm: remove zone_lru_lock() function access ->lru_lock directly

2019-02-28 Thread Andrew Morton
On Thu, 28 Feb 2019 05:53:37 -0700 William Kucharski 
 wrote:

> > On Feb 28, 2019, at 1:33 AM, Andrey Ryabinin  
> > wrote:
> 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a9852ed7b97f..2d081a32c6a8 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1614,8 +1614,8 @@ static __always_inline void update_lru_sizes(struct 
> > lruvec *lruvec,
> > 
> > }
> > 
> > -/*
> > - * zone_lru_lock is heavily contended.  Some of the functions that
> > +/**
> 
> Nit: Remove the extra asterisk here; the line should then revert to being 
> unchanged from
> the original.

I don't think so.  This kernedoc comment was missing its leading /**. 
The patch fixes that.



Re: next/master boot bisection: next-20190215 on beaglebone-black

2019-02-26 Thread Andrew Morton
On Fri, 15 Feb 2019 18:51:51 + Mark Brown  wrote:

> On Fri, Feb 15, 2019 at 10:43:25AM -0800, Andrew Morton wrote:
> > On Fri, 15 Feb 2019 10:20:10 -0800 (PST) "kernelci.org bot" 
> >  wrote:
> 
> > >   Details:https://kernelci.org/boot/id/5c666ea959b514b017fe6017
> > >   Plain log:  
> > > https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.txt
> > >   HTML log:   
> > > https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.html
> 
> > Thanks.
> 
> > But what actually went wrong?  Kernel doesn't boot?
> 
> The linked logs show the kernel dying early in boot before the console
> comes up so yeah.  There should be kernel output at the bottom of the
> logs.

I assume Dan is distracted - I'll keep this patchset on hold until we
can get to the bottom of this.



Re: [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region

2019-02-26 Thread Andrew Morton
[patch 1/4]: OK.  I guess.  Was this worth consuming our last PF_ flag?
[patch 2/4]: unreviewed
[patch 3/4]: unreviewed, mpe still unhappy, I expect?
[patch 4/4]: unreviewed


Re: [PATCH] mm, memcg: Handle cgroup_disable=memory when getting memcg protection

2019-02-26 Thread Andrew Morton
an desirable value because otherwise as soon as
you exceed memory.low, all protection is lost, and all pages are eligible
to scan again.  By contrast, having a gradual ramp in reclaim pressure
means that you now still get some protection when thresholds are exceeded,
which means that one can now be more comfortable setting memory.low to
lower values without worrying that all protection will be lost.  This is
important because workingset size is really hard to know exactly,
especially with variable workloads, so at least getting *some* protection
if your workingset size grows larger than you expect increases user
confidence in setting memory.low without a huge buffer on top being
needed.

Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
assistance in thinking about how to make this work better.

In testing these changes, I intended to verify that:

1. Changes in page scanning become gradual and proportional instead of
   binary.

   To test this, I experimented stepping further and further down
   memory.low protection on a workload that floats around 19G workingset
   when under memory.low protection, watching page scan rates for the
   workload cgroup:

   ++-++--+
   | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
   ++-++--+
   |21G |   0 |  0 | N/A  |
   |17G | 867 |   3799 | 23%  |
   |12G |1203 |   3543 | 34%  |
   | 8G |2534 |   3979 | 64%  |
   | 4G |3980 |   4147 | 96%  |
   |  0 |3799 |   3980 | 95%  |
   ++-++--+

   As you can see, the test kernel (with a kernel containing this
   patch) ramps up page scanning significantly more gradually than the
   control kernel (without this patch).

2. More gradual ramp up in reclaim aggression doesn't result in
   premature OOMs.

   To test this, I wrote a script that slowly increments the number of
   pages held by stress(1)'s --vm-keep mode until a production system
   entered severe overall memory contention.  This script runs in a highly
   protected slice taking up the majority of available system memory. 
   Watching vmstat revealed that page scanning continued essentially
   nominally between test and control, without causing forward reclaim
   progress to become arrested.

[0]: 
https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project

[a...@linux-foundation.org: reflow block comments to fit in 80 cols]
[ch...@chrisdown.name: handle cgroup_disable=memory when getting memcg 
protection]
  Link: http://lkml.kernel.org/r/20190201045711.ga18...@chrisdown.name
Link: http://lkml.kernel.org/r/20190124014455.ga6...@chrisdown.name
Signed-off-by: Chris Down 
Acked-by: Johannes Weiner 
Reviewed-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Tejun Heo 
Cc: Dennis Zhou 
Cc: Tetsuo Handa 
Signed-off-by: Andrew Morton 
---

 Documentation/admin-guide/cgroup-v2.rst |   20 +++--
 include/linux/memcontrol.h  |   20 +
 mm/memcontrol.c |5 +
 mm/vmscan.c |   82 --
 4 files changed, 115 insertions(+), 12 deletions(-)

--- 
a/Documentation/admin-guide/cgroup-v2.rst~mm-proportional-memorylowmin-reclaim
+++ a/Documentation/admin-guide/cgroup-v2.rst
@@ -606,8 +606,8 @@ on an IO device and is an example of thi
 Protections
 ---
 
-A cgroup is protected to be allocated upto the configured amount of
-the resource if the usages of all its ancestors are under their
+A cgroup is protected upto the configured amount of the resource
+as long as the usages of all its ancestors are under their
 protected levels.  Protections can be hard guarantees or best effort
 soft boundaries.  Protections can also be over-committed in which case
 only upto the amount available to the parent is protected among
@@ -1020,7 +1020,10 @@ PAGE_SIZE multiple when read back.
is within its effective min boundary, the cgroup's memory
won't be reclaimed under any conditions. If there is no
unprotected reclaimable memory available, OOM killer
-   is invoked.
+   is invoked. Above the effective min boundary (or
+   effective low boundary if it is higher), pages are reclaimed
+   proportionally to the overage, reducing reclaim pressure for
+   smaller overages.
 
Effective min boundary is limited by memory.min values of
all ancestor cgroups. If there is memory.min overcommitment
@@ -1042,7 +1045,10 @@ PAGE_SIZE multiple when read back.
Best-effort memory protection.  If the memory usage of a
cgroup is within its effective low boundary, the cgroup's

Re: [PATCH v4] mm/hugetlb: Fix unsigned overflow in __nr_hugepages_store_common()

2019-02-26 Thread Andrew Morton
> 
> The number of node specific huge pages can be set via a file such as:
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
> When a node specific value is specified, the global number of huge
> pages must also be adjusted.  This adjustment is calculated as the
> specified node specific value + (global value - current node value).
> If the node specific value provided by the user is large enough, this
> calculation could overflow an unsigned long leading to a smaller
> than expected number of huge pages.
> 
> To fix, check the calculation for overflow.  If overflow is detected,
> use ULONG_MAX as the requested value.  This is inline with the user
> request to allocate as many huge pages as possible.
> 
> It was also noticed that the above calculation was done outside the
> hugetlb_lock.  Therefore, the values could be inconsistent and result
> in underflow.  To fix, the calculation is moved to within the routine
> set_max_huge_pages() where the lock is held.
> 
> ...
>
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2274,7 +2274,7 @@ static int adjust_pool_surplus(struct hstate *h,
> nodemask_t *nodes_allowed,

Please tweak that email client to prevent the wordwraps.

> + /*
> +  * Check for a node specific request.  Adjust global count, but
> +  * restrict alloc/free to the specified node.
> +  */
> + if (nid != NUMA_NO_NODE) {
> + unsigned long old_count = count;
> + count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> + /*
> +  * If user specified count causes overflow, set to
> +  * largest possible value.
> +  */
> + if (count < old_count)
> + count = ULONG_MAX;
> + }

The above two comments explain the code, but do not reveal the
reasoning behind the policy decisions which that code implements.

> ...
>
> + } else {
>   /*
> -  * per node hstate attribute: adjust count to global,
> -  * but restrict alloc/free to the specified node.
> +  * Node specific request, but we could not allocate
> +  * node mask.  Pass in ALL nodes, and clear nid.
>*/

Ditto here, somewhat.

The old mantra: comments should explain "why", not "what".  Reading the
code tells us the "what".

Thanks.


Re: [PATCH] mm,mremap: Bail out earlier in mremap_to under map pressure

2019-02-26 Thread Andrew Morton
On Tue, 26 Feb 2019 10:13:14 +0100 Oscar Salvador  wrote:

> When using mremap() syscall in addition to MREMAP_FIXED flag,
> mremap() calls mremap_to() which does the following:
> 
> 1) unmaps the destination region where we are going to move the map
> 2) If the new region is going to be smaller, we unmap the last part
>of the old region
> 
> Then, we will eventually call move_vma() to do the actual move.
> 
> move_vma() checks whether we are at least 4 maps below max_map_count
> before going further, otherwise it bails out with -ENOMEM.
> The problem is that we might have already unmapped the vma's in steps
> 1) and 2), so it is not possible for userspace to figure out the state
> of the vma's after it gets -ENOMEM, and it gets tricky for userspace
> to clean up properly on error path.
> 
> While it is true that we can return -ENOMEM for more reasons
> (e.g: see may_expand_vm() or move_page_tables()), I think that we can
> avoid this scenario in concret if we check early in mremap_to() if the
> operation has high chances to succeed map-wise.
> 
> Should not be that the case, we can bail out before we even try to unmap
> anything, so we make sure the vma's are left untouched in case we are likely
> to be short of maps.
> 
> The thumb-rule now is to rely on the worst-scenario case we can have.
> That is when both vma's (old region and new region) are going to be split
> in 3, so we get two more maps to the ones we already hold (one per each).
> If current map count + 2 maps still leads us to 4 maps below the threshold,
> we are going to pass the check in move_vma().
> 
> Of course, this is not free, as it might generate false positives when it is
> true that we are tight map-wise, but the unmap operation can release several
> vma's leading us to a good state.
> 
> Another approach was also investigated [1], but it may be too much hassle
> for what it brings.
> 

How is this going to affect existing userspace which is aware of the
current behaviour?

And how does it affect your existing cleanup code, come to that?  Does
it work as well or better after this change?


Re: [PATCH] kernel/hung_task.c: Use continuously blocked time when reporting.

2019-02-26 Thread Andrew Morton
On Tue, 26 Feb 2019 18:58:03 +0900 Tetsuo Handa 
 wrote:

> Since commit a2e514453861dd39 ("kernel/hung_task.c: allow to set checking
> interval separately from timeout") added hung_task_check_interval_secs,
> setting a value different from hung_task_timeout_secs
> 
>   echo 0 > /proc/sys/kernel/hung_task_panic
>   echo 120 > /proc/sys/kernel/hung_task_timeout_secs
>   echo 5 > /proc/sys/kernel/hung_task_check_interval_secs
> 
> causes confusing output as if the task was blocked for
> hung_task_timeout_secs seconds from the previous report.
> 
>   [  399.395930] INFO: task kswapd0:75 blocked for more than 120 seconds.
>   [  405.027637] INFO: task kswapd0:75 blocked for more than 120 seconds.
>   [  410.659725] INFO: task kswapd0:75 blocked for more than 120 seconds.
>   [  416.292860] INFO: task kswapd0:75 blocked for more than 120 seconds.
>   [  421.932305] INFO: task kswapd0:75 blocked for more than 120 seconds.
> 
> Although we could update t->last_switch_time after sched_show_task(t) if
> we want to report only every 120 seconds, reporting every 5 seconds might
> not be very bad for monitoring after a problematic situation has started.
> Thus, let's use continuously blocked time instead of updating previously
> reported time.
> 
>   [  677.985011] INFO: task kswapd0:80 blocked for more than 122 seconds.
>   [  693.856126] INFO: task kswapd0:80 blocked for more than 138 seconds.
>   [  709.728075] INFO: task kswapd0:80 blocked for more than 154 seconds.
>   [  725.600018] INFO: task kswapd0:80 blocked for more than 170 seconds.
>   [  741.473133] INFO: task kswapd0:80 blocked for more than 186 seconds.

hm, maybe.

The user asked for a report if a task was blocked for more than 120
seconds, so the kernel is accurately reporting that this happened.  The
actual blockage time is presumably useful information, but that's a
different thing.

We could report both:

"blocked for 122 seconds which is more than 120 seconds"

But on the other hand, what is the point in telling the user how they
configured their own kernel?

I think I'll stop arguing with myself and apply the patch ;)


Re: [PATCH 2/2] test_firmware: silence underflow warning in test_dev_config_update_u8()

2019-02-21 Thread Andrew Morton
On Thu, 21 Feb 2019 22:18:56 +0300 Dan Carpenter  
wrote:

> On Thu, Feb 21, 2019 at 10:54:58AM -0800, Andrew Morton wrote:
> > On Thu, 21 Feb 2019 21:38:26 +0300 Dan Carpenter  
> > wrote:
> > 
> > > We put an upper bound on "new" but we don't check for negatives.
> > 
> > U8_MAX has unsigned type, so `if (new > U8_MAX)' does check for negative.
> > 
> 
> No, doesn't work in this case.
> 
> #define U8_MAX  ((u8)~0U)
> 
> It would need to unsigned long for the type promotion to prevent
> negatives, but it starts as unsigned int, then unsigned char, which is
> type promoted to int.

OK.

> It would be more clear to just write it as:
> 
> #define U8_MAX 0xff

That doesn't work either.  Tricky.


#include 

typedef unsigned char u8;

#define U8_MAX 0xff

int main(int argc, char *argv[])
{
long new;

new = -20;

if (new > U8_MAX)
printf("over\n");
}



Re: [PATCH] huegtlbfs: fix races and page leaks during migration

2019-02-21 Thread Andrew Morton
On Thu, 21 Feb 2019 11:11:06 -0800 Mike Kravetz  wrote:

> 
> Sorry for the churn.  As I find and fix one issue I seem to discover another.
> There is still at least one more issue with private pages when COW comes into
> play.  I continue to work that.  I wanted to send this patch earlier as it
> is pretty easy to hit the bugs if you try.  If you would prefer another
> approach, let me know.
> 

No probs, the bug doesn't seem to be causing a lot of bother out there
and it's cc:stable; there's time to get this right ;)

Here's the delta I queued:

--- a/mm/hugetlb.c~huegtlbfs-fix-races-and-page-leaks-during-migration-update
+++ a/mm/hugetlb.c
@@ -3729,6 +3729,7 @@ static vm_fault_t hugetlb_no_page(struct
pte_t new_pte;
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
+   bool new_page = false;
 
/*
 * Currently, we are forced to kill the process in the event the
@@ -3790,6 +3791,7 @@ retry:
}
clear_huge_page(page, address, pages_per_huge_page(h));
__SetPageUptodate(page);
+   new_page = true;
 
if (vma->vm_flags & VM_MAYSHARE) {
int err = huge_add_to_page_cache(page, mapping, idx);
@@ -3861,8 +3863,9 @@ retry:
 
spin_unlock(ptl);
 
-   /* May already be set if not newly allocated page */
-   set_page_huge_active(page);
+   /* Make newly allocated pages active */
+   if (new_page)
+   set_page_huge_active(page);
 
unlock_page(page);
 out:
--- a/mm/migrate.c~huegtlbfs-fix-races-and-page-leaks-during-migration-update
+++ a/mm/migrate.c
@@ -1315,6 +1315,16 @@ static int unmap_and_move_huge_page(new_
lock_page(hpage);
}
 
+   /*
+* Check for pages which are in the process of being freed.  Without
+* page_mapping() set, hugetlbfs specific move page routine will not
+* be called and we could leak usage counts for subpools.
+*/
+   if (page_private(hpage) && !page_mapping(hpage)) {
+   rc = -EBUSY;
+   goto out_unlock;
+   }
+
if (PageAnon(hpage))
anon_vma = page_get_anon_vma(hpage);
 
@@ -1345,6 +1355,7 @@ put_anon:
put_new_page = NULL;
}
 
+out_unlock:
unlock_page(hpage);
 out:
if (rc != -EAGAIN)
_




Re: BUG: unable to handle kernel NULL pointer dereference in __generic_file_write_iter

2019-02-21 Thread Andrew Morton
On Thu, 21 Feb 2019 06:52:04 -0800 syzbot 
 wrote:

> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:4aa9fc2a435a Revert "mm, memory_hotplug: initialize struct..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1101382f40
> kernel config:  https://syzkaller.appspot.com/x/.config?x=4fceea9e2d99ac20
> dashboard link: https://syzkaller.appspot.com/bug?extid=ca95b2b7aef9e7cbd6ab
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> 
> Unfortunately, I don't have any reproducer for this crash yet.

Not understanding.  That seems to be saying that we got a NULL pointer
deref in __generic_file_write_iter() at

written = generic_perform_write(file, from, iocb->ki_pos);

which isn't possible.

I'm not seeing recent changes in there which could have caused this.  Help.


> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+ca95b2b7aef9e7cbd...@syzkaller.appspotmail.com
> 
> BUG: unable to handle kernel NULL pointer dereference at 
> #PF error: [INSTR]
> PGD a7ea0067 P4D a7ea0067 PUD 81535067 PMD 0
> Oops: 0010 [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 15924 Comm: syz-executor0 Not tainted 5.0.0-rc4+ #50
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
> Google 01/01/2011
> RIP: 0010:  (null)
> Code: Bad RIP value.
> RSP: 0018:88804c3d7858 EFLAGS: 00010246
> RAX:  RBX: 885aeb60 RCX: 005b
> RDX:  RSI: 88807ec22930 RDI: 8880a59bdcc0
> RBP: 88804c3d79b8 R08:  R09: 88804c3d7910
> R10: 8880835ca200 R11:  R12: 005b
> R13: 88804c3d7c98 R14: dc00 R15: 
> FS:  7f3456db4700() GS:8880ae60() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: ffd6 CR3: 814ac000 CR4: 001406f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>   __generic_file_write_iter+0x25e/0x630 mm/filemap.c:
>   ext4_file_write_iter+0x37a/0x1410 fs/ext4/file.c:266
>   call_write_iter include/linux/fs.h:1862 [inline]
>   new_sync_write fs/read_write.c:474 [inline]
>   __vfs_write+0x764/0xb40 fs/read_write.c:487
>   vfs_write+0x20c/0x580 fs/read_write.c:549
>   ksys_write+0x105/0x260 fs/read_write.c:598
>   __do_sys_write fs/read_write.c:610 [inline]
>   __se_sys_write fs/read_write.c:607 [inline]
>   __x64_sys_write+0x73/0xb0 fs/read_write.c:607
>   do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x458089
> Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
> 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
> ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7f3456db3c78 EFLAGS: 0246 ORIG_RAX: 0001
> RAX: ffda RBX: 0003 RCX: 00458089
> RDX: 005b RSI: 2240 RDI: 0003
> RBP: 0073bf00 R08:  R09: 
> R10:  R11: 0246 R12: 7f3456db46d4
> R13: 004c7450 R14: 004dce68 R15: 
> Modules linked in:
> CR2: 
> ---[ end trace 5cac9d2c75a59916 ]---
> kobject: 'loop5' (4426a409): kobject_uevent_env
> RIP: 0010:  (null)
> Code: Bad RIP value.
> RSP: 0018:88804c3d7858 EFLAGS: 00010246
> RAX:  RBX: 885aeb60 RCX: 005b
> kobject: 'loop5' (4426a409): fill_kobj_path: path  
> = '/devices/virtual/block/loop5'
> RDX:  RSI: 88807ec22930 RDI: 8880a59bdcc0
> kobject: 'loop2' (b82e0c58): kobject_uevent_env
> kobject: 'loop2' (b82e0c58): fill_kobj_path: path  
> = '/devices/virtual/block/loop2'
> RBP: 88804c3d79b8 R08:  R09: 88804c3d7910
> R10: 8880835ca200 R11:  R12: 005b
> R13: 88804c3d7c98 R14: dc00 R15: 
> kobject: 'loop5' (4426a409): kobject_uevent_env
> FS:  7f3456db4700() GS:8880ae60() knlGS:
> kobject: 'loop5' (4426a409): fill_kobj_path: path  
> = '/devices/virtual/block/loop5'
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 022029a0 CR3: 814ac000 CR4: 001406f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> 
> 
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
> 
> syzbot will keep track of this bug report. See:
> 

Re: general protection fault in relay_switch_subbuf

2019-02-21 Thread Andrew Morton
(cc Jens ;))

On Thu, 21 Feb 2019 06:54:03 -0800 syzbot 
 wrote:

> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:c04e2a780caf Merge tag 'fsnotify_for_v5.0-rc4' of git://gi..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=133424c0c0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=505743eba4e4f68
> dashboard link: https://syzkaller.appspot.com/bug?extid=8e78f280ccd6930f
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> 
> Unfortunately, I don't have any reproducer for this crash yet.
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+8e78f280ccd69...@syzkaller.appspotmail.com
> 
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault:  [#1] PREEMPT SMP KASAN
> CPU: 1 PID: 11600 Comm: syz-executor2 Not tainted 5.0.0-rc3+ #42
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
> Google 01/01/2011
> RIP: 0010:relay_switch_subbuf+0x27a/0xad0 kernel/relay.c:753
> Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 e2 07 00 00 48 b8 00 00 00 00  
> 00 fc ff df 48 8b 4b 58 48 8d 79 50 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f  
> 85 c9 07 00 00 48 ba 00 00 00 00 00 fc ff df 49 8b
> RSP: 0018:88805e3ef790 EFLAGS: 00010206
> RAX: dc00 RBX: 88804bc7db40 RCX: 
> RDX: 000a RSI: 8182d552 RDI: 0050
> RBP: 88805e3ef850 R08: 8880637f0600 R09: fbfff133b0b1
> R10: 88805e3ef850 R11: 899d8587 R12: 
> R13: 8880a7e05e00 R14:  R15: 
> FS:  7f1fdbddd700() GS:8880ae70() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 00a50489 CR3: 943a CR4: 001406e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>   relay_flush kernel/relay.c:881 [inline]
>   relay_flush+0x1c4/0x280 kernel/relay.c:865
>   __blk_trace_startstop.isra.0+0x28b/0x8b0 kernel/trace/blktrace.c:668
>   blk_trace_ioctl+0x1c6/0x300 kernel/trace/blktrace.c:727
>   blkdev_ioctl+0x141/0x2120 block/ioctl.c:591
>   block_ioctl+0xee/0x130 fs/block_dev.c:1914
>   vfs_ioctl fs/ioctl.c:46 [inline]
>   file_ioctl fs/ioctl.c:509 [inline]
>   do_vfs_ioctl+0x107b/0x17d0 fs/ioctl.c:696
>   ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
>   __do_sys_ioctl fs/ioctl.c:720 [inline]
>   __se_sys_ioctl fs/ioctl.c:718 [inline]
>   __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
>   do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x458099
> Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
> 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
> ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7f1fdbddcc78 EFLAGS: 0246 ORIG_RAX: 0010
> RAX: ffda RBX: 0003 RCX: 00458099
> RDX:  RSI: 1275 RDI: 0005
> RBP: 0073bf00 R08:  R09: 
> R10:  R11: 0246 R12: 7f1fdbddd6d4
> R13: 004bf595 R14: 004d0e88 R15: 
> Modules linked in:
> kobject: 'loop4' (5903bd4d): kobject_uevent_env
> kobject: 'loop4' (5903bd4d): fill_kobj_path: path  
> = '/devices/virtual/block/loop4'
> ---[ end trace d4380d594e5099d1 ]---
> RIP: 0010:relay_switch_subbuf+0x27a/0xad0 kernel/relay.c:753
> Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 e2 07 00 00 48 b8 00 00 00 00  
> 00 fc ff df 48 8b 4b 58 48 8d 79 50 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f  
> 85 c9 07 00 00 48 ba 00 00 00 00 00 fc ff df 49 8b
> kobject: 'loop0' (83b22ec7): kobject_uevent_env
> kobject: 'loop0' (83b22ec7): fill_kobj_path: path  
> = '/devices/virtual/block/loop0'
> Unknown ioctl 19304
> kobject: 'loop3' (50016e38): kobject_uevent_env
> RSP: 0018:88805e3ef790 EFLAGS: 00010206
> kobject: 'loop3' (50016e38): fill_kobj_path: path  
> = '/devices/virtual/block/loop3'
> RAX: dc00 RBX: 88804bc7db40 RCX: 
> kobject: 'loop1' (2962ffee): kobject_uevent_env
> kobject: 'loop1' (2962ffee): fill_kobj_path: path  
> = '/devices/virtual/block/loop1'
> RDX: 000a RSI: 8182d552 RDI: 0050
> kobject: 'loop4' (5903bd4d): kobject_uevent_env
> kobject: 'loop4' (5903bd4d): fill_kobj_path: path  
> = '/devices/virtual/block/loop4'
> RBP: 88805e3ef850 R08: 8880637f0600 R09: fbfff133b0b1
> kobject: 'loop4' (5903bd4d): kobject_uevent_env
> R10: 88805e3ef850 R11: 899d8587 R12: 
> kobject: 'loop4' (5903bd4d): fill_kobj_path: path  
> = '/devices/virtual/block/loop4'
> R13: 

Re: [PATCH 2/2] test_firmware: silence underflow warning in test_dev_config_update_u8()

2019-02-21 Thread Andrew Morton
On Thu, 21 Feb 2019 21:38:26 +0300 Dan Carpenter  
wrote:

> We put an upper bound on "new" but we don't check for negatives.

U8_MAX has unsigned type, so `if (new > U8_MAX)' does check for negative.

> In
> this case the underflow doesn't matter very much, but we may as well
> make the static checker happy.
> 
> ...
>
> --- a/lib/test_firmware.c
> +++ b/lib/test_firmware.c
> @@ -326,15 +326,12 @@ static ssize_t test_dev_config_show_int(char *buf, int 
> cfg)
>  static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
>  {
>   int ret;
> - long new;
> + u8 new;
>  
> - ret = kstrtol(buf, 10, );
> + ret = kstrtou8(buf, 10, );
>   if (ret)
>   return ret;
>  
> - if (new > U8_MAX)
> - return -EINVAL;
> -
>   mutex_lock(_fw_mutex);
>   *(u8 *)cfg = new;
>   mutex_unlock(_fw_mutex);

if *buf=="257",

previous behavior: -EINVAL
new behavior: *cfg = 1

yes?

The old behavior seems better.


Re: [PATCH] huegtlbfs: fix races and page leaks during migration

2019-02-20 Thread Andrew Morton
On Tue, 12 Feb 2019 14:14:00 -0800 Mike Kravetz  wrote:

> hugetlb pages should only be migrated if they are 'active'.  The routines
> set/clear_page_huge_active() modify the active state of hugetlb pages.
> When a new hugetlb page is allocated at fault time, set_page_huge_active
> is called before the page is locked.  Therefore, another thread could
> race and migrate the page while it is being added to page table by the
> fault code.  This race is somewhat hard to trigger, but can be seen by
> strategically adding udelay to simulate worst case scheduling behavior.
> Depending on 'how' the code races, various BUG()s could be triggered.
> 
> To address this issue, simply delay the set_page_huge_active call until
> after the page is successfully added to the page table.
> 
> Hugetlb pages can also be leaked at migration time if the pages are
> associated with a file in an explicitly mounted hugetlbfs filesystem.
> For example, a test program which hole punches, faults and migrates
> pages in such a file (1G in size) will eventually fail because it
> can not allocate a page.  Reported counts and usage at time of failure:
> 
> node0
> 537 free_hugepages
> 1024nr_hugepages
> 0   surplus_hugepages
> node1
> 1000free_hugepages
> 1024nr_hugepages
> 0   surplus_hugepages
> 
> Filesystem Size  Used Avail Use% Mounted on
> nodev  4.0G  4.0G 0 100% /var/opt/hugepool
> 
> Note that the filesystem shows 4G of pages used, while actual usage is
> 511 pages (just under 1G).  Failed trying to allocate page 512.
> 
> If a hugetlb page is associated with an explicitly mounted filesystem,
> this information in contained in the page_private field.  At migration
> time, this information is not preserved.  To fix, simply transfer
> page_private from old to new page at migration time if necessary.
> 
> Cc: 
> Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
> Signed-off-by: Mike Kravetz 

cc:stable.  It would be nice to get some review of this one, please?

> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -859,6 +859,18 @@ static int hugetlbfs_migrate_page(struct address_space 
> *mapping,
>   rc = migrate_huge_page_move_mapping(mapping, newpage, page);
>   if (rc != MIGRATEPAGE_SUCCESS)
>   return rc;
> +
> + /*
> +  * page_private is subpool pointer in hugetlb pages.  Transfer to
> +  * new page.  PagePrivate is not associated with page_private for
> +  * hugetlb pages and can not be set here as only page_huge_active
> +  * pages can be migrated.
> +  */
> + if (page_private(page)) {
> + set_page_private(newpage, page_private(page));
> + set_page_private(page, 0);
> + }
> +
>   if (mode != MIGRATE_SYNC_NO_COPY)
>   migrate_page_copy(newpage, page);
>   else
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a80832487981..f859e319e3eb 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3625,7 +3625,6 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, 
> struct vm_area_struct *vma,
>   copy_user_huge_page(new_page, old_page, address, vma,
>   pages_per_huge_page(h));
>   __SetPageUptodate(new_page);
> - set_page_huge_active(new_page);
>  
>   mmun_start = haddr;
>   mmun_end = mmun_start + huge_page_size(h);
> @@ -3647,6 +3646,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, 
> struct vm_area_struct *vma,
>   make_huge_pte(vma, new_page, 1));
>   page_remove_rmap(old_page, true);
>   hugepage_add_new_anon_rmap(new_page, vma, haddr);
> + set_page_huge_active(new_page);
>   /* Make the old page be freed below */
>   new_page = old_page;
>   }
> @@ -3792,7 +3792,6 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>   }
>   clear_huge_page(page, address, pages_per_huge_page(h));
>   __SetPageUptodate(page);
> - set_page_huge_active(page);
>  
>   if (vma->vm_flags & VM_MAYSHARE) {
>   int err = huge_add_to_page_cache(page, mapping, idx);
> @@ -3863,6 +3862,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>   }
>  
>   spin_unlock(ptl);
> +
> + /* May already be set if not newly allocated page */
> + set_page_huge_active(page);
> +
>   unlock_page(page);
>  out:
>   return ret;
> @@ -4097,7 +4100,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>* the set_pte_at() write.
>*/
>   __SetPageUptodate(page);
> - set_page_huge_active(page);
>  
>   mapping = dst_vma->vm_file->f_mapping;
>   idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> @@ -4165,6 +4167,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>   update_mmu_cache(dst_vma, dst_addr, dst_pte);
>  
>   spin_unlock(ptl);
> + 

Re: [PATCH -next] mm: fix set but not used warning

2019-02-20 Thread Andrew Morton
On Tue, 19 Feb 2019 18:28:30 + "Kani, Toshi"  wrote:

> On Mon, 2019-02-18 at 13:57 +, YueHaibing wrote:
> > Fixes gcc '-Wunused-but-set-variable' warning:
> > 
> > lib/ioremap.c: In function 'ioremap_page_range':
> > lib/ioremap.c:203:16: warning:
> >  variable 'start' set but not used [-Wunused-but-set-variable]
> > 
> > flush_cache_vmap may no need param, so add __maybe_unused fix this warning.
> 
> I think flush_cache_vmap should have a proper prototype with
> __maybe_unused for its args.  But, there are so many arch-specific
> definitions at glace, and I am not sure how manageable such change is,
> though...

This is one of the reasons why things like flush_cache_vmap() should
be implemented as static inline C functions, not as macros.  


Re: next/master boot bisection: next-20190215 on beaglebone-black

2019-02-15 Thread Andrew Morton
On Fri, 15 Feb 2019 10:20:10 -0800 (PST) "kernelci.org bot"  
wrote:

> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
> * This automated bisection report was sent to you on the basis  *
> * that you may be involved with the breaking commit it has  *
> * found.  No manual investigation has been done to verify it,   *
> * and the root cause of the problem may be somewhere else.  *
> * Hope this helps!  *
> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
> 
> next/master boot bisection: next-20190215 on beaglebone-black
> 
> Summary:
>   Start:  7a92eb7cc1dc Add linux-next specific files for 20190215
>   Details:https://kernelci.org/boot/id/5c666ea959b514b017fe6017
>   Plain log:  
> https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.txt
>   HTML log:   
> https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.html
>   Result: 8dd037cc97d9 mm/shuffle: default enable all shuffling
> 
> Checks:
>   revert: PASS
>   verify: PASS
> 
> Parameters:
>   Tree:   next
>   URL:
> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
>   Branch: master
>   Target: beaglebone-black
>   CPU arch:   arm
>   Lab:lab-collabora
>   Compiler:   gcc-7
>   Config: multi_v7_defconfig+CONFIG_SMP=n
>   Test suite: boot
> 
> Breaking commit found:
> 
> ---
> commit 8dd037cc97d9226c97c2ee1abb4e97eff71e0c8d
> Author: Dan Williams 
> Date:   Fri Feb 15 11:28:30 2019 +1100
> 
> mm/shuffle: default enable all shuffling

Thanks.

But what actually went wrong?  Kernel doesn't boot?




Re: next/master boot bisection: next-20190215 on beaglebone-black

2019-02-15 Thread Andrew Morton
On Fri, 15 Feb 2019 18:51:51 + Mark Brown  wrote:

> On Fri, Feb 15, 2019 at 10:43:25AM -0800, Andrew Morton wrote:
> > On Fri, 15 Feb 2019 10:20:10 -0800 (PST) "kernelci.org bot" 
> >  wrote:
> 
> > >   Details:https://kernelci.org/boot/id/5c666ea959b514b017fe6017
> > >   Plain log:  
> > > https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.txt
> > >   HTML log:   
> > > https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.html
> 
> > Thanks.
> 
> > But what actually went wrong?  Kernel doesn't boot?
> 
> The linked logs show the kernel dying early in boot before the console
> comes up so yeah.  There should be kernel output at the bottom of the
> logs.

OK, thanks.

Well, we have a result.  Stephen, can we please drop
mm-shuffle-default-enable-all-shuffling.patch for now?


Re: tmpfs inode leakage when opening file with O_TMP_FILE

2019-02-14 Thread Andrew Morton
(cc linux-fsdevel)

On Mon, 11 Feb 2019 15:18:11 +0100 Matej Kupljen  
wrote:

> Hi,
> 
> it seems that when opening file on file system that is mounted on
> tmpfs with the O_TMPFILE flag and using linkat call after that, it
> uses 2 inodes instead of 1.
> 
> This is simple test case:
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #define TEST_STRING "Testing\n"
> 
> #define TMP_PATH"/tmp/ping/"
> #define TMP_FILE"file.txt"
> 
> 
> int main(int argc, char* argv[])
> {
> char path[PATH_MAX];
> int fd;
> int rc;
> 
> fd = open(TMP_PATH, __O_TMPFILE | O_RDWR,
> S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP |
> S_IROTH | S_IWOTH);
> 
> rc = write(fd, TEST_STRING, strlen(TEST_STRING));
> 
> snprintf(path, PATH_MAX,  "/proc/self/fd/%d", fd);
> linkat(AT_FDCWD, path, AT_FDCWD, TMP_PATH TMP_FILE, 
> AT_SYMLINK_FOLLOW);
> close(fd);
> 
> return 0;
> }
> 
> I have checked indoes with "df -i" tool. The first inode is used when
> the call to open is executed and the second one when the call to
> linkat is executed.
> It is not decreased when close is executed.
> 
> I have also tested this on an ext4 mounted fs and there only one inode is 
> used.
> 
> I tested this on:
> $ cat /etc/lsb-release
> DISTRIB_ID=Ubuntu
> DISTRIB_RELEASE=18.04
> DISTRIB_CODENAME=bionic
> DISTRIB_DESCRIPTION="Ubuntu 18.04.1 LTS"
> 
> $ uname -a
> Linux Orion 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC
> 2018 x86_64 x86_64 x86_64 GNU/Linux
> 
> If you need any more information, please let me know.
> 
> And please CC me when replying, I am not subscribed to the list.
> 
> Thanks and BR,
> Matej


Re: [PATCH] psi: avoid divide-by-zero crash inside virtual machines

2019-02-14 Thread Andrew Morton
On Thu, 14 Feb 2019 15:53:43 -0500 Johannes Weiner  wrote:

> 
>   if (now < expires)
> 
> vs.
> 
>   if (time_before64(now, expires))
> 
> These macros always have me double check the argument order.

Yeah, me too.


Re: [PATCH v2 1/4] mm: Move recent_rotated pages calculation to shrink_inactive_list()

2019-02-14 Thread Andrew Morton
On Thu, 14 Feb 2019 13:35:21 +0300 Kirill Tkhai  wrote:

> Currently, struct reclaim_stat::nr_activate is a local variable,
> used only in shrink_page_list(). This patch introduces another
> local variable pgactivate to use instead of it, and reuses
> nr_activate to account number of active pages.
> 
> Note, that we need nr_activate to be an array, since type of page
> may change during shrink_page_list() (see ClearPageSwapBacked()).

The patch has nothing to do with the Subject:

reclaim_stat::nr_activate is not a local variable - it is a struct member.

I can kinda see what the patch is doing but the changelog needs more
care, please.



Re: [PATCH] mm/cma: cma_declare_contiguous: correct err handling

2019-02-14 Thread Andrew Morton
On Thu, 14 Feb 2019 12:45:51 + Peng Fan  wrote:

> In case cma_init_reserved_mem failed, need to free the memblock allocated
> by memblock_reserve or memblock_alloc_range.
> 
> ...
>
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -353,12 +353,14 @@ int __init cma_declare_contiguous(phys_addr_t base,
>  
>   ret = cma_init_reserved_mem(base, size, order_per_bit, name, res_cma);
>   if (ret)
> - goto err;
> + goto free_mem;
>  
>   pr_info("Reserved %ld MiB at %pa\n", (unsigned long)size / SZ_1M,
>   );
>   return 0;
>  
> +free_mem:
> + memblock_free(base, size);
>  err:
>   pr_err("Failed to reserve %ld MiB\n", (unsigned long)size / SZ_1M);
>   return ret;

This doesn't look right to me.  In the `fixed==true' case we didn't
actually allocate anything and in the `fixed==false' case, the
allocated memory is at `addr', not at `base'.



Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

2019-02-14 Thread Andrew Morton
On Thu, 14 Feb 2019 15:33:18 +0100 Michal Hocko  wrote:

> > Because swapoff() is very rare code path, to make the normal path runs as
> > fast as possible, disabling preemption + stop_machine() instead of
> > reference count is used to implement get/put_swap_device().  From
> > get_swap_device() to put_swap_device(), the preemption is disabled, so
> > stop_machine() in swapoff() will wait until put_swap_device() is called.
> > 
> > In addition to swap_map, cluster_info, etc.  data structure in the struct
> > swap_info_struct, the swap cache radix tree will be freed after swapoff,
> > so this patch fixes the race between swap cache looking up and swapoff
> > too.
> > 
> > Races between some other swap cache usages protected via disabling
> > preemption and swapoff are fixed too via calling stop_machine() between
> > clearing PageSwapCache() and freeing swap cache data structure.
> > 
> > Alternative implementation could be replacing disable preemption with
> > rcu_read_lock_sched and stop_machine() with synchronize_sched().
> 
> using stop_machine is generally discouraged. It is a gross
> synchronization.

This was discussed to death and I think the changelog explains the
conclusions adequately.  swapoff is super-rare so a stop_machine() in
that path is appropriate if its use permits more efficiency in the
regular swap code paths.  

> Besides that, since when do we have this problem?

What problem??


Re: Userspace regression in LTS and stable kernels

2019-02-14 Thread Andrew Morton
On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds 
 wrote:

> On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger
>  wrote:
> >
> > Your shebang line exceeds BINPRM_BUF_SIZE.
> > Before the said commit the kernel silently truncated the shebang line
> > (and corrupted it),
> > now it tells the user that the line is too long.
> 
> It doesn't matter if it "corrupted" things by truncating it. All that
> matters is "it used to work, now it doesn't"
> 
> Yes, maybe it never *should* have worked. And yes, it's sad that
> people apparently had cases that depended on this odd behavior, but
> there we are.
> 
> I see that Kees has a patch to fix it up.
> 

Greg, I think we have a problem here.

8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang
string") wasn't marked for backporting.  And, presumably as a
consequence, Kees's fix "exec: load_script: allow interpreter argument
truncation" was not marked for backporting.

8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet
it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x.

I don't know if Oleg considered backporting that patch.  I certainly
did (I always do), and I decided against doing so.  Yet there it is.



Re: [PATCH] psi: avoid divide-by-zero crash inside virtual machines

2019-02-14 Thread Andrew Morton
On Thu, 14 Feb 2019 14:31:57 -0500 Johannes Weiner  wrote:

> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -322,7 +322,7 @@ static bool update_stats(struct psi_group *group)
>   expires = group->next_update;
>   if (now < expires)
>   goto out;
> - if (now - expires > psi_period)
> + if (now - expires >= psi_period)
>   missed_periods = div_u64(now - expires, psi_period);
>  
>   /*

It seems appropriate to use time_after64() and friends in this code.



Re: linux-next: build failure after merge of the akpm-current tree

2019-02-13 Thread Andrew Morton
On Wed, 13 Feb 2019 17:25:18 +1100 Stephen Rothwell  
wrote:

> Hi Andrew,
> 
> After merging the akpm-current tree, today's linux-next build (arm
> multi_v7_defconfig) failed like this:
> 
> fs/io_uring.c: In function 'io_async_list_note':
> fs/io_uring.c:931:16: error: 'VM_MAX_READAHEAD' undeclared (first use in this 
> function); did you mean 'VM_MAYREAD'?
> max_pages = VM_MAX_READAHEAD >> (PAGE_SHIFT - 10);
> ^~~~
> VM_MAYREAD
> 
> Caused by commit
> 
>   4c416502dae2 ("mm: Refactor readahead defines in mm.h")
> 
> interacting with commit
> 
>   6eff5fa16a2e ("io_uring: allow workqueue item to handle multiple buffered 
> requests")
> 
> from the block tree.
> 

Thanks.  I'll fix mm-refactor-readahead-defines-in-mmh.patch and shall
stage it after the block tree so everything lands nicely.



Re: [PATCH v3] of: fix kmemleak crash caused by imbalance in early memory reservation

2019-02-13 Thread Andrew Morton
On Wed, 13 Feb 2019 23:13:29 +0200 Mike Rapoport  wrote:

> > > As a bonus, since memblock_find_in_range() ensures the allocation in the
> > > specified range, the bounds check can be removed.
> > 
> > hm, why is this against -mm rather than against mainline?
> > 
> > Do the OF maintainers intend to merge the fix?
> 
> There's a conflict this fix and resent memblock related changes in -mm.
> Rob said he anyway wasn't planning to to send this for 5.0 [1] and
> suggested to merge it via your tree.
> 
> [1] 
> https://lore.kernel.org/lkml/cal_jsqk-cc6ovz9mkp+exogjcrha0xxgsgqgkl3w9bff3rk...@mail.gmail.com/

OK, thanks, no probs.


Re: [PATCH] mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs

2019-02-13 Thread Andrew Morton
On Wed, 13 Feb 2019 22:11:58 +0100 Jann Horn  wrote:

> > This is probably more a davem patch than a -mm one.
> 
> Ah, sorry. I assumed that I just should go by which directory the
> patched code is in.
> 
> You did just add it to the -mm tree though, right? So I shouldn't
> resend it to davem?

Yes, please send to Dave.  I'll autodrop the -mm copy if/when it turns
up in -next.



Re: [PATCH] mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs

2019-02-13 Thread Andrew Morton
On Wed, 13 Feb 2019 21:41:57 +0100 Jann Horn  wrote:

> The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
> number of references that we might need to create in the fastpath later,
> the bump-allocation fastpath only has to modify the non-atomic bias value
> that tracks the number of extra references we hold instead of the atomic
> refcount. The maximum number of allocations we can serve (under the
> assumption that no allocation is made with size 0) is nc->size, so that's
> the bias used.
> 
> However, even when all memory in the allocation has been given away, a
> reference to the page is still held; and in the `offset < 0` slowpath, the
> page may be reused if everyone else has dropped their references.
> This means that the necessary number of references is actually
> `nc->size+1`.
> 
> Luckily, from a quick grep, it looks like the only path that can call
> page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
> requires CAP_NET_ADMIN in the init namespace and is only intended to be
> used for kernel testing and fuzzing.

For the net-naive, what is TAP?  It doesn't appear to mean
drivers/net/tap.c.

> To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
> `offset < 0` path, below the virt_to_page() call, and then repeatedly call
> writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
> with a vector consisting of 15 elements containing 1 byte each.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc,
>   /* Even if we own the page, we do not use atomic_set().
>* This would break get_page_unless_zero() users.
>*/
> - page_ref_add(page, size - 1);
> + page_ref_add(page, size);
>  
>   /* reset page count bias and offset to start of new frag */
>   nc->pfmemalloc = page_is_pfmemalloc(page);
> - nc->pagecnt_bias = size;
> + nc->pagecnt_bias = size + 1;
>   nc->offset = size;
>   }
>  
> @@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc,
>   size = nc->size;
>  #endif
>   /* OK, page count is 0, we can safely set it */
> - set_page_count(page, size);
> + set_page_count(page, size + 1);
>  
>   /* reset page count bias and offset to start of new frag */
> - nc->pagecnt_bias = size;
> + nc->pagecnt_bias = size + 1;
>   offset = size - fragsz;
>   }

This is probably more a davem patch than a -mm one.


Re: [PATCH v2 0/5] kasan: more tag based mode fixes

2019-02-13 Thread Andrew Morton
On Wed, 13 Feb 2019 14:58:25 +0100 Andrey Konovalov  
wrote:

> Changes in v2:
> - Add comments about kmemleak vs KASAN hooks order.

I assume this refers to Vincenzo's review of "kasan, kmemleak: pass
tagged pointers to kmemleak".  But v2 of that patch is unchanged.

> - Fix compilation error when CONFIG_SLUB_DEBUG is not defined.


Re: BUG: Bad page state (5)

2019-02-13 Thread Andrew Morton
On Wed, 13 Feb 2019 09:56:04 -0800 syzbot 
 wrote:

> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:c4f3ef3eb53f Add linux-next specific files for 20190213
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=1130a124c0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=9ec67976eb2df882
> dashboard link: https://syzkaller.appspot.com/bug?extid=2cd2887ea471ed6e6995
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=14ecdaa8c0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=12ebe178c0
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+2cd2887ea471ed6e6...@syzkaller.appspotmail.com

It looks like a a memfd page was freed with a non-NULL ->mapping.

Joel touched the memfd code with "mm/memfd: add an F_SEAL_FUTURE_WRITE
seal to memfd" but it would be surprising if syzbot tickled that code?


> BUG: Bad page state in process udevd  pfn:472f0
> name:"memfd:"
> page:ea00011cbc00 count:0 mapcount:0 mapping:88800df2ad40 index:0xf
> shmem_aops
> flags: 0x1fffc08000c(uptodate|dirty|swapbacked)
> raw: 01fffc08000c eaac4f08 8880a85af890 88800df2ad40
> raw: 000f   
> page dumped because: non-NULL mapping
> Modules linked in:
> CPU: 1 PID: 7586 Comm: udevd Not tainted 5.0.0-rc6-next-20190213 #34
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
> Google 01/01/2011
> Call Trace:
>   __dump_stack lib/dump_stack.c:77 [inline]
>   dump_stack+0x172/0x1f0 lib/dump_stack.c:113
>   bad_page.cold+0xda/0xff mm/page_alloc.c:586
>   free_pages_check_bad+0x142/0x1a0 mm/page_alloc.c:1014
>   free_pages_check mm/page_alloc.c:1023 [inline]
>   free_pages_prepare mm/page_alloc.c:1113 [inline]
>   free_pcp_prepare mm/page_alloc.c:1138 [inline]
>   free_unref_page_prepare mm/page_alloc.c:2991 [inline]
>   free_unref_page_list+0x31d/0xc40 mm/page_alloc.c:3060
> name:"memfd:"
>   release_pages+0x60d/0x1940 mm/swap.c:791
>   pagevec_lru_move_fn+0x218/0x2a0 mm/swap.c:213
>   __pagevec_lru_add mm/swap.c:917 [inline]
>   lru_add_drain_cpu+0x2f7/0x520 mm/swap.c:581
>   lru_add_drain+0x20/0x60 mm/swap.c:652
>   exit_mmap+0x290/0x530 mm/mmap.c:3134
>   __mmput kernel/fork.c:1047 [inline]
>   mmput+0x15f/0x4c0 kernel/fork.c:1068
>   exec_mmap fs/exec.c:1046 [inline]
>   flush_old_exec+0x8d9/0x1c20 fs/exec.c:1279
>   load_elf_binary+0x9bc/0x53f0 fs/binfmt_elf.c:864
>   search_binary_handler fs/exec.c:1656 [inline]
>   search_binary_handler+0x17f/0x570 fs/exec.c:1634
>   exec_binprm fs/exec.c:1698 [inline]
>   __do_execve_file.isra.0+0x1394/0x23f0 fs/exec.c:1818
>   do_execveat_common fs/exec.c:1865 [inline]
>   do_execve fs/exec.c:1882 [inline]
>   __do_sys_execve fs/exec.c:1958 [inline]
>   __se_sys_execve fs/exec.c:1953 [inline]
>   __x64_sys_execve+0x8f/0xc0 fs/exec.c:1953
>   do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x7fc7001ba207
> Code: Bad RIP value.
> RSP: 002b:7ffe06aa13b8 EFLAGS: 0206 ORIG_RAX: 003b
> RAX: ffda RBX:  RCX: 7fc7001ba207
> RDX: 01fd5fd0 RSI: 7ffe06aa14b0 RDI: 7ffe06aa24c0
> RBP: 00625500 R08: 1c49 R09: 1c49
> R10:  R11: 0206 R12: 01fd5fd0
> R13: 0007 R14: 01fc6250 R15: 0005
> BUG: Bad page state in process udevd  pfn:2b13c
> page:eaac4f00 count:0 mapcount:0 mapping:88800df2ad40 index:0xe
> shmem_aops
> flags: 0x1fffc08000c(uptodate|dirty|swapbacked)
> raw: 01fffc08000c 8880a85af890 8880a85af890 88800df2ad40
> raw: 000e   
> page dumped because: non-NULL mapping
> Modules linked in:
> CPU: 1 PID: 7586 Comm: udevd Tainted: GB  
> 5.0.0-rc6-next-20190213 #34
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
> Google 01/01/2011
> Call Trace:
>   __dump_stack lib/dump_stack.c:77 [inline]
>   dump_stack+0x172/0x1f0 lib/dump_stack.c:113
>   bad_page.cold+0xda/0xff mm/page_alloc.c:586
> name:"memfd:"
>   free_pages_check_bad+0x142/0x1a0 mm/page_alloc.c:1014
>   free_pages_check mm/page_alloc.c:1023 [inline]
>   free_pages_prepare mm/page_alloc.c:1113 [inline]
>   free_pcp_prepare mm/page_alloc.c:1138 [inline]
>   free_unref_page_prepare mm/page_alloc.c:2991 [inline]
>   free_unref_page_list+0x31d/0xc40 mm/page_alloc.c:3060
>   release_pages+0x60d/0x1940 mm/swap.c:791
>   pagevec_lru_move_fn+0x218/0x2a0 mm/swap.c:213
>   __pagevec_lru_add mm/swap.c:917 [inline]
>   lru_add_drain_cpu+0x2f7/0x520 mm/swap.c:581
>   lru_add_drain+0x20/0x60 mm/swap.c:652
>   exit_mmap+0x290/0x530 mm/mmap.c:3134
>   __mmput kernel/fork.c:1047 [inline]
>   

<    5   6   7   8   9   10   11   12   13   14   >