Re: [PATCH] zram: pass down the bvec we need to read into in the work struct
On Wed, 10 Apr 2019 15:43:50 -0400 Jerome Glisse wrote: > Adding more Cc and stable (i thought this was 5.1 addition). Note that > without this patch on arch/kernel where PAGE_SIZE != 4096 userspace > could read random memory through a zram block device (thought userspace > probably would have no control on the address being read). Looks good to me. Minchan & Sergey, can you please review? From: Jérôme Glisse Subject: zram: pass down the bvec we need to read into in the work struct When scheduling work item to read page we need to pass down the proper bvec struct which points to the page to read into. Before this patch it uses a randomly initialized bvec (only if PAGE_SIZE != 4096) which is wrong. Note that without this patch on arch/kernel where PAGE_SIZE != 4096 userspace could read random memory through a zram block device (thought userspace probably would have no control on the address being read). Link: http://lkml.kernel.org/r/20190408183219.26377-1-jgli...@redhat.com Signed-off-by: Jérôme Glisse Reviewed-by: Andrew Morton Cc: Minchan Kim Cc: Nitin Gupta Cc: Sergey Senozhatsky Cc: Signed-off-by: Andrew Morton --- drivers/block/zram/zram_drv.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) --- a/drivers/block/zram/zram_drv.c~zram-pass-down-the-bvec-we-need-to-read-into-in-the-work-struct +++ a/drivers/block/zram/zram_drv.c @@ -774,18 +774,18 @@ struct zram_work { struct zram *zram; unsigned long entry; struct bio *bio; + struct bio_vec bvec; }; #if PAGE_SIZE != 4096 static void zram_sync_read(struct work_struct *work) { - struct bio_vec bvec; struct zram_work *zw = container_of(work, struct zram_work, work); struct zram *zram = zw->zram; unsigned long entry = zw->entry; struct bio *bio = zw->bio; - read_from_bdev_async(zram, , entry, bio); + read_from_bdev_async(zram, >bvec, entry, bio); } /* @@ -798,6 +798,7 @@ static int read_from_bdev_sync(struct zr { struct zram_work work; + work.bvec = *bvec; work.zram = zram; work.entry = entry; work.bio = bio; _
Re: [PATCH 2/2] kernel: use sysctl shared variables for range check
On Wed, 10 Apr 2019 15:59:55 -0700 Kees Cook wrote: > On Wed, Apr 10, 2019 at 3:54 PM Matteo Croce wrote: > > > > On Thu, Apr 11, 2019 at 12:34 AM Kees Cook wrote: > > > > > > On Wed, Apr 10, 2019 at 3:30 PM Matteo Croce wrote: > > > > > > > > FYI, this are the stats from my local repo, just to let you the size > > > > of a series with all the changes in it: > > > > > > > > $ git --no-pager log --stat --oneline linus/master > > > > 2 files changed, 4 insertions(+), 9 deletions(-) > > > > 2 files changed, 7 insertions(+), 14 deletions(-) > > > > 6 files changed, 15 insertions(+), 30 deletions(-) > > > > 1 file changed, 16 insertions(+), 19 deletions(-) > > > > 1 file changed, 3 insertions(+), 5 deletions(-) > > > > 3 files changed, 15 insertions(+), 20 deletions(-) > > > > 12 files changed, 93 insertions(+), 116 deletions(-) > > > > 3 files changed, 98 insertions(+), 104 deletions(-) > > > > 2 files changed, 9 insertions(+) > > > > > > Tiny! :) Seriously, though, I think this should be fine to take > > > directly to Linus after patch 1 lands, or see if akpm wants to carry > > > it directly. > > > > > > -- > > > Kees Cook > > > > Nice. Still as an 8 patches series, or squashed into a bigger one? > > I suspect a single patch would be fine, but let's wait to hear from akpm. Bring it!
Re: [PATCH] init: Initialize jump labels before command line option parsing
On Tue, 16 Apr 2019 13:54:04 -0700 Dan Williams wrote: > When a module option, or core kernel argument, toggles a static-key it > requires jump labels to be initialized early. While x86, PowerPC, and > ARM64 arrange for jump_label_init() to be called before parse_args(), > ARM does not. > > Kernel command line: rdinit=/sbin/init page_alloc.shuffle=1 panic=-1 > console=ttyAMA0,115200 page_alloc.shuffle=1 > [ cut here ] > WARNING: CPU: 0 PID: 0 at ./include/linux/jump_label.h:303 > page_alloc_shuffle+0x12c/0x1ac > static_key_enable(): static key 'page_alloc_shuffle_key+0x0/0x4' used > before call to jump_label_init() > Modules linked in: > CPU: 0 PID: 0 Comm: swapper Not tainted > 5.1.0-rc4-next-20190410-3-g3367c36ce744 #1 > Hardware name: ARM Integrator/CP (Device Tree) > [] (unwind_backtrace) from [] (show_stack+0x10/0x18) > [] (show_stack) from [] (dump_stack+0x18/0x24) > [] (dump_stack) from [] (__warn+0xe0/0x108) > [] (__warn) from [] (warn_slowpath_fmt+0x44/0x6c) > [] (warn_slowpath_fmt) from [] > (page_alloc_shuffle+0x12c/0x1ac) > [] (page_alloc_shuffle) from [] > (shuffle_store+0x28/0x48) > [] (shuffle_store) from [] (parse_args+0x1f4/0x350) > [] (parse_args) from [] (start_kernel+0x1c0/0x488) > > Move the fallback call to jump_label_init() to occur before > parse_args(). The redundant calls to jump_label_init() in other archs > are left intact in case they have static key toggling use cases that are > even earlier than option parsing. Has it been confirmed that this fixes mm-shuffle-initial-free-memory-to-improve-memory-side-cache-utilization.patch on beaglebone-black?
Re: [PATCH] binfmt_elf: Move brk out of mmap when doing direct loader exec
On Tue, 16 Apr 2019 18:14:00 -0500 Kees Cook wrote: > On Tue, Apr 16, 2019 at 6:04 PM Andrew Morton > wrote: > > > > > > > > Reported-by: Ali Saidi > > > Link: > > > https://lkml.kernel.org/r/CAGXu5jJ5sj3emOT2QPxQkNQk0qbU6zEfu9=omfhx_p0nckp...@mail.gmail.com > > > Fixes: eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE") > > > Signed-off-by: Kees Cook > > > > No cc:stable? > > Probably it should be, yes. I think I'm just shy about that when > poking ELF mappings. :) Well, the -stable bots will backport anything that might look slightly like a fix anyway. I'll add cc:stable and shall hold it out until 5.2-rc1, so it should get a bit of a spin before being backported.
Re: [PATCH 0/4] z3fold: support page migration
On Thu, 11 Apr 2019 17:32:12 +0200 Vitaly Wool wrote: > This patchset implements page migration support and slightly better > buddy search. To implement page migration support, z3fold has to move > away from the current scheme of handle encoding. i. e. stop encoding > page address in handles. Instead, a small per-page structure is created > which will contain actual addresses for z3fold objects, while pointers > to fields of that structure will be used as handles. Can you please help find a reviewer for this work? For some reason I'm seeing a massive number of rejects when trying to apply these. It looks like your mail client performed some sort of selective space-stuffing. I suggest you email a patch to yourself, check that the result applies properly.
Re: [PATCH] binfmt_elf: Move brk out of mmap when doing direct loader exec
On Mon, 15 Apr 2019 21:23:20 -0700 Kees Cook wrote: > Commit eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"), > made changes in the rare case when the ELF loader was directly invoked > (e.g to set a non-inheritable LD_LIBRARY_PATH, testing new versions of > the loader), by moving into the mmap region to avoid both ET_EXEC and PIE > binaries. This had the effect of also moving the brk region into mmap, > which could lead to the stack and brk being arbitrarily close to each > other. An unlucky process wouldn't get its requested stack size and stack > allocations could end up scribbling on the heap. > > This is illustrated here. In the case of using the loader directly, brk > (so helpfully identified as "[heap]") is allocated with the _loader_ > not the binary. For example, with ASLR entirely disabled, you can see > this more clearly: > > $ /bin/cat /proc/self/maps > 4000-c000 r-xp ... /bin/cat > 5575b000-5575c000 r--p 7000 ... /bin/cat > 5575c000-5575d000 rw-p 8000 ... /bin/cat > 5575d000-5577e000 rw-p ... [heap] > ... > 77ff7000-77ffa000 r--p ... [vvar] > 77ffa000-77ffc000 r-xp ... [vdso] > 77ffc000-77ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so > 77ffd000-77ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so > 77ffe000-77fff000 rw-p ... > 7ffde000-7000 rw-p ... [stack] > > $ /lib/x86_64-linux-gnu/ld-2.27.so /bin/cat /proc/self/maps > ... > 77bcc000-77bd4000 r-xp ... /bin/cat > 77bd4000-77dd3000 ---p 8000 ... /bin/cat > 77dd3000-77dd4000 r--p 7000 ... /bin/cat > 77dd4000-77dd5000 rw-p 8000 ... /bin/cat > 77dd5000-77dfc000 r-xp ... /lib/x86_64-linux-gnu/ld-2.27.so > 77fb2000-77fd6000 rw-p ... > 77ff7000-77ffa000 r--p ... [vvar] > 77ffa000-77ffc000 r-xp ... [vdso] > 77ffc000-77ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so > 77ffd000-77ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so > 77ffe000-7802 rw-p ... [heap] > 7ffde000-7000 rw-p ... [stack] > > The solution is to move brk out of mmap and into ELF_ET_DYN_BASE since > nothing is there in this direct loader case (and ET_EXEC still far > away at 0x40). Anything that ran before should still work (i.e. the > ultimately-launched binary already had the brk very far from its text, > so this should be no different from a COMPAT_BRK standpoint). The only > risk I see here is that if someone started to suddenly depend on the > entire memory space above the mmap region being available when launching > binaries via a direct loader execs which seems highly unlikely, I'd hope: > this would mean a binary would _not_ work when execed normally. > > Reported-by: Ali Saidi > Link: > https://lkml.kernel.org/r/CAGXu5jJ5sj3emOT2QPxQkNQk0qbU6zEfu9=omfhx_p0nckp...@mail.gmail.com > Fixes: eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE") > Signed-off-by: Kees Cook No cc:stable?
Re: [PATCH v15 1/3] /proc/pid/status: Add support for architecture specific output
On Tue, 16 Apr 2019 14:32:48 +0800 Aubrey Li wrote: > The architecture specific information of the running processes could > be useful to the userland. Add support to examine process architecture > specific information externally. The implementation looks just fine to me. Have you had any feedback on the overall desirability of adding this feature? > --- a/fs/proc/array.c > +++ b/fs/proc/array.c > @@ -96,6 +96,11 @@ > #include > #include "internal.h" > > +/* Add support for architecture specific output in /proc/pid/status */ > +#ifndef arch_proc_pid_status > +#define arch_proc_pid_status(m, task) > +#endif To this I suggest adding /* arch_proc_pid_status() must be defined in asm/processor.h */ Because we've regularly had different architectures defining such things in different headers, resulting in a mess.
Re: linux-next: build warning after merge of the akpm-current tree
On Tue, 16 Apr 2019 16:52:56 +1000 Stephen Rothwell wrote: > Hi all, > > On Fri, 29 Mar 2019 13:39:14 +1100 Stephen Rothwell > wrote: > > > > After merging the akpm-current tree, today's linux-next build (arm > > multi_v7_defconfig) produced this warning: > > > > lib/list_sort.c:17:36: warning: 'pure' attribute ignored [-Wattributes] > >struct list_head const *, struct list_head const *); > > ^ > > > > Introduced by commit > > > > 820c81be5237 ("lib/list_sort: simplify and remove MAX_LIST_LENGTH_BITS") > > I am still getting that warning :-( Me too and I can't figure it out. Shrug, I guess we take a pass on it until someone has time/inclination to revisit it. --- a/lib/list_sort.c~lib-list_sort-simplify-and-remove-max_list_length_bits-fix +++ a/lib/list_sort.c @@ -7,13 +7,7 @@ #include #include -/* - * By declaring the compare function with the __pure attribute, we give - * the compiler more opportunity to optimize. Ideally, we'd use this in - * the prototype of list_sort(), but that would involve a lot of churn - * at all call sites, so just cast the function pointer passed in. - */ -typedef int __pure __attribute__((nonnull(2,3))) (*cmp_func)(void *, +typedef int __attribute__((nonnull(2,3))) (*cmp_func)(void *, struct list_head const *, struct list_head const *); /* _
Re: [PATCH v2] module: add stubs for within_module functions
On Tue, 16 Apr 2019 11:56:21 -0700 Tri Vo wrote: > On Tue, Apr 16, 2019 at 10:55 AM Tri Vo wrote: > > > > On Tue, Apr 16, 2019 at 8:21 AM Jessica Yu wrote: > > > > > > +++ Tri Vo [15/04/19 11:18 -0700]: > > > >Provide stubs for within_module_core(), within_module_init(), and > > > >within_module() to prevent build errors when !CONFIG_MODULES. > > > > > > > >v2: > > > >- Generalized commit message, as per Jessica. > > > >- Stubs for within_module_core() and within_module_init(), as per Nick. > > > > > > > >Suggested-by: Matthew Wilcox > > > >Reported-by: Randy Dunlap > > > >Reported-by: kbuild test robot > > > >Link: https://marc.info/?l=linux-mm=155384681109231=2 > > > >Signed-off-by: Tri Vo > > > > > > Applied, thanks! > > > > Thank you! > > Andrew, > this patch fixes 8c3d220cb6b5 ("gcov: clang support"). Could you > re-apply the gcov patch? Sorry, if it's a dumb question. I'm not > familiar with how cross-tree patches are handled in Linux. hm, I wonder what Jessica applied this patch to? Please resend a new version of "gcov: clang support".
Re: [PATCH v1 1/4] mm/memory_hotplug: Release memory resource after arch_remove_memory()
On Tue, 9 Apr 2019 12:01:45 +0200 David Hildenbrand wrote: > __add_pages() doesn't add the memory resource, so __remove_pages() > shouldn't remove it. Let's factor it out. Especially as it is a special > case for memory used as system memory, added via add_memory() and > friends. > > We now remove the resource after removing the sections instead of doing > it the other way around. I don't think this change is problematic. > > add_memory() > register memory resource > arch_add_memory() > > remove_memory > arch_remove_memory() > release memory resource > > While at it, explain why we ignore errors and that it only happeny if > we remove memory in a different granularity as we added it. Seems sane. > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1820,6 +1806,25 @@ void try_offline_node(int nid) > } > EXPORT_SYMBOL(try_offline_node); > > +static void __release_memory_resource(u64 start, u64 size) > +{ > + int ret; > + > + /* > + * When removing memory in the same granularity as it was added, > + * this function never fails. It might only fail if resources > + * have to be adjusted or split. We'll ignore the error, as > + * removing of memory cannot fail. > + */ > + ret = release_mem_region_adjustable(_resource, start, size); > + if (ret) { > + resource_size_t endres = start + size - 1; > + > + pr_warn("Unable to release resource <%pa-%pa> (%d)\n", > + , , ret); > + } > +} The types seem confused here. Should `start' and `size' be resource_size_t? Or maybe phys_addr_t. release_mem_region_adjustable() takes resource_size_t's. Is %pa the way to print a resource_size_t? I guess it happens to work because resource_size_t happens to map onto phys_addr_t, which isn't ideal. Wanna have a headscratch over that? > /** > * remove_memory > * @nid: the node ID > @@ -1854,6 +1859,7 @@ void __ref __remove_memory(int nid, u64 start, u64 size) > memblock_remove(start, size); > > arch_remove_memory(nid, start, size, NULL); > + __release_memory_resource(start, size); > > try_offline_node(nid);
Re: [PATCH v5 2/7] slob: Respect list_head abstraction layer
On Wed, 10 Apr 2019 06:06:49 +1000 "Tobin C. Harding" wrote: > On Tue, Apr 09, 2019 at 02:59:52PM +0200, Vlastimil Babka wrote: > > On 4/3/19 11:13 PM, Tobin C. Harding wrote: > > > > > According to 0day test robot this is triggering an error from > > > CHECK_DATA_CORRUPTION when the kernel is built with CONFIG_DEBUG_LIST. > > > > FWIW, that report [1] was for commit 15c8410c67adef from next-20190401. I've > > checked and it's still the v4 version, although the report came after you > > submitted v5 (it wasn't testing the patches from mailing list, but mmotm). I > > don't see any report for the v5 version so I'd expect it to be indeed fixed > > by > > the new approach that adds boolean return parameter to slob_page_alloc(). > > > > Vlastimil > > Oh man thanks! That is super cool, thanks for letting me know > Vlastimil. Yes, thanks for the followup.
Re: [PATCH] mm/hmm: fix hmm_range_dma_map()/hmm_range_dma_unmap()
On Tue, 9 Apr 2019 13:53:40 -0400 jgli...@redhat.com wrote: > Was using wrong field and wrong enum for read only versus read and > write mapping. For thos who were wondering, this fixes mm-hmm-add-an-helper-function-that-fault-pages-and-map-them-to-a-device-v3.patch, which is presently queued in -mm.
Re: [PATCH] cpumask: fix double string traverse in cpumask_parse
On Tue, 9 Apr 2019 13:42:08 -0700 Yury Norov wrote: > From: Yury Norov > > cpumask_parse() finds first occurrence of either \n or \0 by calling > strchr() and strlen(). We can do it better with a single call of > strchrnul(). Fair enough. > --- a/include/linux/cpumask.h > +++ b/include/linux/cpumask.h > @@ -633,8 +633,7 @@ static inline int cpumask_parselist_user(const char > __user *buf, int len, > */ > static inline int cpumask_parse(const char *buf, struct cpumask *dstp) > { > - char *nl = strchr(buf, '\n'); > - unsigned int len = nl ? (unsigned int)(nl - buf) : strlen(buf); > + unsigned int len = (unsigned int)(strchrnul(buf, '\n') - buf); Does the cast do anything useful? The compiler will convert a ptrdiff_t to uint without issues, I think? > return bitmap_parse(buf, len, cpumask_bits(dstp), nr_cpumask_bits); > }
Re: [PATCH v2 2/2] mm, memory_hotplug: provide a more generic restrictions for memory hotplug
On Mon, 8 Apr 2019 10:26:33 +0200 Oscar Salvador wrote: > arch_add_memory, __add_pages take a want_memblock which controls whether > the newly added memory should get the sysfs memblock user API (e.g. > ZONE_DEVICE users do not want/need this interface). Some callers even > want to control where do we allocate the memmap from by configuring > altmap. > > Add a more generic hotplug context for arch_add_memory and __add_pages. > struct mhp_restrictions contains flags which contains additional > features to be enabled by the memory hotplug (MHP_MEMBLOCK_API > currently) and altmap for alternative memmap allocator. > > This patch shouldn't introduce any functional change. From: Andrew Morton Subject: mm-memory_hotplug-provide-a-more-generic-restrictions-for-memory-hotplug-fix x86_64 allnoconfig: In file included from ./include/linux/mmzone.h:744:0, from ./include/linux/gfp.h:6, from ./include/linux/umh.h:4, from ./include/linux/kmod.h:22, from ./include/linux/module.h:13, from init/do_mounts.c:1: ./include/linux/memory_hotplug.h:353:11: warning: ‘struct mhp_restrictions’ declared inside parameter list will not be visible outside of this definition or declaration struct mhp_restrictions *restrictions); ^~~~ Fix this by moving the arch_add_memory() definition inside CONFIG_MEMORY_HOTPLUG and moving the mhp_restrictions definition to a more appropriate place. Cc: Dan Williams Cc: David Hildenbrand Cc: Michal Hocko Cc: Oscar Salvador Signed-off-by: Andrew Morton --- include/linux/memory_hotplug.h | 24 1 file changed, 12 insertions(+), 12 deletions(-) --- a/include/linux/memory_hotplug.h~mm-memory_hotplug-provide-a-more-generic-restrictions-for-memory-hotplug-fix +++ a/include/linux/memory_hotplug.h @@ -54,6 +54,16 @@ enum { }; /* + * Restrictions for the memory hotplug: + * flags: MHP_ flags + * altmap: alternative allocator for memmap array + */ +struct mhp_restrictions { + unsigned long flags; + struct vmem_altmap *altmap; +}; + +/* * Zone resizing functions * * Note: any attempt to resize a zone should has pgdat_resize_lock() @@ -101,6 +111,8 @@ extern void __online_page_free(struct pa extern int try_online_node(int nid); +extern int arch_add_memory(int nid, u64 start, u64 size, + struct mhp_restrictions *restrictions); extern u64 max_mem_size; extern bool memhp_auto_online; @@ -126,16 +138,6 @@ extern int __remove_pages(struct zone *z #define MHP_MEMBLOCK_API (1<<0) -/* - * Restrictions for the memory hotplug: - * flags: MHP_ flags - * altmap: alternative allocator for memmap array - */ -struct mhp_restrictions { - unsigned long flags; - struct vmem_altmap *altmap; -}; - /* reasonably generic interface to expand the physical pages */ extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages, struct mhp_restrictions *restrictions); @@ -349,8 +351,6 @@ extern int walk_memory_range(unsigned lo extern int __add_memory(int nid, u64 start, u64 size); extern int add_memory(int nid, u64 start, u64 size); extern int add_memory_resource(int nid, struct resource *resource); -extern int arch_add_memory(int nid, u64 start, u64 size, - struct mhp_restrictions *restrictions); extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages, struct vmem_altmap *altmap); extern bool is_memblock_offlined(struct memory_block *mem); _
Re: [PATCH v1 bitops] bitops: Fix UBSAN undefined behavior warning for rotation right
(resend, cc Andrey) On Sun, 7 Apr 2019 12:53:25 + Vadim Pasternak wrote: > The warning is caused by call to rorXX(), if the second parameters of > this function "shift" is zero. In such case UBSAN reports the warning > for the next expression: (word << (XX - shift), where XX is > 64, 32, 16, 8 for respectively ror64, ror32, ror16, ror8. > Fix adds validation of this parameter - in case it's equal zero, no > need to rotate, just original "word" is to be returned to caller. > > The UBSAN undefined behavior warning has been reported for call to > ror32(): > [ 11.426543] UBSAN: Undefined behaviour in ./include/linux/bitops.h:93:33 > [ 11.434045] shift exponent 32 is too large for 32-bit type 'unsigned int' hm, do we care? > ... > > --- a/include/linux/bitops.h > +++ b/include/linux/bitops.h > @@ -70,6 +70,9 @@ static inline __u64 rol64(__u64 word, unsigned int shift) > */ > static inline __u64 ror64(__u64 word, unsigned int shift) > { > + if (!shift) > + return word; > + > return (word >> shift) | (word << (64 - shift)); > } Is there any known architecture or compiler for which UL<<64 doesn't reliably produce zero? Is there any prospect that this will become a problem in the future?
Re: [PATCH v1 bitops] bitops: Fix UBSAN undefined behavior warning for rotation right
On Sun, 7 Apr 2019 12:53:25 + Vadim Pasternak wrote: > The warning is caused by call to rorXX(), if the second parameters of > this function "shift" is zero. In such case UBSAN reports the warning > for the next expression: (word << (XX - shift), where XX is > 64, 32, 16, 8 for respectively ror64, ror32, ror16, ror8. > Fix adds validation of this parameter - in case it's equal zero, no > need to rotate, just original "word" is to be returned to caller. > > The UBSAN undefined behavior warning has been reported for call to > ror32(): > [ 11.426543] UBSAN: Undefined behaviour in ./include/linux/bitops.h:93:33 > [ 11.434045] shift exponent 32 is too large for 32-bit type 'unsigned int' hm, do we care? > ... > > --- a/include/linux/bitops.h > +++ b/include/linux/bitops.h > @@ -70,6 +70,9 @@ static inline __u64 rol64(__u64 word, unsigned int shift) > */ > static inline __u64 ror64(__u64 word, unsigned int shift) > { > + if (!shift) > + return word; > + > return (word >> shift) | (word << (64 - shift)); > } Is there any known architecture or compiler for which UL<<64 doesn't reliably produce zero? Is there any prospect that this will become a problem in the future?
Re: [PATCH 1/2] fsl_hypervisor: dereferencing error pointers in ioctl
On Tue, 18 Dec 2018 11:20:03 +0300 Dan Carpenter wrote: > The strndup_user() function returns error pointers on error, and then > in the error handling we pass the error pointers to kfree(). It will > cause an Oops. > Looks good to me. I guess we should fix this too? From: Andrew Morton Subject: mm/util.c: fix strndup_user() comment The kerneldoc misdescribes strndup_user()'s return value. Cc: Dan Carpenter Cc: Timur Tabi Cc: Mihai Caraman Cc: Kumar Gala Signed-off-by: Andrew Morton --- mm/util.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/mm/util.c~mm-utilc-fix-strndup_user-comment +++ a/mm/util.c @@ -204,7 +204,7 @@ EXPORT_SYMBOL(vmemdup_user); * @s: The string to duplicate * @n: Maximum number of bytes to copy, including the trailing NUL. * - * Return: newly allocated copy of @s or %NULL in case of error + * Return: newly allocated copy of @s or an ERR_PTR() in case of error */ char *strndup_user(const char __user *s, long n) { _
Re: [PATCH v2] writeback: use exact memcg dirty counts
On Mon, 1 Apr 2019 14:20:44 -0400 Johannes Weiner wrote: > On Fri, Mar 29, 2019 at 10:46:09AM -0700, Greg Thelen wrote: > > @@ -3907,10 +3923,10 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, > > unsigned long *pfilepages, > > struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); > > struct mem_cgroup *parent; > > > > - *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); > > + *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); > > > > /* this should eventually include NR_UNSTABLE_NFS */ > > - *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); > > + *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); > > *pfilepages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) | > > (1 << LRU_ACTIVE_FILE)); > > Andrew, > > just a head-up: -mm has that LRU stat cleanup series queued ("mm: > memcontrol: clean up the LRU counts tracking") that changes the > mem_cgroup_nr_lru_pages() call here to two memcg_page_state(). > > I'm assuming Greg's fix here will get merged before the cleanup. When > it gets picked up, it'll conflict with "mm: memcontrol: push down > mem_cgroup_nr_lru_pages()". > > "mm: memcontrol: push down mem_cgroup_nr_lru_pages()" will need to be > changed to use memcg_exact_page_state() calls instead of the plain > memcg_page_state() for *pfilepages. > Thanks. Like this? void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, unsigned long *pheadroom, unsigned long *pdirty, unsigned long *pwriteback) { struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); /* this should eventually include NR_UNSTABLE_NFS */ *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + memcg_exact_page_state(memcg, NR_ACTIVE_FILE); *pheadroom = PAGE_COUNTER_MAX; while ((parent = parent_mem_cgroup(memcg))) { unsigned long ceiling = min(memcg->memory.max, memcg->high); unsigned long used = page_counter_read(>memory); *pheadroom = min(*pheadroom, ceiling - min(ceiling, used)); memcg = parent; } }
Re: [PATCH] mm/hugetlb: Get rid of NODEMASK_ALLOC
On Tue, 2 Apr 2019 15:34:15 +0200 Oscar Salvador wrote: > NODEMASK_ALLOC is used to allocate a nodemask bitmap, ant it does it by > first determining whether it should be allocated in the stack or dinamically > depending on NODES_SHIFT. > Right now, it goes the dynamic path whenever the nodemask_t is above 32 > bytes. > > Although we could bump it to a reasonable value, the largest a nodemask_t > can get is 128 bytes, so since __nr_hugepages_store_common is called from > a rather shore stack we can just get rid of the NODEMASK_ALLOC call here. > > This reduces some code churn and complexity. It took a bit of sleuthing to figure out that this patch applies to Mike's "hugetlbfs: fix potential over/underflow setting node specific nr_hugepages". Should they be folded together? I'm thinking not. (Also, should "hugetlbfs: fix potential over/underflow setting node specific nr_hugepages" have been -stableified? I also think not, but I bet it happens anyway).
Re: [RESEND PATCH 0/3] improve vmap allocation
On Tue, 2 Apr 2019 18:25:28 +0200 "Uladzislau Rezki (Sony)" wrote: > Changes in v3 > - > - simplify the __get_va_next_sibling() and __find_va_links() functions; > - remove "unlikely". Place the WARN_ON_ONCE directly to the "if" condition; > - replace inline to __always_inline; > - move the debug code to separate patches; Does v3 address the report from kernel test robot , Subject "[mm/vmalloc.c] 7ae76449bd: kernel_BUG_at_lib/list_debug.c". For some reason I cannot find that email in my linux-kernel folder and nor does google find it and http://lkml.kernel.org/r/20190401132655.GD9217@shao2-debian doesn't work. Message-ID 20190401132655.GD9217@shao2-debian doesn't seem to have got through the list server. Ditto linux-mm.
Re: [PATCH 1/1] slob: Only use list functions when safe to do so
On Wed, 3 Apr 2019 06:05:38 +1100 "Tobin C. Harding" wrote: > > It's regrettable that this fixes > > slob-respect-list_head-abstraction-layer.patch but doesn't apply to > > that patch - slob-use-slab_list-instead-of-lru.patch gets in the way. > > So we end up with a patch series which introduces a bug and later > > fixes it. > > Yes I thought that also. Do you rebase the mm tree? Did you apply this > right after slob-use-slab_list-instead-of-lru or to the current tip? After slob-use-slab_list-instead-of-lru.patch > If > it is applied to the tip does this effect the ability to later bisect in > between these two commits (if the need arises for some unrelated reason)? There is a bisection hole but it is short and the bug is hardish to hit. > > I guess we can live with that but if the need comes to respin this > > series, please do simply fix > > slob-respect-list_head-abstraction-layer.patch so we get a clean > > series. > > If its not too much work for you to apply the new series I'll do another > version just to get this right. I guess that would be best, thanks.
Re: [PATCH] lib: Fix possible incorrect result from rational fractions helper
On Sat, 30 Mar 2019 13:58:55 -0700 Trent Piepho wrote: > In some cases the previous algorithm would not return the closest > approximation. This would happen when a semi-convergent was the > closest, as the previous algorithm would only consider convergents. > > As an example, consider an initial value of 5/4, and trying to find the > closest approximation with a maximum of 4 for numerator and denominator. > The previous algorithm would return 1/1 as the closest approximation, > while this version will return the correct answer of 4/3. > > To do this, the main loop performs effectively the same operations as it > did before. It must now keep track of the last three approximations, > n2/d2 .. n0/d0, while before it only needed the last two. > > If an exact answer is not found, the algorithm will now calculate the > best semi-convergent term, t, which is a single expression with two > divisions: > min((max_numerator - n0) / n1, (max_denominator - d0) / d1) > > This will be used if it is better than previous convergent. The test > for this is generally a simple comparison, 2*t > a. But in an edge > case, where the convergent's final term is even and the best allowable > semi-convergent has a final term of exactly half the convergent's final > term, the more complex comparison (d0*dp > d1*d) is used. > > I also wrote some comments explaining the code. While one still needs > to look up the math elsewhere, they should help a lot to follow how the > code relates to that math. What are the userspace-visible runtime effects of this change?
Re: [PATCH 1/1] slob: Only use list functions when safe to do so
On Tue, 2 Apr 2019 14:29:57 +1100 "Tobin C. Harding" wrote: > Currently we call (indirectly) list_del() then we manually try to combat > the fact that the list may be in an undefined state by getting 'prev' > and 'next' pointers in a somewhat contrived manner. It is hard to > verify that this works for all initial states of the list. Clearly the > author (me) got it wrong the first time because the 0day kernel testing > robot managed to crash the kernel thanks to this code. > > All this is done in order to do an optimisation aimed at preventing > fragmentation at the start of a slab. We can just skip this > optimisation any time the list is put into an undefined state since this > only occurs when an allocation completely fills the slab and in this > case the optimisation is unnecessary since we have not fragmented the slab > by this allocation. > > Change the page pointer passed to slob_alloc_page() to be a double > pointer so that we can set it to NULL to indicate that the page was > removed from the list. Skip the optimisation if the page was removed. > > Found thanks to the kernel test robot, email subject: > > 340d3d6178 ("mm/slob.c: respect list_head abstraction layer"): kernel > BUG at lib/list_debug.c:31! > It's regrettable that this fixes slob-respect-list_head-abstraction-layer.patch but doesn't apply to that patch - slob-use-slab_list-instead-of-lru.patch gets in the way. So we end up with a patch series which introduces a bug and later fixes it. I guess we can live with that but if the need comes to respin this series, please do simply fix slob-respect-list_head-abstraction-layer.patch so we get a clean series.
Re: [RFC PATCH v2 0/1] improve vmap allocation
On Mon, 1 Apr 2019 13:03:47 +0200 Uladzislau Rezki wrote: > Hello, Andrew. > > > > > It's a lot of new code. I t looks decent and I'll toss it in there for > > further testing. Hopefully someone will be able to find the time for a > > detailed review. > > > I have got some proposals and comments about simplifying the code a bit. > So i am about to upload the v3 for further review. I see that your commit > message includes whole details and explanation: > > http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=b6ac5ca4c512094f217a8140edeea17e6621b2ad > > Should i base on and keep it in v3? > It's probably best to do a clean resend at this stage. I'll take a look at the deltas and updating changelogs, etc.
Re: [RESEND PATCH v2 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS
On Tue, 19 Mar 2019 08:16:00 + George Spelvin wrote: > Rather than a fixed-size array of pending sorted runs, use the ->prev > links to keep track of things. This reduces stack usage, eliminates > some ugly overflow handling, and reduces the code size. > > Also: > * merge() no longer needs to handle NULL inputs, so simplify. > * The same applies to merge_and_restore_back_links(), which is renamed > to the less ponderous merge_final(). (It's a static helper function, > so we don't need a super-descriptive name; comments will do.) > * Document the actual return value requirements on the (*cmp)() > function; some callers are already using this feature. > > x86-64 code size 1086 -> 739 bytes (-347) > > (Yes, I see checkpatch complaining about no space after comma in > "__attribute__((nonnull(2,3,4,5)))". Checkpatch is wrong.) x86_64 allnoconfig with gcc-7.3.0: lib/list_sort.c:17:36: warning: __pure__ attribute ignored [-Wattributes] struct list_head const *, struct list_head const *);
Re: [PATCH 2/6] bitmap_parselist: move non-parser logic to helpers
On Tue, 26 Mar 2019 00:07:44 +0300 Yury Norov wrote: > --- a/lib/bitmap.c > +++ b/lib/bitmap.c > @@ -477,6 +477,42 @@ int bitmap_print_to_pagebuf(bool list, char *buf, const > unsigned long *maskp, > } > EXPORT_SYMBOL(bitmap_print_to_pagebuf); > > +/* > + * Region 9-38:4/10 describes the following bitmap structure: > + * 09 1218 38 > + * ....... > + * ^ ^ ^ ^ > + * start off grlen end > + */ > +struct region { > + unsigned int start; > + unsigned int off; > + unsigned int grlen; > + unsigned int end; > +}; grlen means... group_len, I think? Perhaps it should be called group_len ;)
Re: [PATCH REBASED] mm, memcg: Make scan aggression always exclude protection
On Fri, 22 Mar 2019 16:03:07 + Chris Down wrote: > This patch is an incremental improvement on the existing > memory.{low,min} relative reclaim work to base its scan pressure > calculations on how much protection is available compared to the current > usage, rather than how much the current usage is over some protection > threshold. > > Previously the way that memory.low protection works is that if you are > 50% over a certain baseline, you get 50% of your normal scan pressure. > This is certainly better than the previous cliff-edge behaviour, but it > can be improved even further by always considering memory under the > currently enforced protection threshold to be out of bounds. This means > that we can set relatively low memory.low thresholds for variable or > bursty workloads while still getting a reasonable level of protection, > whereas with the previous version we may still trivially hit the 100% > clamp. The previous 100% clamp is also somewhat arbitrary, whereas this > one is more concretely based on the currently enforced protection > threshold, which is likely easier to reason about. > > There is also a subtle issue with the way that proportional reclaim > worked previously -- it promotes having no memory.low, since it makes > pressure higher during low reclaim. This happens because we base our > scan pressure modulation on how far memory.current is between memory.min > and memory.low, but if memory.low is unset, we only use the overage > method. In most cromulent configurations, this then means that we end up > with *more* pressure than with no memory.low at all when we're in low > reclaim, which is not really very usable or expected. > > With this patch, memory.low and memory.min affect reclaim pressure in a > more understandable and composable way. For example, from a user > standpoint, "protected" memory now remains untouchable from a reclaim > aggression standpoint, and users can also have more confidence that > bursty workloads will still receive some amount of guaranteed > protection. Could you please provide more description of the effect this has upon userspace? Preferably in real-world cases. What problems were being observed and how does this improve things?
Re: [PATCH 3/6] mm: memcontrol: replace node summing with memcg_page_state()
On Thu, 28 Feb 2019 11:30:17 -0500 Johannes Weiner wrote: > Instead of adding up the node counters, use memcg_page_state() to get > the memcg state directly. This is a bit cheaper and more stream-lined. > > ... > > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -746,10 +746,13 @@ static unsigned long mem_cgroup_nr_lru_pages(struct > mem_cgroup *memcg, > unsigned int lru_mask) > { > unsigned long nr = 0; > - int nid; > + enum lru_list lru; > > - for_each_node_state(nid, N_MEMORY) > - nr += mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask); > + for_each_lru(lru) { > + if (!(BIT(lru) & lru_mask)) > + continue; > + nr += memcg_page_state(memcg, NR_LRU_BASE + lru); > + } Might be able to use for_each_set_bit(_mnask) here, but it's much of a muchness.
Re: [PATCH] writeback: sum memcg dirty counters as needed
On Thu, 7 Mar 2019 08:56:32 -0800 Greg Thelen wrote: > Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in > memory.stat reporting") memcg dirty and writeback counters are managed > as: > 1) per-memcg per-cpu values in range of [-32..32] > 2) per-memcg atomic counter > When a per-cpu counter cannot fit in [-32..32] it's flushed to the > atomic. Stat readers only check the atomic. > Thus readers such as balance_dirty_pages() may see a nontrivial error > margin: 32 pages per cpu. > Assuming 100 cpus: >4k x86 page_size: 13 MiB error per memcg > 64k ppc page_size: 200 MiB error per memcg > Considering that dirty+writeback are used together for some decisions > the errors double. > > This inaccuracy can lead to undeserved oom kills. One nasty case is > when all per-cpu counters hold positive values offsetting an atomic > negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32). > balance_dirty_pages() only consults the atomic and does not consider > throttling the next n_cpu*32 dirty pages. If the file_lru is in the > 13..200 MiB range then there's absolutely no dirty throttling, which > burdens vmscan with only dirty+writeback pages thus resorting to oom > kill. > > It could be argued that tiny containers are not supported, but it's more > subtle. It's the amount the space available for file lru that matters. > If a container has memory.max-200MiB of non reclaimable memory, then it > will also suffer such oom kills on a 100 cpu machine. > > ... > > Make balance_dirty_pages() and wb_over_bg_thresh() work harder to > collect exact per memcg counters when a memcg is close to the > throttling/writeback threshold. This avoids the aforementioned oom > kills. > > This does not affect the overhead of memory.stat, which still reads the > single atomic counter. > > Why not use percpu_counter? memcg already handles cpus going offline, > so no need for that overhead from percpu_counter. And the > percpu_counter spinlocks are more heavyweight than is required. > > It probably also makes sense to include exact dirty and writeback > counters in memcg oom reports. But that is saved for later. Nice changelog, thanks. > Signed-off-by: Greg Thelen Did you consider cc:stable for this? We may as well - the stablebots backport everything which might look slightly like a fix anyway :( > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -573,6 +573,22 @@ static inline unsigned long memcg_page_state(struct > mem_cgroup *memcg, > return x; > } > > +/* idx can be of type enum memcg_stat_item or node_stat_item */ > +static inline unsigned long > +memcg_exact_page_state(struct mem_cgroup *memcg, int idx) > +{ > + long x = atomic_long_read(>stat[idx]); > +#ifdef CONFIG_SMP > + int cpu; > + > + for_each_online_cpu(cpu) > + x += per_cpu_ptr(memcg->stat_cpu, cpu)->count[idx]; > + if (x < 0) > + x = 0; > +#endif > + return x; > +} This looks awfully heavyweight for an inline function. Why not make it a regular function and avoid the bloat and i-cache consumption? Also, did you instead consider making this spill the percpu counters into memcg->stat[idx]? That might be more useful for potential future callers. It would become a little more expensive though.
Re: [PATCH] mm: page_mkclean vs MADV_DONTNEED race
On Thu, 21 Mar 2019 09:36:10 +0530 "Aneesh Kumar K.V" wrote: > MADV_DONTNEED is handled with mmap_sem taken in read mode. > We call page_mkclean without holding mmap_sem. > > MADV_DONTNEED implies that pages in the region are unmapped and subsequent > access to the pages in that range is handled as a new page fault. > This implies that if we don't have parallel access to the region when > MADV_DONTNEED is run we expect those range to be unallocated. > > w.r.t page_mkclean we need to make sure that we don't break the MADV_DONTNEED > semantics. MADV_DONTNEED check for pmd_none without holding pmd_lock. > This implies we skip the pmd if we temporarily mark pmd none. Avoid doing > that while marking the page clean. > > Keep the sequence same for dax too even though we don't support MADV_DONTNEED > for dax mapping What were the runtime effects of the bug? Did you consider a -stable backport?
Re: [RFC PATCH v2 0/1] improve vmap allocation
On Thu, 21 Mar 2019 20:03:26 +0100 "Uladzislau Rezki (Sony)" wrote: > Hello. > > This is the v2 of the https://lkml.org/lkml/2018/10/19/786 rework. Instead of > referring you to that link, i will go through it again describing the improved > allocation method and provide changes between v1 and v2 in the end. > > ... > > Performance analysis > Impressive numbers. But this is presumably a worst-case microbenchmark. Are you able to describe the benefits which are observed in some real-world workload which someone cares about? It's a lot of new code. I t looks decent and I'll toss it in there for further testing. Hopefully someone will be able to find the time for a detailed review. Trivial point: the code uses "inline" a lot. Nowadays gcc cheerfully ignores that and does its own thing. You might want to look at the effects of simply deleting all that. Is the generated code better or worse or the same? If something really needs to be inlined then use __always_inline, preferably with a comment explaining why it is there.
Re: [PATCH v4] lib/string.c: implement a basic bcmp
On Thu, 21 Mar 2019 10:02:37 -0700 Nick Desaulniers wrote: > Shall I send you a cleanup removing the undefs for bcmp, memcmp, > strcat, strcpy, and strcmp? Of those, I only see memcmp being > `#defined` in arch/m68k/include/asm/string.h, arch/x86/boot/string.h, > and arch/x86/include/asm/string_32.h. > > Further, I can drop some of the __GNUC__ < 4 code in > arch/x86/include/asm/string_32.h. (grepping for __GNUC__, looks like > there's a fair amount of code that can be cleaned up). We should > probably check it since Clang lies about being GCC 4.2 compatible, > which will surely break in horrific ways at some point.\ All sounds good. Some time, when convenient, thanks.
Re: [PATCH v4] lib/string.c: implement a basic bcmp
On Wed, 13 Mar 2019 14:13:31 -0700 Nick Desaulniers wrote: > A recent optimization in Clang (r355672) lowers comparisons of the > return value of memcmp against zero to comparisons of the return value > of bcmp against zero. This helps some platforms that implement bcmp > more efficiently than memcmp. glibc simply aliases bcmp to memcmp, but > an optimized implementation is in the works. > > This results in linkage failures for all targets with Clang due to the > undefined symbol. For now, just implement bcmp as a tailcail to memcmp > to unbreak the build. This routine can be further optimized in the > future. > > Other ideas discussed: > * A weak alias was discussed, but breaks for architectures that define > their own implementations of memcmp since aliases to declarations are > not permitted (only definitions). Arch-specific memcmp implementations > typically declare memcmp in C headers, but implement them in assembly. > * -ffreestanding also is used sporadically throughout the kernel. > * -fno-builtin-bcmp doesn't work when doing LTO. I guess we should backport this into -stable so that older kernels can be built with newer Clang. > ... > > --- a/lib/string.c > +++ b/lib/string.c > @@ -866,6 +866,26 @@ __visible int memcmp(const void *cs, const void *ct, > size_t count) > EXPORT_SYMBOL(memcmp); > #endif > > +#ifndef __HAVE_ARCH_BCMP > +/** > + * bcmp - returns 0 if and only if the buffers have identical contents. > + * @a: pointer to first buffer. > + * @b: pointer to second buffer. > + * @len: size of buffers. > + * > + * The sign or magnitude of a non-zero return value has no particular > + * meaning, and architectures may implement their own more efficient bcmp(). > So > + * while this particular implementation is a simple (tail) call to memcmp, do > + * not rely on anything but whether the return value is zero or non-zero. > + */ > +#undef bcmp What is the undef for? > +int bcmp(const void *a, const void *b, size_t len) > +{ > + return memcmp(a, b, len); > +} > +EXPORT_SYMBOL(bcmp); > +#endif > + > #ifndef __HAVE_ARCH_MEMSCAN > /** > * memscan - Find a character in an area of memory. > -- > 2.21.0.360.g471c308f928-goog
Re: [PATCH] mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified
On Wed, 20 Mar 2019 11:23:03 +0530 Souptick Joarder wrote: > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -447,6 +447,13 @@ static inline bool queue_pages_required(struct page > > *page, > > return node_isset(nid, *qp->nmask) == !(flags & MPOL_MF_INVERT); > > } > > > > +/* > > + * The queue_pages_pmd() may have three kind of return value. > > + * 1 - pages are placed on he right node or queued successfully. > > Minor typo -> s/he/the ? Yes, that comment needs some help. This? --- a/mm/mempolicy.c~mm-mempolicy-make-mbind-return-eio-when-mpol_mf_strict-is-specified-fix +++ a/mm/mempolicy.c @@ -429,9 +429,9 @@ static inline bool queue_pages_required( } /* - * The queue_pages_pmd() may have three kind of return value. - * 1 - pages are placed on he right node or queued successfully. - * 0 - THP get split. + * queue_pages_pmd() has three possible return values: + * 1 - pages are placed on the right node or queued successfully. + * 0 - THP was split. * -EIO - is migration entry or MPOL_MF_STRICT was specified and an existing *page was already on a node that does not follow the policy. */ _
Re: [RESEND#2 PATCH] mm/compaction: fix an undefined behaviour
On Wed, 20 Mar 2019 16:33:38 -0400 Qian Cai wrote: > In a low-memory situation, cc->fast_search_fail can keep increasing as > it is unable to find an available page to isolate in > fast_isolate_freepages(). As the result, it could trigger an error > below, so just compare with the maximum bits can be shifted first. > > UBSAN: Undefined behaviour in mm/compaction.c:1160:30 > shift exponent 64 is too large for 64-bit type 'unsigned long' > CPU: 131 PID: 1308 Comm: kcompactd1 Kdump: loaded Tainted: G > WL5.0.0+ #17 > Call trace: > dump_backtrace+0x0/0x450 > show_stack+0x20/0x2c > dump_stack+0xc8/0x14c > __ubsan_handle_shift_out_of_bounds+0x7e8/0x8c4 > compaction_alloc+0x2344/0x2484 > unmap_and_move+0xdc/0x1dbc > migrate_pages+0x274/0x1310 > compact_zone+0x26ec/0x43bc > kcompactd+0x15b8/0x1a24 > kthread+0x374/0x390 > ret_from_fork+0x10/0x18 > > Fixes: 70b44595eafe ("mm, compaction: use free lists to quickly locate a > migration source") > Acked-by: Vlastimil Babka > Acked-by: Mel Gorman > Signed-off-by: Qian Cai > --- > mm/compaction.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/mm/compaction.c b/mm/compaction.c > index e1a08fc92353..0d1156578114 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -1157,7 +1157,9 @@ static bool suitable_migration_target(struct > compact_control *cc, > static inline unsigned int > freelist_scan_limit(struct compact_control *cc) > { > - return (COMPACT_CLUSTER_MAX >> cc->fast_search_fail) + 1; > + return (COMPACT_CLUSTER_MAX >> > + min((unsigned short)(BITS_PER_LONG - 1), cc->fast_search_fail)) > + + 1; > } That's rather an eyesore. How about static inline unsigned int freelist_scan_limit(struct compact_control *cc) { unsigned short shift = BITS_PER_LONG - 1; return (COMPACT_CLUSTER_MAX >> min(shift, cc->fast_search_fail)) + 1; }
Re: [PATCH] mm, memory_hotplug: Fix the wrong usage of N_HIGH_MEMORY
On Wed, 20 Mar 2019 16:07:32 +0800 Baoquan He wrote: > In function node_states_check_changes_online(), N_HIGH_MEMORY is used > to substitute ZONE_HIGHMEM directly. This is not right. N_HIGH_MEMORY > always has value '3' if CONFIG_HIGHMEM=y, while ZONE_HIGHMEM's value > is not. It depends on whether CONFIG_ZONE_DMA/CONFIG_ZONE_DMA32 are > enabled. Obviously it's not true for CONFIG_ZONE_DMA32 on 32bit system, > and CONFIG_ZONE_DMA is also optional. > > Replace it with ZONE_HIGHMEM. > > Fixes: 8efe33f40f3e ("mm/memory_hotplug.c: simplify > node_states_check_changes_online") What are the runtime effects of this change? > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -712,7 +712,7 @@ static void node_states_check_changes_online(unsigned > long nr_pages, > if (zone_idx(zone) <= ZONE_NORMAL && !node_state(nid, N_NORMAL_MEMORY)) > arg->status_change_nid_normal = nid; > #ifdef CONFIG_HIGHMEM > - if (zone_idx(zone) <= N_HIGH_MEMORY && !node_state(nid, N_HIGH_MEMORY)) > + if (zone_idx(zone) <= ZONE_HIGHMEM && !node_state(nid, N_HIGH_MEMORY)) > arg->status_change_nid_high = nid; > #endif > }
Re: [PATCH RESEND] eventfd: prepare id to userspace via fdinfo
On Wed, 20 Mar 2019 18:29:29 +0900 Masatake YAMATO wrote: > Finding endpoints of an IPC channel is one of essential task to > understand how a user program works. Procfs and netlink socket provide > enough hints to find endpoints for IPC channels like pipes, unix > sockets, and pseudo terminals. However, there is no simple way to find > endpoints for an eventfd file from userland. An inode number doesn't > hint. Unlike pipe, all eventfd files share the same inode object. > > To provide the way to find endpoints of an eventfd file, this patch > adds "eventfd-id" field to /proc/PID/fdinfo of eventfd as identifier. > Address for eventfd context is used as id. > > A tool like lsof can utilize the information to print endpoints. > > ... > > --- a/fs/eventfd.c > +++ b/fs/eventfd.c > @@ -297,6 +297,7 @@ static void eventfd_show_fdinfo(struct seq_file *m, > struct file *f) > seq_printf(m, "eventfd-count: %16llx\n", > (unsigned long long)ctx->count); > spin_unlock_irq(>wqh.lock); > + seq_printf(m, "eventfd-id: %p\n", ctx); > } > #endif Is it a good idea to use a bare kernel address for this? How does this interact with printk pointer randomization and hashing?
Re: [RESEND PATCH] ptrace: take into account saved_sigmask in PTRACE_{GET,SET}SIGMASK
On Wed, 20 Mar 2019 01:32:53 +0300 "Dmitry V. Levin" wrote: > On Tue, Mar 19, 2019 at 12:19:57PM -0700, Andrei Vagin wrote: > > There are a few system calls (pselect, ppoll, etc) which replace a task > > sigmask while they are running in a kernel-space > > > > When a task calls one of these syscalls, the kernel saves a current > > sigmask in task->saved_sigmask and sets a syscall sigmask. > > > > On syscall-exit-stop, ptrace traps a task before restoring the > > saved_sigmask, so PTRACE_GETSIGMASK returns the syscall sigmask and > > PTRACE_SETSIGMASK does nothing, because its sigmask is replaced by > > saved_sigmask, when the task returns to user-space. > > > > This patch fixes this problem. PTRACE_GET_SIGMASK returns saved_sigmask > > is it's set. PTRACE_SETSIGMASK drops the TIF_RESTORE_SIGMASK flag. > > If it's not too late, could somebody tweak the commit message so that > PTRACE_GET_SIGMASK becomes PTRACE_GETSIGMASK and "is it's set" is changed > to "if it's set", please? I made those changes to my copy.
Re: [RESEND PATCH 0/7] Add FOLL_LONGTERM to GUP fast and use it
On Sun, 17 Mar 2019 11:34:31 -0700 ira.we...@intel.com wrote: > Resending after rebasing to the latest mm tree. > > HFI1, qib, and mthca, use get_user_pages_fast() due to it performance > advantages. These pages can be held for a significant time. But > get_user_pages_fast() does not protect against mapping FS DAX pages. > > Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which > retains the performance while also adding the FS DAX checks. XDP has also > shown interest in using this functionality.[1] > > In addition we change get_user_pages() to use the new FOLL_LONGTERM flag and > remove the specialized get_user_pages_longterm call. It would be helpful to include your response to Christoph's question (http://lkml.kernel.org/r/20190220180255.ga12...@iweiny-desk2.sc.intel.com) in the changelog. Because if one person was wondering about this, others will likely do so. We have no record of acks or reviewed-by's. At least one was missed (http://lkml.kernel.org/r/caog9msttcd-9bcsdfc0wryqfvrnb4twozl0c4+6qxi-n_y4...@mail.gmail.com), but that is very very partial. This patchset is fairly DAX-centered, but Dan wasn't cc'ed! So ho hum. I'll scoop them up and shall make the above changes to the [1/n] changelog, but we still have some work to do.
Re: [PATCH v2 1/1] userfaultfd/sysctl: add vm.unprivileged_userfaultfd
On Tue, 19 Mar 2019 11:07:22 +0800 Peter Xu wrote: > Add a global sysctl knob "vm.unprivileged_userfaultfd" to control > whether userfaultfd is allowed by unprivileged users. When this is > set to zero, only privileged users (root user, or users with the > CAP_SYS_PTRACE capability) will be able to use the userfaultfd > syscalls. Please send along a full description of why you believe Linux needs this feature, for me to add to the changelog. What is the benefit to our users? How will it be used? etcetera. As it was presented I'm seeing no justification for adding the patch!
Re: [PATCH 00/10] HMM updates for 5.1
On Tue, 19 Mar 2019 12:58:02 -0400 Jerome Glisse wrote: > > So I think I'll throw up my hands, drop them all and shall await > > developments :( > > What more do you want to see ? I can repost with the ack already given > and the improve commit wording on some of the patch. But from user point > of view nouveau is already upstream, ODP RDMA depends on this patchset > and is posted and i have given link to it. amdgpu is queue up. What more > do i need ? I guess I can ignore linux-next for a few days. Yes, a resend against mainline with those various updates will be helpful. Please go through the various fixes which we had as well: mm-hmm-use-reference-counting-for-hmm-struct.patch mm-hmm-do-not-erase-snapshot-when-a-range-is-invalidated.patch mm-hmm-improve-and-rename-hmm_vma_get_pfns-to-hmm_range_snapshot.patch mm-hmm-improve-and-rename-hmm_vma_fault-to-hmm_range_fault.patch mm-hmm-improve-driver-api-to-work-and-wait-over-a-range.patch mm-hmm-improve-driver-api-to-work-and-wait-over-a-range-fix.patch mm-hmm-improve-driver-api-to-work-and-wait-over-a-range-fix-fix.patch mm-hmm-add-default-fault-flags-to-avoid-the-need-to-pre-fill-pfns-arrays.patch mm-hmm-add-an-helper-function-that-fault-pages-and-map-them-to-a-device.patch mm-hmm-support-hugetlbfs-snap-shoting-faulting-and-dma-mapping.patch mm-hmm-allow-to-mirror-vma-of-a-file-on-a-dax-backed-filesystem.patch mm-hmm-allow-to-mirror-vma-of-a-file-on-a-dax-backed-filesystem-fix.patch mm-hmm-allow-to-mirror-vma-of-a-file-on-a-dax-backed-filesystem-fix-2.patch mm-hmm-add-helpers-for-driver-to-safely-take-the-mmap_sem.patch Also, the discussion regarding [07/10] is substantial and is ongoing so please let's push along wth that. What is the review/discussion status of "[PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem"?
Re: [PATCH 00/10] HMM updates for 5.1
On Mon, 18 Mar 2019 13:04:04 -0400 Jerome Glisse wrote: > On Wed, Mar 13, 2019 at 09:10:04AM -0700, Andrew Morton wrote: > > On Tue, 12 Mar 2019 21:27:06 -0400 Jerome Glisse wrote: > > > > > Andrew you will not be pushing this patchset in 5.1 ? > > > > I'd like to. It sounds like we're converging on a plan. > > > > It would be good to hear more from the driver developers who will be > > consuming these new features - links to patchsets, review feedback, > > etc. Which individuals should we be asking? Felix, Christian and > > Jason, perhaps? > > > > So i am guessing you will not send this to Linus ? I was waiting to see how the discussion proceeds. Was also expecting various changelog updates (at least) - more acks from driver developers, additional pointers to client driver patchsets, description of their readiness, etc. Today I discover that Alex has cherrypicked "mm/hmm: use reference counting for HMM struct" into a tree which is fed into linux-next which rather messes things up from my end and makes it hard to feed a (possibly modified version of) that into Linus. So I think I'll throw up my hands, drop them all and shall await developments :(
Re: [PATCH 00/10] HMM updates for 5.1
On Tue, 12 Mar 2019 21:27:06 -0400 Jerome Glisse wrote: > Andrew you will not be pushing this patchset in 5.1 ? I'd like to. It sounds like we're converging on a plan. It would be good to hear more from the driver developers who will be consuming these new features - links to patchsets, review feedback, etc. Which individuals should we be asking? Felix, Christian and Jason, perhaps?
Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
On Tue, 12 Mar 2019 20:10:19 -0400 Jerome Glisse wrote: > > You're correct. We chose to go this way because the HMM code is so > > large and all-over-the-place that developing it in a standalone tree > > seemed impractical - better to feed it into mainline piecewise. > > > > This decision very much assumed that HMM users would definitely be > > merged, and that it would happen soon. I was skeptical for a long time > > and was eventually persuaded by quite a few conversations with various > > architecture and driver maintainers indicating that these HMM users > > would be forthcoming. > > > > In retrospect, the arrival of HMM clients took quite a lot longer than > > was anticipated and I'm not sure that all of the anticipated usage > > sites will actually be using it. I wish I'd kept records of > > who-said-what, but I didn't and the info is now all rather dissipated. > > > > So the plan didn't really work out as hoped. Lesson learned, I would > > now very much prefer that new HMM feature work's changelogs include > > links to the driver patchsets which will be using those features and > > acks and review input from the developers of those driver patchsets. > > This is what i am doing now and this patchset falls into that. I did > post the ODP and nouveau bits to use the 2 new functions (dma map and > unmap). I expect to merge both ODP and nouveau bits for that during > the next merge window. > > Also with 5.1 everything that is upstream is use by nouveau at least. > They are posted patches to use HMM for AMD, Intel, Radeon, ODP, PPC. > Some are going through several revisions so i do not know exactly when > each will make it upstream but i keep working on all this. > > So the guideline we agree on: > - no new infrastructure without user > - device driver maintainer for which new infrastructure is done > must either sign off or review of explicitly say that they want > the feature I do not expect all driver maintainer will have > the bandwidth to do proper review of the mm part of the infra- > structure and it would not be fair to ask that from them. They > can still provide feedback on the API expose to the device > driver. The patchset in -mm ("HMM updates for 5.1") has review from Ralph Campbell @ nvidia. Are there any other maintainers who we should have feedback from? > - driver bits must be posted at the same time as the new infra- > structure even if they target the next release cycle to avoid > inter-tree dependency > - driver bits must be merge as soon as possible Are there links to driver patchsets which we can add to the changelogs? > Thing we do not agree on: > - If driver bits miss for any reason the +1 target directly > revert the new infra-structure. I think it should not be black > and white and the reasons why the driver bit missed the merge > window should be taken into account. If the feature is still > wanted and the driver bits missed the window for simple reasons > then it means that we push everything by 2 release ie the > revert is done in +1 then we reupload the infra-structure in > +2 and finaly repush the driver bit in +3 so we loose 1 cycle. > Hence why i would rather that the revert would only happen if > it is clear that the infrastructure is not ready or can not > be use in timely (over couple kernel release) fashion by any > drivers. I agree that this should be more a philosophy than a set of hard rules.
Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
On Tue, 12 Mar 2019 12:30:52 -0700 Dan Williams wrote: > On Tue, Mar 12, 2019 at 12:06 PM Jerome Glisse wrote: > > On Tue, Mar 12, 2019 at 09:06:12AM -0700, Dan Williams wrote: > > > On Tue, Mar 12, 2019 at 8:26 AM Jerome Glisse wrote: > [..] > > > > Spirit of the rule is better than blind application of rule. > > > > > > Again, I fail to see why HMM is suddenly unable to make forward > > > progress when the infrastructure that came before it was merged with > > > consumers in the same development cycle. > > > > > > A gate to upstream merge is about the only lever a reviewer has to > > > push for change, and these requests to uncouple the consumer only > > > serve to weaken that review tool in my mind. > > > > Well let just agree to disagree and leave it at that and stop > > wasting each other time > > I'm fine to continue this discussion if you are. Please be specific > about where we disagree and what aspect of the proposed rules about > merge staging are either acceptable, painful-but-doable, or > show-stoppers. Do you agree that HMM is doing something novel with > merge staging, am I off base there? You're correct. We chose to go this way because the HMM code is so large and all-over-the-place that developing it in a standalone tree seemed impractical - better to feed it into mainline piecewise. This decision very much assumed that HMM users would definitely be merged, and that it would happen soon. I was skeptical for a long time and was eventually persuaded by quite a few conversations with various architecture and driver maintainers indicating that these HMM users would be forthcoming. In retrospect, the arrival of HMM clients took quite a lot longer than was anticipated and I'm not sure that all of the anticipated usage sites will actually be using it. I wish I'd kept records of who-said-what, but I didn't and the info is now all rather dissipated. So the plan didn't really work out as hoped. Lesson learned, I would now very much prefer that new HMM feature work's changelogs include links to the driver patchsets which will be using those features and acks and review input from the developers of those driver patchsets.
Re: [PATCH] mm: remove unused variable
On Tue, 12 Mar 2019 15:03:52 +0100 Bartosz Golaszewski wrote: > wt., 12 mar 2019 o 14:59 Khalid Aziz napisał(a): > > > > On 3/12/19 7:28 AM, Bartosz Golaszewski wrote: > > > From: Bartosz Golaszewski > > > > > > The mm variable is set but unused. Remove it. > > > > It is used. Look further down for calls to set_pte_at(). > > > > -- > > Khalid > > > > > > > > Signed-off-by: Bartosz Golaszewski > > > --- > > > mm/mprotect.c | 1 - > > > 1 file changed, 1 deletion(-) > > > > > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > > index 028c724dcb1a..130dac3ad04f 100644 > > > --- a/mm/mprotect.c > > > +++ b/mm/mprotect.c > > > @@ -39,7 +39,6 @@ static unsigned long change_pte_range(struct > > > vm_area_struct *vma, pmd_t *pmd, > > > unsigned long addr, unsigned long end, pgprot_t newprot, > > > int dirty_accountable, int prot_numa) > > > { > > > - struct mm_struct *mm = vma->vm_mm; > > > pte_t *pte, oldpte; > > > spinlock_t *ptl; > > > unsigned long pages = 0; > > > > > > > > > Oops, I blindly assumed the compiler is right, sorry for that. GCC > complains it's unused when building usermode linux. I guess it's a > matter of how set_pte_at() is defined for ARCH=um. I'll take a second > look. > The problem is that set_pte_at() is implemented as a macro on some architectures. The appropriate fix is to make all architectures use a static inline C functions in all cases. That will make the compiler think that the `mm' arg is used, even if it is not.
Re: [PATCH] Drop -Wdeclaration-after-statement
> > > > > > It is not good in my opinion to stick to -Wdeclaration-after-statement. > > > > Why? > > It is useful to have declarations mixed with code. Am inclined to agree. Maybe. > Once kernel will switch to C99 or C11 it _will_ be used to the point of > requiring it on the coding style level. The superstition of declaring > everything in the beginning of a function will fall, so might as well > start earlier. Sure, it's a matter of kernel coding conventions. But this isn't the way to get those changed! The process for making such changes involves large numbers of people arguing at each other (perhaps at kernel summit) until Linus comes in an tells everyone how it's going to be.
Re: [PATCH] Drop -Wdeclaration-after-statement
On Tue, 12 Mar 2019 20:24:47 +0300 Alexey Dobriyan wrote: > On Mon, Mar 11, 2019 at 05:38:45PM -0700, Andrew Morton wrote: > > On Sun, 10 Mar 2019 16:35:35 +0300 Alexey Dobriyan > > wrote: > > > > > Newly added static_assert() is formally a declaration, which will give > > > a warning if used in the middle of the function. > > > > > > ... > > > > > > --- a/Makefile > > > +++ b/Makefile > > > @@ -792,9 +792,6 @@ endif > > > # arch Makefile may override CC so keep this after arch Makefile is > > > included > > > NOSTDINC_FLAGS += -nostdinc -isystem $(shell $(CC) > > > -print-file-name=include) > > > > > > -# warn about C99 declaration after statement > > > -KBUILD_CFLAGS += -Wdeclaration-after-statement > > > - > > > # Variable Length Arrays (VLAs) should not be used anywhere in the kernel > > > KBUILD_CFLAGS += $(call cc-option,-Wvla) > > > > I do wish your changelogs were more elaborate :( > > > So the proposal is to disable -Wdeclaration-after-statement in all > > cases for all time because static_assert() doesn't work correctly? > > Yes. I converted 2 cases in /proc to static_assert() and you can't write > > { > [code] > static_assert() > } > > without a warning because static_assert() is declaration. > So people would move BUILD_BUG_ON() to where it doesn't belong. Sure. > > Surely there's something we can do to squish the static_assert() issue > > while retaining -Wdeclaration-after-statement? > > It is not good in my opinion to stick to -Wdeclaration-after-statement. Why? > > Perhaps by making > > static_assert() a nop if -Wdeclaration-after-statement is in use. > > Perhaps simply by putting { } around the static_assert()? > > Making a statement out of it would disable current cases where it is > placed in headers. I think you mean cases where static_assert() is used outside functions? We could have two versions of it, one for use inside functions, one for use outside functions?
Re: KASAN: null-ptr-deref Read in reclaim_high
On Tue, 12 Mar 2019 07:08:38 +0100 Dmitry Vyukov wrote: > On Tue, Mar 12, 2019 at 12:37 AM Andrew Morton > wrote: > > > > On Mon, 11 Mar 2019 06:08:01 -0700 syzbot > > wrote: > > > > > syzbot has bisected this bug to: > > > > > > commit 29a4b8e275d1f10c51c7891362877ef6cffae9e7 > > > Author: Shakeel Butt > > > Date: Wed Jan 9 22:02:21 2019 + > > > > > > memcg: schedule high reclaim for remote memcgs on high_work > > > > > > bisection log: > > > https://syzkaller.appspot.com/x/bisect.txt?x=155bf5db20 > > > start commit: 29a4b8e2 memcg: schedule high reclaim for remote memcgs > > > on.. > > > git tree: linux-next > > > final crash: > > > https://syzkaller.appspot.com/x/report.txt?x=175bf5db20 > > > console output: https://syzkaller.appspot.com/x/log.txt?x=135bf5db20 > > > kernel config: https://syzkaller.appspot.com/x/.config?x=611f89e5b6868db > > > dashboard link: > > > https://syzkaller.appspot.com/bug?extid=fa11f9da42b46cea3b4a > > > userspace arch: amd64 > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1425901740 > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=141630a0c0 > > > > > > Reported-by: syzbot+fa11f9da42b46cea3...@syzkaller.appspotmail.com > > > Fixes: 29a4b8e2 ("memcg: schedule high reclaim for remote memcgs on > > > high_work") > > > > The following patch > > memcg-schedule-high-reclaim-for-remote-memcgs-on-high_work-v3.patch > > might have fixed this. Was it applied? > > Hi Andrew, > > You mean if the patch was applied during the bisection? > No, it wasn't. Bisection is very specifically done on the same tree > where the bug was hit. There are already too many factors that make > the result flaky/wrong/inconclusive without changing the tree state. > Now, if syzbot would know about any pending fix for this bug, then it > would not do the bisection at all. But it have not seen any patch in > upstream/linux-next with the Reported-by tag, nor it received any syz > fix commands for this bugs. Should have been it aware of the fix? How? memcg-schedule-high-reclaim-for-remote-memcgs-on-high_work-v3.patch was added to linux-next on Jan 10. I take it that this bug was hit when testing the entire linux-next tree, so we can assume that memcg-schedule-high-reclaim-for-remote-memcgs-on-high_work-v3.patch does not fix it, correct? In which case, over to Shakeel!
Re: [PATCH v9 1/9] bitops: Introduce the for_each_set_clump8 macro
On Fri, 8 Mar 2019 09:31:00 +0100 Linus Walleij wrote: > On Sun, Mar 3, 2019 at 8:47 AM William Breathitt Gray > wrote: > > > This macro iterates for each 8-bit group of bits (clump) with set bits, > > within a bitmap memory region. For each iteration, "start" is set to the > > bit offset of the found clump, while the respective clump value is > > stored to the location pointed by "clump". Additionally, the > > bitmap_get_value8 and bitmap_set_value8 functions are introduced to > > respectively get and set an 8-bit value in a bitmap memory region. > > > > Suggested-by: Andy Shevchenko > > Suggested-by: Rasmus Villemoes > > Cc: Arnd Bergmann > > Cc: Andrew Morton > > Reviewed-by: Andy Shevchenko > > Reviewed-by: Linus Walleij > > Signed-off-by: William Breathitt Gray > > Andrew: would you be OK with this being merged in v5.1? Yup. We have quite a few users there. I assume this will go via the gpio tree? Feel free to add Acked-by: Andrew Morton , although it probably isn't worth churning the git tree to do so at this late stage - your cvall.
Re: [PATCH] Drop -Wdeclaration-after-statement
On Sun, 10 Mar 2019 16:35:35 +0300 Alexey Dobriyan wrote: > Newly added static_assert() is formally a declaration, which will give > a warning if used in the middle of the function. > > ... > > --- a/Makefile > +++ b/Makefile > @@ -792,9 +792,6 @@ endif > # arch Makefile may override CC so keep this after arch Makefile is included > NOSTDINC_FLAGS += -nostdinc -isystem $(shell $(CC) -print-file-name=include) > > -# warn about C99 declaration after statement > -KBUILD_CFLAGS += -Wdeclaration-after-statement > - > # Variable Length Arrays (VLAs) should not be used anywhere in the kernel > KBUILD_CFLAGS += $(call cc-option,-Wvla) I do wish your changelogs were more elaborate :( So the proposal is to disable -Wdeclaration-after-statement in all cases for all time because static_assert() doesn't work correctly? Surely there's something we can do to squish the static_assert() issue while retaining -Wdeclaration-after-statement? Perhaps by making static_assert() a nop if -Wdeclaration-after-statement is in use. Perhaps simply by putting { } around the static_assert()?
Re: KASAN: null-ptr-deref Read in reclaim_high
On Mon, 11 Mar 2019 06:08:01 -0700 syzbot wrote: > syzbot has bisected this bug to: > > commit 29a4b8e275d1f10c51c7891362877ef6cffae9e7 > Author: Shakeel Butt > Date: Wed Jan 9 22:02:21 2019 + > > memcg: schedule high reclaim for remote memcgs on high_work > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=155bf5db20 > start commit: 29a4b8e2 memcg: schedule high reclaim for remote memcgs on.. > git tree: linux-next > final crash:https://syzkaller.appspot.com/x/report.txt?x=175bf5db20 > console output: https://syzkaller.appspot.com/x/log.txt?x=135bf5db20 > kernel config: https://syzkaller.appspot.com/x/.config?x=611f89e5b6868db > dashboard link: https://syzkaller.appspot.com/bug?extid=fa11f9da42b46cea3b4a > userspace arch: amd64 > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1425901740 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=141630a0c0 > > Reported-by: syzbot+fa11f9da42b46cea3...@syzkaller.appspotmail.com > Fixes: 29a4b8e2 ("memcg: schedule high reclaim for remote memcgs on > high_work") The following patch memcg-schedule-high-reclaim-for-remote-memcgs-on-high_work-v3.patch might have fixed this. Was it applied?
Re: [LKP] [selftests/vm] a05ef00c97: kernel_selftests.vm.vmalloc_stability_smoke_test.fail
On Mon, 11 Mar 2019 10:20:06 -0700 Linus Torvalds wrote: > On Mon, Mar 11, 2019 at 12:43 AM kernel test robot > wrote: > > > > ./run_vmtests: line 217: ./test_vmalloc.sh: Permission denied > > I marked that script executable: > > 6bc3fe8e7e17 ("tools: mark 'test_vmalloc.sh' executable") > > although perhaps the better fix would have been to explicitly use the > shell to start it in the 'run_vmtests' script instead? > > We have a lot of files that aren't marked executable, probably because > they came in as (or were generated as) non-git patches. Often through > Andrew's workflow. Not only that. patch(1) doesn't set the x bit. So if someone downloads and applies patch-5.0-rc1.xz (as we say to do in the documentation), their kernel won't work correctly. This happens fairly regularly and the fix is to replace test_vmalloc.sh ... with /bin/sh test_vmalloc.sh ...
Re: [PATCH] proc: test with vsyscall in mind
On Thu, 7 Mar 2019 21:32:04 +0300 Alexey Dobriyan wrote: > Read from vsyscall page to tell if vsyscall is being used. > > Reported-by: kernel test robot > Signed-off-by: Alexey Dobriyan There's a lot of information missing from this changelog :( I rewrote it to this: : selftests: proc: proc-pid-vm : : proc-pid-vm: proc-pid-vm.c:277: main: Assertion `rv == strlen(buf0)' failed. : Aborted Because the vsyscall mapping is enabled. Read from vsyscall page to tell if vsyscall is being used. Link: http://lkml.kernel.org/r/20190307183204.GA11405@avx2 Link: http://lkml.kernel.org/r/20190219094722.GB28258@shao2-debian Fixes: 34aab6bec23e7e9 ("proc: test /proc/*/maps, smaps, smaps_rollup, statm") Signed-off-by: Alexey Dobriyan Reported-by: kernel test robot Cc: Shuah Khan Signed-off-by: Andrew Morton
Re: [PATCH] ipc: prevent lockup on alloc_msg and free_msg
On Thu, 7 Mar 2019 16:10:22 +0800 Li RongQing wrote: > From: Li Rongqing > > msgctl10 of ltp triggers the following lockup When CONFIG_KASAN > is enabled on large memory SMP systems, the pages initialization > can take a long time, if msgctl10 requests a huge block memory, > and it will block rcu scheduler, so release cpu actively. > > ... > > Signed-off-by: Zhang Yu > Signed-off-by: Li RongQing This signoff ordering somewhat implies that Zhang Yu was the author. But you added "From: Li Rongqing", so you will be recorded as the patch's author. Is this correct? > --- a/ipc/msgutil.c > +++ b/ipc/msgutil.c > @@ -18,6 +18,7 @@ > #include > #include > #include > +#include > > #include "util.h" > > @@ -72,6 +73,7 @@ static struct msg_msg *alloc_msg(size_t len) > seg->next = NULL; > pseg = >next; > len -= alen; > + cond_resched(); > } This looks OK. > return msg; > @@ -178,5 +180,6 @@ void free_msg(struct msg_msg *msg) > struct msg_msgseg *tmp = seg->next; > kfree(seg); > seg = tmp; > + cond_resched(); > } This does not. mqueue_evict_inode() (at least) calls free_msg() from under spin_lock().
Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
On Tue, 5 Mar 2019 20:20:10 -0800 Dan Williams wrote: > My hesitation would be drastically reduced if there was a plan to > avoid dangling unconsumed symbols and functionality. Specifically one > or more of the following suggestions: > > * EXPORT_SYMBOL_GPL on all exports to avoid a growing liability > surface for out-of-tree consumers to come grumble at us when we > continue to refactor the kernel as we are wont to do. The existing patches use EXPORT_SYMBOL() so that's a sticking point. Jerome, what would happen is we made these EXPORT_SYMBOL_GPL()? > * A commitment to consume newly exported symbols in the same merge > window, or the following merge window. When that goal is missed revert > the functionality until such time that it can be consumed, or > otherwise abandoned. It sounds like we can tick this box. > * No new symbol exports and functionality while existing symbols go > unconsumed. Unsure about this one?
Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments
On Thu, 7 Mar 2019 00:32:09 +0100 Dominique Martinet wrote: > Andrew Morton wrote on Wed, Mar 06, 2019: > > On Wed, 6 Mar 2019 23:48:03 +0100 (CET) Jiri Kosina > > wrote: > > > > > 3/3 is actually waiting for your decision, see > > > > > > https://lore.kernel.org/lkml/20190212063643.gl15...@dhcp22.suse.cz/ > > > > I pity anyone who tried to understand this code by reading this code. > > Can we please get some careful commentary in there explaining what is > > going on, and why things are thus? > > > > I guess the [3/3] change makes sense, although it's unclear whether > > anyone really needs it? 5.0 was released with 574823bfab8 ("Change > > mincore() to count "mapped" pages rather than "cached" pages") so we'll > > have a release cycle to somewhat determine how much impact 574823bfab8 > > has on users. How about I queue up [3/3] and we reevaluate its > > desirability in a couple of months? > > FWIW, > > 574823bfab8 has been reverted in 30bac164aca750, included in 5.0-rc4, so > the controversial change has only been there from 5.0-rc1 to 5.0-rc3 Ah, OK, thanks, I misread. Linus, do you have thoughts on http://lkml.kernel.org/r/20190130124420.1834-4-vba...@suse.cz ?
Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments
On Wed, 6 Mar 2019 23:48:03 +0100 (CET) Jiri Kosina wrote: > 3/3 is actually waiting for your decision, see > > https://lore.kernel.org/lkml/20190212063643.gl15...@dhcp22.suse.cz/ I pity anyone who tried to understand this code by reading this code. Can we please get some careful commentary in there explaining what is going on, and why things are thus? I guess the [3/3] change makes sense, although it's unclear whether anyone really needs it? 5.0 was released with 574823bfab8 ("Change mincore() to count "mapped" pages rather than "cached" pages") so we'll have a release cycle to somewhat determine how much impact 574823bfab8 has on users. How about I queue up [3/3] and we reevaluate its desirability in a couple of months?
Re: [PATCH 1/3] mm/mincore: make mincore() more conservative
On Wed, 30 Jan 2019 13:44:18 +0100 Vlastimil Babka wrote: > From: Jiri Kosina > > The semantics of what mincore() considers to be resident is not completely > clear, but Linux has always (since 2.3.52, which is when mincore() was > initially done) treated it as "page is available in page cache". > > That's potentially a problem, as that [in]directly exposes meta-information > about pagecache / memory mapping state even about memory not strictly > belonging > to the process executing the syscall, opening possibilities for sidechannel > attacks. > > Change the semantics of mincore() so that it only reveals pagecache > information > for non-anonymous mappings that belog to files that the calling process could > (if it tried to) successfully open for writing. "for writing" comes as a bit of a surprise. Why not for reading? Could we please explain the reasoning in the changelog and in the (presently absent) comments which describe can_do_mincore()? > @@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned long > pages, unsigned char *v > vma = find_vma(current->mm, addr); > if (!vma || addr < vma->vm_start) > return -ENOMEM; > - mincore_walk.mm = vma->vm_mm; > end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); > + if (!can_do_mincore(vma)) { > + unsigned long pages = (end - addr) >> PAGE_SHIFT; I'm not sure this is correct in all cases. If addr = 4095 vma->vm_end = 4096 pages = 1000 then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it should have been 1. Please check? A mincore test suite in tools/testing/selftests would be useful, methinks. To exercise such corner cases, check for future breakage, etc. > + memset(vec, 1, pages); > + return pages; > + } > + mincore_walk.mm = vma->vm_mm; > err = walk_page_range(addr, end, _walk); > if (err < 0) > return err;
Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments
On Wed, 6 Mar 2019 13:11:39 +0100 (CET) Jiri Kosina wrote: > On Wed, 30 Jan 2019, Vlastimil Babka wrote: > > > I've collected the patches from the discussion for formal posting. The first > > two should be settled already, third one is the possible improvement I've > > mentioned earlier, where only in restricted case we resort to existence of > > page > > table mapping (the original and later reverted approach from Linus) instead > > of > > faking the result completely. Review and testing welcome. > > > > The consensus seems to be going through -mm tree for 5.1, unless Linus wants > > them alredy for 5.0. > > > > Jiri Kosina (2): > > mm/mincore: make mincore() more conservative > > mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O > > > > Vlastimil Babka (1): > > mm/mincore: provide mapped status when cached status is not allowed > > Andrew, > > could you please take at least the correct and straightforward fix for > mincore() before we figure out how to deal with the slightly less > practical RWF_NOWAIT? Thanks. I assume we're talking about [1/3] and [2/3] from this thread? Can we have a resend please? Gather the various acks and revisions, make changelog changes to address the review questions and comments? Thanks.
Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
On Wed, 6 Mar 2019 10:49:04 -0500 Jerome Glisse wrote: > On Tue, Mar 05, 2019 at 02:16:35PM -0800, Andrew Morton wrote: > > On Wed, 30 Jan 2019 21:44:46 -0800 Dan Williams > > wrote: > > > > > > > > > > > Another way to help allay these worries is commit to no new exports > > > > > without in-tree users. In general, that should go without saying for > > > > > any core changes for new or future hardware. > > > > > > > > I always intend to have an upstream user the issue is that the device > > > > driver tree and the mm tree move a different pace and there is always > > > > a chicken and egg problem. I do not think Andrew wants to have to > > > > merge driver patches through its tree, nor Linus want to have to merge > > > > drivers and mm trees in specific order. So it is easier to introduce > > > > mm change in one release and driver change in the next. This is what > > > > i am doing with ODP. Adding things necessary in 5.1 and working with > > > > Mellanox to have the ODP HMM patch fully tested and ready to go in > > > > 5.2 (the patch is available today and Mellanox have begin testing it > > > > AFAIK). So this is the guideline i will be following. Post mm bits > > > > with driver patches, push to merge mm bits one release and have the > > > > driver bits in the next. I do hope this sound fine to everyone. > > > > > > The track record to date has not been "merge HMM patch in one release > > > and merge the driver updates the next". If that is the plan going > > > forward that's great, and I do appreciate that this set came with > > > driver changes, and maintain hope the existing exports don't go > > > user-less for too much longer. > > > > Decision time. Jerome, how are things looking for getting these driver > > changes merged in the next cycle? > > nouveau is merge already. Confused. Nouveau in mainline is dependent upon "mm/hmm: allow to mirror vma of a file on a DAX backed filesystem"? That can't be the case? > > > > Dan, what's your overall take on this series for a 5.1-rc1 merge? > > > > Jerome, what would be the risks in skipping just this [09/10] patch? > > As nouveau is a new user it does not regress anything but for RDMA > mlx5 (which i expect to merge new window) it would regress that > driver. Also confused. How can omitting "mm/hmm: allow to mirror vma of a file on a DAX backed filesystem" from 5.1-rc1 cause an mlx5 regression?
Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
On Wed, 30 Jan 2019 21:44:46 -0800 Dan Williams wrote: > > > > > Another way to help allay these worries is commit to no new exports > > > without in-tree users. In general, that should go without saying for > > > any core changes for new or future hardware. > > > > I always intend to have an upstream user the issue is that the device > > driver tree and the mm tree move a different pace and there is always > > a chicken and egg problem. I do not think Andrew wants to have to > > merge driver patches through its tree, nor Linus want to have to merge > > drivers and mm trees in specific order. So it is easier to introduce > > mm change in one release and driver change in the next. This is what > > i am doing with ODP. Adding things necessary in 5.1 and working with > > Mellanox to have the ODP HMM patch fully tested and ready to go in > > 5.2 (the patch is available today and Mellanox have begin testing it > > AFAIK). So this is the guideline i will be following. Post mm bits > > with driver patches, push to merge mm bits one release and have the > > driver bits in the next. I do hope this sound fine to everyone. > > The track record to date has not been "merge HMM patch in one release > and merge the driver updates the next". If that is the plan going > forward that's great, and I do appreciate that this set came with > driver changes, and maintain hope the existing exports don't go > user-less for too much longer. Decision time. Jerome, how are things looking for getting these driver changes merged in the next cycle? Dan, what's your overall take on this series for a 5.1-rc1 merge? Jerome, what would be the risks in skipping just this [09/10] patch?
Re: [PATCH v4] mm/hugetlb: Fix unsigned overflow in __nr_hugepages_store_common()
On Mon, 4 Mar 2019 20:15:40 -0800 Mike Kravetz wrote: > Andrew, this is on top of Alexandre Ghiti's "hugetlb: allow to free gigantic > pages regardless of the configuration" patch. Both patches modify > __nr_hugepages_store_common(). Alex's patch is going to change slightly > in this area. OK, thanks, I missed that. Are the changes significant?
Re: [PATCH] selftest/proc: Remove duplicate header
On Mon, 4 Mar 2019 23:57:19 +0530 Souptick Joarder wrote: > Remove duplicate header which is included twice. > > Signed-off-by: Sabyasachi Gupta > Signed-off-by: Souptick Joarder This signoff order makes me suspect that Sabyasachi was the author. If so, the patch should have had Sabyasachi's From: line at the start of the changelog. Please clafify?
Re: [PATCH v6] panic: Avoid the extra noise dmesg
On Fri, 1 Mar 2019 16:57:52 -0500 Steven Rostedt wrote: > Looks good to me. > > Acked-by: Steven Rostedt (VMware) > > Andrew, you want to take this patch? Yup.
Re: [PATCH] mm/hugepages: fix "orig_pud" set but not used
On Thu, 28 Feb 2019 19:49:03 -0500 Qian Cai wrote: > The commit a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent > hugepages") introduced pudp_huge_get_and_clear_full() but no one uses > its return code, so just make it void. > > mm/huge_memory.c: In function 'zap_huge_pud': > mm/huge_memory.c:1982:8: warning: variable 'orig_pud' set but not used > [-Wunused-but-set-variable] > pud_t orig_pud; > ^~~~ > > ... > > --- a/include/asm-generic/pgtable.h > +++ b/include/asm-generic/pgtable.h > @@ -167,11 +167,11 @@ static inline pmd_t pmdp_huge_get_and_clear_full(struct > mm_struct *mm, > #endif > > #ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR_FULL > -static inline pud_t pudp_huge_get_and_clear_full(struct mm_struct *mm, > - unsigned long address, pud_t *pudp, > - int full) > +static inline void pudp_huge_get_and_clear_full(struct mm_struct *mm, > + unsigned long address, > + pud_t *pudp, int full) > { > - return pudp_huge_get_and_clear(mm, address, pudp); > + pudp_huge_get_and_clear(mm, address, pudp); > } Not sure this is a good change. Future callers might want that return value. > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1979,7 +1979,6 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct > vm_area_struct *vma) > int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, >pud_t *pud, unsigned long addr) > { > - pud_t orig_pud; > spinlock_t *ptl; > > ptl = __pud_trans_huge_lock(pud, vma); > @@ -1991,8 +1990,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct > vm_area_struct *vma, >* pgtable_trans_huge_withdraw after finishing pudp related >* operations. >*/ > - orig_pud = pudp_huge_get_and_clear_full(tlb->mm, addr, pud, > - tlb->fullmm); > + pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm); In fact this code perhaps should be passing orig_pud into pudp_huge_get_and_clear_full(). That could depend on what future per-arch implementations of pudp_huge_get_and_clear_full() choose to do. Anyway, I'll await Matthew's feedback.
Re: [PATCH v2] mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC
On Fri, 1 Mar 2019 15:19:50 -0500 Qian Cai wrote: > When onlining a memory block with DEBUG_PAGEALLOC, it unmaps the pages > in the block from kernel, However, it does not map those pages while > offlining at the beginning. As the result, it triggers a panic below > while onlining on ppc64le as it checks if the pages are mapped before > unmapping. However, the imbalance exists for all arches where > double-unmappings could happen. Therefore, let kernel map those pages in > generic_online_page() before they have being freed into the page > allocator for the first time where it will set the page count to one. > > On the other hand, it works fine during the boot, because at least for > IBM POWER8, it does, > > early_setup > early_init_mmu > harsh__early_init_mmu > htab_initialize [1] > htab_bolt_mapping [2] > > where it effectively map all memblock regions just like > kernel_map_linear_page(), so later mem_init() -> memblock_free_all() > will unmap them just fine without any imbalance. On other arches without > this imbalance checking, it still unmap them once at the most. > > [1] > for_each_memblock(memory, reg) { > base = (unsigned long)__va(reg->base); > size = reg->size; > > DBG("creating mapping for region: %lx..%lx (prot: %lx)\n", > base, size, prot); > > BUG_ON(htab_bolt_mapping(base, base + size, __pa(base), > prot, mmu_linear_psize, mmu_kernel_ssize)); > } > > [2] linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 0x80; > > kernel BUG at arch/powerpc/mm/hash_utils_64.c:1815! > > ... > > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -660,6 +660,7 @@ static void generic_online_page(struct page *page) > { > __online_page_set_limits(page); > __online_page_increment_counters(page); > + kernel_map_pages(page, 1, 1); > __online_page_free(page); > } This code was changed a lot by Arun's "mm/page_alloc.c: memory hotplug: free pages as higher order". I don't think hotplug+DEBUG_PAGEALLOC is important enough to disrupt memory_hotplug-free-pages-as-higher-order.patch, which took a long time to sort out. So could you please take a look at linux-next, determine whether the problem is still there and propose a suitable patch? Thanks.
Re: next/master boot bisection: next-20190215 on beaglebone-black
On Fri, 1 Mar 2019 09:25:24 +0100 Guillaume Tucker wrote: > >>> Michal had asked if the free space accounting fix up addressed this > >>> boot regression? I was awaiting word on that. > >> > >> hm, does b...@kernelci.org actually read emails? Let's try info@ as well.. > > b...@kernelci.org is not person, it's a send-only account for > automated reports. So no, it doesn't read emails. > > I guess the tricky point here is that the authors of the commits > found by bisections may not always have the hardware needed to > reproduce the problem. So it needs to be dealt with on a > case-by-case basis: sometimes they do have the hardware, > sometimes someone else on the list or on CC does, and sometimes > it's better for the people who have access to the test lab which > ran the KernelCI test to deal with it. > > This case seems to fall into the last category. As I have access > to the Collabora lab, I can do some quick checks to confirm > whether the proposed patch does fix the issue. I hadn't realised > that someone was waiting for this to happen, especially as the > BeagleBone Black is a very common platform. Sorry about that, > I'll take a look today. > > It may be a nice feature to be able to give access to the > KernelCI test infrastructure to anyone who wants to debug an > issue reported by KernelCI or verify a fix, so they won't need to > have the hardware locally. Something to think about for the > future. Thanks, that all sounds good. > >> Is it possible to determine whether this regression is still present in > >> current linux-next? > > I'll try to re-apply the patch that caused the issue, then see if > the suggested change fixes it. As far as the current linux-next > master branch is concerned, KernelCI boot tests are passing fine > on that platform. They would, because I dropped mm-shuffle-default-enable-all-shuffling.patch, so your tests presumably now have shuffling disabled. Is it possible to add the below to linux-next and try again? Or I can re-add this to linux-next. Where should we go to determine the results of such a change? There are a heck of a lot of results on https://kernelci.org/boot/ and entering "beaglebone-black" doesn't get me anything. Thanks. From: Dan Williams Subject: mm/shuffle: default enable all shuffling Per Andrew's request arrange for all memory allocation shuffling code to be enabled by default. The page_alloc.shuffle command line parameter can still be used to disable shuffling at boot, but the kernel will default enable the shuffling if the command line option is not specified. Link: http://lkml.kernel.org/r/154943713572.3858443.11206307988382889377.st...@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams Cc: Kees Cook Cc: Michal Hocko Cc: Dave Hansen Cc: Keith Busch Signed-off-by: Andrew Morton --- init/Kconfig |4 ++-- mm/shuffle.c |4 ++-- mm/shuffle.h |2 +- 3 files changed, 5 insertions(+), 5 deletions(-) --- a/init/Kconfig~mm-shuffle-default-enable-all-shuffling +++ a/init/Kconfig @@ -1709,7 +1709,7 @@ config SLAB_MERGE_DEFAULT command line. config SLAB_FREELIST_RANDOM - default n + default y depends on SLAB || SLUB bool "SLAB freelist randomization" help @@ -1728,7 +1728,7 @@ config SLAB_FREELIST_HARDENED config SHUFFLE_PAGE_ALLOCATOR bool "Page allocator randomization" - default SLAB_FREELIST_RANDOM && ACPI_NUMA + default y help Randomization of the page allocator improves the average utilization of a direct-mapped memory-side-cache. See section --- a/mm/shuffle.c~mm-shuffle-default-enable-all-shuffling +++ a/mm/shuffle.c @@ -9,8 +9,8 @@ #include "internal.h" #include "shuffle.h" -DEFINE_STATIC_KEY_FALSE(page_alloc_shuffle_key); -static unsigned long shuffle_state __ro_after_init; +DEFINE_STATIC_KEY_TRUE(page_alloc_shuffle_key); +static unsigned long shuffle_state __ro_after_init = 1 << SHUFFLE_ENABLE; /* * Depending on the architecture, module parameter parsing may run --- a/mm/shuffle.h~mm-shuffle-default-enable-all-shuffling +++ a/mm/shuffle.h @@ -19,7 +19,7 @@ enum mm_shuffle_ctl { #define SHUFFLE_ORDER (MAX_ORDER-1) #ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR -DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key); +DECLARE_STATIC_KEY_TRUE(page_alloc_shuffle_key); extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl); extern void __shuffle_free_memory(pg_data_t *pgdat); static inline void shuffle_free_memory(pg_data_t *pgdat) _
Re: [PATCH] log2: make is_power_of_2() integer constant expression when possible
On Fri, 1 Mar 2019 14:52:07 +0200 Jani Nikula wrote: > While is_power_of_2() is an inline function and likely gets optimized > for compile time constant arguments, it still doesn't produce an integer > constant expression that could be used in, say, static data > initialization or case labels. hm, what code wants to do these things? > Make is_power_of_2() an integer constant expression when possible, > otherwise using the inline function to avoid multiple evaluation of the > parameter. Spose so. But I fear that some gcc versions will get it right and others will mess it up. While this patch is under test I think it would be best to also have at least one or two callsites which actually use the compile-time evaluation feature. Possible?
Re: next/master boot bisection: next-20190215 on beaglebone-black
On Tue, 26 Feb 2019 16:04:04 -0800 Dan Williams wrote: > On Tue, Feb 26, 2019 at 4:00 PM Andrew Morton > wrote: > > > > On Fri, 15 Feb 2019 18:51:51 + Mark Brown wrote: > > > > > On Fri, Feb 15, 2019 at 10:43:25AM -0800, Andrew Morton wrote: > > > > On Fri, 15 Feb 2019 10:20:10 -0800 (PST) "kernelci.org bot" > > > > wrote: > > > > > > > > Details:https://kernelci.org/boot/id/5c666ea959b514b017fe6017 > > > > > Plain log: > > > > > https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.txt > > > > > HTML log: > > > > > https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.html > > > > > > > Thanks. > > > > > > > But what actually went wrong? Kernel doesn't boot? > > > > > > The linked logs show the kernel dying early in boot before the console > > > comes up so yeah. There should be kernel output at the bottom of the > > > logs. > > > > I assume Dan is distracted - I'll keep this patchset on hold until we > > can get to the bottom of this. > > Michal had asked if the free space accounting fix up addressed this > boot regression? I was awaiting word on that. hm, does b...@kernelci.org actually read emails? Let's try info@ as well.. Is it possible to determine whether this regression is still present in current linux-next? > I assume you're not willing to entertain a "depends > NOT_THIS_ARM_BOARD" hack in the meantime? We'd probably never be able to remove it. And we don't know whether other systems might be affected.
Re: [PATCH v5] panic: Avoid the extra noise dmesg
On Fri, 22 Feb 2019 14:09:59 +0800 Feng Tang wrote: > When kernel panic happens, it will first print the panic call stack, > then the ending msg like: > > [ 35.743249] ---[ end Kernel panic - not syncing: Fatal exception > [ 35.749975] [ cut here ] > > The above message are very useful for debugging. > > But if system is configured to not reboot on panic, say the "panic_timeout" > parameter equals 0, it will likely print out many noisy message like > WARN() call stack for each and every CPU except the panic one, messages > like below: > > WARNING: CPU: 1 PID: 280 at kernel/sched/core.c:1198 > set_task_cpu+0x183/0x190 > Call Trace: > > try_to_wake_up > default_wake_function > autoremove_wake_function > __wake_up_common > __wake_up_common_lock > __wake_up > wake_up_klogd_work_func > irq_work_run_list > irq_work_tick > update_process_times > tick_sched_timer > __hrtimer_run_queues > hrtimer_interrupt > smp_apic_timer_interrupt > apic_timer_interrupt It's a fairly ugly-looking patch but I am inclined to agree. The panicing CPU is spinning and blinking a LED and all CPUs have interrupts enabled and the system is known to be in a messed up state. All sorts of kernel code could emit all sorts of output in such circumstances. So a global printk-killing knob seems appropriate. Thoughts: - why do the suppression in vprintk_emit()? Doing it right at entry to printk() seems cleaner, more explicit? - Other code sites may wish to suppress all printks. Perhaps `panic_suppress_printk' should just be called `suppress_printk'?
Re: [PATCH v2 2/4] mm: remove zone_lru_lock() function access ->lru_lock directly
On Thu, 28 Feb 2019 05:53:37 -0700 William Kucharski wrote: > > On Feb 28, 2019, at 1:33 AM, Andrey Ryabinin > > wrote: > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index a9852ed7b97f..2d081a32c6a8 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1614,8 +1614,8 @@ static __always_inline void update_lru_sizes(struct > > lruvec *lruvec, > > > > } > > > > -/* > > - * zone_lru_lock is heavily contended. Some of the functions that > > +/** > > Nit: Remove the extra asterisk here; the line should then revert to being > unchanged from > the original. I don't think so. This kernedoc comment was missing its leading /**. The patch fixes that.
Re: next/master boot bisection: next-20190215 on beaglebone-black
On Fri, 15 Feb 2019 18:51:51 + Mark Brown wrote: > On Fri, Feb 15, 2019 at 10:43:25AM -0800, Andrew Morton wrote: > > On Fri, 15 Feb 2019 10:20:10 -0800 (PST) "kernelci.org bot" > > wrote: > > > > Details:https://kernelci.org/boot/id/5c666ea959b514b017fe6017 > > > Plain log: > > > https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.txt > > > HTML log: > > > https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.html > > > Thanks. > > > But what actually went wrong? Kernel doesn't boot? > > The linked logs show the kernel dying early in boot before the console > comes up so yeah. There should be kernel output at the bottom of the > logs. I assume Dan is distracted - I'll keep this patchset on hold until we can get to the bottom of this.
Re: [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region
[patch 1/4]: OK. I guess. Was this worth consuming our last PF_ flag? [patch 2/4]: unreviewed [patch 3/4]: unreviewed, mpe still unhappy, I expect? [patch 4/4]: unreviewed
Re: [PATCH] mm, memcg: Handle cgroup_disable=memory when getting memcg protection
an desirable value because otherwise as soon as you exceed memory.low, all protection is lost, and all pages are eligible to scan again. By contrast, having a gradual ramp in reclaim pressure means that you now still get some protection when thresholds are exceeded, which means that one can now be more comfortable setting memory.low to lower values without worrying that all protection will be lost. This is important because workingset size is really hard to know exactly, especially with variable workloads, so at least getting *some* protection if your workingset size grows larger than you expect increases user confidence in setting memory.low without a huge buffer on top being needed. Thanks a lot to Johannes Weiner and Tejun Heo for their advice and assistance in thinking about how to make this work better. In testing these changes, I intended to verify that: 1. Changes in page scanning become gradual and proportional instead of binary. To test this, I experimented stepping further and further down memory.low protection on a workload that floats around 19G workingset when under memory.low protection, watching page scan rates for the workload cgroup: ++-++--+ | memory.low | test (pgscan/s) | control (pgscan/s) | % of control | ++-++--+ |21G | 0 | 0 | N/A | |17G | 867 | 3799 | 23% | |12G |1203 | 3543 | 34% | | 8G |2534 | 3979 | 64% | | 4G |3980 | 4147 | 96% | | 0 |3799 | 3980 | 95% | ++-++--+ As you can see, the test kernel (with a kernel containing this patch) ramps up page scanning significantly more gradually than the control kernel (without this patch). 2. More gradual ramp up in reclaim aggression doesn't result in premature OOMs. To test this, I wrote a script that slowly increments the number of pages held by stress(1)'s --vm-keep mode until a production system entered severe overall memory contention. This script runs in a highly protected slice taking up the majority of available system memory. Watching vmstat revealed that page scanning continued essentially nominally between test and control, without causing forward reclaim progress to become arrested. [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project [a...@linux-foundation.org: reflow block comments to fit in 80 cols] [ch...@chrisdown.name: handle cgroup_disable=memory when getting memcg protection] Link: http://lkml.kernel.org/r/20190201045711.ga18...@chrisdown.name Link: http://lkml.kernel.org/r/20190124014455.ga6...@chrisdown.name Signed-off-by: Chris Down Acked-by: Johannes Weiner Reviewed-by: Roman Gushchin Cc: Michal Hocko Cc: Tejun Heo Cc: Dennis Zhou Cc: Tetsuo Handa Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v2.rst | 20 +++-- include/linux/memcontrol.h | 20 + mm/memcontrol.c |5 + mm/vmscan.c | 82 -- 4 files changed, 115 insertions(+), 12 deletions(-) --- a/Documentation/admin-guide/cgroup-v2.rst~mm-proportional-memorylowmin-reclaim +++ a/Documentation/admin-guide/cgroup-v2.rst @@ -606,8 +606,8 @@ on an IO device and is an example of thi Protections --- -A cgroup is protected to be allocated upto the configured amount of -the resource if the usages of all its ancestors are under their +A cgroup is protected upto the configured amount of the resource +as long as the usages of all its ancestors are under their protected levels. Protections can be hard guarantees or best effort soft boundaries. Protections can also be over-committed in which case only upto the amount available to the parent is protected among @@ -1020,7 +1020,10 @@ PAGE_SIZE multiple when read back. is within its effective min boundary, the cgroup's memory won't be reclaimed under any conditions. If there is no unprotected reclaimable memory available, OOM killer - is invoked. + is invoked. Above the effective min boundary (or + effective low boundary if it is higher), pages are reclaimed + proportionally to the overage, reducing reclaim pressure for + smaller overages. Effective min boundary is limited by memory.min values of all ancestor cgroups. If there is memory.min overcommitment @@ -1042,7 +1045,10 @@ PAGE_SIZE multiple when read back. Best-effort memory protection. If the memory usage of a cgroup is within its effective low boundary, the cgroup's
Re: [PATCH v4] mm/hugetlb: Fix unsigned overflow in __nr_hugepages_store_common()
> > The number of node specific huge pages can be set via a file such as: > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages > When a node specific value is specified, the global number of huge > pages must also be adjusted. This adjustment is calculated as the > specified node specific value + (global value - current node value). > If the node specific value provided by the user is large enough, this > calculation could overflow an unsigned long leading to a smaller > than expected number of huge pages. > > To fix, check the calculation for overflow. If overflow is detected, > use ULONG_MAX as the requested value. This is inline with the user > request to allocate as many huge pages as possible. > > It was also noticed that the above calculation was done outside the > hugetlb_lock. Therefore, the values could be inconsistent and result > in underflow. To fix, the calculation is moved to within the routine > set_max_huge_pages() where the lock is held. > > ... > > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2274,7 +2274,7 @@ static int adjust_pool_surplus(struct hstate *h, > nodemask_t *nodes_allowed, Please tweak that email client to prevent the wordwraps. > + /* > + * Check for a node specific request. Adjust global count, but > + * restrict alloc/free to the specified node. > + */ > + if (nid != NUMA_NO_NODE) { > + unsigned long old_count = count; > + count += h->nr_huge_pages - h->nr_huge_pages_node[nid]; > + /* > + * If user specified count causes overflow, set to > + * largest possible value. > + */ > + if (count < old_count) > + count = ULONG_MAX; > + } The above two comments explain the code, but do not reveal the reasoning behind the policy decisions which that code implements. > ... > > + } else { > /* > - * per node hstate attribute: adjust count to global, > - * but restrict alloc/free to the specified node. > + * Node specific request, but we could not allocate > + * node mask. Pass in ALL nodes, and clear nid. >*/ Ditto here, somewhat. The old mantra: comments should explain "why", not "what". Reading the code tells us the "what". Thanks.
Re: [PATCH] mm,mremap: Bail out earlier in mremap_to under map pressure
On Tue, 26 Feb 2019 10:13:14 +0100 Oscar Salvador wrote: > When using mremap() syscall in addition to MREMAP_FIXED flag, > mremap() calls mremap_to() which does the following: > > 1) unmaps the destination region where we are going to move the map > 2) If the new region is going to be smaller, we unmap the last part >of the old region > > Then, we will eventually call move_vma() to do the actual move. > > move_vma() checks whether we are at least 4 maps below max_map_count > before going further, otherwise it bails out with -ENOMEM. > The problem is that we might have already unmapped the vma's in steps > 1) and 2), so it is not possible for userspace to figure out the state > of the vma's after it gets -ENOMEM, and it gets tricky for userspace > to clean up properly on error path. > > While it is true that we can return -ENOMEM for more reasons > (e.g: see may_expand_vm() or move_page_tables()), I think that we can > avoid this scenario in concret if we check early in mremap_to() if the > operation has high chances to succeed map-wise. > > Should not be that the case, we can bail out before we even try to unmap > anything, so we make sure the vma's are left untouched in case we are likely > to be short of maps. > > The thumb-rule now is to rely on the worst-scenario case we can have. > That is when both vma's (old region and new region) are going to be split > in 3, so we get two more maps to the ones we already hold (one per each). > If current map count + 2 maps still leads us to 4 maps below the threshold, > we are going to pass the check in move_vma(). > > Of course, this is not free, as it might generate false positives when it is > true that we are tight map-wise, but the unmap operation can release several > vma's leading us to a good state. > > Another approach was also investigated [1], but it may be too much hassle > for what it brings. > How is this going to affect existing userspace which is aware of the current behaviour? And how does it affect your existing cleanup code, come to that? Does it work as well or better after this change?
Re: [PATCH] kernel/hung_task.c: Use continuously blocked time when reporting.
On Tue, 26 Feb 2019 18:58:03 +0900 Tetsuo Handa wrote: > Since commit a2e514453861dd39 ("kernel/hung_task.c: allow to set checking > interval separately from timeout") added hung_task_check_interval_secs, > setting a value different from hung_task_timeout_secs > > echo 0 > /proc/sys/kernel/hung_task_panic > echo 120 > /proc/sys/kernel/hung_task_timeout_secs > echo 5 > /proc/sys/kernel/hung_task_check_interval_secs > > causes confusing output as if the task was blocked for > hung_task_timeout_secs seconds from the previous report. > > [ 399.395930] INFO: task kswapd0:75 blocked for more than 120 seconds. > [ 405.027637] INFO: task kswapd0:75 blocked for more than 120 seconds. > [ 410.659725] INFO: task kswapd0:75 blocked for more than 120 seconds. > [ 416.292860] INFO: task kswapd0:75 blocked for more than 120 seconds. > [ 421.932305] INFO: task kswapd0:75 blocked for more than 120 seconds. > > Although we could update t->last_switch_time after sched_show_task(t) if > we want to report only every 120 seconds, reporting every 5 seconds might > not be very bad for monitoring after a problematic situation has started. > Thus, let's use continuously blocked time instead of updating previously > reported time. > > [ 677.985011] INFO: task kswapd0:80 blocked for more than 122 seconds. > [ 693.856126] INFO: task kswapd0:80 blocked for more than 138 seconds. > [ 709.728075] INFO: task kswapd0:80 blocked for more than 154 seconds. > [ 725.600018] INFO: task kswapd0:80 blocked for more than 170 seconds. > [ 741.473133] INFO: task kswapd0:80 blocked for more than 186 seconds. hm, maybe. The user asked for a report if a task was blocked for more than 120 seconds, so the kernel is accurately reporting that this happened. The actual blockage time is presumably useful information, but that's a different thing. We could report both: "blocked for 122 seconds which is more than 120 seconds" But on the other hand, what is the point in telling the user how they configured their own kernel? I think I'll stop arguing with myself and apply the patch ;)
Re: [PATCH 2/2] test_firmware: silence underflow warning in test_dev_config_update_u8()
On Thu, 21 Feb 2019 22:18:56 +0300 Dan Carpenter wrote: > On Thu, Feb 21, 2019 at 10:54:58AM -0800, Andrew Morton wrote: > > On Thu, 21 Feb 2019 21:38:26 +0300 Dan Carpenter > > wrote: > > > > > We put an upper bound on "new" but we don't check for negatives. > > > > U8_MAX has unsigned type, so `if (new > U8_MAX)' does check for negative. > > > > No, doesn't work in this case. > > #define U8_MAX ((u8)~0U) > > It would need to unsigned long for the type promotion to prevent > negatives, but it starts as unsigned int, then unsigned char, which is > type promoted to int. OK. > It would be more clear to just write it as: > > #define U8_MAX 0xff That doesn't work either. Tricky. #include typedef unsigned char u8; #define U8_MAX 0xff int main(int argc, char *argv[]) { long new; new = -20; if (new > U8_MAX) printf("over\n"); }
Re: [PATCH] huegtlbfs: fix races and page leaks during migration
On Thu, 21 Feb 2019 11:11:06 -0800 Mike Kravetz wrote: > > Sorry for the churn. As I find and fix one issue I seem to discover another. > There is still at least one more issue with private pages when COW comes into > play. I continue to work that. I wanted to send this patch earlier as it > is pretty easy to hit the bugs if you try. If you would prefer another > approach, let me know. > No probs, the bug doesn't seem to be causing a lot of bother out there and it's cc:stable; there's time to get this right ;) Here's the delta I queued: --- a/mm/hugetlb.c~huegtlbfs-fix-races-and-page-leaks-during-migration-update +++ a/mm/hugetlb.c @@ -3729,6 +3729,7 @@ static vm_fault_t hugetlb_no_page(struct pte_t new_pte; spinlock_t *ptl; unsigned long haddr = address & huge_page_mask(h); + bool new_page = false; /* * Currently, we are forced to kill the process in the event the @@ -3790,6 +3791,7 @@ retry: } clear_huge_page(page, address, pages_per_huge_page(h)); __SetPageUptodate(page); + new_page = true; if (vma->vm_flags & VM_MAYSHARE) { int err = huge_add_to_page_cache(page, mapping, idx); @@ -3861,8 +3863,9 @@ retry: spin_unlock(ptl); - /* May already be set if not newly allocated page */ - set_page_huge_active(page); + /* Make newly allocated pages active */ + if (new_page) + set_page_huge_active(page); unlock_page(page); out: --- a/mm/migrate.c~huegtlbfs-fix-races-and-page-leaks-during-migration-update +++ a/mm/migrate.c @@ -1315,6 +1315,16 @@ static int unmap_and_move_huge_page(new_ lock_page(hpage); } + /* +* Check for pages which are in the process of being freed. Without +* page_mapping() set, hugetlbfs specific move page routine will not +* be called and we could leak usage counts for subpools. +*/ + if (page_private(hpage) && !page_mapping(hpage)) { + rc = -EBUSY; + goto out_unlock; + } + if (PageAnon(hpage)) anon_vma = page_get_anon_vma(hpage); @@ -1345,6 +1355,7 @@ put_anon: put_new_page = NULL; } +out_unlock: unlock_page(hpage); out: if (rc != -EAGAIN) _
Re: BUG: unable to handle kernel NULL pointer dereference in __generic_file_write_iter
On Thu, 21 Feb 2019 06:52:04 -0800 syzbot wrote: > Hello, > > syzbot found the following crash on: > > HEAD commit:4aa9fc2a435a Revert "mm, memory_hotplug: initialize struct.. > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=1101382f40 > kernel config: https://syzkaller.appspot.com/x/.config?x=4fceea9e2d99ac20 > dashboard link: https://syzkaller.appspot.com/bug?extid=ca95b2b7aef9e7cbd6ab > compiler: gcc (GCC) 9.0.0 20181231 (experimental) > > Unfortunately, I don't have any reproducer for this crash yet. Not understanding. That seems to be saying that we got a NULL pointer deref in __generic_file_write_iter() at written = generic_perform_write(file, from, iocb->ki_pos); which isn't possible. I'm not seeing recent changes in there which could have caused this. Help. > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+ca95b2b7aef9e7cbd...@syzkaller.appspotmail.com > > BUG: unable to handle kernel NULL pointer dereference at > #PF error: [INSTR] > PGD a7ea0067 P4D a7ea0067 PUD 81535067 PMD 0 > Oops: 0010 [#1] PREEMPT SMP KASAN > CPU: 0 PID: 15924 Comm: syz-executor0 Not tainted 5.0.0-rc4+ #50 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > RIP: 0010: (null) > Code: Bad RIP value. > RSP: 0018:88804c3d7858 EFLAGS: 00010246 > RAX: RBX: 885aeb60 RCX: 005b > RDX: RSI: 88807ec22930 RDI: 8880a59bdcc0 > RBP: 88804c3d79b8 R08: R09: 88804c3d7910 > R10: 8880835ca200 R11: R12: 005b > R13: 88804c3d7c98 R14: dc00 R15: > FS: 7f3456db4700() GS:8880ae60() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: ffd6 CR3: 814ac000 CR4: 001406f0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > __generic_file_write_iter+0x25e/0x630 mm/filemap.c: > ext4_file_write_iter+0x37a/0x1410 fs/ext4/file.c:266 > call_write_iter include/linux/fs.h:1862 [inline] > new_sync_write fs/read_write.c:474 [inline] > __vfs_write+0x764/0xb40 fs/read_write.c:487 > vfs_write+0x20c/0x580 fs/read_write.c:549 > ksys_write+0x105/0x260 fs/read_write.c:598 > __do_sys_write fs/read_write.c:610 [inline] > __se_sys_write fs/read_write.c:607 [inline] > __x64_sys_write+0x73/0xb0 fs/read_write.c:607 > do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > RIP: 0033:0x458089 > Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 > 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff > ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 > RSP: 002b:7f3456db3c78 EFLAGS: 0246 ORIG_RAX: 0001 > RAX: ffda RBX: 0003 RCX: 00458089 > RDX: 005b RSI: 2240 RDI: 0003 > RBP: 0073bf00 R08: R09: > R10: R11: 0246 R12: 7f3456db46d4 > R13: 004c7450 R14: 004dce68 R15: > Modules linked in: > CR2: > ---[ end trace 5cac9d2c75a59916 ]--- > kobject: 'loop5' (4426a409): kobject_uevent_env > RIP: 0010: (null) > Code: Bad RIP value. > RSP: 0018:88804c3d7858 EFLAGS: 00010246 > RAX: RBX: 885aeb60 RCX: 005b > kobject: 'loop5' (4426a409): fill_kobj_path: path > = '/devices/virtual/block/loop5' > RDX: RSI: 88807ec22930 RDI: 8880a59bdcc0 > kobject: 'loop2' (b82e0c58): kobject_uevent_env > kobject: 'loop2' (b82e0c58): fill_kobj_path: path > = '/devices/virtual/block/loop2' > RBP: 88804c3d79b8 R08: R09: 88804c3d7910 > R10: 8880835ca200 R11: R12: 005b > R13: 88804c3d7c98 R14: dc00 R15: > kobject: 'loop5' (4426a409): kobject_uevent_env > FS: 7f3456db4700() GS:8880ae60() knlGS: > kobject: 'loop5' (4426a409): fill_kobj_path: path > = '/devices/virtual/block/loop5' > CS: 0010 DS: ES: CR0: 80050033 > CR2: 022029a0 CR3: 814ac000 CR4: 001406f0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > > > --- > This bug is generated by a bot. It may contain errors. > See https://goo.gl/tpsmEJ for more information about syzbot. > syzbot engineers can be reached at syzkal...@googlegroups.com. > > syzbot will keep track of this bug report. See: >
Re: general protection fault in relay_switch_subbuf
(cc Jens ;)) On Thu, 21 Feb 2019 06:54:03 -0800 syzbot wrote: > Hello, > > syzbot found the following crash on: > > HEAD commit:c04e2a780caf Merge tag 'fsnotify_for_v5.0-rc4' of git://gi.. > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=133424c0c0 > kernel config: https://syzkaller.appspot.com/x/.config?x=505743eba4e4f68 > dashboard link: https://syzkaller.appspot.com/bug?extid=8e78f280ccd6930f > compiler: gcc (GCC) 9.0.0 20181231 (experimental) > > Unfortunately, I don't have any reproducer for this crash yet. > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+8e78f280ccd69...@syzkaller.appspotmail.com > > kasan: GPF could be caused by NULL-ptr deref or user memory access > general protection fault: [#1] PREEMPT SMP KASAN > CPU: 1 PID: 11600 Comm: syz-executor2 Not tainted 5.0.0-rc3+ #42 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > RIP: 0010:relay_switch_subbuf+0x27a/0xad0 kernel/relay.c:753 > Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 e2 07 00 00 48 b8 00 00 00 00 > 00 fc ff df 48 8b 4b 58 48 8d 79 50 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f > 85 c9 07 00 00 48 ba 00 00 00 00 00 fc ff df 49 8b > RSP: 0018:88805e3ef790 EFLAGS: 00010206 > RAX: dc00 RBX: 88804bc7db40 RCX: > RDX: 000a RSI: 8182d552 RDI: 0050 > RBP: 88805e3ef850 R08: 8880637f0600 R09: fbfff133b0b1 > R10: 88805e3ef850 R11: 899d8587 R12: > R13: 8880a7e05e00 R14: R15: > FS: 7f1fdbddd700() GS:8880ae70() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 00a50489 CR3: 943a CR4: 001406e0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > relay_flush kernel/relay.c:881 [inline] > relay_flush+0x1c4/0x280 kernel/relay.c:865 > __blk_trace_startstop.isra.0+0x28b/0x8b0 kernel/trace/blktrace.c:668 > blk_trace_ioctl+0x1c6/0x300 kernel/trace/blktrace.c:727 > blkdev_ioctl+0x141/0x2120 block/ioctl.c:591 > block_ioctl+0xee/0x130 fs/block_dev.c:1914 > vfs_ioctl fs/ioctl.c:46 [inline] > file_ioctl fs/ioctl.c:509 [inline] > do_vfs_ioctl+0x107b/0x17d0 fs/ioctl.c:696 > ksys_ioctl+0xab/0xd0 fs/ioctl.c:713 > __do_sys_ioctl fs/ioctl.c:720 [inline] > __se_sys_ioctl fs/ioctl.c:718 [inline] > __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718 > do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > RIP: 0033:0x458099 > Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 > 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff > ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 > RSP: 002b:7f1fdbddcc78 EFLAGS: 0246 ORIG_RAX: 0010 > RAX: ffda RBX: 0003 RCX: 00458099 > RDX: RSI: 1275 RDI: 0005 > RBP: 0073bf00 R08: R09: > R10: R11: 0246 R12: 7f1fdbddd6d4 > R13: 004bf595 R14: 004d0e88 R15: > Modules linked in: > kobject: 'loop4' (5903bd4d): kobject_uevent_env > kobject: 'loop4' (5903bd4d): fill_kobj_path: path > = '/devices/virtual/block/loop4' > ---[ end trace d4380d594e5099d1 ]--- > RIP: 0010:relay_switch_subbuf+0x27a/0xad0 kernel/relay.c:753 > Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 e2 07 00 00 48 b8 00 00 00 00 > 00 fc ff df 48 8b 4b 58 48 8d 79 50 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f > 85 c9 07 00 00 48 ba 00 00 00 00 00 fc ff df 49 8b > kobject: 'loop0' (83b22ec7): kobject_uevent_env > kobject: 'loop0' (83b22ec7): fill_kobj_path: path > = '/devices/virtual/block/loop0' > Unknown ioctl 19304 > kobject: 'loop3' (50016e38): kobject_uevent_env > RSP: 0018:88805e3ef790 EFLAGS: 00010206 > kobject: 'loop3' (50016e38): fill_kobj_path: path > = '/devices/virtual/block/loop3' > RAX: dc00 RBX: 88804bc7db40 RCX: > kobject: 'loop1' (2962ffee): kobject_uevent_env > kobject: 'loop1' (2962ffee): fill_kobj_path: path > = '/devices/virtual/block/loop1' > RDX: 000a RSI: 8182d552 RDI: 0050 > kobject: 'loop4' (5903bd4d): kobject_uevent_env > kobject: 'loop4' (5903bd4d): fill_kobj_path: path > = '/devices/virtual/block/loop4' > RBP: 88805e3ef850 R08: 8880637f0600 R09: fbfff133b0b1 > kobject: 'loop4' (5903bd4d): kobject_uevent_env > R10: 88805e3ef850 R11: 899d8587 R12: > kobject: 'loop4' (5903bd4d): fill_kobj_path: path > = '/devices/virtual/block/loop4' > R13:
Re: [PATCH 2/2] test_firmware: silence underflow warning in test_dev_config_update_u8()
On Thu, 21 Feb 2019 21:38:26 +0300 Dan Carpenter wrote: > We put an upper bound on "new" but we don't check for negatives. U8_MAX has unsigned type, so `if (new > U8_MAX)' does check for negative. > In > this case the underflow doesn't matter very much, but we may as well > make the static checker happy. > > ... > > --- a/lib/test_firmware.c > +++ b/lib/test_firmware.c > @@ -326,15 +326,12 @@ static ssize_t test_dev_config_show_int(char *buf, int > cfg) > static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg) > { > int ret; > - long new; > + u8 new; > > - ret = kstrtol(buf, 10, ); > + ret = kstrtou8(buf, 10, ); > if (ret) > return ret; > > - if (new > U8_MAX) > - return -EINVAL; > - > mutex_lock(_fw_mutex); > *(u8 *)cfg = new; > mutex_unlock(_fw_mutex); if *buf=="257", previous behavior: -EINVAL new behavior: *cfg = 1 yes? The old behavior seems better.
Re: [PATCH] huegtlbfs: fix races and page leaks during migration
On Tue, 12 Feb 2019 14:14:00 -0800 Mike Kravetz wrote: > hugetlb pages should only be migrated if they are 'active'. The routines > set/clear_page_huge_active() modify the active state of hugetlb pages. > When a new hugetlb page is allocated at fault time, set_page_huge_active > is called before the page is locked. Therefore, another thread could > race and migrate the page while it is being added to page table by the > fault code. This race is somewhat hard to trigger, but can be seen by > strategically adding udelay to simulate worst case scheduling behavior. > Depending on 'how' the code races, various BUG()s could be triggered. > > To address this issue, simply delay the set_page_huge_active call until > after the page is successfully added to the page table. > > Hugetlb pages can also be leaked at migration time if the pages are > associated with a file in an explicitly mounted hugetlbfs filesystem. > For example, a test program which hole punches, faults and migrates > pages in such a file (1G in size) will eventually fail because it > can not allocate a page. Reported counts and usage at time of failure: > > node0 > 537 free_hugepages > 1024nr_hugepages > 0 surplus_hugepages > node1 > 1000free_hugepages > 1024nr_hugepages > 0 surplus_hugepages > > Filesystem Size Used Avail Use% Mounted on > nodev 4.0G 4.0G 0 100% /var/opt/hugepool > > Note that the filesystem shows 4G of pages used, while actual usage is > 511 pages (just under 1G). Failed trying to allocate page 512. > > If a hugetlb page is associated with an explicitly mounted filesystem, > this information in contained in the page_private field. At migration > time, this information is not preserved. To fix, simply transfer > page_private from old to new page at migration time if necessary. > > Cc: > Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active") > Signed-off-by: Mike Kravetz cc:stable. It would be nice to get some review of this one, please? > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -859,6 +859,18 @@ static int hugetlbfs_migrate_page(struct address_space > *mapping, > rc = migrate_huge_page_move_mapping(mapping, newpage, page); > if (rc != MIGRATEPAGE_SUCCESS) > return rc; > + > + /* > + * page_private is subpool pointer in hugetlb pages. Transfer to > + * new page. PagePrivate is not associated with page_private for > + * hugetlb pages and can not be set here as only page_huge_active > + * pages can be migrated. > + */ > + if (page_private(page)) { > + set_page_private(newpage, page_private(page)); > + set_page_private(page, 0); > + } > + > if (mode != MIGRATE_SYNC_NO_COPY) > migrate_page_copy(newpage, page); > else > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index a80832487981..f859e319e3eb 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3625,7 +3625,6 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, > struct vm_area_struct *vma, > copy_user_huge_page(new_page, old_page, address, vma, > pages_per_huge_page(h)); > __SetPageUptodate(new_page); > - set_page_huge_active(new_page); > > mmun_start = haddr; > mmun_end = mmun_start + huge_page_size(h); > @@ -3647,6 +3646,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, > struct vm_area_struct *vma, > make_huge_pte(vma, new_page, 1)); > page_remove_rmap(old_page, true); > hugepage_add_new_anon_rmap(new_page, vma, haddr); > + set_page_huge_active(new_page); > /* Make the old page be freed below */ > new_page = old_page; > } > @@ -3792,7 +3792,6 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > } > clear_huge_page(page, address, pages_per_huge_page(h)); > __SetPageUptodate(page); > - set_page_huge_active(page); > > if (vma->vm_flags & VM_MAYSHARE) { > int err = huge_add_to_page_cache(page, mapping, idx); > @@ -3863,6 +3862,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > } > > spin_unlock(ptl); > + > + /* May already be set if not newly allocated page */ > + set_page_huge_active(page); > + > unlock_page(page); > out: > return ret; > @@ -4097,7 +4100,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, >* the set_pte_at() write. >*/ > __SetPageUptodate(page); > - set_page_huge_active(page); > > mapping = dst_vma->vm_file->f_mapping; > idx = vma_hugecache_offset(h, dst_vma, dst_addr); > @@ -4165,6 +4167,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > update_mmu_cache(dst_vma, dst_addr, dst_pte); > > spin_unlock(ptl); > +
Re: [PATCH -next] mm: fix set but not used warning
On Tue, 19 Feb 2019 18:28:30 + "Kani, Toshi" wrote: > On Mon, 2019-02-18 at 13:57 +, YueHaibing wrote: > > Fixes gcc '-Wunused-but-set-variable' warning: > > > > lib/ioremap.c: In function 'ioremap_page_range': > > lib/ioremap.c:203:16: warning: > > variable 'start' set but not used [-Wunused-but-set-variable] > > > > flush_cache_vmap may no need param, so add __maybe_unused fix this warning. > > I think flush_cache_vmap should have a proper prototype with > __maybe_unused for its args. But, there are so many arch-specific > definitions at glace, and I am not sure how manageable such change is, > though... This is one of the reasons why things like flush_cache_vmap() should be implemented as static inline C functions, not as macros.
Re: next/master boot bisection: next-20190215 on beaglebone-black
On Fri, 15 Feb 2019 10:20:10 -0800 (PST) "kernelci.org bot" wrote: > * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * > * This automated bisection report was sent to you on the basis * > * that you may be involved with the breaking commit it has * > * found. No manual investigation has been done to verify it, * > * and the root cause of the problem may be somewhere else. * > * Hope this helps! * > * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * > > next/master boot bisection: next-20190215 on beaglebone-black > > Summary: > Start: 7a92eb7cc1dc Add linux-next specific files for 20190215 > Details:https://kernelci.org/boot/id/5c666ea959b514b017fe6017 > Plain log: > https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.txt > HTML log: > https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.html > Result: 8dd037cc97d9 mm/shuffle: default enable all shuffling > > Checks: > revert: PASS > verify: PASS > > Parameters: > Tree: next > URL: > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git > Branch: master > Target: beaglebone-black > CPU arch: arm > Lab:lab-collabora > Compiler: gcc-7 > Config: multi_v7_defconfig+CONFIG_SMP=n > Test suite: boot > > Breaking commit found: > > --- > commit 8dd037cc97d9226c97c2ee1abb4e97eff71e0c8d > Author: Dan Williams > Date: Fri Feb 15 11:28:30 2019 +1100 > > mm/shuffle: default enable all shuffling Thanks. But what actually went wrong? Kernel doesn't boot?
Re: next/master boot bisection: next-20190215 on beaglebone-black
On Fri, 15 Feb 2019 18:51:51 + Mark Brown wrote: > On Fri, Feb 15, 2019 at 10:43:25AM -0800, Andrew Morton wrote: > > On Fri, 15 Feb 2019 10:20:10 -0800 (PST) "kernelci.org bot" > > wrote: > > > > Details:https://kernelci.org/boot/id/5c666ea959b514b017fe6017 > > > Plain log: > > > https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.txt > > > HTML log: > > > https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.html > > > Thanks. > > > But what actually went wrong? Kernel doesn't boot? > > The linked logs show the kernel dying early in boot before the console > comes up so yeah. There should be kernel output at the bottom of the > logs. OK, thanks. Well, we have a result. Stephen, can we please drop mm-shuffle-default-enable-all-shuffling.patch for now?
Re: tmpfs inode leakage when opening file with O_TMP_FILE
(cc linux-fsdevel) On Mon, 11 Feb 2019 15:18:11 +0100 Matej Kupljen wrote: > Hi, > > it seems that when opening file on file system that is mounted on > tmpfs with the O_TMPFILE flag and using linkat call after that, it > uses 2 inodes instead of 1. > > This is simple test case: > > #include > #include > #include > #include > #include > #include > #include > #include > #include > > #define TEST_STRING "Testing\n" > > #define TMP_PATH"/tmp/ping/" > #define TMP_FILE"file.txt" > > > int main(int argc, char* argv[]) > { > char path[PATH_MAX]; > int fd; > int rc; > > fd = open(TMP_PATH, __O_TMPFILE | O_RDWR, > S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | > S_IROTH | S_IWOTH); > > rc = write(fd, TEST_STRING, strlen(TEST_STRING)); > > snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd); > linkat(AT_FDCWD, path, AT_FDCWD, TMP_PATH TMP_FILE, > AT_SYMLINK_FOLLOW); > close(fd); > > return 0; > } > > I have checked indoes with "df -i" tool. The first inode is used when > the call to open is executed and the second one when the call to > linkat is executed. > It is not decreased when close is executed. > > I have also tested this on an ext4 mounted fs and there only one inode is > used. > > I tested this on: > $ cat /etc/lsb-release > DISTRIB_ID=Ubuntu > DISTRIB_RELEASE=18.04 > DISTRIB_CODENAME=bionic > DISTRIB_DESCRIPTION="Ubuntu 18.04.1 LTS" > > $ uname -a > Linux Orion 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC > 2018 x86_64 x86_64 x86_64 GNU/Linux > > If you need any more information, please let me know. > > And please CC me when replying, I am not subscribed to the list. > > Thanks and BR, > Matej
Re: [PATCH] psi: avoid divide-by-zero crash inside virtual machines
On Thu, 14 Feb 2019 15:53:43 -0500 Johannes Weiner wrote: > > if (now < expires) > > vs. > > if (time_before64(now, expires)) > > These macros always have me double check the argument order. Yeah, me too.
Re: [PATCH v2 1/4] mm: Move recent_rotated pages calculation to shrink_inactive_list()
On Thu, 14 Feb 2019 13:35:21 +0300 Kirill Tkhai wrote: > Currently, struct reclaim_stat::nr_activate is a local variable, > used only in shrink_page_list(). This patch introduces another > local variable pgactivate to use instead of it, and reuses > nr_activate to account number of active pages. > > Note, that we need nr_activate to be an array, since type of page > may change during shrink_page_list() (see ClearPageSwapBacked()). The patch has nothing to do with the Subject: reclaim_stat::nr_activate is not a local variable - it is a struct member. I can kinda see what the patch is doing but the changelog needs more care, please.
Re: [PATCH] mm/cma: cma_declare_contiguous: correct err handling
On Thu, 14 Feb 2019 12:45:51 + Peng Fan wrote: > In case cma_init_reserved_mem failed, need to free the memblock allocated > by memblock_reserve or memblock_alloc_range. > > ... > > --- a/mm/cma.c > +++ b/mm/cma.c > @@ -353,12 +353,14 @@ int __init cma_declare_contiguous(phys_addr_t base, > > ret = cma_init_reserved_mem(base, size, order_per_bit, name, res_cma); > if (ret) > - goto err; > + goto free_mem; > > pr_info("Reserved %ld MiB at %pa\n", (unsigned long)size / SZ_1M, > ); > return 0; > > +free_mem: > + memblock_free(base, size); > err: > pr_err("Failed to reserve %ld MiB\n", (unsigned long)size / SZ_1M); > return ret; This doesn't look right to me. In the `fixed==true' case we didn't actually allocate anything and in the `fixed==false' case, the allocated memory is at `addr', not at `base'.
Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations
On Thu, 14 Feb 2019 15:33:18 +0100 Michal Hocko wrote: > > Because swapoff() is very rare code path, to make the normal path runs as > > fast as possible, disabling preemption + stop_machine() instead of > > reference count is used to implement get/put_swap_device(). From > > get_swap_device() to put_swap_device(), the preemption is disabled, so > > stop_machine() in swapoff() will wait until put_swap_device() is called. > > > > In addition to swap_map, cluster_info, etc. data structure in the struct > > swap_info_struct, the swap cache radix tree will be freed after swapoff, > > so this patch fixes the race between swap cache looking up and swapoff > > too. > > > > Races between some other swap cache usages protected via disabling > > preemption and swapoff are fixed too via calling stop_machine() between > > clearing PageSwapCache() and freeing swap cache data structure. > > > > Alternative implementation could be replacing disable preemption with > > rcu_read_lock_sched and stop_machine() with synchronize_sched(). > > using stop_machine is generally discouraged. It is a gross > synchronization. This was discussed to death and I think the changelog explains the conclusions adequately. swapoff is super-rare so a stop_machine() in that path is appropriate if its use permits more efficiency in the regular swap code paths. > Besides that, since when do we have this problem? What problem??
Re: Userspace regression in LTS and stable kernels
On Thu, 14 Feb 2019 09:56:46 -0800 Linus Torvalds wrote: > On Wed, Feb 13, 2019 at 3:37 PM Richard Weinberger > wrote: > > > > Your shebang line exceeds BINPRM_BUF_SIZE. > > Before the said commit the kernel silently truncated the shebang line > > (and corrupted it), > > now it tells the user that the line is too long. > > It doesn't matter if it "corrupted" things by truncating it. All that > matters is "it used to work, now it doesn't" > > Yes, maybe it never *should* have worked. And yes, it's sad that > people apparently had cases that depended on this odd behavior, but > there we are. > > I see that Kees has a patch to fix it up. > Greg, I think we have a problem here. 8099b047ecc431518 ("exec: load_script: don't blindly truncate shebang string") wasn't marked for backporting. And, presumably as a consequence, Kees's fix "exec: load_script: allow interpreter argument truncation" was not marked for backporting. 8099b047ecc431518 hasn't even appeared in a Linus released kernel, yet it is now present in 4.9.x, 4.14.x, 4.19.x and 4.20.x. I don't know if Oleg considered backporting that patch. I certainly did (I always do), and I decided against doing so. Yet there it is.
Re: [PATCH] psi: avoid divide-by-zero crash inside virtual machines
On Thu, 14 Feb 2019 14:31:57 -0500 Johannes Weiner wrote: > --- a/kernel/sched/psi.c > +++ b/kernel/sched/psi.c > @@ -322,7 +322,7 @@ static bool update_stats(struct psi_group *group) > expires = group->next_update; > if (now < expires) > goto out; > - if (now - expires > psi_period) > + if (now - expires >= psi_period) > missed_periods = div_u64(now - expires, psi_period); > > /* It seems appropriate to use time_after64() and friends in this code.
Re: linux-next: build failure after merge of the akpm-current tree
On Wed, 13 Feb 2019 17:25:18 +1100 Stephen Rothwell wrote: > Hi Andrew, > > After merging the akpm-current tree, today's linux-next build (arm > multi_v7_defconfig) failed like this: > > fs/io_uring.c: In function 'io_async_list_note': > fs/io_uring.c:931:16: error: 'VM_MAX_READAHEAD' undeclared (first use in this > function); did you mean 'VM_MAYREAD'? > max_pages = VM_MAX_READAHEAD >> (PAGE_SHIFT - 10); > ^~~~ > VM_MAYREAD > > Caused by commit > > 4c416502dae2 ("mm: Refactor readahead defines in mm.h") > > interacting with commit > > 6eff5fa16a2e ("io_uring: allow workqueue item to handle multiple buffered > requests") > > from the block tree. > Thanks. I'll fix mm-refactor-readahead-defines-in-mmh.patch and shall stage it after the block tree so everything lands nicely.
Re: [PATCH v3] of: fix kmemleak crash caused by imbalance in early memory reservation
On Wed, 13 Feb 2019 23:13:29 +0200 Mike Rapoport wrote: > > > As a bonus, since memblock_find_in_range() ensures the allocation in the > > > specified range, the bounds check can be removed. > > > > hm, why is this against -mm rather than against mainline? > > > > Do the OF maintainers intend to merge the fix? > > There's a conflict this fix and resent memblock related changes in -mm. > Rob said he anyway wasn't planning to to send this for 5.0 [1] and > suggested to merge it via your tree. > > [1] > https://lore.kernel.org/lkml/cal_jsqk-cc6ovz9mkp+exogjcrha0xxgsgqgkl3w9bff3rk...@mail.gmail.com/ OK, thanks, no probs.
Re: [PATCH] mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
On Wed, 13 Feb 2019 22:11:58 +0100 Jann Horn wrote: > > This is probably more a davem patch than a -mm one. > > Ah, sorry. I assumed that I just should go by which directory the > patched code is in. > > You did just add it to the -mm tree though, right? So I shouldn't > resend it to davem? Yes, please send to Dave. I'll autodrop the -mm copy if/when it turns up in -next.
Re: [PATCH] mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
On Wed, 13 Feb 2019 21:41:57 +0100 Jann Horn wrote: > The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum > number of references that we might need to create in the fastpath later, > the bump-allocation fastpath only has to modify the non-atomic bias value > that tracks the number of extra references we hold instead of the atomic > refcount. The maximum number of allocations we can serve (under the > assumption that no allocation is made with size 0) is nc->size, so that's > the bias used. > > However, even when all memory in the allocation has been given away, a > reference to the page is still held; and in the `offset < 0` slowpath, the > page may be reused if everyone else has dropped their references. > This means that the necessary number of references is actually > `nc->size+1`. > > Luckily, from a quick grep, it looks like the only path that can call > page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which > requires CAP_NET_ADMIN in the init namespace and is only intended to be > used for kernel testing and fuzzing. For the net-naive, what is TAP? It doesn't appear to mean drivers/net/tap.c. > To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the > `offset < 0` path, below the virt_to_page() call, and then repeatedly call > writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI, > with a vector consisting of 15 elements containing 1 byte each. > > ... > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc, > /* Even if we own the page, we do not use atomic_set(). >* This would break get_page_unless_zero() users. >*/ > - page_ref_add(page, size - 1); > + page_ref_add(page, size); > > /* reset page count bias and offset to start of new frag */ > nc->pfmemalloc = page_is_pfmemalloc(page); > - nc->pagecnt_bias = size; > + nc->pagecnt_bias = size + 1; > nc->offset = size; > } > > @@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc, > size = nc->size; > #endif > /* OK, page count is 0, we can safely set it */ > - set_page_count(page, size); > + set_page_count(page, size + 1); > > /* reset page count bias and offset to start of new frag */ > - nc->pagecnt_bias = size; > + nc->pagecnt_bias = size + 1; > offset = size - fragsz; > } This is probably more a davem patch than a -mm one.
Re: [PATCH v2 0/5] kasan: more tag based mode fixes
On Wed, 13 Feb 2019 14:58:25 +0100 Andrey Konovalov wrote: > Changes in v2: > - Add comments about kmemleak vs KASAN hooks order. I assume this refers to Vincenzo's review of "kasan, kmemleak: pass tagged pointers to kmemleak". But v2 of that patch is unchanged. > - Fix compilation error when CONFIG_SLUB_DEBUG is not defined.
Re: BUG: Bad page state (5)
On Wed, 13 Feb 2019 09:56:04 -0800 syzbot wrote: > Hello, > > syzbot found the following crash on: > > HEAD commit:c4f3ef3eb53f Add linux-next specific files for 20190213 > git tree: linux-next > console output: https://syzkaller.appspot.com/x/log.txt?x=1130a124c0 > kernel config: https://syzkaller.appspot.com/x/.config?x=9ec67976eb2df882 > dashboard link: https://syzkaller.appspot.com/bug?extid=2cd2887ea471ed6e6995 > compiler: gcc (GCC) 9.0.0 20181231 (experimental) > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=14ecdaa8c0 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=12ebe178c0 > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+2cd2887ea471ed6e6...@syzkaller.appspotmail.com It looks like a a memfd page was freed with a non-NULL ->mapping. Joel touched the memfd code with "mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd" but it would be surprising if syzbot tickled that code? > BUG: Bad page state in process udevd pfn:472f0 > name:"memfd:" > page:ea00011cbc00 count:0 mapcount:0 mapping:88800df2ad40 index:0xf > shmem_aops > flags: 0x1fffc08000c(uptodate|dirty|swapbacked) > raw: 01fffc08000c eaac4f08 8880a85af890 88800df2ad40 > raw: 000f > page dumped because: non-NULL mapping > Modules linked in: > CPU: 1 PID: 7586 Comm: udevd Not tainted 5.0.0-rc6-next-20190213 #34 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:77 [inline] > dump_stack+0x172/0x1f0 lib/dump_stack.c:113 > bad_page.cold+0xda/0xff mm/page_alloc.c:586 > free_pages_check_bad+0x142/0x1a0 mm/page_alloc.c:1014 > free_pages_check mm/page_alloc.c:1023 [inline] > free_pages_prepare mm/page_alloc.c:1113 [inline] > free_pcp_prepare mm/page_alloc.c:1138 [inline] > free_unref_page_prepare mm/page_alloc.c:2991 [inline] > free_unref_page_list+0x31d/0xc40 mm/page_alloc.c:3060 > name:"memfd:" > release_pages+0x60d/0x1940 mm/swap.c:791 > pagevec_lru_move_fn+0x218/0x2a0 mm/swap.c:213 > __pagevec_lru_add mm/swap.c:917 [inline] > lru_add_drain_cpu+0x2f7/0x520 mm/swap.c:581 > lru_add_drain+0x20/0x60 mm/swap.c:652 > exit_mmap+0x290/0x530 mm/mmap.c:3134 > __mmput kernel/fork.c:1047 [inline] > mmput+0x15f/0x4c0 kernel/fork.c:1068 > exec_mmap fs/exec.c:1046 [inline] > flush_old_exec+0x8d9/0x1c20 fs/exec.c:1279 > load_elf_binary+0x9bc/0x53f0 fs/binfmt_elf.c:864 > search_binary_handler fs/exec.c:1656 [inline] > search_binary_handler+0x17f/0x570 fs/exec.c:1634 > exec_binprm fs/exec.c:1698 [inline] > __do_execve_file.isra.0+0x1394/0x23f0 fs/exec.c:1818 > do_execveat_common fs/exec.c:1865 [inline] > do_execve fs/exec.c:1882 [inline] > __do_sys_execve fs/exec.c:1958 [inline] > __se_sys_execve fs/exec.c:1953 [inline] > __x64_sys_execve+0x8f/0xc0 fs/exec.c:1953 > do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > RIP: 0033:0x7fc7001ba207 > Code: Bad RIP value. > RSP: 002b:7ffe06aa13b8 EFLAGS: 0206 ORIG_RAX: 003b > RAX: ffda RBX: RCX: 7fc7001ba207 > RDX: 01fd5fd0 RSI: 7ffe06aa14b0 RDI: 7ffe06aa24c0 > RBP: 00625500 R08: 1c49 R09: 1c49 > R10: R11: 0206 R12: 01fd5fd0 > R13: 0007 R14: 01fc6250 R15: 0005 > BUG: Bad page state in process udevd pfn:2b13c > page:eaac4f00 count:0 mapcount:0 mapping:88800df2ad40 index:0xe > shmem_aops > flags: 0x1fffc08000c(uptodate|dirty|swapbacked) > raw: 01fffc08000c 8880a85af890 8880a85af890 88800df2ad40 > raw: 000e > page dumped because: non-NULL mapping > Modules linked in: > CPU: 1 PID: 7586 Comm: udevd Tainted: GB > 5.0.0-rc6-next-20190213 #34 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:77 [inline] > dump_stack+0x172/0x1f0 lib/dump_stack.c:113 > bad_page.cold+0xda/0xff mm/page_alloc.c:586 > name:"memfd:" > free_pages_check_bad+0x142/0x1a0 mm/page_alloc.c:1014 > free_pages_check mm/page_alloc.c:1023 [inline] > free_pages_prepare mm/page_alloc.c:1113 [inline] > free_pcp_prepare mm/page_alloc.c:1138 [inline] > free_unref_page_prepare mm/page_alloc.c:2991 [inline] > free_unref_page_list+0x31d/0xc40 mm/page_alloc.c:3060 > release_pages+0x60d/0x1940 mm/swap.c:791 > pagevec_lru_move_fn+0x218/0x2a0 mm/swap.c:213 > __pagevec_lru_add mm/swap.c:917 [inline] > lru_add_drain_cpu+0x2f7/0x520 mm/swap.c:581 > lru_add_drain+0x20/0x60 mm/swap.c:652 > exit_mmap+0x290/0x530 mm/mmap.c:3134 > __mmput kernel/fork.c:1047 [inline] >