fix filler_t callback type mismatches v2
Casting mapping->a_ops->readpage to filler_t causes an indirect call type mismatch with Control-Flow Integrity checking. This change fixes the mismatch in read_cache_page_gfp and read_mapping_page by adding using a NULL filler argument as an indication to call ->readpage directly, and by passing the right parameter callbacks in nfs and jffs2. Changes since v1: - add the 9p patch to the series - drop the nfs patch that has been merged
Re: PROBLEM: Power9: kernel oops on memory hotunplug from ppc64le guest
On Mon, May 20, 2019 at 02:48:35PM +1000, Nicholas Piggin wrote: > >> > git bisect points to > >> > > >> > commit 4231aba000f5a4583dd9f67057aadb68c3eca99d > >> > Author: Nicholas Piggin > >> > Date: Fri Jul 27 21:48:17 2018 +1000 > >> > > >> > powerpc/64s: Fix page table fragment refcount race vs speculative > >> > references > >> > > >> > The page table fragment allocator uses the main page refcount racily > >> > with respect to speculative references. A customer observed a BUG due > >> > to page table page refcount underflow in the fragment allocator. This > >> > can be caused by the fragment allocator set_page_count stomping on a > >> > speculative reference, and then the speculative failure handler > >> > decrements the new reference, and the underflow eventually pops when > >> > the page tables are freed. > >> > > >> > Fix this by using a dedicated field in the struct page for the page > >> > table fragment allocator. > >> > > >> > Fixes: 5c1f6ee9a31c ("powerpc: Reduce PTE table memory wastage") > >> > Cc: sta...@vger.kernel.org # v3.10+ > >> > >> That's the commit that added the BUG_ON(), so prior to that you won't > >> see the crash. > > > > Right, but the commit says it fixes page table page refcount underflow by > > introducing a new field >pt_frag_refcount. Now we are hitting the > > underflow > > for this pt_frag_refcount. > > The fixed underflow is caused by a bug (race on page count) that got > fixed by that patch. You are hitting a different underflow here. It's > not certain my patch caused it, I'm just trying to reproduce now. Ok. > > > > > BTW, if I go below this commit, I don't hit the pagecount > > > > VM_BUG_ON_PAGE(page_ref_count(page) == 0, page); > > > > which is in pte_fragment_free() path. > > Do you have CONFIG_DEBUG_VM=y? Yes. Regards, Bharata.
[PATCH 3/4] jffs2: pass the correct prototype to read_cache_page
Fix the callback jffs2 passes to read_cache_page to actually have the proper type expected. Casting around function pointers can easily hide typing bugs, and defeats control flow protection. Signed-off-by: Christoph Hellwig Reviewed-by: Kees Cook --- fs/jffs2/file.c | 4 ++-- fs/jffs2/fs.c | 2 +- fs/jffs2/os-linux.h | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/jffs2/file.c b/fs/jffs2/file.c index 7d8654a1472e..f8fb89b10227 100644 --- a/fs/jffs2/file.c +++ b/fs/jffs2/file.c @@ -109,9 +109,9 @@ static int jffs2_do_readpage_nolock (struct inode *inode, struct page *pg) return ret; } -int jffs2_do_readpage_unlock(struct inode *inode, struct page *pg) +int jffs2_do_readpage_unlock(void *data, struct page *pg) { - int ret = jffs2_do_readpage_nolock(inode, pg); + int ret = jffs2_do_readpage_nolock(data, pg); unlock_page(pg); return ret; } diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c index 112d85849db1..8a20ddd25f2d 100644 --- a/fs/jffs2/fs.c +++ b/fs/jffs2/fs.c @@ -687,7 +687,7 @@ unsigned char *jffs2_gc_fetch_page(struct jffs2_sb_info *c, struct page *pg; pg = read_cache_page(inode->i_mapping, offset >> PAGE_SHIFT, -(void *)jffs2_do_readpage_unlock, inode); +jffs2_do_readpage_unlock, inode); if (IS_ERR(pg)) return (void *)pg; diff --git a/fs/jffs2/os-linux.h b/fs/jffs2/os-linux.h index a2dbbb3f4c74..bd3d5f0ddc34 100644 --- a/fs/jffs2/os-linux.h +++ b/fs/jffs2/os-linux.h @@ -155,7 +155,7 @@ extern const struct file_operations jffs2_file_operations; extern const struct inode_operations jffs2_file_inode_operations; extern const struct address_space_operations jffs2_file_address_operations; int jffs2_fsync(struct file *, loff_t, loff_t, int); -int jffs2_do_readpage_unlock (struct inode *inode, struct page *pg); +int jffs2_do_readpage_unlock(void *data, struct page *pg); /* ioctl.c */ long jffs2_ioctl(struct file *, unsigned int, unsigned long); -- 2.20.1
[PATCH 4/4] 9p: pass the correct prototype to read_cache_page
Fix the callback 9p passes to read_cache_page to actually have the proper type expected. Casting around function pointers can easily hide typing bugs, and defeats control flow protection. Signed-off-by: Christoph Hellwig Reviewed-by: Kees Cook --- fs/9p/vfs_addr.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c index 0bcbcc20f769..02e0fc51401e 100644 --- a/fs/9p/vfs_addr.c +++ b/fs/9p/vfs_addr.c @@ -50,8 +50,9 @@ * @page: structure to page * */ -static int v9fs_fid_readpage(struct p9_fid *fid, struct page *page) +static int v9fs_fid_readpage(void *data, struct page *page) { + struct p9_fid *fid = data; struct inode *inode = page->mapping->host; struct bio_vec bvec = {.bv_page = page, .bv_len = PAGE_SIZE}; struct iov_iter to; @@ -122,7 +123,8 @@ static int v9fs_vfs_readpages(struct file *filp, struct address_space *mapping, if (ret == 0) return ret; - ret = read_cache_pages(mapping, pages, (void *)v9fs_vfs_readpage, filp); + ret = read_cache_pages(mapping, pages, v9fs_fid_readpage, + filp->private_data); p9_debug(P9_DEBUG_VFS, " = %d\n", ret); return ret; } -- 2.20.1
[PATCH 2/4] mm: don't cast ->readpage to filler_t for do_read_cache_page
We can just pass a NULL filler and do the right thing inside of do_read_cache_page based on the NULL parameter. Signed-off-by: Christoph Hellwig Reviewed-by: Kees Cook --- include/linux/pagemap.h | 3 +-- mm/filemap.c| 10 ++ 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 9ec3544baee2..6dd7ec95c778 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -396,8 +396,7 @@ extern int read_cache_pages(struct address_space *mapping, static inline struct page *read_mapping_page(struct address_space *mapping, pgoff_t index, void *data) { - filler_t *filler = (filler_t *)mapping->a_ops->readpage; - return read_cache_page(mapping, index, filler, data); + return read_cache_page(mapping, index, NULL, data); } /* diff --git a/mm/filemap.c b/mm/filemap.c index 6a8048477bc6..3bec6e18b763 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2772,7 +2772,11 @@ static struct page *do_read_cache_page(struct address_space *mapping, } filler: - err = filler(data, page); + if (filler) + err = filler(data, page); + else + err = mapping->a_ops->readpage(data, page); + if (err < 0) { put_page(page); return ERR_PTR(err); @@ -2884,9 +2888,7 @@ struct page *read_cache_page_gfp(struct address_space *mapping, pgoff_t index, gfp_t gfp) { - filler_t *filler = (filler_t *)mapping->a_ops->readpage; - - return do_read_cache_page(mapping, index, filler, NULL, gfp); + return do_read_cache_page(mapping, index, NULL, NULL, gfp); } EXPORT_SYMBOL(read_cache_page_gfp); -- 2.20.1
[PATCH 1/4] mm: fix an overly long line in read_cache_page
Signed-off-by: Christoph Hellwig Reviewed-by: Kees Cook --- mm/filemap.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/filemap.c b/mm/filemap.c index c5af80c43d36..6a8048477bc6 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2862,7 +2862,8 @@ struct page *read_cache_page(struct address_space *mapping, int (*filler)(void *, struct page *), void *data) { - return do_read_cache_page(mapping, index, filler, data, mapping_gfp_mask(mapping)); + return do_read_cache_page(mapping, index, filler, data, + mapping_gfp_mask(mapping)); } EXPORT_SYMBOL(read_cache_page); -- 2.20.1
Re: [RFC PATCH] usb: host: xhci: allow __GFP_FS in dma allocation
Folks, you can't just pass arbitary GFP_ flags to dma allocation routines, beause very often they are not just wrappers around the page allocator. So no, you can't just fine grained control the flags, but the existing code is just as buggy. Please switch to use memalloc_noio_save() instead.
Re: arch/xtensa/include/asm/uaccess.h:40:22: error: implicit declaration of function 'uaccess_kernel'; did you mean 'getname_kernel'?
Hello, On Sun, May 19, 2019 at 10:39 PM kbuild test robot wrote: > Hi Matt, > > FYI, the error/warning still remains. > > tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > master > head: a188339ca5a396acc588e5851ed7e19f66b0ebd9 > commit: 7df95299b94a63ec67a6389fc02dc25019a80ee8 staging: kpc2000: Add DMA > driver > date: 4 weeks ago > config: xtensa-allmodconfig (attached as .config) > compiler: xtensa-linux-gcc (GCC) 8.1.0 > reproduce: > wget > https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O > ~/bin/make.cross > chmod +x ~/bin/make.cross > git checkout 7df95299b94a63ec67a6389fc02dc25019a80ee8 > # save the attached .config to linux build tree > GCC_VERSION=8.1.0 make.cross ARCH=xtensa > > If you fix the issue, kindly add following tag > Reported-by: kbuild test robot > > All errors (new ones prefixed by >>): > >In file included from drivers/staging/kpc2000/kpc_dma/fileops.c:11: >arch/xtensa/include/asm/uaccess.h: In function 'clear_user': > >> arch/xtensa/include/asm/uaccess.h:40:22: error: implicit declaration of > >> function 'uaccess_kernel'; did you mean 'getname_kernel'? > >> [-Werror=implicit-function-declaration] > #define __kernel_ok (uaccess_kernel()) > ^~ I've posted a fix for this issue here: https://lkml.org/lkml/2019/5/8/956 If there are post merge window pull requests planned for the staging tree please consider including this fix. Alternatively if I could get an ack for this fix I'd submit it in a pull request from the xtensa tree. -- Thanks. -- Max
Re: [PATCH] intel_th: msu: Fix unused variable warning on arm64 platform
please ignore this patch, I will update it and resend it. On 2019/5/20 11:23, Shaokun Zhang wrote: > Fix this compiler warning on arm64 platform. > > Cc: Alexander Shishkin > Cc: Greg Kroah-Hartman > Signed-off-by: Shaokun Zhang > --- > drivers/hwtracing/intel_th/msu.c | 8 +++- > 1 file changed, 3 insertions(+), 5 deletions(-) > > diff --git a/drivers/hwtracing/intel_th/msu.c > b/drivers/hwtracing/intel_th/msu.c > index 81bb54fa3ce8..e15ed5c308e1 100644 > --- a/drivers/hwtracing/intel_th/msu.c > +++ b/drivers/hwtracing/intel_th/msu.c > @@ -780,7 +780,7 @@ static int __msc_buffer_win_alloc(struct msc_window *win, > static int msc_buffer_win_alloc(struct msc *msc, unsigned int nr_blocks) > { > struct msc_window *win; > - int ret = -ENOMEM, i; > + int ret = -ENOMEM; > > if (!nr_blocks) > return 0; > @@ -812,7 +812,7 @@ static int msc_buffer_win_alloc(struct msc *msc, unsigned > int nr_blocks) > goto err_nomem; > > #ifdef CONFIG_X86 > - for (i = 0; i < ret; i++) > + for (int i = 0; i < ret; i++) > /* Set the page as uncached */ > set_memory_uc((unsigned long)msc_win_block(win, i), 1); > #endif > @@ -860,8 +860,6 @@ static void __msc_buffer_win_free(struct msc *msc, struct > msc_window *win) > */ > static void msc_buffer_win_free(struct msc *msc, struct msc_window *win) > { > - int i; > - > msc->nr_pages -= win->nr_blocks; > > list_del(>entry); > @@ -871,7 +869,7 @@ static void msc_buffer_win_free(struct msc *msc, struct > msc_window *win) > } > > #ifdef CONFIG_X86 > - for (i = 0; i < win->nr_blocks; i++) > + for (int i = 0; i < win->nr_blocks; i++) > /* Reset the page to write-back */ > set_memory_wb((unsigned long)msc_win_block(win, i), 1); > #endif >
[RFC PATCH v5 12/16] slub: Enable moving objects to/from specific nodes
We have just implemented Slab Movable Objects (object migration). Currently object migration is used to defrag a cache. On NUMA systems it would be nice to be able to control the source and destination nodes when moving objects. Add CONFIG_SMO_NODE to guard this feature. CONFIG_SMO_NODE depends on CONFIG_SLUB_DEBUG because we use the full list. Implement moving all objects (including those in full slabs) to a specific node. Expose this functionality to userspace via a sysfs entry. Add sysfs entry: /sysfs/kernel/slab//move With this users get access to the following functionality: - Move all objects to specified node. echo "N1" > move - Move all objects from specified node to other specified node (from N1 -> to N2): echo "N1 N2" > move This also enables shrinking slabs on a specific node: echo "N1 N1" > move Signed-off-by: Tobin C. Harding --- mm/Kconfig | 7 ++ mm/slub.c | 249 + 2 files changed, 256 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index ee8d1f311858..aa8d60e69a01 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -258,6 +258,13 @@ config ARCH_ENABLE_THP_MIGRATION config CONTIG_ALLOC def_bool (MEMORY_ISOLATION && COMPACTION) || CMA +config SMO_NODE + bool "Enable per node control of Slab Movable Objects" + depends on SLUB && SYSFS + select SLUB_DEBUG + help + On NUMA systems enable moving objects to and from a specified node. + config PHYS_ADDR_T_64BIT def_bool 64BIT diff --git a/mm/slub.c b/mm/slub.c index 2157205df7ba..9582f2fc97d2 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4336,6 +4336,106 @@ static void move_slab_page(struct page *page, void *scratch, int node) s->migrate(s, vector, count, node, private); } +#ifdef CONFIG_SMO_NODE +/* + * kmem_cache_move() - Attempt to move all slab objects. + * @s: The cache we are working on. + * @node: The node to move objects away from. + * @target_node: The node to move objects on to. + * + * Attempts to move all objects (partial slabs and full slabs) to target + * node. + * + * Context: Takes the list_lock. + * Return: The number of slabs remaining on node. + */ +static unsigned long kmem_cache_move(struct kmem_cache *s, +int node, int target_node) +{ + struct kmem_cache_node *n = get_node(s, node); + LIST_HEAD(move_list); + struct page *page, *page2; + unsigned long flags; + void **scratch; + + if (!s->migrate) { + pr_warn("%s SMO not enabled, cannot move objects\n", s->name); + goto out; + } + + scratch = alloc_scratch(s); + if (!scratch) + goto out; + + spin_lock_irqsave(>list_lock, flags); + + list_for_each_entry_safe(page, page2, >partial, lru) { + if (!slab_trylock(page)) + /* Busy slab. Get out of the way */ + continue; + + if (page->inuse) { + list_move(>lru, _list); + /* Stop page being considered for allocations */ + n->nr_partial--; + page->frozen = 1; + + slab_unlock(page); + } else {/* Empty slab page */ + list_del(>lru); + n->nr_partial--; + slab_unlock(page); + discard_slab(s, page); + } + } + list_for_each_entry_safe(page, page2, >full, lru) { + if (!slab_trylock(page)) + continue; + + list_move(>lru, _list); + page->frozen = 1; + slab_unlock(page); + } + + spin_unlock_irqrestore(>list_lock, flags); + + list_for_each_entry(page, _list, lru) { + if (page->inuse) + move_slab_page(page, scratch, target_node); + } + kfree(scratch); + + /* Bail here to save taking the list_lock */ + if (list_empty(_list)) + goto out; + + /* Inspect results and dispose of pages */ + spin_lock_irqsave(>list_lock, flags); + list_for_each_entry_safe(page, page2, _list, lru) { + list_del(>lru); + slab_lock(page); + page->frozen = 0; + + if (page->inuse) { + if (page->inuse == page->objects) { + list_add(>lru, >full); + slab_unlock(page); + } else { + n->nr_partial++; + list_add_tail(>lru, >partial); + slab_unlock(page); + } + } else { + slab_unlock(page); + discard_slab(s, page); + } + } +
[RFC PATCH v5 15/16] dcache: Implement partial shrink via Slab Movable Objects
The dentry slab cache is susceptible to internal fragmentation. Now that we have Slab Movable Objects we can attempt to defragment the dcache. Dentry objects are inherently _not_ relocatable however under some conditions they can be free'd. This is the same as shrinking the dcache but instead of shrinking the whole cache we only attempt to free those objects that are located in partially full slab pages. There is no guarantee that this will reduce the memory usage of the system, it is a compromise between fragmented memory and total cache shrinkage with the hope that some memory pressure can be alleviated. This is implemented using the newly added Slab Movable Objects infrastructure. The dcache 'migration' function is intentionally _not_ called 'd_migrate' because we only free, we do not migrate. Call it 'd_partial_shrink' to make explicit that no reallocation is done. Implement isolate and 'migrate' functions for the dentry slab cache. Signed-off-by: Tobin C. Harding --- fs/dcache.c | 76 + 1 file changed, 76 insertions(+) diff --git a/fs/dcache.c b/fs/dcache.c index b7318615979d..0dfe580c2d42 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -31,6 +31,7 @@ #include #include #include +#include #include "internal.h" #include "mount.h" @@ -3071,6 +3072,79 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode) } EXPORT_SYMBOL(d_tmpfile); +/* + * d_isolate() - Dentry isolation callback function. + * @s: The dentry cache. + * @v: Vector of pointers to the objects to isolate. + * @nr: Number of objects in @v. + * + * The slab allocator is holding off frees. We can safely examine + * the object without the danger of it vanishing from under us. + */ +static void *d_isolate(struct kmem_cache *s, void **v, int nr) +{ + struct list_head *dispose; + struct dentry *dentry; + int i; + + dispose = kmalloc(sizeof(*dispose), GFP_KERNEL); + if (!dispose) + return NULL; + + INIT_LIST_HEAD(dispose); + + for (i = 0; i < nr; i++) { + dentry = v[i]; + spin_lock(>d_lock); + + if (dentry->d_lockref.count > 0 || + dentry->d_flags & DCACHE_SHRINK_LIST) { + spin_unlock(>d_lock); + continue; + } + + if (dentry->d_flags & DCACHE_LRU_LIST) + d_lru_del(dentry); + + d_shrink_add(dentry, dispose); + spin_unlock(>d_lock); + } + + return dispose; +} + +/* + * d_partial_shrink() - Dentry migration callback function. + * @s: The dentry cache. + * @_unused: We do not access the vector. + * @__unused: No need for length of vector. + * @___unused: We do not do any allocation. + * @private: list_head pointer representing the shrink list. + * + * Dispose of the shrink list created during isolation function. + * + * Dentry objects can _not_ be relocated and shrinking the whole dcache + * can be expensive. This is an effort to free dentry objects that are + * stopping slab pages from being free'd without clearing the whole dcache. + * + * This callback is called from the SLUB allocator object migration + * infrastructure in attempt to free up slab pages by freeing dentry + * objects from partially full slabs. + */ +static void d_partial_shrink(struct kmem_cache *s, void **_unused, int __unused, +int ___unused, void *private) +{ + struct list_head *dispose = private; + + if (!private) /* kmalloc error during isolate. */ + return; + + if (!list_empty(dispose)) + shrink_dentry_list(dispose); + + kfree(private); +} + static __initdata unsigned long dhash_entries; static int __init set_dhash_entries(char *str) { @@ -3116,6 +3190,8 @@ static void __init dcache_init(void) sizeof_field(struct dentry, d_iname), dcache_ctor); + kmem_cache_setup_mobility(dentry_cache, d_isolate, d_partial_shrink); + /* Hash may have been set up in dcache_init_early */ if (!hashdist) return; -- 2.21.0
[RFC PATCH v5 13/16] slub: Enable balancing slabs across nodes
We have just implemented Slab Movable Objects (SMO). On NUMA systems slabs can become unbalanced i.e. many slabs on one node while other nodes have few slabs. Using SMO we can balance the slabs across all the nodes. The algorithm used is as follows: 1. Move all objects to node 0 (this has the effect of defragmenting the cache). 2. Calculate the desired number of slabs for each node (this is done using the approximation nr_slabs / nr_nodes). 3. Loop over the nodes moving the desired number of slabs from node 0 to the node. Feature is conditionally built in with CONFIG_SMO_NODE, this is because we need the full list (we enable SLUB_DEBUG to get this). Future version may separate final list out of SLUB_DEBUG. Expose this functionality to userspace via a sysfs entry. Add sysfs entry: /sysfs/kernel/slab//balance Write of '1' to this file triggers balance, no other value accepted. This feature relies on SMO being enable for the cache, this is done with a call to, after the isolate/migrate functions have been defined. kmem_cache_setup_mobility(s, isolate, migrate) Signed-off-by: Tobin C. Harding --- mm/slub.c | 120 ++ 1 file changed, 120 insertions(+) diff --git a/mm/slub.c b/mm/slub.c index 9582f2fc97d2..25b6d1e408e3 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4574,6 +4574,109 @@ static unsigned long kmem_cache_move_to_node(struct kmem_cache *s, int node) return left; } + +/* + * kmem_cache_move_slabs() - Attempt to move @num slabs to target_node, + * @s: The cache we are working on. + * @node: The node to move objects from. + * @target_node: The node to move objects to. + * @num: The number of slabs to move. + * + * Attempts to move @num slabs from @node to @target_node. This is done + * by migrating objects from slabs on the full_list. + * + * Return: The number of slabs moved or error code. + */ +static long kmem_cache_move_slabs(struct kmem_cache *s, + int node, int target_node, long num) +{ + struct kmem_cache_node *n = get_node(s, node); + LIST_HEAD(move_list); + struct page *page, *page2; + unsigned long flags; + void **scratch; + long done = 0; + + if (node == target_node) + return -EINVAL; + + scratch = alloc_scratch(s); + if (!scratch) + return -ENOMEM; + + spin_lock_irqsave(>list_lock, flags); + list_for_each_entry_safe(page, page2, >full, lru) { + if (!slab_trylock(page)) + /* Busy slab. Get out of the way */ + continue; + + list_move(>lru, _list); + page->frozen = 1; + slab_unlock(page); + + if (++done >= num) + break; + } + spin_unlock_irqrestore(>list_lock, flags); + + list_for_each_entry(page, _list, lru) { + if (page->inuse) + move_slab_page(page, scratch, target_node); + } + kfree(scratch); + + /* Inspect results and dispose of pages */ + spin_lock_irqsave(>list_lock, flags); + list_for_each_entry_safe(page, page2, _list, lru) { + list_del(>lru); + slab_lock(page); + page->frozen = 0; + + if (page->inuse) { + /* +* This is best effort only, if slab still has +* objects just put it back on the partial list. +*/ + n->nr_partial++; + list_add_tail(>lru, >partial); + slab_unlock(page); + } else { + slab_unlock(page); + discard_slab(s, page); + } + } + spin_unlock_irqrestore(>list_lock, flags); + + return done; +} + +/* + * kmem_cache_balance_nodes() - Balance slabs across nodes. + * @s: The cache we are working on. + */ +static void kmem_cache_balance_nodes(struct kmem_cache *s) +{ + struct kmem_cache_node *n = get_node(s, 0); + unsigned long desired_nr_slabs_per_node; + unsigned long nr_slabs; + int nr_nodes = 0; + int nid; + + (void)kmem_cache_move_to_node(s, 0); + + for_each_node_state(nid, N_NORMAL_MEMORY) + nr_nodes++; + + nr_slabs = atomic_long_read(>nr_slabs); + desired_nr_slabs_per_node = nr_slabs / nr_nodes; + + for_each_node_state(nid, N_NORMAL_MEMORY) { + if (nid == 0) + continue; + + kmem_cache_move_slabs(s, 0, nid, desired_nr_slabs_per_node); + } +} #endif /** @@ -5838,6 +5941,22 @@ static ssize_t move_store(struct kmem_cache *s, const char *buf, size_t length) return length; } SLAB_ATTR(move); + +static ssize_t balance_show(struct kmem_cache *s, char *buf) +{ + return 0; +} + +static
[RFC PATCH v5 16/16] dcache: Add CONFIG_DCACHE_SMO
In an attempt to make the SMO patchset as non-invasive as possible add a config option CONFIG_DCACHE_SMO (under "Memory Management options") for enabling SMO for the DCACHE. Whithout this option dcache constructor is used but no other code is built in, with this option enabled slab mobility is enabled and the isolate/migrate functions are built in. Add CONFIG_DCACHE_SMO to guard the partial shrinking of the dcache via Slab Movable Objects infrastructure. Signed-off-by: Tobin C. Harding --- fs/dcache.c | 4 mm/Kconfig | 7 +++ 2 files changed, 11 insertions(+) diff --git a/fs/dcache.c b/fs/dcache.c index 0dfe580c2d42..96063e872366 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -3072,6 +3072,7 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode) } EXPORT_SYMBOL(d_tmpfile); +#ifdef CONFIG_DCACHE_SMO /* * d_isolate() - Dentry isolation callback function. * @s: The dentry cache. @@ -3144,6 +3145,7 @@ static void d_partial_shrink(struct kmem_cache *s, void **_unused, int __unused, kfree(private); } +#endif /* CONFIG_DCACHE_SMO */ static __initdata unsigned long dhash_entries; static int __init set_dhash_entries(char *str) @@ -3190,7 +3192,9 @@ static void __init dcache_init(void) sizeof_field(struct dentry, d_iname), dcache_ctor); +#ifdef CONFIG_DCACHE_SMO kmem_cache_setup_mobility(dentry_cache, d_isolate, d_partial_shrink); +#endif /* Hash may have been set up in dcache_init_early */ if (!hashdist) diff --git a/mm/Kconfig b/mm/Kconfig index aa8d60e69a01..7dcea76e5ecc 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -265,6 +265,13 @@ config SMO_NODE help On NUMA systems enable moving objects to and from a specified node. +config DCACHE_SMO + bool "Enable Slab Movable Objects for the dcache" + depends on SLUB + help + Under memory pressure we can try to free dentry slab cache objects from + the partial slab list if this is enabled. + config PHYS_ADDR_T_64BIT def_bool 64BIT -- 2.21.0
[RFC PATCH v5 11/16] tools/testing/slab: Add XArray movable objects tests
We just implemented movable objects for the XArray. Let's test it intree. Add test module for the XArray's movable objects implementation. Functionality of the XArray Slab Movable Object implementation can usually be seen by simply by using `slabinfo` on a running machine since the radix tree is typically in use on a running machine and will have partial slabs. For repeated testing we can use the test module to run to simulate a workload on the XArray then use `slabinfo` to test object migration is functioning. If testing on freshly spun up VM (low radix tree workload) it may be necessary to load/unload the module a number of times to create partial slabs. Example test session Relevant /proc/slabinfo column headers: name Prior to testing slabinfo report for radix_tree_node: # slabinfo radix_tree_node --report Slabcache: radix_tree_node Aliases: 0 Order : 2 Objects: 8352 ** Reclaim accounting active ** Defragmentation at 30% Sizes (bytes) Slabs DebugMemory Object : 576 Total : 497 Sanity Checks : On Total: 8142848 SlabObj: 912 Full : 473 Redzoning : On Used : 4810752 SlabSiz: 16384 Partial: 24 Poisoning : On Loss : 3332096 Loss : 336 CpuSlab: 0 Tracking : On Lalig: 2806272 Align : 8 Objects: 17 Tracing : Off Lpadd: 437360 Here you can see the kernel was built with Slab Movable Objects enabled for the XArray (XArray uses the radix tree below the surface). After inserting the test module (note we have triggered allocation of a number of radix tree nodes increasing the object count but decreasing the number of partial slabs): # slabinfo radix_tree_node --report Slabcache: radix_tree_node Aliases: 0 Order : 2 Objects: 8442 ** Reclaim accounting active ** Defragmentation at 30% Sizes (bytes) Slabs DebugMemory Object : 576 Total : 499 Sanity Checks : On Total: 8175616 SlabObj: 912 Full : 484 Redzoning : On Used : 4862592 SlabSiz: 16384 Partial: 15 Poisoning : On Loss : 3313024 Loss : 336 CpuSlab: 0 Tracking : On Lalig: 2836512 Align : 8 Objects: 17 Tracing : Off Lpadd: 439120 Now we can shrink the radix_tree_node cache: # slabinfo radix_tree_node --shrink # slabinfo radix_tree_node --report Slabcache: radix_tree_node Aliases: 0 Order : 2 Objects: 8515 ** Reclaim accounting active ** Defragmentation at 30% Sizes (bytes) Slabs DebugMemory Object : 576 Total : 501 Sanity Checks : On Total: 8208384 SlabObj: 912 Full : 500 Redzoning : On Used : 4904640 SlabSiz: 16384 Partial: 1 Poisoning : On Loss : 3303744 Loss : 336 CpuSlab: 0 Tracking : On Lalig: 2861040 Align : 8 Objects: 17 Tracing : Off Lpadd: 440880 Note the single remaining partial slab. Signed-off-by: Tobin C. Harding --- tools/testing/slab/Makefile | 2 +- tools/testing/slab/slub_defrag_xarray.c | 211 2 files changed, 212 insertions(+), 1 deletion(-) create mode 100644 tools/testing/slab/slub_defrag_xarray.c diff --git a/tools/testing/slab/Makefile b/tools/testing/slab/Makefile index 440c2e3e356f..44c18d9a4d52 100644 --- a/tools/testing/slab/Makefile +++ b/tools/testing/slab/Makefile @@ -1,4 +1,4 @@ -obj-m += slub_defrag.o +obj-m += slub_defrag.o slub_defrag_xarray.o KTREE=../../.. diff --git a/tools/testing/slab/slub_defrag_xarray.c b/tools/testing/slab/slub_defrag_xarray.c new file mode 100644 index ..41143f73256c --- /dev/null +++ b/tools/testing/slab/slub_defrag_xarray.c @@ -0,0 +1,211 @@ +// SPDX-License-Identifier: GPL-2.0+ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define SMOX_CACHE_NAME "smox_test" +static struct kmem_cache *cachep; + +/* + * Declare XArrays globally so we can clean them up on module unload. + */ + +/* Used by test_smo_xarray()*/ +DEFINE_XARRAY(things); + +/* Thing to store pointers to in the XArray */ +struct smox_thing { + long id; +}; + +/* It's up to the caller to ensure id is unique */ +static struct smox_thing *alloc_thing(int id) +{ + struct smox_thing *thing; + + thing = kmem_cache_alloc(cachep, GFP_KERNEL); + if (!thing) + return ERR_PTR(-ENOMEM); + + thing->id = id; + return thing; +} + +/** + * smox_object_ctor() - SMO object constructor function. + * @ptr: Pointer to memory where the object should be constructed. + */ +void
[RFC PATCH v5 14/16] dcache: Provide a dentry constructor
In order to support object migration on the dentry cache we need to have a determined object state at all times. Without a constructor the object would have a random state after allocation. Provide a dentry constructor. Signed-off-by: Tobin C. Harding --- fs/dcache.c | 30 +- 1 file changed, 21 insertions(+), 9 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 8136bda27a1f..b7318615979d 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1602,6 +1602,16 @@ void d_invalidate(struct dentry *dentry) } EXPORT_SYMBOL(d_invalidate); +static void dcache_ctor(void *p) +{ + struct dentry *dentry = p; + + /* Mimic lockref_mark_dead() */ + dentry->d_lockref.count = -128; + + spin_lock_init(>d_lock); +} + /** * __d_alloc - allocate a dcache entry * @sb: filesystem it will belong to @@ -1657,7 +1667,6 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name) dentry->d_lockref.count = 1; dentry->d_flags = 0; - spin_lock_init(>d_lock); seqcount_init(>d_seq); dentry->d_inode = NULL; dentry->d_parent = dentry; @@ -3095,14 +3104,17 @@ static void __init dcache_init_early(void) static void __init dcache_init(void) { - /* -* A constructor could be added for stable state like the lists, -* but it is probably not worth it because of the cache nature -* of the dcache. -*/ - dentry_cache = KMEM_CACHE_USERCOPY(dentry, - SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT, - d_iname); + slab_flags_t flags = + SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | SLAB_MEM_SPREAD | SLAB_ACCOUNT; + + dentry_cache = + kmem_cache_create_usercopy("dentry", + sizeof(struct dentry), + __alignof__(struct dentry), + flags, + offsetof(struct dentry, d_iname), + sizeof_field(struct dentry, d_iname), + dcache_ctor); /* Hash may have been set up in dcache_init_early */ if (!hashdist) -- 2.21.0
[RFC PATCH v5 05/16] tools/vm/slabinfo: Add remote node defrag ratio output
Add output line for NUMA remote node defrag ratio. Signed-off-by: Tobin C. Harding --- tools/vm/slabinfo.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c index cbfc56c44c2f..d2c22f9ee2d8 100644 --- a/tools/vm/slabinfo.c +++ b/tools/vm/slabinfo.c @@ -34,6 +34,7 @@ struct slabinfo { unsigned int sanity_checks, slab_size, store_user, trace; int order, poison, reclaim_account, red_zone; int movable, ctor; + int remote_node_defrag_ratio; unsigned long partial, objects, slabs, objects_partial, objects_total; unsigned long alloc_fastpath, alloc_slowpath; unsigned long free_fastpath, free_slowpath; @@ -377,6 +378,10 @@ static void slab_numa(struct slabinfo *s, int mode) if (skip_zero && !s->slabs) return; + if (mode) { + printf("\nNUMA remote node defrag ratio: %3d\n", + s->remote_node_defrag_ratio); + } if (!line) { printf("\n%-21s:", mode ? "NUMA nodes" : "Slab"); for(node = 0; node <= highest_node; node++) @@ -1272,6 +1277,8 @@ static void read_slab_dir(void) slab->cpu_partial_free = get_obj("cpu_partial_free"); slab->alloc_node_mismatch = get_obj("alloc_node_mismatch"); slab->deactivate_bypass = get_obj("deactivate_bypass"); + slab->remote_node_defrag_ratio = + get_obj("remote_node_defrag_ratio"); chdir(".."); if (read_slab_obj(slab, "ops")) { if (strstr(buffer, "ctor :")) -- 2.21.0
[RFC PATCH v5 08/16] tools/testing/slab: Add object migration test suite
We just added a module that enables testing the SLUB allocators ability to defrag/shrink caches via movable objects. Tests are better when they are automated. Add automated testing via a python script for SLUB movable objects. Example output: $ cd path/to/linux/tools/testing/slab $ /slub_defrag.py Please run script as root $ sudo ./slub_defrag.py $ sudo ./slub_defrag.py --debug Loading module ... Slab cache smo_test created Objects per slab: 20 Running sanity checks ... Running module stress test (see dmesg for additional test output) ... Removing module slub_defrag ... Loading module ... Slab cache smo_test created Running test non-movable ... testing slab 'smo_test' prior to enabling movable objects ... verified non-movable slabs are NOT shrinkable Running test movable ... testing slab 'smo_test' after enabling movable objects ... verified movable slabs are shrinkable Removing module slub_defrag ... Signed-off-by: Tobin C. Harding --- tools/testing/slab/slub_defrag.c | 1 + tools/testing/slab/slub_defrag.py | 451 ++ 2 files changed, 452 insertions(+) create mode 100755 tools/testing/slab/slub_defrag.py diff --git a/tools/testing/slab/slub_defrag.c b/tools/testing/slab/slub_defrag.c index 4a5c24394b96..8332e69ee868 100644 --- a/tools/testing/slab/slub_defrag.c +++ b/tools/testing/slab/slub_defrag.c @@ -337,6 +337,7 @@ static int smo_run_module_tests(int nr_objs, int keep) /* * struct functions() - Map command to a function pointer. + * If you update this please update the documentation in slub_defrag.py */ struct functions { char *fn_name; diff --git a/tools/testing/slab/slub_defrag.py b/tools/testing/slab/slub_defrag.py new file mode 100755 index ..41747c0db39b --- /dev/null +++ b/tools/testing/slab/slub_defrag.py @@ -0,0 +1,451 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 + +import subprocess +import sys +from os import path + +# SLUB Movable Objects test suite. +# +# Requirements: +# - CONFIG_SLUB=y +# - CONFIG_SLUB_DEBUG=y +# - The slub_defrag module in this directory. + +# Test SMO using a kernel module that enables triggering arbitrary +# kernel code from userspace via a debugfs file. +# +# Module code is in ./slub_defrag.c, basically the functionality is as +# follows: +# +# - Creates debugfs file /sys/kernel/debugfs/smo/callfn +# - Writes to 'callfn' are parsed as a command string and the function +#associated with command is called. +# - Defines 4 commands (all commands operate on smo_test cache): +# - 'test': Runs module stress tests. +# - 'alloc N': Allocates N slub objects +# - 'free N POS': Frees N objects starting at POS (see below) +# - 'enable': Enables SLUB Movable Objects +# +# The module maintains a list of allocated objects. Allocation adds +# objects to the tail of the list. Free'ing frees from the head of the +# list. This has the effect of creating free slots in the slab. For +# finer grained control over where in the cache slots are free'd POS +# (position) argument may be used. + +# The main() function is reasonably readable; the test suite does the +# following: +# +# 1. Runs the module stress tests. +# 2. Tests the cache without movable objects enabled. +#- Creates multiple partial slabs as explained above. +#- Verifies that partial slabs are _not_ removed by shrink (see below). +# 3. Tests the cache with movable objects enabled. +#- Creates multiple partial slabs as explained above. +#- Verifies that partial slabs _are_ removed by shrink (see below). + +# The sysfs file /sys/kernel/slab//shrink enables calling the +# function kmem_cache_shrink() (see mm/slab_common.c and mm/slub.cc). +# Shrinking a cache attempts to consolidate all partial slabs by moving +# objects if object migration is enable for the cache, otherwise +# shrinking a cache simply re-orders the partial list so as most densely +# populated slab are at the head of the list. + +# Enable/disable debugging output (also enabled via -d | --debug). +debug = False + +# Used in debug messages and when running `insmod`. +MODULE_NAME = "slub_defrag" + +# Slab cache created by the test module. +CACHE_NAME = "smo_test" + +# Set by get_slab_config() +objects_per_slab = 0 +pages_per_slab = 0 +debugfs_mounted = False # Set to true if we mount debugfs. + + +def eprint(*args, **kwargs): +print(*args, file=sys.stderr, **kwargs) + + +def dprint(*args, **kwargs): +if debug: +print(*args, file=sys.stderr, **kwargs) + + +def run_shell(cmd): +return subprocess.call([cmd], shell=True) + + +def run_shell_get_stdout(cmd): +return subprocess.check_output([cmd], shell=True) + + +def assert_root(): +user = run_shell_get_stdout('whoami') +if user != b'root\n': +eprint("Please run script as root") +sys.exit(1) + + +def mount_debugfs(): +mounted = False + +# Check if debugfs is mounted at a known
[RFC PATCH v5 07/16] tools/testing/slab: Add object migration test module
We just implemented slab movable objects for the SLUB allocator. We should test that code. In order to do so we need to be able to do a number of things - Create a cache - Enable Slab Movable Objects for the cache - Allocate objects to the cache - Free objects from within specific slabs of the cache We can do all this via a loadable module. Add a module that defines functions that can be triggered from userspace via a debugfs entry. From the source: /* * SLUB defragmentation a.k.a. Slab Movable Objects (SMO). * * This module is used for testing the SLUB allocator. Enables * userspace to run kernel functions via a debugfs file. * * debugfs: /sys/kernel/debugfs/smo/callfn (write only) * * String written to `callfn` is parsed by the module and associated * function is called. See fn_tab for mapping of strings to functions. */ References to allocated objects are kept by the module in a linked list so that userspace can control which object to free. We introduce the following four functions via the function table "enable": Enables object migration for the test cache. "alloc X": Allocates X objects "free X [Y]": Frees X objects starting at list position Y (default Y==0) "test": Runs [stress] tests from within the module (see below). {"enable", smo_enable_cache_mobility}, {"alloc", smo_alloc_objects}, {"free", smo_free_object}, {"test", smo_run_module_tests}, Freeing from the start of the list creates a hole in the slab being freed from (i.e. creates a partial slab). The results of running these commands can be see using `slabinfo` (available in tools/vm/): make -o slabinfo tools/vm/slabinfo.c Stress tests can be run from within the module. These tests are internal to the module because we verify that object references are still good after object migration. These are called 'stress' tests because it is intended that they create/free a lot of objects. Userspace can control the number of objects to create, default is 1000. Example test session Relevant /proc/slabinfo column headers: name # mount -t debugfs none /sys/kernel/debug/ $ cd path/to/linux/tools/testing/slab; make ... # insmod slub_defrag.ko # cat /proc/slabinfo | grep smo_test | sed 's/:.*//' smo_test 0 0392 202 >From this we can see that the module created cache 'smo_test' with 20 objects per slab and 2 pages per slab (and cache is currently empty). We can play with the slab allocator manually: # insmod slub_defrag.ko # echo 'alloc 21' > callfn # cat /proc/slabinfo | grep smo_test | sed 's/:.*//' smo_test 21 40392 202 We see here that 21 active objects have been allocated creating 2 slabs (40 total objects). # slabinfo smo_test --report Slabcache: smo_test Aliases: 0 Order : 1 Objects: 21 Sizes (bytes) Slabs DebugMemory Object : 56 Total : 2 Sanity Checks : On Total: 16384 SlabObj: 392 Full : 1 Redzoning : On Used :1176 SlabSiz:8192 Partial: 1 Poisoning : On Loss : 15208 Loss : 336 CpuSlab: 0 Tracking : On Lalig:7056 Align : 8 Objects: 20 Tracing : Off Lpadd: 704 Now free an object from the first slot of the first slab # echo 'free 1' > callfn # cat /proc/slabinfo | grep smo_test | sed 's/:.*//' smo_test 20 40392 202 # slabinfo smo_test --report Slabcache: smo_test Aliases: 0 Order : 1 Objects: 20 Sizes (bytes) Slabs DebugMemory Object : 56 Total : 2 Sanity Checks : On Total: 16384 SlabObj: 392 Full : 0 Redzoning : On Used :1120 SlabSiz:8192 Partial: 2 Poisoning : On Loss : 15264 Loss : 336 CpuSlab: 0 Tracking : On Lalig:6720 Align : 8 Objects: 20 Tracing : Off Lpadd: 704 Calling shrink now on the cache does nothing because object migration is not enabled (output omitted). If we enable object migration then shrink the cache we expect the object from the second slab to me moved to the first slot in the first slab and the second slab to be removed from the partial list. # echo 'enable' > callfn # slabinfo smo_test --shrink # slabinfo smo_test --report Slabcache: smo_test Aliases: 0 Order : 1 Objects: 20 ** Defragmentation at 30% Sizes (bytes) Slabs DebugMemory Object : 56 Total : 1 Sanity Checks : On Total:8192 SlabObj: 392 Full : 1
[RFC PATCH v5 06/16] tools/vm/slabinfo: Add defrag_used_ratio output
Add output for the newly added defrag_used_ratio sysfs knob. Signed-off-by: Tobin C. Harding --- tools/vm/slabinfo.c | 4 1 file changed, 4 insertions(+) diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c index d2c22f9ee2d8..ef4ff93df4cc 100644 --- a/tools/vm/slabinfo.c +++ b/tools/vm/slabinfo.c @@ -34,6 +34,7 @@ struct slabinfo { unsigned int sanity_checks, slab_size, store_user, trace; int order, poison, reclaim_account, red_zone; int movable, ctor; + int defrag_used_ratio; int remote_node_defrag_ratio; unsigned long partial, objects, slabs, objects_partial, objects_total; unsigned long alloc_fastpath, alloc_slowpath; @@ -549,6 +550,8 @@ static void report(struct slabinfo *s) printf("** Slabs are destroyed via RCU\n"); if (s->reclaim_account) printf("** Reclaim accounting active\n"); + if (s->movable) + printf("** Defragmentation at %d%%\n", s->defrag_used_ratio); printf("\nSizes (bytes) Slabs Debug Memory\n"); printf("\n"); @@ -1279,6 +1282,7 @@ static void read_slab_dir(void) slab->deactivate_bypass = get_obj("deactivate_bypass"); slab->remote_node_defrag_ratio = get_obj("remote_node_defrag_ratio"); + slab->defrag_used_ratio = get_obj("defrag_used_ratio"); chdir(".."); if (read_slab_obj(slab, "ops")) { if (strstr(buffer, "ctor :")) -- 2.21.0
[RFC PATCH v5 09/16] lib: Separate radix_tree_node and xa_node slab cache
Earlier, Slab Movable Objects (SMO) was implemented. The XArray is now able to take advantage of SMO in order to make xarray nodes movable (when using the SLUB allocator). Currently the radix tree uses the same slab cache as the XArray. Only XArray nodes are movable _not_ radix tree nodes. We can give the radix tree its own slab cache to overcome this. In preparation for implementing XArray object migration (xa_node objects) via Slab Movable Objects add a slab cache solely for XArray nodes and make the XArray use this slab cache instead of the radix_tree_node slab cache. Cc: Matthew Wilcox Signed-off-by: Tobin C. Harding --- include/linux/xarray.h | 3 +++ init/main.c| 2 ++ lib/radix-tree.c | 2 +- lib/xarray.c | 48 ++ 4 files changed, 45 insertions(+), 10 deletions(-) diff --git a/include/linux/xarray.h b/include/linux/xarray.h index 0e01e6129145..773f91f8e1db 100644 --- a/include/linux/xarray.h +++ b/include/linux/xarray.h @@ -42,6 +42,9 @@ #define BITS_PER_XA_VALUE (BITS_PER_LONG - 1) +/* Called from init/main.c */ +void xarray_slabcache_init(void); + /** * xa_mk_value() - Create an XArray entry from an integer. * @v: Value to store in XArray. diff --git a/init/main.c b/init/main.c index 5a2c69b4d7b3..e89915ffbe26 100644 --- a/init/main.c +++ b/init/main.c @@ -106,6 +106,7 @@ static int kernel_init(void *); extern void init_IRQ(void); extern void radix_tree_init(void); +extern void xarray_slabcache_init(void); /* * Debug helper: via this flag we know that we are in 'early bootup code' @@ -621,6 +622,7 @@ asmlinkage __visible void __init start_kernel(void) "Interrupts were enabled *very* early, fixing it\n")) local_irq_disable(); radix_tree_init(); + xarray_slabcache_init(); /* * Set up housekeeping before setting up workqueues to allow the unbound diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 14d51548bea6..edbfb530ba73 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -44,7 +44,7 @@ /* * Radix tree node cache. */ -struct kmem_cache *radix_tree_node_cachep; +static struct kmem_cache *radix_tree_node_cachep; /* * The radix tree is variable-height, so an insert operation not only has diff --git a/lib/xarray.c b/lib/xarray.c index 6be3acbb861f..a528a5277c9d 100644 --- a/lib/xarray.c +++ b/lib/xarray.c @@ -27,6 +27,8 @@ * @entry refers to something stored in a slot in the xarray */ +static struct kmem_cache *xa_node_cachep; + static inline unsigned int xa_lock_type(const struct xarray *xa) { return (__force unsigned int)xa->xa_flags & 3; @@ -244,9 +246,21 @@ void *xas_load(struct xa_state *xas) } EXPORT_SYMBOL_GPL(xas_load); -/* Move the radix tree node cache here */ -extern struct kmem_cache *radix_tree_node_cachep; -extern void radix_tree_node_rcu_free(struct rcu_head *head); +void xa_node_rcu_free(struct rcu_head *head) +{ + struct xa_node *node = container_of(head, struct xa_node, rcu_head); + + /* +* Must only free zeroed nodes into the slab. We can be left with +* non-NULL entries by radix_tree_free_nodes, so clear the entries +* and tags here. +*/ + memset(node->slots, 0, sizeof(node->slots)); + memset(node->tags, 0, sizeof(node->tags)); + INIT_LIST_HEAD(>private_list); + + kmem_cache_free(xa_node_cachep, node); +} #define XA_RCU_FREE((struct xarray *)1) @@ -254,7 +268,7 @@ static void xa_node_free(struct xa_node *node) { XA_NODE_BUG_ON(node, !list_empty(>private_list)); node->array = XA_RCU_FREE; - call_rcu(>rcu_head, radix_tree_node_rcu_free); + call_rcu(>rcu_head, xa_node_rcu_free); } /* @@ -270,7 +284,7 @@ static void xas_destroy(struct xa_state *xas) if (!node) return; XA_NODE_BUG_ON(node, !list_empty(>private_list)); - kmem_cache_free(radix_tree_node_cachep, node); + kmem_cache_free(xa_node_cachep, node); xas->xa_alloc = NULL; } @@ -298,7 +312,7 @@ bool xas_nomem(struct xa_state *xas, gfp_t gfp) xas_destroy(xas); return false; } - xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp); + xas->xa_alloc = kmem_cache_alloc(xa_node_cachep, gfp); if (!xas->xa_alloc) return false; XA_NODE_BUG_ON(xas->xa_alloc, !list_empty(>xa_alloc->private_list)); @@ -327,10 +341,10 @@ static bool __xas_nomem(struct xa_state *xas, gfp_t gfp) } if (gfpflags_allow_blocking(gfp)) { xas_unlock_type(xas, lock_type); - xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp); + xas->xa_alloc = kmem_cache_alloc(xa_node_cachep, gfp); xas_lock_type(xas, lock_type); } else { - xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp); +
[RFC PATCH v5 10/16] xarray: Implement migration function for xa_node objects
Recently Slab Movable Objects (SMO) was implemented for the SLUB allocator. The XArray can take advantage of this and make the xa_node slab cache objects movable. Implement functions to migrate objects and activate SMO when we initialise the XArray slab cache. This is based on initial code by Matthew Wilcox and was modified to work with slab object migration. Cc: Matthew Wilcox Co-developed-by: Christoph Lameter Signed-off-by: Tobin C. Harding --- lib/xarray.c | 61 1 file changed, 61 insertions(+) diff --git a/lib/xarray.c b/lib/xarray.c index a528a5277c9d..c6b077f59e88 100644 --- a/lib/xarray.c +++ b/lib/xarray.c @@ -1993,12 +1993,73 @@ static void xa_node_ctor(void *arg) INIT_LIST_HEAD(>private_list); } +static void xa_object_migrate(struct xa_node *node, int numa_node) +{ + struct xarray *xa = READ_ONCE(node->array); + void __rcu **slot; + struct xa_node *new_node; + int i; + + /* Freed or not yet in tree then skip */ + if (!xa || xa == XA_RCU_FREE) + return; + + new_node = kmem_cache_alloc_node(xa_node_cachep, GFP_KERNEL, numa_node); + if (!new_node) { + pr_err("%s: slab cache allocation failed\n", __func__); + return; + } + + xa_lock_irq(xa); + + /* Check again. */ + if (xa != node->array) { + node = new_node; + goto unlock; + } + + memcpy(new_node, node, sizeof(struct xa_node)); + + if (list_empty(>private_list)) + INIT_LIST_HEAD(_node->private_list); + else + list_replace(>private_list, _node->private_list); + + for (i = 0; i < XA_CHUNK_SIZE; i++) { + void *x = xa_entry_locked(xa, new_node, i); + + if (xa_is_node(x)) + rcu_assign_pointer(xa_to_node(x)->parent, new_node); + } + if (!new_node->parent) + slot = >xa_head; + else + slot = _parent_locked(xa, new_node)->slots[new_node->offset]; + rcu_assign_pointer(*slot, xa_mk_node(new_node)); + +unlock: + xa_unlock_irq(xa); + xa_node_free(node); + rcu_barrier(); +} + +static void xa_migrate(struct kmem_cache *s, void **objects, int nr, + int node, void *_unused) +{ + int i; + + for (i = 0; i < nr; i++) + xa_object_migrate(objects[i], node); +} + + void __init xarray_slabcache_init(void) { xa_node_cachep = kmem_cache_create("xarray_node", sizeof(struct xa_node), 0, SLAB_PANIC | SLAB_RECLAIM_ACCOUNT, xa_node_ctor); + kmem_cache_setup_mobility(xa_node_cachep, NULL, xa_migrate); } #ifdef XA_DEBUG -- 2.21.0
[RFC PATCH v5 03/16] slub: Sort slab cache list
It is advantageous to have all defragmentable slabs together at the beginning of the list of slabs so that there is no need to scan the complete list. Put defragmentable caches first when adding a slab cache and others last. Co-developed-by: Christoph Lameter Signed-off-by: Tobin C. Harding --- mm/slab_common.c | 2 +- mm/slub.c| 6 ++ 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/slab_common.c b/mm/slab_common.c index 58251ba63e4a..db5e9a0b1535 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -393,7 +393,7 @@ static struct kmem_cache *create_cache(const char *name, goto out_free_cache; s->refcount = 1; - list_add(>list, _caches); + list_add_tail(>list, _caches); memcg_link_cache(s); out: if (err) diff --git a/mm/slub.c b/mm/slub.c index 1c380a2bc78a..66d474397c0f 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4333,6 +4333,8 @@ void kmem_cache_setup_mobility(struct kmem_cache *s, return; } + mutex_lock(_mutex); + s->isolate = isolate; s->migrate = migrate; @@ -4341,6 +4343,10 @@ void kmem_cache_setup_mobility(struct kmem_cache *s, * to disable fast cmpxchg based processing. */ s->flags &= ~__CMPXCHG_DOUBLE; + + list_move(>list, _caches); /* Move to top */ + + mutex_unlock(_mutex); } EXPORT_SYMBOL(kmem_cache_setup_mobility); -- 2.21.0
[RFC PATCH v5 04/16] slub: Slab defrag core
Internal fragmentation can occur within pages used by the slub allocator. Under some workloads large numbers of pages can be used by partial slab pages. This under-utilisation is bad simply because it wastes memory but also because if the system is under memory pressure higher order allocations may become difficult to satisfy. If we can defrag slab caches we can alleviate these problems. Implement Slab Movable Objects in order to defragment slab caches. Slab defragmentation may occur: 1. Unconditionally when __kmem_cache_shrink() is called on a slab cache by the kernel calling kmem_cache_shrink(). 2. Unconditionally through the use of the slabinfo command. slabinfo -s 3. Conditionally via the use of kmem_cache_defrag() - Use Slab Movable Objects when shrinking cache. Currently when the kernel calls kmem_cache_shrink() we curate the partial slabs list. If object migration is not enabled for the cache we still do this, if however, SMO is enabled we attempt to move objects in partially full slabs in order to defragment the cache. Shrink attempts to move all objects in order to reduce the cache to a single partial slab for each node. - Add conditional per node defrag via new function: kmem_defrag_slabs(int node). kmem_defrag_slabs() attempts to defragment all slab caches for node. Defragmentation is done conditionally dependent on MAX_PARTIAL _and_ defrag_used_ratio. Caches are only considered for defragmentation if the number of partial slabs exceeds MAX_PARTIAL (per node). Also, defragmentation only occurs if the usage ratio of the slab is lower than the configured percentage (sysfs field added in this patch). Fragmentation ratios are measured by calculating the percentage of objects in use compared to the total number of objects that the slab page can accommodate. The scanning of slab caches is optimized because the defragmentable slabs come first on the list. Thus we can terminate scans on the first slab encountered that does not support defragmentation. kmem_defrag_slabs() takes a node parameter. This can either be -1 if defragmentation should be performed on all nodes, or a node number. Defragmentation may be disabled by setting defrag ratio to 0 echo 0 > /sys/kernel/slab//defrag_used_ratio - Add a defrag ratio sysfs field and set it to 30% by default. A limit of 30% specifies that more than 3 out of 10 available slots for objects need to be in use otherwise slab defragmentation will be attempted on the remaining objects. In order for a cache to be defragmentable the cache must support object migration (SMO). Enabling SMO for a cache is done via a call to the recently added function: void kmem_cache_setup_mobility(struct kmem_cache *, kmem_cache_isolate_func, kmem_cache_migrate_func); Co-developed-by: Christoph Lameter Signed-off-by: Tobin C. Harding --- Documentation/ABI/testing/sysfs-kernel-slab | 14 + include/linux/slab.h| 1 + include/linux/slub_def.h| 7 + mm/slub.c | 385 4 files changed, 334 insertions(+), 73 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab index 29601d93a1c2..c6f129af035a 100644 --- a/Documentation/ABI/testing/sysfs-kernel-slab +++ b/Documentation/ABI/testing/sysfs-kernel-slab @@ -180,6 +180,20 @@ Description: list. It can be written to clear the current count. Available when CONFIG_SLUB_STATS is enabled. +What: /sys/kernel/slab/cache/defrag_used_ratio +Date: May 2019 +KernelVersion: 5.2 +Contact: Christoph Lameter + Pekka Enberg , +Description: + The defrag_used_ratio file allows the control of how aggressive + slab fragmentation reduction works at reclaiming objects from + sparsely populated slabs. This is a percentage. If a slab has + less than this percentage of objects allocated then reclaim will + attempt to reclaim objects so that the whole slab page can be + freed. 0% specifies no reclaim attempt (defrag disabled), 100% + specifies attempt to reclaim all pages. The default is 30%. + What: /sys/kernel/slab/cache/deactivate_to_tail Date: February 2008 KernelVersion: 2.6.25 diff --git a/include/linux/slab.h b/include/linux/slab.h index 886fc130334d..4bf381b34829 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -149,6 +149,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name, void (*ctor)(void *)); void kmem_cache_destroy(struct kmem_cache *); int kmem_cache_shrink(struct kmem_cache *); +unsigned long kmem_defrag_slabs(int node); void memcg_create_kmem_cache(struct
[RFC PATCH v5 02/16] tools/vm/slabinfo: Add support for -C and -M options
-C lists caches that use a ctor. -M lists caches that support object migration. Add command line options to show caches with a constructor and caches that are movable (i.e. have migrate function). Co-developed-by: Christoph Lameter Signed-off-by: Tobin C. Harding --- tools/vm/slabinfo.c | 40 1 file changed, 36 insertions(+), 4 deletions(-) diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c index 73818f1b2ef8..cbfc56c44c2f 100644 --- a/tools/vm/slabinfo.c +++ b/tools/vm/slabinfo.c @@ -33,6 +33,7 @@ struct slabinfo { unsigned int hwcache_align, object_size, objs_per_slab; unsigned int sanity_checks, slab_size, store_user, trace; int order, poison, reclaim_account, red_zone; + int movable, ctor; unsigned long partial, objects, slabs, objects_partial, objects_total; unsigned long alloc_fastpath, alloc_slowpath; unsigned long free_fastpath, free_slowpath; @@ -67,6 +68,8 @@ int show_report; int show_alias; int show_slab; int skip_zero = 1; +int show_movable; +int show_ctor; int show_numa; int show_track; int show_first_alias; @@ -109,11 +112,13 @@ static void fatal(const char *x, ...) static void usage(void) { - printf("slabinfo 4/15/2011. (c) 2007 sgi/(c) 2011 Linux Foundation.\n\n" - "slabinfo [-aADefhilnosrStTvz1LXBU] [N=K] [-dafzput] [slab-regexp]\n" + printf("slabinfo 4/15/2017. (c) 2007 sgi/(c) 2011 Linux Foundation/(c) 2017 Jump Trading LLC.\n\n" + "slabinfo [-aACDefhilMnosrStTvz1LXBU] [N=K] [-dafzput] [slab-regexp]\n" + "-a|--aliases Show aliases\n" "-A|--activity Most active slabs first\n" "-B|--Bytes Show size in bytes\n" + "-C|--ctor Show slabs with ctors\n" "-D|--display-activeSwitch line format to activity\n" "-e|--empty Show empty slabs\n" "-f|--first-alias Show first alias\n" @@ -121,6 +126,7 @@ static void usage(void) "-i|--inverted Inverted list\n" "-l|--slabs Show slabs\n" "-L|--Loss Sort by loss\n" + "-M|--movable Show caches that support movable objects\n" "-n|--numa Show NUMA information\n" "-N|--lines=K Show the first K slabs\n" "-o|--ops Show kmem_cache_ops\n" @@ -588,6 +594,12 @@ static void slabcache(struct slabinfo *s) if (show_empty && s->slabs) return; + if (show_ctor && !s->ctor) + return; + + if (show_movable && !s->movable) + return; + if (sort_loss == 0) store_size(size_str, slab_size(s)); else @@ -602,6 +614,10 @@ static void slabcache(struct slabinfo *s) *p++ = '*'; if (s->cache_dma) *p++ = 'd'; + if (s->ctor) + *p++ = 'C'; + if (s->movable) + *p++ = 'M'; if (s->hwcache_align) *p++ = 'A'; if (s->poison) @@ -636,7 +652,8 @@ static void slabcache(struct slabinfo *s) printf("%-21s %8ld %7d %15s %14s %4d %1d %3ld %3ld %s\n", s->name, s->objects, s->object_size, size_str, dist_str, s->objs_per_slab, s->order, - s->slabs ? (s->partial * 100) / s->slabs : 100, + s->slabs ? (s->partial * 100) / + (s->slabs * s->objs_per_slab) : 100, s->slabs ? (s->objects * s->object_size * 100) / (s->slabs * (page_size << s->order)) : 100, flags); @@ -1256,6 +1273,13 @@ static void read_slab_dir(void) slab->alloc_node_mismatch = get_obj("alloc_node_mismatch"); slab->deactivate_bypass = get_obj("deactivate_bypass"); chdir(".."); + if (read_slab_obj(slab, "ops")) { + if (strstr(buffer, "ctor :")) + slab->ctor = 1; + if (strstr(buffer, "migrate :")) + slab->movable = 1; + } + if (slab->name[0] == ':') alias_targets++; slab++; @@ -1332,6 +1356,8 @@ static void xtotals(void) } struct option opts[] = { + { "ctor", no_argument, NULL, 'C' }, + { "movable", no_argument, NULL, 'M' }, { "aliases", no_argument, NULL, 'a' }, { "activity", no_argument, NULL, 'A' }, { "debug", optional_argument, NULL, 'd' }, @@ -1367,7 +1393,7 @@ int main(int argc, char *argv[]) page_size = getpagesize(); -
[RFC PATCH v5 00/16] Slab Movable Objects (SMO)
Hi, Another iteration of the SMO patch set, updates to this version are restricted to the XArray patches (#9 and #10 and tested with module implemented in #11). Applies on top of Linus' tree (tag: v5.2-rc1). This is a patch set implementing movable objects within the SLUB allocator. This is work based on Christopher Lameter's patch set: https://lore.kernel.org/patchwork/project/lkml/list/?series=377335 The original code logic is from that set and implemented by Christopher. Clean up, refactoring, documentation, and additional features by myself. Responsibility for any bugs remaining falls solely with myself. I am intending on sending a non-RFC version soon after this one (if XArray stuff is ok). If anyone has any objects with SMO in general please yell at me now. Changes to this version: Patch XArray to use a separate slab cache. Currently the radix tree and XArray use the same slab cache. Radix tree nodes can not be moved but XArray nodes can. Matthew, Does this fit in ok with your plans for the XArray and radix tree? I don't really like the function names used here or the init function name (xarray_slabcache_init()). If there is a better way to do this please mercilessly correct me :) Thanks for looking at this, Tobin. Tobin C. Harding (16): slub: Add isolate() and migrate() methods tools/vm/slabinfo: Add support for -C and -M options slub: Sort slab cache list slub: Slab defrag core tools/vm/slabinfo: Add remote node defrag ratio output tools/vm/slabinfo: Add defrag_used_ratio output tools/testing/slab: Add object migration test module tools/testing/slab: Add object migration test suite lib: Separate radix_tree_node and xa_node slab cache xarray: Implement migration function for xa_node objects tools/testing/slab: Add XArray movable objects tests slub: Enable moving objects to/from specific nodes slub: Enable balancing slabs across nodes dcache: Provide a dentry constructor dcache: Implement partial shrink via Slab Movable Objects dcache: Add CONFIG_DCACHE_SMO Documentation/ABI/testing/sysfs-kernel-slab | 14 + fs/dcache.c | 110 ++- include/linux/slab.h| 71 ++ include/linux/slub_def.h| 10 + include/linux/xarray.h | 3 + init/main.c | 2 + lib/radix-tree.c| 2 +- lib/xarray.c| 109 ++- mm/Kconfig | 14 + mm/slab_common.c| 2 +- mm/slub.c | 819 ++-- tools/testing/slab/Makefile | 10 + tools/testing/slab/slub_defrag.c| 567 ++ tools/testing/slab/slub_defrag.py | 451 +++ tools/testing/slab/slub_defrag_xarray.c | 211 + tools/vm/slabinfo.c | 51 +- 16 files changed, 2343 insertions(+), 103 deletions(-) create mode 100644 tools/testing/slab/Makefile create mode 100644 tools/testing/slab/slub_defrag.c create mode 100755 tools/testing/slab/slub_defrag.py create mode 100644 tools/testing/slab/slub_defrag_xarray.c -- 2.21.0
[RFC PATCH v5 01/16] slub: Add isolate() and migrate() methods
Add the two methods needed for moving objects and enable the display of the callbacks via the /sys/kernel/slab interface. Add documentation explaining the use of these methods and the prototypes for slab.h. Add functions to setup the callbacks method for a slab cache. Add empty functions for SLAB/SLOB. The API is generic so it could be theoretically implemented for these allocators as well. Change sysfs 'ctor' field to be 'ops' to contain all the callback operations defined for a slab cache. Display the existing 'ctor' callback in the ops fields contents along with 'isolate' and 'migrate' callbacks. Co-developed-by: Christoph Lameter Signed-off-by: Tobin C. Harding --- include/linux/slab.h | 70 include/linux/slub_def.h | 3 ++ mm/slub.c| 59 + 3 files changed, 126 insertions(+), 6 deletions(-) diff --git a/include/linux/slab.h b/include/linux/slab.h index 9449b19c5f10..886fc130334d 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -154,6 +154,76 @@ void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *); void memcg_deactivate_kmem_caches(struct mem_cgroup *); void memcg_destroy_kmem_caches(struct mem_cgroup *); +/* + * Function prototypes passed to kmem_cache_setup_mobility() to enable + * mobile objects and targeted reclaim in slab caches. + */ + +/** + * typedef kmem_cache_isolate_func - Object migration callback function. + * @s: The cache we are working on. + * @ptr: Pointer to an array of pointers to the objects to isolate. + * @nr: Number of objects in @ptr array. + * + * The purpose of kmem_cache_isolate_func() is to pin each object so that + * they cannot be freed until kmem_cache_migrate_func() has processed + * them. This may be accomplished by increasing the refcount or setting + * a flag. + * + * The object pointer array passed is also passed to + * kmem_cache_migrate_func(). The function may remove objects from the + * array by setting pointers to %NULL. This is useful if we can + * determine that an object is being freed because + * kmem_cache_isolate_func() was called when the subsystem was calling + * kmem_cache_free(). In that case it is not necessary to increase the + * refcount or specially mark the object because the release of the slab + * lock will lead to the immediate freeing of the object. + * + * Context: Called with locks held so that the slab objects cannot be + * freed. We are in an atomic context and no slab operations + * may be performed. + * Return: A pointer that is passed to the migrate function. If any + * objects cannot be touched at this point then the pointer may + * indicate a failure and then the migration function can simply + * remove the references that were already obtained. The private + * data could be used to track the objects that were already pinned. + */ +typedef void *kmem_cache_isolate_func(struct kmem_cache *s, void **ptr, int nr); + +/** + * typedef kmem_cache_migrate_func - Object migration callback function. + * @s: The cache we are working on. + * @ptr: Pointer to an array of pointers to the objects to migrate. + * @nr: Number of objects in @ptr array. + * @node: The NUMA node where the object should be allocated. + * @private: The pointer returned by kmem_cache_isolate_func(). + * + * This function is responsible for migrating objects. Typically, for + * each object in the input array you will want to allocate an new + * object, copy the original object, update any pointers, and free the + * old object. + * + * After this function returns all pointers to the old object should now + * point to the new object. + * + * Context: Called with no locks held and interrupts enabled. Sleeping + * is possible. Any operation may be performed. + */ +typedef void kmem_cache_migrate_func(struct kmem_cache *s, void **ptr, +int nr, int node, void *private); + +/* + * kmem_cache_setup_mobility() is used to setup callbacks for a slab cache. + */ +#ifdef CONFIG_SLUB +void kmem_cache_setup_mobility(struct kmem_cache *, kmem_cache_isolate_func, + kmem_cache_migrate_func); +#else +static inline void +kmem_cache_setup_mobility(struct kmem_cache *s, kmem_cache_isolate_func isolate, + kmem_cache_migrate_func migrate) {} +#endif + /* * Please use this macro to create slab caches. Simply specify the * name of the structure and maybe some flags that are listed above. diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index d2153789bd9f..2879a2f5f8eb 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -99,6 +99,9 @@ struct kmem_cache { gfp_t allocflags; /* gfp flags to use on each alloc */ int refcount; /* Refcount for slab cache destroy */ void (*ctor)(void *); + kmem_cache_isolate_func *isolate; +
[PATCH v1 5/9] perf diff: Use hists to manage basic blocks
The function hist__account_cycles() can account cycles per basic block. The basic block information are saved in a per-symbol cycles_hist structure. This patch processes each symbol, get basic blocks from cycles_hist and add the basic block entry to a hists. Using a hists is because we need to compare, sort and print the basic blocks. Signed-off-by: Jin Yao --- tools/perf/builtin-diff.c | 145 ++ 1 file changed, 145 insertions(+) diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c index e067ac6..09551fe 100644 --- a/tools/perf/builtin-diff.c +++ b/tools/perf/builtin-diff.c @@ -20,6 +20,7 @@ #include "util/data.h" #include "util/config.h" #include "util/time-utils.h" +#include "util/annotate.h" #include #include @@ -30,6 +31,9 @@ struct block_hists { struct histssym_hists; struct perf_hpp_listsym_list; struct perf_hpp_fmt sym_fmt; + struct histshists; + struct perf_hpp_listlist; + struct perf_hpp_fmt fmt; }; struct perf_diff { @@ -73,6 +77,8 @@ struct data__file { struct block_hists block_hists; }; +static struct addr_location dummy_al; + static struct data__file *data__files; static int data__files_cnt; @@ -378,6 +384,7 @@ static int diff__process_sample_event(struct perf_tool *tool, pr_warning("problem counting symbol for basic block\n"); goto out_put; } + hist__account_cycles(sample->branch_stack, , sample, false); } else { if (!hists__add_entry(hists, , NULL, NULL, NULL, sample, true)) { @@ -938,13 +945,149 @@ static int block_sym_hists_init(struct block_hists *hists) return 0; } +static int64_t block_cmp(struct perf_hpp_fmt *fmt __maybe_unused, +struct hist_entry *left, struct hist_entry *right) +{ + struct block_info *bi_l = left->block_info; + struct block_info *bi_r = right->block_info; + int cmp; + + if (!bi_l->sym || !bi_r->sym) { + if (!bi_l->sym && !bi_r->sym) + return 0; + else if (!bi_l->sym) + return -1; + else + return 1; + } + + if (bi_l->sym == bi_r->sym) { + if (bi_l->start == bi_r->start) { + if (bi_l->end == bi_r->end) + return 0; + else + return (int64_t)(bi_r->end - bi_l->end); + } else + return (int64_t)(bi_r->start - bi_l->start); + } else { + cmp = strcmp(bi_l->sym->name, bi_r->sym->name); + return cmp; + } + + if (bi_l->sym->start != bi_r->sym->start) + return (int64_t)(bi_r->sym->start - bi_l->sym->start); + + return (int64_t)(bi_r->sym->end - bi_l->sym->end); +} + +static int block_hists_init(struct block_hists *hists) +{ + struct perf_hpp_fmt *fmt; + + __hists__init(>hists, >list); + perf_hpp_list__init(>list); + fmt = >fmt; + INIT_LIST_HEAD(>list); + INIT_LIST_HEAD(>sort_list); + + fmt->cmp = block_cmp; + + perf_hpp_list__register_sort_field(>list, fmt); + return 0; +} + +static void *block_he_zalloc(size_t size) +{ + return zalloc(size + sizeof(struct hist_entry)); +} + +static void block_he_free(void *he) +{ + struct block_info *bi = ((struct hist_entry *)he)->block_info; + + block_info__put(bi); + free(he); +} + +struct hist_entry_ops block_he_ops = { + .new= block_he_zalloc, + .free = block_he_free, +}; + +static void init_block_info(struct block_info *bi, struct symbol *sym, + struct cyc_hist *ch, int offset) +{ + bi->sym = sym; + bi->start = ch->start; + bi->end = offset; + bi->cycles = ch->cycles; + bi->cycles_aggr = ch->cycles_aggr; + bi->num = ch->num; + bi->num_aggr = ch->num_aggr; +} + +static int process_block_per_sym(struct data__file *d) +{ + struct hists *hists = >block_hists.sym_hists; + struct rb_root_cached *root = hists->entries_in; + struct rb_node *next = rb_first_cached(root); + struct annotation *notes; + struct cyc_hist *ch; + unsigned int i; + struct block_info *bi; + struct hist_entry *he, *he_block; + + while (next != NULL) { + he = rb_entry(next, struct hist_entry, rb_node_in); + next = rb_next(>rb_node_in); + + if (!he->ms.map || !he->ms.sym) + continue; + + notes = symbol__annotation(he->ms.sym); + if (!notes || !notes->src || !notes->src->cycles_hist) + continue; + + ch = notes->src->cycles_hist; +
[PATCH v1 4/9] perf diff: Get a list of symbols(functions)
We already have a function hist__account_cycles() which can be used to account cycles per basic block in symbol/function. But we also need to know what the symbols are, since we need to get basic blocks of all symbols(functions) before diff. This patch records the sorted symbols in sym_hists, which will be used in next patch for accounting cycles per basic block per function. Signed-off-by: Jin Yao --- tools/perf/builtin-diff.c | 81 ++- 1 file changed, 73 insertions(+), 8 deletions(-) diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c index 6023f8c..e067ac6 100644 --- a/tools/perf/builtin-diff.c +++ b/tools/perf/builtin-diff.c @@ -26,6 +26,12 @@ #include #include +struct block_hists { + struct histssym_hists; + struct perf_hpp_listsym_list; + struct perf_hpp_fmt sym_fmt; +}; + struct perf_diff { struct perf_tool tool; const char *time_str; @@ -33,6 +39,8 @@ struct perf_diff { int range_size; int range_num; bool has_br_stack; + bool basic_block; + struct block_hists *block_hists; }; /* Diff command specific HPP columns. */ @@ -62,6 +70,7 @@ struct data__file { int idx; struct hists*hists; struct diff_hpp_fmt fmt[PERF_HPP_DIFF__MAX_INDEX]; + struct block_hists block_hists; }; static struct data__file *data__files; @@ -363,9 +372,18 @@ static int diff__process_sample_event(struct perf_tool *tool, goto out_put; } - if (!hists__add_entry(hists, , NULL, NULL, NULL, sample, true)) { - pr_warning("problem incrementing symbol period, skipping event\n"); - goto out_put; + if (pdiff->has_br_stack && pdiff->basic_block) { + if (!hists__add_entry(>block_hists->sym_hists, , + NULL, NULL, NULL, sample, false)) { + pr_warning("problem counting symbol for basic block\n"); + goto out_put; + } + } else { + if (!hists__add_entry(hists, , NULL, NULL, NULL, + sample, true)) { + pr_warning("problem incrementing symbol period, skipping event\n"); + goto out_put; + } } /* @@ -899,6 +917,37 @@ static int check_file_brstack(void) return 0; } +static int64_t symbol_se_cmp(struct perf_hpp_fmt *fmt __maybe_unused, +struct hist_entry *a, struct hist_entry *b) +{ + return sort_sym.se_cmp(a, b); +} + +static int block_sym_hists_init(struct block_hists *hists) +{ + struct perf_hpp_fmt *fmt; + + __hists__init(>sym_hists, >sym_list); + perf_hpp_list__init(>sym_list); + fmt = >sym_fmt; + INIT_LIST_HEAD(>list); + INIT_LIST_HEAD(>sort_list); + + fmt->cmp = symbol_se_cmp; + perf_hpp_list__register_sort_field(>sym_list, fmt); + return 0; +} + +static void basic_block_process(void) +{ + struct data__file *d; + int i; + + data__for_each_file(i, d) { + hists__delete_entries(>block_hists.sym_hists); + } +} + static int __cmd_diff(void) { struct data__file *d; @@ -937,6 +986,16 @@ static int __cmd_diff(void) goto out_delete; } + if (pdiff.has_br_stack && pdiff.basic_block) { + ret = block_sym_hists_init(>block_hists); + if (ret) { + pr_err("Failed to initialize basic block hists\n"); + goto out_delete; + } + + pdiff.block_hists = >block_hists; + } + ret = perf_session__process_events(d->session); if (ret) { pr_err("Failed to process %s\n", d->data.path); @@ -949,7 +1008,10 @@ static int __cmd_diff(void) zfree(_range); } - data_process(); + if (pdiff.has_br_stack && pdiff.basic_block) + basic_block_process(); + else + data_process(); out_delete: data__for_each_file(i, d) { @@ -1019,6 +1081,8 @@ static const struct option options[] = { "only consider symbols in these pids"), OPT_STRING(0, "tid", _conf.tid_list_str, "tid[,tid...]", "only consider symbols in these tids"), + OPT_BOOLEAN(0, "basic-block", _block, + "display the differential program basic block"), OPT_END() }; @@ -1517,10 +1581,11 @@ int cmd_diff(int argc, const char **argv) if (data_init(argc, argv)
[PATCH v1 0/9] perf diff: diff cycles at basic block level
In some cases small changes in hot loops can show big differences. But it's difficult to identify these differences. perf diff currently can only diff symbols (functions). We can also expand it to diff cycles of individual programs blocks as reported by timed LBR. This would allow to identify changes in specific code accurately. With this patch set, for example, perf record -b ./div perf record -b ./div perf diff --basic-block # Cycles diff Basic block (start:end) # ... ... # -20 native_write_msr (7fff9a069900:7fff9a06990b) -3 __indirect_thunk_start (7fff9ac02ca0:7fff9ac02ca0) 1 __indirect_thunk_start (7fff9ac02cac:7fff9ac02cb0) 0 rand@plt (490:490) 0 rand (3af60:3af64) 0 rand (3af69:3af6d) 0 main (4e8:4ea) 0 main (4ef:500) 0 main (4ef:535) 0 compute_flag (640:644) 0 compute_flag (649:659) 0 __random_r (3ac40:3ac76) 0 __random_r (3ac40:3ac88) 0 __random_r (3ac90:3ac9c) 0 __random (3aac0:3aad2) 0 __random (3aae0:3aae7) 0 __random (3ab03:3ab0f) 0 __random (3ab14:3ab1b) 0 __random (3ab28:3ab2e) 0 __random (3ab4a:3ab53) The "basic-block" is a new perf-diff option, which enables the displaying of cycles difference of same program basic block amongst two or more perf.data. The program basic block is the code block between two branches in a function. Jin Yao (9): perf util: Create block_info structure perf util: Add block_info in hist_entry perf diff: Check if all data files with branch stacks perf diff: Get a list of symbols(functions) perf diff: Use hists to manage basic blocks perf diff: Link same basic blocks among different data files perf diff: Compute cycles diff of basic blocks perf diff: Print the basic block cycles diff perf diff: Documentation --basic-block option tools/perf/Documentation/perf-diff.txt | 5 + tools/perf/builtin-diff.c | 521 - tools/perf/util/hist.c | 24 +- tools/perf/util/hist.h | 6 + tools/perf/util/sort.h | 4 + tools/perf/util/symbol.c | 22 ++ tools/perf/util/symbol.h | 23 ++ 7 files changed, 594 insertions(+), 11 deletions(-) -- 2.7.4
[PATCH v1 3/9] perf diff: Check if all data files with branch stacks
We will expand perf diff to support diff cycles of individual programs blocks, so it requires all data files having branch stacks. This patch checks HEADER_BRANCH_STACK in header, and only set the flag has_br_stack when HEADER_BRANCH_STACK are set in all data files. Signed-off-by: Jin Yao --- tools/perf/builtin-diff.c | 30 ++ 1 file changed, 30 insertions(+) diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c index 6e79207..6023f8c 100644 --- a/tools/perf/builtin-diff.c +++ b/tools/perf/builtin-diff.c @@ -32,6 +32,7 @@ struct perf_diff { struct perf_time_interval *ptime_range; int range_size; int range_num; + bool has_br_stack; }; /* Diff command specific HPP columns. */ @@ -873,12 +874,41 @@ static int parse_time_str(struct data__file *d, char *abstime_ostr, return ret; } +static int check_file_brstack(void) +{ + struct data__file *d; + bool has_br_stack; + int i; + + data__for_each_file(i, d) { + d->session = perf_session__new(>data, false, ); + if (!d->session) { + pr_err("Failed to open %s\n", d->data.path); + return -1; + } + + has_br_stack = perf_header__has_feat(>session->header, +HEADER_BRANCH_STACK); + perf_session__delete(d->session); + if (!has_br_stack) + return 0; + } + + /* Set only all files having branch stacks */ + pdiff.has_br_stack = true; + return 0; +} + static int __cmd_diff(void) { struct data__file *d; int ret, i; char *abstime_ostr, *abstime_tmp; + ret = check_file_brstack(); + if (ret) + return ret; + ret = abstime_str_dup(_ostr); if (ret) return ret; -- 2.7.4
[PATCH v1 2/9] perf util: Add block_info in hist_entry
The block_info contains the program basic block information, i.e, contains the start address and the end address of this basic block and how much cycles it takes. We need to compare, sort and even print out the basic block by some orders, i.e. sort by cycles. For this purpose, we add block_info field to hist_entry. In order not to impact current interface, we creates a new function hists__add_entry_block. Signed-off-by: Jin Yao --- tools/perf/util/hist.c | 22 -- tools/perf/util/hist.h | 6 ++ tools/perf/util/sort.h | 1 + 3 files changed, 27 insertions(+), 2 deletions(-) diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c index 7ace7a10..3810460 100644 --- a/tools/perf/util/hist.c +++ b/tools/perf/util/hist.c @@ -574,6 +574,8 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists, */ mem_info__zput(entry->mem_info); + block_info__zput(entry->block_info); + /* If the map of an existing hist_entry has * become out-of-date due to an exec() or * similar, update it. Otherwise we will @@ -645,6 +647,7 @@ __hists__add_entry(struct hists *hists, struct symbol *sym_parent, struct branch_info *bi, struct mem_info *mi, + struct block_info *block_info, struct perf_sample *sample, bool sample_self, struct hist_entry_ops *ops) @@ -677,6 +680,7 @@ __hists__add_entry(struct hists *hists, .hists = hists, .branch_info = bi, .mem_info = mi, + .block_info = block_info, .transaction = sample->transaction, .raw_data = sample->raw_data, .raw_size = sample->raw_size, @@ -699,7 +703,7 @@ struct hist_entry *hists__add_entry(struct hists *hists, struct perf_sample *sample, bool sample_self) { - return __hists__add_entry(hists, al, sym_parent, bi, mi, + return __hists__add_entry(hists, al, sym_parent, bi, mi, NULL, sample, sample_self, NULL); } @@ -712,10 +716,24 @@ struct hist_entry *hists__add_entry_ops(struct hists *hists, struct perf_sample *sample, bool sample_self) { - return __hists__add_entry(hists, al, sym_parent, bi, mi, + return __hists__add_entry(hists, al, sym_parent, bi, mi, NULL, sample, sample_self, ops); } +struct hist_entry *hists__add_entry_block(struct hists *hists, + struct hist_entry_ops *ops, + struct addr_location *al, + struct block_info *block_info) +{ + struct hist_entry entry = { + .ops = ops, + .block_info = block_info, + .hists = hists, + }, *he = hists__findnew_entry(hists, , al, false); + + return he; +} + static int iter_next_nop_entry(struct hist_entry_iter *iter __maybe_unused, struct addr_location *al __maybe_unused) diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h index 76ff6c6..c8f7d66 100644 --- a/tools/perf/util/hist.h +++ b/tools/perf/util/hist.h @@ -16,6 +16,7 @@ struct addr_location; struct map_symbol; struct mem_info; struct branch_info; +struct block_info; struct symbol; enum hist_filter { @@ -149,6 +150,11 @@ struct hist_entry *hists__add_entry_ops(struct hists *hists, struct perf_sample *sample, bool sample_self); +struct hist_entry *hists__add_entry_block(struct hists *hists, + struct hist_entry_ops *ops, + struct addr_location *al, + struct block_info *bi); + int hist_entry_iter__add(struct hist_entry_iter *iter, struct addr_location *al, int max_stack_depth, void *arg); diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h index ce376a7..43623fa 100644 --- a/tools/perf/util/sort.h +++ b/tools/perf/util/sort.h @@ -144,6 +144,7 @@ struct hist_entry { longtime; struct hists*hists; struct mem_info *mem_info; + struct block_info *block_info; void*raw_data; u32 raw_size; int num_res; -- 2.7.4
[PATCH v1 1/9] perf util: Create block_info structure
perf diff currently can only diff symbols(functions). We should expand it to diff cycles of individual programs blocks as reported by timed LBR. This would allow to identify changes in specific code accurately. We need a new structure to maintain the basic block information, such as, symbol(function), start/end addrress of this block, cycles. This patch creates this structure and with some ops. Signed-off-by: Jin Yao --- tools/perf/util/symbol.c | 22 ++ tools/perf/util/symbol.h | 23 +++ 2 files changed, 45 insertions(+) diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c index 5cbad55..3a90e72 100644 --- a/tools/perf/util/symbol.c +++ b/tools/perf/util/symbol.c @@ -2262,3 +2262,25 @@ struct mem_info *mem_info__new(void) refcount_set(>refcnt, 1); return mi; } + +struct block_info *block_info__get(struct block_info *bi) +{ + if (bi) + refcount_inc(>refcnt); + return bi; +} + +void block_info__put(struct block_info *bi) +{ + if (bi && refcount_dec_and_test(>refcnt)) + free(bi); +} + +struct block_info *block_info__new(void) +{ + struct block_info *bi = zalloc(sizeof(*bi)); + + if (bi) + refcount_set(>refcnt, 1); + return bi; +} diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h index 9a8fe01..12755b4 100644 --- a/tools/perf/util/symbol.h +++ b/tools/perf/util/symbol.h @@ -131,6 +131,17 @@ struct mem_info { refcount_t refcnt; }; +struct block_info { + struct symbol *sym; + u64 start; + u64 end; + u64 cycles; + u64 cycles_aggr; + int num; + int num_aggr; + refcount_t refcnt; +}; + struct addr_location { struct machine *machine; struct thread *thread; @@ -332,4 +343,16 @@ static inline void __mem_info__zput(struct mem_info **mi) #define mem_info__zput(mi) __mem_info__zput() +struct block_info *block_info__new(void); +struct block_info *block_info__get(struct block_info *bi); +void block_info__put(struct block_info *bi); + +static inline void __block_info__zput(struct block_info **bi) +{ + block_info__put(*bi); + *bi = NULL; +} + +#define block_info__zput(bi) __block_info__zput() + #endif /* __PERF_SYMBOL */ -- 2.7.4
[PATCH v1 9/9] perf diff: Documentation --basic-block option
Documentation the new option '--basic-block'. Signed-off-by: Jin Yao --- tools/perf/Documentation/perf-diff.txt | 5 + 1 file changed, 5 insertions(+) diff --git a/tools/perf/Documentation/perf-diff.txt b/tools/perf/Documentation/perf-diff.txt index da7809b..b242af8 100644 --- a/tools/perf/Documentation/perf-diff.txt +++ b/tools/perf/Documentation/perf-diff.txt @@ -174,6 +174,11 @@ OPTIONS --tid=:: Only diff samples for given thread ID (comma separated list). +--basic-block:: + Display the cycles difference of same program basic block amongst + two or more perf.data. The program basic block is the code block + between two branches in a function. + COMPARISON -- The comparison is governed by the baseline file. The baseline perf.data -- 2.7.4
[PATCH v1 7/9] perf diff: Compute cycles diff of basic blocks
In previous patch, we have already linked up the same basic blocks. Now we compute the cycles diff value of basic blocks, in order to sort by diff cycles later. Signed-off-by: Jin Yao --- tools/perf/builtin-diff.c | 31 +++ tools/perf/util/sort.h| 2 ++ 2 files changed, 33 insertions(+) diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c index 72c33ab..47e34a3 100644 --- a/tools/perf/builtin-diff.c +++ b/tools/perf/builtin-diff.c @@ -1132,6 +1132,31 @@ static void block_hists_match(struct hists *hists_base, } } +static void compute_block_hists_diff(struct block_hists *block_hists, +struct data__file *d) +{ + struct hists *hists = _hists->hists; + struct rb_root_cached *root = hists->entries_in; + struct rb_node *next = rb_first_cached(root); + + while (next != NULL) { + struct hist_entry *he = rb_entry(next, struct hist_entry, +rb_node_in); + struct hist_entry *pair = get_pair_data(he, d); + + next = rb_next(>rb_node_in); + + if (pair) { + pair->diff.computed = true; + if (pair->block_info->num && he->block_info->num) { + pair->diff.cycles_diff = + pair->block_info->cycles_aggr / pair->block_info->num_aggr - + he->block_info->cycles_aggr / he->block_info->num_aggr; + } + } + } +} + static void basic_block_process(void) { struct hists *hists_base = __files[0].block_hists.hists; @@ -1151,6 +1176,12 @@ static void basic_block_process(void) block_hists_match(hists_base, hists); } + data__for_each_file_new(i, d) { + hists = >block_hists.hists; + d->hists = hists; + compute_block_hists_diff(__files[0].block_hists, d); + } + data__for_each_file(i, d) { hists__delete_entries(>block_hists.sym_hists); hists__delete_entries(>block_hists.hists); diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h index 43623fa..de9e61a 100644 --- a/tools/perf/util/sort.h +++ b/tools/perf/util/sort.h @@ -79,6 +79,8 @@ struct hist_entry_diff { /* HISTC_WEIGHTED_DIFF */ s64 wdiff; + + s64 cycles_diff; }; }; -- 2.7.4
[PATCH v1 8/9] perf diff: Print the basic block cycles diff
Currently we only support sorting by diff cycles. For example, perf record -b ./div perf record -b ./div perf diff --basic-block # Cycles diff Basic block (start:end) # ... ... # -20 native_write_msr (7fff9a069900:7fff9a06990b) -3 __indirect_thunk_start (7fff9ac02ca0:7fff9ac02ca0) 1 __indirect_thunk_start (7fff9ac02cac:7fff9ac02cb0) 0 rand@plt (490:490) 0 rand (3af60:3af64) 0 rand (3af69:3af6d) 0 main (4e8:4ea) 0 main (4ef:500) 0 main (4ef:535) 0 compute_flag (640:644) 0 compute_flag (649:659) 0 __random_r (3ac40:3ac76) 0 __random_r (3ac40:3ac88) 0 __random_r (3ac90:3ac9c) 0 __random (3aac0:3aad2) 0 __random (3aae0:3aae7) 0 __random (3ab03:3ab0f) 0 __random (3ab14:3ab1b) 0 __random (3ab28:3ab2e) 0 __random (3ab4a:3ab53) Signed-off-by: Jin Yao --- tools/perf/builtin-diff.c | 168 ++ tools/perf/util/hist.c| 2 +- tools/perf/util/sort.h| 1 + 3 files changed, 170 insertions(+), 1 deletion(-) diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c index 47e34a3..dbf242d 100644 --- a/tools/perf/builtin-diff.c +++ b/tools/perf/builtin-diff.c @@ -27,6 +27,12 @@ #include #include +struct block_hpp_fmt { + struct perf_hpp_fmt fmt; + struct data__file *file; + int width; +}; + struct block_hists { struct histssym_hists; struct perf_hpp_listsym_list; @@ -34,6 +40,8 @@ struct block_hists { struct histshists; struct perf_hpp_listlist; struct perf_hpp_fmt fmt; + struct block_hpp_fmtblock_fmt; + struct perf_hpp_fmt desc_fmt; }; struct perf_diff { @@ -1157,6 +1165,162 @@ static void compute_block_hists_diff(struct block_hists *block_hists, } } +static int64_t block_cycles_diff_cmp(struct perf_hpp_fmt *fmt, +struct hist_entry *left, +struct hist_entry *right) +{ + struct block_hpp_fmt *block_fmt = container_of(fmt, + struct block_hpp_fmt, + fmt); + struct data__file *d = block_fmt->file; + bool pairs_left = hist_entry__has_pairs(left); + bool pairs_right = hist_entry__has_pairs(right); + struct hist_entry *p_right, *p_left; + s64 l, r; + + if (!pairs_left && !pairs_right) + return 0; + + if (!pairs_left || !pairs_right) + return pairs_left ? -1 : 1; + + p_left = get_pair_data(left, d); + p_right = get_pair_data(right, d); + + if (!p_left && !p_right) + return 0; + + if (!p_left || !p_right) + return p_left ? -1 : 1; + + l = abs(p_left->diff.cycles_diff); + r = abs(p_right->diff.cycles_diff); + + return r - l; +} + +static int64_t block_diff_sort(struct perf_hpp_fmt *fmt, + struct hist_entry *left, struct hist_entry *right) +{ + return block_cycles_diff_cmp(fmt, right, left); +} + +static int block_diff_header(struct perf_hpp_fmt *fmt __maybe_unused, +struct perf_hpp *hpp, +struct hists *hists __maybe_unused, +int line __maybe_unused, +int *span __maybe_unused) +{ + return scnprintf(hpp->buf, hpp->size, "Cycles diff"); +} + +static int block_diff_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp, + struct hist_entry *he) +{ + struct block_hpp_fmt *block_fmt = container_of(fmt, + struct block_hpp_fmt, + fmt); + struct data__file *d = block_fmt->file; + struct hist_entry *pair = get_pair_data(he, d); + + if (pair && pair->diff.computed) { + return scnprintf(hpp->buf, hpp->size, "%*ld", block_fmt->width, +pair->diff.cycles_diff); + } + + return scnprintf(hpp->buf, hpp->size, "%*s", block_fmt->width, " "); +} + +static int block_diff_width(struct perf_hpp_fmt *fmt, + struct perf_hpp *hpp __maybe_unused, + struct hists *hists __maybe_unused) +{ + struct block_hpp_fmt *block_fmt = + container_of(fmt, struct block_hpp_fmt, fmt); + + return block_fmt->width; +} + +static int block_sym_width(struct perf_hpp_fmt *fmt __maybe_unused, + struct perf_hpp *hpp __maybe_unused, + struct
[PATCH v1 6/9] perf diff: Link same basic blocks among different data files
The target is to compare the performance difference (cycles diff) for the same basic blocks in different data files. The same basic block means same function, same start address and same end address. This patch finds the same basic blocks from different data files and link them together. Signed-off-by: Jin Yao --- tools/perf/builtin-diff.c | 66 +++ 1 file changed, 66 insertions(+) diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c index 09551fe..72c33ab 100644 --- a/tools/perf/builtin-diff.c +++ b/tools/perf/builtin-diff.c @@ -1073,8 +1073,69 @@ static int process_block_per_sym(struct data__file *d) return 0; } +static int block_pair_cmp(struct hist_entry *a, struct hist_entry *b) +{ + struct block_info *bi_a = a->block_info; + struct block_info *bi_b = b->block_info; + int cmp; + + if (!bi_a->sym || !bi_b->sym) + return -1; + + if (bi_a->sym->name && bi_b->sym->name) { + cmp = strcmp(bi_a->sym->name, bi_b->sym->name); + if ((!cmp) && (bi_a->start == bi_b->start) && + (bi_a->end == bi_b->end)) { + return 0; + } + } + + return -1; +} + +static struct hist_entry *get_block_pair(struct hist_entry *he, +struct hists *hists_pair) +{ + struct rb_root_cached *root = hists_pair->entries_in; + struct rb_node *next = rb_first_cached(root); + int cmp; + + while (next != NULL) { + struct hist_entry *he_pair = rb_entry(next, struct hist_entry, + rb_node_in); + + next = rb_next(_pair->rb_node_in); + + cmp = block_pair_cmp(he_pair, he); + if (!cmp) + return he_pair; + } + + return NULL; +} + +static void block_hists_match(struct hists *hists_base, + struct hists *hists_pair) +{ + struct rb_root_cached *root = hists_base->entries_in; + struct rb_node *next = rb_first_cached(root); + + while (next != NULL) { + struct hist_entry *he = rb_entry(next, struct hist_entry, +rb_node_in); + struct hist_entry *pair = get_block_pair(he, hists_pair); + + next = rb_next(>rb_node_in); + + if (pair) + hist_entry__add_pair(pair, he); + } +} + static void basic_block_process(void) { + struct hists *hists_base = __files[0].block_hists.hists; + struct hists *hists; struct data__file *d; int i; @@ -1085,6 +1146,11 @@ static void basic_block_process(void) process_block_per_sym(d); } + data__for_each_file_new(i, d) { + hists = >block_hists.hists; + block_hists_match(hists_base, hists); + } + data__for_each_file(i, d) { hists__delete_entries(>block_hists.sym_hists); hists__delete_entries(>block_hists.hists); -- 2.7.4
Re: [PATCH] mm/dev_pfn: Exclude MEMORY_DEVICE_PRIVATE while computing virtual address
On 05/18/2019 03:20 AM, Andrew Morton wrote: > On Fri, 17 May 2019 16:08:34 +0530 Anshuman Khandual > wrote: > >> The presence of struct page does not guarantee linear mapping for the pfn >> physical range. Device private memory which is non-coherent is excluded >> from linear mapping during devm_memremap_pages() though they will still >> have struct page coverage. Just check for device private memory before >> giving out virtual address for a given pfn. > > I was going to give my standard "what are the user-visible runtime > effects of this change?", but... > >> All these helper functions are all pfn_t related but could not figure out >> another way of determining a private pfn without looking into it's struct >> page. pfn_t_to_virt() is not getting used any where in mainline kernel.Is >> it used by out of tree drivers ? Should we then drop it completely ? > > Yeah, let's kill it. > > But first, let's fix it so that if someone brings it back, they bring > back a non-buggy version. Makes sense. > > So... what (would be) the user-visible runtime effects of this change? I am not very well aware about the user interaction with the drivers which hotplug and manage ZONE_DEVICE memory in general. Hence will not be able to comment on it's user visible runtime impact. I just figured this out from code audit while testing ZONE_DEVICE on arm64 platform. But the fix makes the function bit more expensive as it now involve some additional memory references.
[PATCH V4 3/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
The arm64 page table dump code can race with concurrent modification of the kernel page tables. When a leaf entries are modified concurrently, the dump code may log stale or inconsistent information for a VA range, but this is otherwise not harmful. When intermediate levels of table are freed, the dump code will continue to use memory which has been freed and potentially reallocated for another purpose. In such cases, the dump code may dereference bogus addresses, leading to a number of potential problems. Intermediate levels of table may by freed during memory hot-remove, which will be enabled by a subsequent patch. To avoid racing with this, take the memory hotplug lock when walking the kernel page table. Acked-by: David Hildenbrand Signed-off-by: Anshuman Khandual --- arch/arm64/mm/ptdump_debugfs.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c index 064163f..80171d1 100644 --- a/arch/arm64/mm/ptdump_debugfs.c +++ b/arch/arm64/mm/ptdump_debugfs.c @@ -7,7 +7,10 @@ static int ptdump_show(struct seq_file *m, void *v) { struct ptdump_info *info = m->private; + + get_online_mems(); ptdump_walk_pgd(m, info); + put_online_mems(); return 0; } DEFINE_SHOW_ATTRIBUTE(ptdump); -- 2.7.4
[PATCH V4 1/4] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs entries between memory block and node. It first checks pfn validity with pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config (arm64 has this enabled) pfn_valid_within() calls pfn_valid(). pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID) which scans all mapped memblock regions with memblock_is_map_memory(). This creates a problem in memory hot remove path which has already removed given memory range from memory block with memblock_[remove|free] before arriving at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1 skipping subsequent sysfs_remove_link() calls leaving node <-> memory block sysfs entries as is. Subsequent memory add operation hits BUG_ON() because of existing sysfs entries. [ 62.007176] NUMA: Unknown node for memory at 0x68000, assuming node 0 [ 62.052517] [ cut here ] [ 62.053211] kernel BUG at mm/memory_hotplug.c:1143! [ 62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [ 62.054589] Modules linked in: [ 62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-4-g28cea40b2683 #41 [ 62.056274] Hardware name: linux,dummy-virt (DT) [ 62.057166] pstate: 4045 (nZcv daif +PAN -UAO) [ 62.058083] pc : add_memory_resource+0x1cc/0x1d8 [ 62.058961] lr : add_memory_resource+0x10c/0x1d8 [ 62.059842] sp : 168b3ce0 [ 62.060477] x29: 168b3ce0 x28: 8005db546c00 [ 62.061501] x27: x26: [ 62.062509] x25: 111ef000 x24: 111ef5d0 [ 62.063520] x23: x22: 0006bfff [ 62.064540] x21: ffef x20: 006c [ 62.065558] x19: 0068 x18: 0024 [ 62.066566] x17: x16: [ 62.067579] x15: x14: 8005e412e890 [ 62.068588] x13: 8005d6b105d8 x12: [ 62.069610] x11: 8005d6b10490 x10: 0040 [ 62.070615] x9 : 8005e412e898 x8 : 8005e412e890 [ 62.071631] x7 : 8005d6b105d8 x6 : 8005db546c00 [ 62.072640] x5 : 0001 x4 : 0002 [ 62.073654] x3 : 8005d7049480 x2 : 0002 [ 62.074666] x1 : 0003 x0 : ffef [ 62.075685] Process bash (pid: 3275, stack limit = 0xd754280f) [ 62.076930] Call trace: [ 62.077411] add_memory_resource+0x1cc/0x1d8 [ 62.078227] __add_memory+0x70/0xa8 [ 62.078901] probe_store+0xa4/0xc8 [ 62.079561] dev_attr_store+0x18/0x28 [ 62.080270] sysfs_kf_write+0x40/0x58 [ 62.080992] kernfs_fop_write+0xcc/0x1d8 [ 62.081744] __vfs_write+0x18/0x40 [ 62.082400] vfs_write+0xa4/0x1b0 [ 62.083037] ksys_write+0x5c/0xc0 [ 62.083681] __arm64_sys_write+0x18/0x20 [ 62.084432] el0_svc_handler+0x88/0x100 [ 62.085177] el0_svc+0x8/0xc Re-ordering arch_remove_memory() with memblock_[free|remove] solves the problem on arm64 as pfn_valid() behaves correctly and returns positive as memblock for the address range still exists. arch_remove_memory() removes applicable memory sections from zone with __remove_pages() and tears down kernel linear mapping. Removing memblock regions afterwards is safe because there is no other memblock (bootmem) allocator user that late. So nobody is going to allocate from the removed range just to blow up later. Also nobody should be using the bootmem allocated range else we wouldn't allow to remove it. So reordering is indeed safe. Acked-by: Michal Hocko Reviewed-by: David Hildenbrand Reviewed-by: Oscar Salvador Signed-off-by: Anshuman Khandual --- mm/memory_hotplug.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 328878b..1dbda48 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1850,10 +1850,10 @@ void __ref __remove_memory(int nid, u64 start, u64 size) /* remove memmap entry */ firmware_map_remove(start, start + size, "System RAM"); + arch_remove_memory(nid, start, size, NULL); memblock_free(start, size); memblock_remove(start, size); - arch_remove_memory(nid, start, size, NULL); __release_memory_resource(start, size); try_offline_node(nid); -- 2.7.4
[PATCH V4 2/4] arm64/mm: Inhibit huge-vmap with ptdump
From: Mark Rutland The arm64 ptdump code can race with concurrent modification of the kernel page tables. At the time this was added, this was sound as: * Modifications to leaf entries could result in stale information being logged, but would not result in a functional problem. * Boot time modifications to non-leaf entries (e.g. freeing of initmem) were performed when the ptdump code cannot be invoked. * At runtime, modifications to non-leaf entries only occurred in the vmalloc region, and these were strictly additive, as intermediate entries were never freed. However, since commit: commit 324420bf91f6 ("arm64: add support for ioremap() block mappings") ... it has been possible to create huge mappings in the vmalloc area at runtime, and as part of this existing intermediate levels of table my be removed and freed. It's possible for the ptdump code to race with this, and continue to walk tables which have been freed (and potentially poisoned or reallocated). As a result of this, the ptdump code may dereference bogus addresses, which could be fatal. Since huge-vmap is a TLB and memory optimization, we can disable it when the runtime ptdump code is in use to avoid this problem. Fixes: 324420bf91f60582 ("arm64: add support for ioremap() block mappings") Acked-by: Ard Biesheuvel Signed-off-by: Mark Rutland Signed-off-by: Anshuman Khandual Cc: Ard Biesheuvel Cc: Catalin Marinas Cc: Will Deacon --- arch/arm64/mm/mmu.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index a170c63..a1bfc44 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -955,13 +955,18 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys) int __init arch_ioremap_pud_supported(void) { - /* only 4k granule supports level 1 block mappings */ - return IS_ENABLED(CONFIG_ARM64_4K_PAGES); + /* +* Only 4k granule supports level 1 block mappings. +* SW table walks can't handle removal of intermediate entries. +*/ + return IS_ENABLED(CONFIG_ARM64_4K_PAGES) && + !IS_ENABLED(CONFIG_ARM64_PTDUMP_DEBUGFS); } int __init arch_ioremap_pmd_supported(void) { - return 1; + /* See arch_ioremap_pud_supported() */ + return !IS_ENABLED(CONFIG_ARM64_PTDUMP_DEBUGFS); } int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot) -- 2.7.4
[PATCH V4 4/4] arm64/mm: Enable memory hot remove
The arch code for hot-remove must tear down portions of the linear map and vmemmap corresponding to memory being removed. In both cases the page tables mapping these regions must be freed, and when sparse vmemmap is in use the memory backing the vmemmap must also be freed. This patch adds a new remove_pagetable() helper which can be used to tear down either region, and calls it from vmemmap_free() and ___remove_pgd_mapping(). The sparse_vmap argument determines whether the backing memory will be freed. While freeing intermediate level page table pages bail out if any of it's entries are still valid. This can happen for partially filled kernel page table either from a previously attempted failed memory hot add or while removing an address range which does not span the entire page table page range. The vmemmap region may share levels of table with the vmalloc region. Take the kernel ptl so that we can safely free potentially-shared tables. While here update arch_add_memory() to handle __add_pages() failures by just unmapping recently added kernel linear mapping. Now enable memory hot remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE. This implementation is overall inspired from kernel page table tear down procedure on X86 architecture. Signed-off-by: Anshuman Khandual --- arch/arm64/Kconfig | 3 + arch/arm64/mm/mmu.c | 212 +++- 2 files changed, 213 insertions(+), 2 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 4780eb7..ce24427 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -267,6 +267,9 @@ config HAVE_GENERIC_GUP config ARCH_ENABLE_MEMORY_HOTPLUG def_bool y +config ARCH_ENABLE_MEMORY_HOTREMOVE + def_bool y + config SMP def_bool y diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index a1bfc44..0cf0d41 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -733,6 +733,187 @@ int kern_addr_valid(unsigned long addr) return pfn_valid(pte_pfn(pte)); } + +#ifdef CONFIG_MEMORY_HOTPLUG +static void free_hotplug_page_range(struct page *page, ssize_t size) +{ + WARN_ON(PageReserved(page)); + free_pages((unsigned long)page_address(page), get_order(size)); +} + +static void free_hotplug_pgtable_page(struct page *page) +{ + free_hotplug_page_range(page, PAGE_SIZE); +} + +static void free_pte_table(pte_t *ptep, pmd_t *pmdp, unsigned long addr) +{ + struct page *page; + int i; + + for (i = 0; i < PTRS_PER_PTE; i++) { + if (!pte_none(ptep[i])) + return; + } + + page = pmd_page(READ_ONCE(*pmdp)); + pmd_clear(pmdp); + __flush_tlb_kernel_pgtable(addr); + free_hotplug_pgtable_page(page); +} + +static void free_pmd_table(pmd_t *pmdp, pud_t *pudp, unsigned long addr) +{ + struct page *page; + int i; + + if (CONFIG_PGTABLE_LEVELS <= 2) + return; + + for (i = 0; i < PTRS_PER_PMD; i++) { + if (!pmd_none(pmdp[i])) + return; + } + + page = pud_page(READ_ONCE(*pudp)); + pud_clear(pudp); + __flush_tlb_kernel_pgtable(addr); + free_hotplug_pgtable_page(page); +} + +static void free_pud_table(pud_t *pudp, pgd_t *pgdp, unsigned long addr) +{ + struct page *page; + int i; + + if (CONFIG_PGTABLE_LEVELS <= 3) + return; + + for (i = 0; i < PTRS_PER_PUD; i++) { + if (!pud_none(pudp[i])) + return; + } + + page = pgd_page(READ_ONCE(*pgdp)); + pgd_clear(pgdp); + __flush_tlb_kernel_pgtable(addr); + free_hotplug_pgtable_page(page); +} + +static void +remove_pte_table(pmd_t *pmdp, unsigned long addr, + unsigned long end, bool sparse_vmap) +{ + struct page *page; + pte_t *ptep, pte; + unsigned long start = addr; + + for (; addr < end; addr += PAGE_SIZE) { + ptep = pte_offset_kernel(pmdp, addr); + pte = READ_ONCE(*ptep); + + if (pte_none(pte)) + continue; + + WARN_ON(!pte_present(pte)); + if (sparse_vmap) { + page = pte_page(pte); + free_hotplug_page_range(page, PAGE_SIZE); + } + pte_clear(_mm, addr, ptep); + } + flush_tlb_kernel_range(start, end); +} + +static void +remove_pmd_table(pud_t *pudp, unsigned long addr, + unsigned long end, bool sparse_vmap) +{ + unsigned long next; + struct page *page; + pte_t *ptep_base; + pmd_t *pmdp, pmd; + + for (; addr < end; addr = next) { + next = pmd_addr_end(addr, end); + pmdp = pmd_offset(pudp, addr); + pmd = READ_ONCE(*pmdp); + + if (pmd_none(pmd)) + continue; + +
[PATCH V4 0/4] arm64/mm: Enable memory hot remove
This series enables memory hot remove on arm64 after fixing a memblock removal ordering problem in generic __remove_memory() and two possible arm64 platform specific kernel page table race conditions. This series is based on latest v5.2-rc1 tag. Testing: Memory hot remove has been tested on arm64 for 4K, 16K, 64K page config options with all possible CONFIG_ARM64_VA_BITS and CONFIG_PGTABLE_LEVELS combinations. Its only build tested on non-arm64 platforms. Changes in V4: - Implemented most of the suggestions from Mark Rutland - Interchanged patch [PATCH 2/4] <---> [PATCH 3/4] and updated commit message - Moved CONFIG_PGTABLE_LEVELS inside free_[pud|pmd]_table() - Used READ_ONCE() in missing instances while accessing page table entries - s/p???_present()/p???_none() for checking valid kernel page table entries - WARN_ON() when an entry is !p???_none() and !p???_present() at the same time - Updated memory hot-remove commit message with additional details as suggested - Rebased the series on 5.2-rc1 with hotplug changes from David and Michal Hocko - Collected all new Acked-by tags Changes in V3: (https://lkml.org/lkml/2019/5/14/197) - Implemented most of the suggestions from Mark Rutland for remove_pagetable() - Fixed applicable PGTABLE_LEVEL wrappers around pgtable page freeing functions - Replaced 'direct' with 'sparse_vmap' in remove_pagetable() with inverted polarity - Changed pointer names ('p' at end) and removed tmp from iterations - Perform intermediate TLB invalidation while clearing pgtable entries - Dropped flush_tlb_kernel_range() in remove_pagetable() - Added flush_tlb_kernel_range() in remove_pte_table() instead - Renamed page freeing functions for pgtable page and mapped pages - Used page range size instead of order while freeing mapped or pgtable pages - Removed all PageReserved() handling while freeing mapped or pgtable pages - Replaced XXX_index() with XXX_offset() while walking the kernel page table - Used READ_ONCE() while fetching individual pgtable entries - Taken overall init_mm.page_table_lock instead of just while changing an entry - Dropped previously added [pmd|pud]_index() which are not required anymore - Added a new patch to protect kernel page table race condition for ptdump - Added a new patch from Mark Rutland to prevent huge-vmap with ptdump Changes in V2: (https://lkml.org/lkml/2019/4/14/5) - Added all received review and ack tags - Split the series from ZONE_DEVICE enablement for better review - Moved memblock re-order patch to the front as per Robin Murphy - Updated commit message on memblock re-order patch per Michal Hocko - Dropped [pmd|pud]_large() definitions - Used existing [pmd|pud]_sect() instead of earlier [pmd|pud]_large() - Removed __meminit and __ref tags as per Oscar Salvador - Dropped unnecessary 'ret' init in arch_add_memory() per Robin Murphy - Skipped calling into pgtable_page_dtor() for linear mapping page table pages and updated all relevant functions Changes in V1: (https://lkml.org/lkml/2019/4/3/28) Anshuman Khandual (3): mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() arm64/mm: Hold memory hotplug lock while walking for kernel page table dump arm64/mm: Enable memory hot remove Mark Rutland (1): arm64/mm: Inhibit huge-vmap with ptdump arch/arm64/Kconfig | 3 + arch/arm64/mm/mmu.c| 223 - arch/arm64/mm/ptdump_debugfs.c | 3 + mm/memory_hotplug.c| 2 +- 4 files changed, 225 insertions(+), 6 deletions(-) -- 2.7.4
[PATCH] intel-svm: fix typos in code comments
fix acccess to access Signed-off-by: Weitao Hou --- include/linux/intel-svm.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h index e3f76315ca4d..8dfead70699c 100644 --- a/include/linux/intel-svm.h +++ b/include/linux/intel-svm.h @@ -57,7 +57,7 @@ struct svm_dev_ops { /** * intel_svm_bind_mm() - Bind the current process to a PASID - * @dev: Device to be granted acccess + * @dev: Device to be granted access * @pasid: Address for allocated PASID * @flags: Flags. Later for requesting supervisor mode, etc. * @ops: Callbacks to device driver -- 2.18.0
Re: PROBLEM: Power9: kernel oops on memory hotunplug from ppc64le guest
Bharata B Rao's on May 20, 2019 2:25 pm: > On Mon, May 20, 2019 at 12:02:23PM +1000, Michael Ellerman wrote: >> Bharata B Rao writes: >> > On Thu, May 16, 2019 at 07:44:20PM +0530, srikanth wrote: >> >> Hello, >> >> >> >> On power9 host, performing memory hotunplug from ppc64le guest results in >> >> kernel oops. >> >> >> >> Kernel used : https://github.com/torvalds/linux/tree/v5.1 built using >> >> ppc64le_defconfig for host and ppc64le_guest_defconfig for guest. >> >> >> >> Recreation steps: >> >> >> >> 1. Boot a guest with below mem configuration: >> >> 33554432 >> >> 8388608 >> >> 4194304 >> >> >> >> >> >> >> >> >> >> >> >> >> >> 2. From host hotplug 8G memory -> verify memory hotadded succesfully -> >> >> now >> >> reboot guest -> once guest comes back try to unplug 8G memory >> >> >> >> mem.xml used: >> >> >> >> >> >> 8 >> >> 0 >> >> >> >> >> >> >> >> Memory attach and detach commands used: >> >> virsh attach-device vm1 ./mem.xml --live >> >> virsh detach-device vm1 ./mem.xml --live >> >> >> >> Trace seen inside guest after unplug, guest just hangs there forever: >> >> >> >> [ 21.962986] kernel BUG at arch/powerpc/mm/pgtable-frag.c:113! >> >> [ 21.963064] Oops: Exception in kernel mode, sig: 5 [#1] >> >> [ 21.963090] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA >> >> pSeries >> >> [ 21.963131] Modules linked in: xt_tcpudp iptable_filter squashfs fuse >> >> vmx_crypto ib_iser rdma_cm iw_cm ib_cm ib_core libiscsi >> >> scsi_transport_iscsi >> >> ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress >> >> lzo_compress >> >> raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx >> >> xor raid6_pq multipath crc32c_vpmsum >> >> [ 21.963281] CPU: 11 PID: 316 Comm: kworker/u64:5 Kdump: loaded Not >> >> tainted 5.1.0-dirty #2 >> >> [ 21.963323] Workqueue: pseries hotplug workque pseries_hp_work_fn >> >> [ 21.963355] NIP: c0079e18 LR: c0c79308 CTR: >> >> 8000 >> >> [ 21.963392] REGS: c003f88034f0 TRAP: 0700 Not tainted >> >> (5.1.0-dirty) >> >> [ 21.963422] MSR: 8282b033 >> >> CR: >> >> 28002884 XER: 2004 >> >> [ 21.963470] CFAR: c0c79304 IRQMASK: 0 >> >> [ 21.963470] GPR00: c0c79308 c003f8803780 c1521000 >> >> 00fff8c0 >> >> [ 21.963470] GPR04: 0001 ffe30005 0005 >> >> 0020 >> >> [ 21.963470] GPR08: 0001 c00a00fff8e0 >> >> c16d21a0 >> >> [ 21.963470] GPR12: c16e7b90 c7ff2700 c00a00a0 >> >> c003ffe30100 >> >> [ 21.963470] GPR16: c003ffe3 c14aa4de c00a009f >> >> c16d21b0 >> >> [ 21.963470] GPR20: c14de588 0001 c16d21b8 >> >> c00a00a0 >> >> [ 21.963470] GPR24: c00a00a0 >> >> c003ffe96000 >> >> [ 21.963470] GPR28: c00a00a0 c00a00a0 c003fffec000 >> >> c00a00fff8c0 >> >> [ 21.963802] NIP [c0079e18] pte_fragment_free+0x48/0xd0 >> >> [ 21.963838] LR [c0c79308] remove_pagetable+0x49c/0x5b4 >> >> [ 21.963873] Call Trace: >> >> [ 21.963890] [c003f8803780] [c003ffe997f0] 0xc003ffe997f0 >> >> (unreliable) >> >> [ 21.963933] [c003f88037b0] [] (null) >> >> [ 21.963969] [c003f88038c0] [c006f038] >> >> vmemmap_free+0x218/0x2e0 >> >> [ 21.964006] [c003f8803940] [c036f100] >> >> sparse_remove_one_section+0xd0/0x138 >> >> [ 21.964050] [c003f8803980] [c0383a50] >> >> __remove_pages+0x410/0x560 >> >> [ 21.964093] [c003f8803a90] [c0c784d8] >> >> arch_remove_memory+0x68/0xdc >> >> [ 21.964136] [c003f8803ad0] [c0385d74] >> >> __remove_memory+0xc4/0x110 >> >> [ 21.964180] [c003f8803b10] [c00d44e4] >> >> dlpar_remove_lmb+0x94/0x140 >> >> [ 21.964223] [c003f8803b50] [c00d52b4] >> >> dlpar_memory+0x464/0xd00 >> >> [ 21.964259] [c003f8803be0] [c00cd5c0] >> >> handle_dlpar_errorlog+0xc0/0x190 >> >> [ 21.964303] [c003f8803c50] [c00cd6bc] >> >> pseries_hp_work_fn+0x2c/0x60 >> >> [ 21.964346] [c003f8803c80] [c013a4a0] >> >> process_one_work+0x2b0/0x5a0 >> >> [ 21.964388] [c003f8803d10] [c013a818] >> >> worker_thread+0x88/0x610 >> >> [ 21.964434] [c003f8803db0] [c0143884] kthread+0x1a4/0x1b0 >> >> [ 21.964468] [c003f8803e20] [c000bdc4] >> >> ret_from_kernel_thread+0x5c/0x78 >> >> [ 21.964506] Instruction dump: >> >> [ 21.964527] fbe1fff8 f821ffd1 78638502 78633664 ebe9 7fff1a14 >> >> 395f0020 813f0020 >> >> [ 21.964569] 7d2907b4 7d2900d0 79290fe0 69290001 <0b09> 7c0004ac >> >> 7d205028 3129 >> >> [ 21.964613] ---[ end trace aaa571aa1636fee6 ]--- >> >> [ 21.966349] >> >> [ 21.966383] Sending IPI to other CPUs >> >> [
[PATCH] fddi: fix typos in code comments
fix abord to abort Signed-off-by: Weitao Hou --- drivers/net/fddi/skfp/hwmtm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/fddi/skfp/hwmtm.c b/drivers/net/fddi/skfp/hwmtm.c index abbe309051d9..3d0f417e8586 100644 --- a/drivers/net/fddi/skfp/hwmtm.c +++ b/drivers/net/fddi/skfp/hwmtm.c @@ -1206,7 +1206,7 @@ void process_receive(struct s_smc *smc) } /* * SUPERNET 3 Bug: FORMAC delivers status words -* of aborded frames to the BMU +* of aborted frames to the BMU */ if (len <= 4) { DB_RX(2, "Frame length = 0"); @@ -1343,7 +1343,7 @@ void process_receive(struct s_smc *smc) break ; default : /* -* unknown FC abord the frame +* unknown FC abort the frame */ DB_RX(2, "unknown FC error"); smt_free_mbuf(smc,mb) ; -- 2.18.0
[PATCH] mm/failslab: By default, do not fail allocations with direct reclaim only
When failslab was originally written, the intention of the "ignore-gfp-wait" flag default value ("N") was to fail GFP_ATOMIC allocations. Those were defined as (__GFP_HIGH), and the code would test for __GFP_WAIT (0x10u). However, since then, __GFP_WAIT was replaced by __GFP_RECLAIM (___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM), and GFP_ATOMIC is now defined as (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM). This means that when the flag is false, almost no allocation ever fails (as even GFP_ATOMIC allocations contain __GFP_KSWAPD_RECLAIM). Restore the original intent of the code, by ignoring calls that directly reclaim only (___GFP_DIRECT_RECLAIM), and thus, failing GFP_ATOMIC calls again by default. Fixes: 71baba4b92dc1fa1 ("mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM") Signed-off-by: Nicolas Boichat --- mm/failslab.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/failslab.c b/mm/failslab.c index ec5aad211c5be97..33efcb60e633c0a 100644 --- a/mm/failslab.c +++ b/mm/failslab.c @@ -23,7 +23,8 @@ bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags) if (gfpflags & __GFP_NOFAIL) return false; - if (failslab.ignore_gfp_reclaim && (gfpflags & __GFP_RECLAIM)) + if (failslab.ignore_gfp_reclaim && + (gfpflags & ___GFP_DIRECT_RECLAIM)) return false; if (failslab.cache_filter && !(s->flags & SLAB_FAILSLAB)) -- 2.21.0.1020.gf2820cf01a-goog
Re: [PATCH] vfio: vfio_pci_nvlink2: use a vma helper function
On 17/05/2019 00:48, Alex Williamson wrote: > [Cc Alexey + kvm] > > On Thu, 16 May 2019 20:38:26 +0800 > "richard.p...@oppo.com" wrote: > >> Use a vma helper function to simply code. >> >> Signed-off-by: Peng Hao >> --- >> drivers/vfio/pci/vfio_pci_nvlink2.c | 3 +-- >> 1 file changed, 1 insertion(+), 2 deletions(-) >> >> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c >> b/drivers/vfio/pci/vfio_pci_nvlink2.c >> index 32f695ffe128..dc42aa0e47f6 100644 >> --- a/drivers/vfio/pci/vfio_pci_nvlink2.c >> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c >> @@ -161,8 +161,7 @@ static int vfio_pci_nvgpu_mmap(struct vfio_pci_device >> *vdev, >> >> atomic_inc(>mm->mm_count); >> ret = (int) mm_iommu_newdev(data->mm, data->useraddr, >> -(vma->vm_end - vma->vm_start) >> PAGE_SHIFT, >> -data->gpu_hpa, >mem); >> +vma_pages(vma), data->gpu_hpa, >mem); I did not realize we have been having this mighty helper since 2005 :) Reviewed-by: Alexey Kardashevskiy >> >> trace_vfio_pci_nvgpu_mmap(vdev->pdev, data->gpu_hpa, data->useraddr, >> vma->vm_end - vma->vm_start, ret); >> -- >> 2.20.1 -- Alexey
Re: [PATCH 6/8] PM / OPP: Support adjusting OPP voltages at runtime
On 16-05-19, 17:08, Andrew-sh.Cheng wrote: > From: Stephen Boyd > > On some SoCs the Adaptive Voltage Scaling (AVS) technique is > employed to optimize the operating voltage of a device. At a > given frequency, the hardware monitors dynamic factors and either > makes a suggestion for how much to adjust a voltage for the > current frequency, or it automatically adjusts the voltage > without software intervention. Add an API to the OPP library for > the former case, so that AVS type devices can update the voltages > for an OPP when the hardware determines the voltage should > change. The assumption is that drivers like CPUfreq or devfreq > will register for the OPP notifiers and adjust the voltage > according to suggestions that AVS makes. > > This patch is devired from [1] submitted by Stephen. > [1] https://lore.kernel.org/patchwork/patch/599279/ > > Signed-off-by: Stephen Boyd > Signed-off-by: Roger Lu > --- > drivers/opp/core.c | 78 > ++ > include/linux/pm_opp.h | 11 +++ > 2 files changed, 89 insertions(+) This is an rcu implementation which got removed long back from OPP core. Please align this with the latest changes. -- viresh
Re: [PATCH 4/8] dt-bindings: devfreq: add compatible for mt8183 cci devfreq
On 16-05-19, 17:08, Andrew-sh.Cheng wrote: > From: "Andrew-sh.Cheng" > > This adds dt-binding documentation of cci devfreq > for Mediatek MT8183 SoC platform. > > Signed-off-by: Andrew-sh.Cheng > --- > .../bindings/devfreq/mt8183-cci-devfreq.txt | 20 > > 1 file changed, 20 insertions(+) > create mode 100644 > Documentation/devicetree/bindings/devfreq/mt8183-cci-devfreq.txt > > diff --git a/Documentation/devicetree/bindings/devfreq/mt8183-cci-devfreq.txt > b/Documentation/devicetree/bindings/devfreq/mt8183-cci-devfreq.txt > new file mode 100644 > index ..3189902902e0 > --- /dev/null > +++ b/Documentation/devicetree/bindings/devfreq/mt8183-cci-devfreq.txt > @@ -0,0 +1,20 @@ > +* Mediatek Cache Coherent Interconnect(CCI) frequency device > + > +Required properties: > +- compatible: should contain "mediatek,mt8183-cci" for frequency scaling of > CCI Example doesn't have this compatible . > +- clocks: for frequency scaling of CCI > +- clock-names: for frequency scaling of CCI driver to reference > +- regulator: for voltage scaling of CCI > +- operating-points-v2: for frequency scaling of CCI opp table > + > +Example: > + cci: cci { > + compatible = "mediatek,cci"; > + clocks = < CLK_APMIXED_CCIPLL>; > + clock-names = "cci_clock"; > + operating-points-v2 = <_opp>; > + }; > + > + { > + proc-supply = <_vproc12_reg>; > + }; > \ No newline at end of file > -- > 2.12.5 -- viresh
Re: [EXT] Re: [PATCH 1/3] enetc: add hardware timestamping support
On Mon, May 20, 2019 at 03:20:23AM +, Y.b. Lu wrote: > > > +config FSL_ENETC_HW_TIMESTAMPING > > > + bool "ENETC hardware timestamping support" > > > + depends on FSL_ENETC || FSL_ENETC_VF > > > + help > > > + Enable hardware timestamping support on the Ethernet packets > > > + using the SO_TIMESTAMPING API. Because the RX BD ring dynamic > > > + allocation hasn't been supported and it's too expensive to use > > > > s/it's/it is/ > > [Y.b. Lu] Will modify it. BTW, may I know what's the purpose of dropping > single quote character? For searching, script checking, or something else? Simply because "it's" is informal speech, but the Kconfig help is formal technical documentation. (Or at least it should be!) Thanks, Richard
Re: [RFC 5/5] cpufreq: add driver for Raspbery Pi
On 17-05-19, 17:35, Nicolas Saenz Julienne wrote: > Raspberry Pi's firmware offers and interface though which update it's > performance requirements. It allows us to request for specific runtime > frequencies, which the firmware might or might not respect, depending on > the firmware configuration and thermals. > > As the maximum and minimum frequencies are configurable in the firmware > there is no way to know in advance their values. So the Raspberry Pi > cpufreq driver queries them, builds an opp frequency table to then > launch cpufreq-dt. > > Signed-off-by: Nicolas Saenz Julienne > --- > drivers/cpufreq/Kconfig.arm | 8 +++ > drivers/cpufreq/Makefile | 1 + > drivers/cpufreq/raspberrypi-cpufreq.c | 79 +++ > 3 files changed, 88 insertions(+) > create mode 100644 drivers/cpufreq/raspberrypi-cpufreq.c > > diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm > index 179a1d302f48..70e5f13f7632 100644 > --- a/drivers/cpufreq/Kconfig.arm > +++ b/drivers/cpufreq/Kconfig.arm > @@ -308,3 +308,11 @@ config ARM_PXA2xx_CPUFREQ > This add the CPUFreq driver support for Intel PXA2xx SOCs. > > If in doubt, say N. > + > +config ARM_RASPBERRYPI_CPUFREQ > + tristate "Raspberry Pi cpufreq support" > + depends on RASPBERRYPI_FIRMWARE || (RASPBERRYPI_FIRMWARE=n && > COMPILE_TEST) What about: depends on RASPBERRYPI_FIRMWARE || COMPILE_TEST > + help > + This adds the CPUFreq driver for Raspberry Pi > + > + If in doubt, say N. > diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile > index 689b26c6f949..02678e9b2ff5 100644 > --- a/drivers/cpufreq/Makefile > +++ b/drivers/cpufreq/Makefile > @@ -84,6 +84,7 @@ obj-$(CONFIG_ARM_TEGRA124_CPUFREQ) += tegra124-cpufreq.o > obj-$(CONFIG_ARM_TEGRA186_CPUFREQ) += tegra186-cpufreq.o > obj-$(CONFIG_ARM_TI_CPUFREQ) += ti-cpufreq.o > obj-$(CONFIG_ARM_VEXPRESS_SPC_CPUFREQ) += vexpress-spc-cpufreq.o > +obj-$(CONFIG_ARM_RASPBERRYPI_CPUFREQ)+= raspberrypi-cpufreq.o > > > > ## > diff --git a/drivers/cpufreq/raspberrypi-cpufreq.c > b/drivers/cpufreq/raspberrypi-cpufreq.c > new file mode 100644 > index ..53cb3e5a8457 > --- /dev/null > +++ b/drivers/cpufreq/raspberrypi-cpufreq.c > @@ -0,0 +1,79 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Raspberry Pi cpufreq driver > + * > + * Copyright (C) 2019, Nicolas Saenz Julienne > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +static const struct of_device_id machines[] __initconst = { > + { .compatible = "raspberrypi,3-model-b-plus" }, > + { .compatible = "raspberrypi,3-model-b" }, > + { /* sentinel */ } > +}; > + > +static int __init raspberrypi_cpufreq_driver_init(void) > +{ > + struct platform_device *pdev; > + struct cpumask shared_cpus; > + struct device *cpu_dev; > + struct clk *clk; > + long min, max; > + long rate; > + int ret; > + > + if (!of_match_node(machines, of_root)) > + return -ENODEV; > + > + cpu_dev = get_cpu_device(0); > + if (!cpu_dev) { > + pr_err("Cannot get CPU for cpufreq driver\n"); > + return -ENODEV; > + } > + > + clk = clk_get(cpu_dev, 0); > + if (IS_ERR(clk)) { > + dev_err(cpu_dev, "Cannot get clock for CPU0\n"); > + return PTR_ERR(clk); > + } You want to do a clk_put() somewhere ? > + > + /* > + * The max and min frequencies are configurable in the Raspberry Pi > + * firmware, so we query them at runtime > + */ > + min = clk_round_rate(clk, 0); > + max = clk_round_rate(clk, ULONG_MAX); > + > + for (rate = min; rate < max; rate += 1) { > + ret = dev_pm_opp_add(cpu_dev, rate, 0); > + if (ret) > + return ret; > + } > + > + ret = dev_pm_opp_add(cpu_dev, max, 0); > + if (ret) > + return ret; On errors, you should remove all previously added OPPs. > + > + cpumask_setall(_cpus); > + dev_pm_opp_set_sharing_cpus(cpu_dev, _cpus); Why are these required? This must be done by the cpufreq-dt driver anyway ? > + > + pdev = platform_device_register_data(NULL, "cpufreq-dt", -1, NULL, 0); > + ret = PTR_ERR_OR_ZERO(pdev); > + if (ret) > + dev_err(cpu_dev, "Failed to create platform device, %d\n", ret); > + > + return ret; > +} > + > +late_initcall(raspberrypi_cpufreq_driver_init); > + > +MODULE_AUTHOR("Nicolas Saenz Julienne +MODULE_DESCRIPTION("Raspberry Pi cpufreq driver"); > +MODULE_LICENSE("GPL v2"); > -- > 2.21.0 -- viresh
[PATCH] kvm: vmx: Fix -Wmissing-prototypes warnings
We get a warning when build kernel W=1: arch/x86/kvm/vmx/vmx.c:6365:6: warning: no previous prototype for ‘vmx_update_host_rsp’ [-Wmissing-prototypes] void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp) Add the missing declaration to fix this. Signed-off-by: Yi Wang --- arch/x86/kvm/vmx/vmx.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index f879529..9cd72de 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -314,6 +314,7 @@ struct kvm_vmx { void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu); struct shared_msr_entry *find_msr_entry(struct vcpu_vmx *vmx, u32 msr); void pt_update_intercept_for_msr(struct vcpu_vmx *vmx); +void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp); #define POSTED_INTR_ON 0 #define POSTED_INTR_SN 1 -- 1.8.3.1
Re: PROBLEM: Power9: kernel oops on memory hotunplug from ppc64le guest
On Mon, May 20, 2019 at 12:02:23PM +1000, Michael Ellerman wrote: > Bharata B Rao writes: > > On Thu, May 16, 2019 at 07:44:20PM +0530, srikanth wrote: > >> Hello, > >> > >> On power9 host, performing memory hotunplug from ppc64le guest results in > >> kernel oops. > >> > >> Kernel used : https://github.com/torvalds/linux/tree/v5.1 built using > >> ppc64le_defconfig for host and ppc64le_guest_defconfig for guest. > >> > >> Recreation steps: > >> > >> 1. Boot a guest with below mem configuration: > >> 33554432 > >> 8388608 > >> 4194304 > >> > >> > >> > >> > >> > >> > >> 2. From host hotplug 8G memory -> verify memory hotadded succesfully -> now > >> reboot guest -> once guest comes back try to unplug 8G memory > >> > >> mem.xml used: > >> > >> > >> 8 > >> 0 > >> > >> > >> > >> Memory attach and detach commands used: > >> virsh attach-device vm1 ./mem.xml --live > >> virsh detach-device vm1 ./mem.xml --live > >> > >> Trace seen inside guest after unplug, guest just hangs there forever: > >> > >> [ 21.962986] kernel BUG at arch/powerpc/mm/pgtable-frag.c:113! > >> [ 21.963064] Oops: Exception in kernel mode, sig: 5 [#1] > >> [ 21.963090] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA > >> pSeries > >> [ 21.963131] Modules linked in: xt_tcpudp iptable_filter squashfs fuse > >> vmx_crypto ib_iser rdma_cm iw_cm ib_cm ib_core libiscsi > >> scsi_transport_iscsi > >> ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress lzo_compress > >> raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx > >> xor raid6_pq multipath crc32c_vpmsum > >> [ 21.963281] CPU: 11 PID: 316 Comm: kworker/u64:5 Kdump: loaded Not > >> tainted 5.1.0-dirty #2 > >> [ 21.963323] Workqueue: pseries hotplug workque pseries_hp_work_fn > >> [ 21.963355] NIP: c0079e18 LR: c0c79308 CTR: > >> 8000 > >> [ 21.963392] REGS: c003f88034f0 TRAP: 0700 Not tainted > >> (5.1.0-dirty) > >> [ 21.963422] MSR: 8282b033 > >> CR: > >> 28002884 XER: 2004 > >> [ 21.963470] CFAR: c0c79304 IRQMASK: 0 > >> [ 21.963470] GPR00: c0c79308 c003f8803780 c1521000 > >> 00fff8c0 > >> [ 21.963470] GPR04: 0001 ffe30005 0005 > >> 0020 > >> [ 21.963470] GPR08: 0001 c00a00fff8e0 > >> c16d21a0 > >> [ 21.963470] GPR12: c16e7b90 c7ff2700 c00a00a0 > >> c003ffe30100 > >> [ 21.963470] GPR16: c003ffe3 c14aa4de c00a009f > >> c16d21b0 > >> [ 21.963470] GPR20: c14de588 0001 c16d21b8 > >> c00a00a0 > >> [ 21.963470] GPR24: c00a00a0 > >> c003ffe96000 > >> [ 21.963470] GPR28: c00a00a0 c00a00a0 c003fffec000 > >> c00a00fff8c0 > >> [ 21.963802] NIP [c0079e18] pte_fragment_free+0x48/0xd0 > >> [ 21.963838] LR [c0c79308] remove_pagetable+0x49c/0x5b4 > >> [ 21.963873] Call Trace: > >> [ 21.963890] [c003f8803780] [c003ffe997f0] 0xc003ffe997f0 > >> (unreliable) > >> [ 21.963933] [c003f88037b0] [] (null) > >> [ 21.963969] [c003f88038c0] [c006f038] > >> vmemmap_free+0x218/0x2e0 > >> [ 21.964006] [c003f8803940] [c036f100] > >> sparse_remove_one_section+0xd0/0x138 > >> [ 21.964050] [c003f8803980] [c0383a50] > >> __remove_pages+0x410/0x560 > >> [ 21.964093] [c003f8803a90] [c0c784d8] > >> arch_remove_memory+0x68/0xdc > >> [ 21.964136] [c003f8803ad0] [c0385d74] > >> __remove_memory+0xc4/0x110 > >> [ 21.964180] [c003f8803b10] [c00d44e4] > >> dlpar_remove_lmb+0x94/0x140 > >> [ 21.964223] [c003f8803b50] [c00d52b4] > >> dlpar_memory+0x464/0xd00 > >> [ 21.964259] [c003f8803be0] [c00cd5c0] > >> handle_dlpar_errorlog+0xc0/0x190 > >> [ 21.964303] [c003f8803c50] [c00cd6bc] > >> pseries_hp_work_fn+0x2c/0x60 > >> [ 21.964346] [c003f8803c80] [c013a4a0] > >> process_one_work+0x2b0/0x5a0 > >> [ 21.964388] [c003f8803d10] [c013a818] > >> worker_thread+0x88/0x610 > >> [ 21.964434] [c003f8803db0] [c0143884] kthread+0x1a4/0x1b0 > >> [ 21.964468] [c003f8803e20] [c000bdc4] > >> ret_from_kernel_thread+0x5c/0x78 > >> [ 21.964506] Instruction dump: > >> [ 21.964527] fbe1fff8 f821ffd1 78638502 78633664 ebe9 7fff1a14 > >> 395f0020 813f0020 > >> [ 21.964569] 7d2907b4 7d2900d0 79290fe0 69290001 <0b09> 7c0004ac > >> 7d205028 3129 > >> [ 21.964613] ---[ end trace aaa571aa1636fee6 ]--- > >> [ 21.966349] > >> [ 21.966383] Sending IPI to other CPUs > >> [ 21.978335] IPI complete > >> [ 21.981354] kexec: Starting switchover sequence. > >> I'm in purgatory > > > > git bisect points to > > > > commit
Re: [PATCH v3 3/3] i2c-ocores: sifive: add polling mode workaround for FU540-C000 SoC
Hi Andrew, On Thu, May 16, 2019 at 6:37 PM Andrew Lunn wrote: > > On Thu, May 16, 2019 at 10:38:40AM +0530, Sagar Shrikant Kadam wrote: > > The i2c-ocore driver already has a polling mode interface.But it needs > > a workaround for FU540 Chipset on HiFive unleashed board (RevA00). > > There is an erratum in FU540 chip that prevents interrupt driven i2c > > transfers from working, and also the I2C controller's interrupt bit > > cannot be cleared if set, due to this the existing i2c polling mode > > interface added in mainline earlier doesn't work, and CPU stall's > > infinitely, when-ever i2c transfer is initiated. > > > > Ref:previous polling mode support in mainline > > > > commit 69c8c0c0efa8 ("i2c: ocores: add polling interface") > > > > The workaround / fix under OCORES_FLAG_BROKEN_IRQ is particularly for > > FU540-COOO SoC. > > > > Signed-off-by: Sagar Shrikant Kadam > > --- > > drivers/i2c/busses/i2c-ocores.c | 34 -- > > 1 file changed, 28 insertions(+), 6 deletions(-) > > > > diff --git a/drivers/i2c/busses/i2c-ocores.c > > b/drivers/i2c/busses/i2c-ocores.c > > index aee1d86..00ee45c 100644 > > --- a/drivers/i2c/busses/i2c-ocores.c > > +++ b/drivers/i2c/busses/i2c-ocores.c > > @@ -27,6 +27,7 @@ > > #include > > > > #define OCORES_FLAG_POLL BIT(0) > > +#define OCORES_FLAG_BROKEN_IRQ BIT(2) /* Broken IRQ in HiFive Unleashed */ > > Hi Sigar > > BIT(1). Don't leave a gap. I will remove the gap and update this in V4. Thanks, Sagar > > Andrew
arch/xtensa/include/asm/uaccess.h:40:22: error: implicit declaration of function 'uaccess_kernel'; did you mean 'getname_kernel'?
Hi Matt, FYI, the error/warning still remains. tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: a188339ca5a396acc588e5851ed7e19f66b0ebd9 commit: 7df95299b94a63ec67a6389fc02dc25019a80ee8 staging: kpc2000: Add DMA driver date: 4 weeks ago config: xtensa-allmodconfig (attached as .config) compiler: xtensa-linux-gcc (GCC) 8.1.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout 7df95299b94a63ec67a6389fc02dc25019a80ee8 # save the attached .config to linux build tree GCC_VERSION=8.1.0 make.cross ARCH=xtensa If you fix the issue, kindly add following tag Reported-by: kbuild test robot All errors (new ones prefixed by >>): In file included from drivers/staging/kpc2000/kpc_dma/fileops.c:11: arch/xtensa/include/asm/uaccess.h: In function 'clear_user': >> arch/xtensa/include/asm/uaccess.h:40:22: error: implicit declaration of >> function 'uaccess_kernel'; did you mean 'getname_kernel'? >> [-Werror=implicit-function-declaration] #define __kernel_ok (uaccess_kernel()) ^~ arch/xtensa/include/asm/uaccess.h:43:34: note: in expansion of macro '__kernel_ok' #define __access_ok(addr, size) (__kernel_ok || __user_ok((addr), (size))) ^~~ arch/xtensa/include/asm/uaccess.h:44:31: note: in expansion of macro '__access_ok' #define access_ok(addr, size) __access_ok((unsigned long)(addr), (size)) ^~~ arch/xtensa/include/asm/uaccess.h:271:6: note: in expansion of macro 'access_ok' if (access_ok(addr, size)) ^ In file included from include/linux/printk.h:330, from include/linux/kernel.h:15, from include/linux/list.h:9, from include/linux/module.h:9, from drivers/staging/kpc2000/kpc_dma/fileops.c:2: drivers/staging/kpc2000/kpc_dma/fileops.c: In function 'kpc_dma_transfer': drivers/staging/kpc2000/kpc_dma/fileops.c:58:35: warning: format '%ld' expects argument of type 'long int', but argument 7 has type 'size_t' {aka 'unsigned int'} [-Wformat=] dev_dbg(>ldev->pldev->dev, "kpc_dma_transfer(priv = [%p], kcb = [%p], iov_base = [%p], iov_len = %ld) ldev = [%p]\n", priv, kcb, (void*)iov_base, iov_len, ldev); ^ include/linux/dynamic_debug.h:118:15: note: in definition of macro '__dynamic_func_call' func(, ##__VA_ARGS__); \ ^~~ include/linux/dynamic_debug.h:150:2: note: in expansion of macro '_dynamic_func_call' _dynamic_func_call(fmt,__dynamic_dev_dbg, \ ^~ include/linux/device.h:1493:2: note: in expansion of macro 'dynamic_dev_dbg' dynamic_dev_dbg(dev, dev_fmt(fmt), ##__VA_ARGS__) ^~~ include/linux/device.h:1493:23: note: in expansion of macro 'dev_fmt' dynamic_dev_dbg(dev, dev_fmt(fmt), ##__VA_ARGS__) ^~~ drivers/staging/kpc2000/kpc_dma/fileops.c:58:2: note: in expansion of macro 'dev_dbg' dev_dbg(>ldev->pldev->dev, "kpc_dma_transfer(priv = [%p], kcb = [%p], iov_base = [%p], iov_len = %ld) ldev = [%p]\n", priv, kcb, (void*)iov_base, iov_len, ldev); ^~~ cc1: some warnings being treated as errors -- In file included from drivers/staging//kpc2000/kpc_dma/fileops.c:11: arch/xtensa/include/asm/uaccess.h: In function 'clear_user': >> arch/xtensa/include/asm/uaccess.h:40:22: error: implicit declaration of >> function 'uaccess_kernel'; did you mean 'getname_kernel'? >> [-Werror=implicit-function-declaration] #define __kernel_ok (uaccess_kernel()) ^~ arch/xtensa/include/asm/uaccess.h:43:34: note: in expansion of macro '__kernel_ok' #define __access_ok(addr, size) (__kernel_ok || __user_ok((addr), (size))) ^~~ arch/xtensa/include/asm/uaccess.h:44:31: note: in expansion of macro '__access_ok' #define access_ok(addr, size) __access_ok((unsigned long)(addr), (size)) ^~~ arch/xtensa/include/asm/uaccess.h:271:6: note: in expansion of macro 'access_ok' if (access_ok(addr, size)) ^ In file included from include/linux/printk.h:330, from include/linux/kernel.h:15, from include/linux/list.h:9, from include/linux/module.h:9, from drivers/staging//kpc2000/kpc_dma/fileops.c:2: drivers/staging//kpc2000/kpc_dma/fileops.c: In function 'kpc_dma_transfer': drivers/staging//kpc2000/kpc_dma/fileops.c:58:35: warning: format '%ld' expects argument of type 'long int', but
drivers//pinctrl/pinctrl-stmfx.c:652:17: error: 'struct gpio_chip' has no member named 'of_node'
Hi Amelie, FYI, the error/warning still remains. tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: a188339ca5a396acc588e5851ed7e19f66b0ebd9 commit: 1490d9f841b186664f9d3ca213dcfa4464a60680 pinctrl: Add STMFX GPIO expander Pinctrl/GPIO driver date: 10 days ago config: um-allmodconfig (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: git checkout 1490d9f841b186664f9d3ca213dcfa4464a60680 # save the attached .config to linux build tree make ARCH=um If you fix the issue, kindly add following tag Reported-by: kbuild test robot All errors (new ones prefixed by >>): drivers//pinctrl/pinctrl-stmfx.c: In function 'stmfx_pinctrl_probe': >> drivers//pinctrl/pinctrl-stmfx.c:652:17: error: 'struct gpio_chip' has no >> member named 'of_node' pctl->gpio_chip.of_node = np; ^ vim +652 drivers//pinctrl/pinctrl-stmfx.c 588 589 static int stmfx_pinctrl_probe(struct platform_device *pdev) 590 { 591 struct stmfx *stmfx = dev_get_drvdata(pdev->dev.parent); 592 struct device_node *np = pdev->dev.of_node; 593 struct stmfx_pinctrl *pctl; 594 u32 n; 595 int irq, ret; 596 597 pctl = devm_kzalloc(stmfx->dev, sizeof(*pctl), GFP_KERNEL); 598 if (!pctl) 599 return -ENOMEM; 600 601 platform_set_drvdata(pdev, pctl); 602 603 pctl->dev = >dev; 604 pctl->stmfx = stmfx; 605 606 if (!of_find_property(np, "gpio-ranges", NULL)) { 607 dev_err(pctl->dev, "missing required gpio-ranges property\n"); 608 return -EINVAL; 609 } 610 611 irq = platform_get_irq(pdev, 0); 612 if (irq <= 0) { 613 dev_err(pctl->dev, "failed to get irq\n"); 614 return -ENXIO; 615 } 616 617 mutex_init(>lock); 618 619 /* Register pin controller */ 620 pctl->pctl_desc.name = "stmfx-pinctrl"; 621 pctl->pctl_desc.pctlops = _pinctrl_ops; 622 pctl->pctl_desc.confops = _pinconf_ops; 623 pctl->pctl_desc.pins = stmfx_pins; 624 pctl->pctl_desc.npins = ARRAY_SIZE(stmfx_pins); 625 pctl->pctl_desc.owner = THIS_MODULE; 626 627 ret = devm_pinctrl_register_and_init(pctl->dev, >pctl_desc, 628 pctl, >pctl_dev); 629 if (ret) { 630 dev_err(pctl->dev, "pinctrl registration failed\n"); 631 return ret; 632 } 633 634 ret = pinctrl_enable(pctl->pctl_dev); 635 if (ret) { 636 dev_err(pctl->dev, "pinctrl enable failed\n"); 637 return ret; 638 } 639 640 /* Register gpio controller */ 641 pctl->gpio_chip.label = "stmfx-gpio"; 642 pctl->gpio_chip.parent = pctl->dev; 643 pctl->gpio_chip.get_direction = stmfx_gpio_get_direction; 644 pctl->gpio_chip.direction_input = stmfx_gpio_direction_input; 645 pctl->gpio_chip.direction_output = stmfx_gpio_direction_output; 646 pctl->gpio_chip.get = stmfx_gpio_get; 647 pctl->gpio_chip.set = stmfx_gpio_set; 648 pctl->gpio_chip.set_config = gpiochip_generic_config; 649 pctl->gpio_chip.base = -1; 650 pctl->gpio_chip.ngpio = pctl->pctl_desc.npins; 651 pctl->gpio_chip.can_sleep = true; > 652 pctl->gpio_chip.of_node = np; 653 pctl->gpio_chip.need_valid_mask = true; 654 655 ret = devm_gpiochip_add_data(pctl->dev, >gpio_chip, pctl); 656 if (ret) { 657 dev_err(pctl->dev, "gpio_chip registration failed\n"); 658 return ret; 659 } 660 661 ret = stmfx_pinctrl_gpio_function_enable(pctl); 662 if (ret) 663 return ret; 664 665 pctl->irq_chip.name = dev_name(pctl->dev); 666 pctl->irq_chip.irq_mask = stmfx_pinctrl_irq_mask; 667 pctl->irq_chip.irq_unmask = stmfx_pinctrl_irq_unmask; 668 pctl->irq_chip.irq_set_type = stmfx_pinctrl_irq_set_type; 669 pctl->irq_chip.irq_bus_lock = stmfx_pinctrl_irq_bus_lock; 670 pctl->irq_chip.irq_bus_sync_unlock = stmfx_pinctrl_irq_bus_sync_unlock; 671 for_each_clear_bit(n, >gpio_valid_mask, pctl->gpio_chip.ngpio) 672 clear_bit(n, pctl->gpio_chip.valid_mask); 673 674 ret = gpiochip_irqchip_add_nested(>gpio_chip, >irq_chip, 6750, handle_bad_irq, IRQ_TYPE_NONE); 676 if (ret) { 677 dev_err(pctl->dev, "cannot add
ERROR: "uio_unregister_device" [drivers/staging/kpc2000/kpc2000/kpc2000.ko] undefined!
tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: a188339ca5a396acc588e5851ed7e19f66b0ebd9 commit: 7dc7967fc39af81191558f63eeaf3d2b83899b1c staging: kpc2000: add initial set of Daktronics drivers date: 4 weeks ago config: x86_64-randconfig-m3-05200349 (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: git checkout 7dc7967fc39af81191558f63eeaf3d2b83899b1c # save the attached .config to linux build tree make ARCH=x86_64 If you fix the issue, kindly add following tag Reported-by: kbuild test robot All errors (new ones prefixed by >>): >> ERROR: "uio_unregister_device" [drivers/staging/kpc2000/kpc2000/kpc2000.ko] >> undefined! >> ERROR: "__uio_register_device" [drivers/staging/kpc2000/kpc2000/kpc2000.ko] >> undefined! --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH v2] pstore: Add boot loader log messages support
On Wed, 15 May 2019 15:04:09 -0700 Kees Cook wrote: > Hi! > > Thanks for the reminder to review this code. :) Sorry for the delay! > > On Thu, Feb 14, 2019 at 11:49 PM Yue Hu wrote: > > > > From: Yue Hu > > > > Sometimes we hope to check boot loader log messages (e.g. Android > > Verified Boot status) when kernel is coming up. Generally it does > > depend on serial device, but it will be removed for the hardware > > shipping to market by most of manufacturers. In that case better > > solder and proper serial cable for different interface (e.g. Type-C > > or microUSB) are needed. That is inconvenient and even wasting much > > time on it. > > Can you give some examples of how this would be used on a real device? > More notes below... > > > > > Therefore, let's add a logging support: PSTORE_TYPE_XBL. > > > > Signed-off-by: Yue Hu > > --- > > v2: mention info of interacting with boot loader > > > > fs/pstore/Kconfig | 10 +++ > > fs/pstore/platform.c | 16 ++ > > fs/pstore/ram.c| 81 > > -- > > include/linux/pstore.h | 21 + > > 4 files changed, 121 insertions(+), 7 deletions(-) > > > > diff --git a/fs/pstore/Kconfig b/fs/pstore/Kconfig > > index 0d19d19..ef4a2dc 100644 > > --- a/fs/pstore/Kconfig > > +++ b/fs/pstore/Kconfig > > @@ -137,6 +137,16 @@ config PSTORE_FTRACE > > > > If unsure, say N. > > > > +config PSTORE_XBL > > + bool "Log bootloader messages" > > + depends on PSTORE > > + help > > + When the option is enabled, pstore will log boot loader > > + messages to /sys/fs/pstore/xbl-ramoops-[ID] after reboot. > > + Boot loader needs to support log buffer reserved. > > + > > + If unsure, say N. > > + > > config PSTORE_RAM > > tristate "Log panic/oops to a RAM buffer" > > depends on PSTORE > > diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c > > index 2d1066e..2e6c3f8f 100644 > > --- a/fs/pstore/platform.c > > +++ b/fs/pstore/platform.c > > @@ -65,6 +65,7 @@ > > "mce", > > "console", > > "ftrace", > > + "xbl", > > "rtas", > > "powerpc-ofw", > > "powerpc-common", > > @@ -530,6 +531,19 @@ static void pstore_register_console(void) {} > > static void pstore_unregister_console(void) {} > > #endif > > > > +#ifdef CONFIG_PSTORE_XBL > > +static void pstore_register_xbl(void) > > +{ > > + struct pstore_record record; > > + > > + pstore_record_init(, psinfo); > > + record.type = PSTORE_TYPE_XBL; > > + psinfo->write(); > > +} > > This seems like a very strange way to get the record: this is an > "empty" write that has a side-effect of reading the XBL region and > copying it into the prz area. I would expect this to all happen in > ramoops_pstore_read() instead. > > > +#else > > +static void pstore_register_xbl(void) {} > > +#endif > > + > > static int pstore_write_user_compat(struct pstore_record *record, > > const char __user *buf) > > { > > @@ -616,6 +630,8 @@ int pstore_register(struct pstore_info *psi) > > pstore_register_ftrace(); > > if (psi->flags & PSTORE_FLAGS_PMSG) > > pstore_register_pmsg(); > > + if (psi->flags & PSTORE_FLAGS_XBL) > > + pstore_register_xbl(); > > > > /* Start watching for new records, if desired. */ > > if (pstore_update_ms >= 0) { > > diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c > > index 1adb5e3..b114b1d 100644 > > --- a/fs/pstore/ram.c > > +++ b/fs/pstore/ram.c > > @@ -56,6 +56,27 @@ > > module_param_named(pmsg_size, ramoops_pmsg_size, ulong, 0400); > > MODULE_PARM_DESC(pmsg_size, "size of user space message log"); > > How is the base address of the XBL area specified? It looks currently > like it's up to the end user to do all the math correctly to line it > up with where it's expected? > > > > > +/* > > + * interact with boot loader > > + * = > > + * > > + * xbl memory layout: > > + * ++ > > + * |dst | > > + * || ++ > > + * |src > | header | > > + * ++ |log messages| > > + *++ > > + * > > + * As above, src memory is used to store header and log messages generated > > + * by boot loader, pstore will copy the log messages to dst memory which > > + * has same size as src. The header in src is to record log messages size > > + * written and make xbl cookie. > > Why is such a copy needed? The log is already present in memory; why > can't pstore just use what's already there? > > > + */ > > +static ulong ramoops_xbl_size = MIN_MEM_SIZE; > > +module_param_named(xbl_size, ramoops_xbl_size, ulong, 0400); > > +MODULE_PARM_DESC(xbl_size, "size of boot loader log"); > > + > > static unsigned long long mem_address; > > module_param_hw(mem_address, ullong, other, 0400); > > MODULE_PARM_DESC(mem_address, > > @@ -88,6
Re: [PATCH v3] kbuild: check uniqueness of module names
On Mon, May 20, 2019 at 8:52 AM Stephen Rothwell wrote: > > Hi Masahiro, > > On Sat, 18 May 2019 01:07:15 +0900 Masahiro Yamada > wrote: > > > > It checks not only real modules, but also built-in modules (i.e. > > controlled by tristate CONFIG option, but currently compiled with =y). > > Non-unique names for built-in modules also cause problems because > > /sys/modules/ would fall over. > > > > I tested allmodconfig on the latest kernel, and it detected the > > following: > > A powerpc ppc64_defconfig produces: > > warning: same basename if the following are built as modules: > arch/powerpc/platforms/powermac/nvram.ko > drivers/char/nvram.ko > > Which is a false positive since > arch/powerpc/platforms/powermac/Makefile has > > # CONFIG_NVRAM is an arch. independent tristate symbol, for pmac32 we really > # need this to be a bool. Cheat here and pretend CONFIG_NVRAM=m is really > # CONFIG_NVRAM=y > obj-$(CONFIG_NVRAM:m=y) += nvram.o > > Which means that this nvram.o will never be built as a module. > -- > Cheers, > Stephen Rothwell BTW, arm64 defconfig also produces a false positive: warning: same basename if the following are built as modules: arch/arm64/lib/crc32.ko lib/crc32.ko CONFIG_CRC32 is a tristate option, but ARM64 selects CRC32. So, CRC32 is always =y. We must stop checking modules.builtin soon. Sorry about noises. -- Best Regards Masahiro Yamada
[RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
System could have much faster swap device like zRAM. In that case, swapping is extremely cheaper than file-IO on the low-end storage. In this configuration, userspace could handle different strategy for each kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD while it keeps file-backed pages in inactive LRU by MADV_COOL because file IO is more expensive in this case so want to keep them in memory until memory pressure happens. To support such strategy easier, this patch introduces MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like that /proc//clear_refs already has supported same filters. They are filters could be Ored with other existing hints using top two bits of (int behavior). Once either of them is set, the hint could affect only the interested vma either anonymous or file-backed. With that, user could call a process_madvise syscall simply with a entire range(0x0 - 0x) but either of MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER so there is no need to call the syscall range by range. * from v1r2 * use consistent check with clear_refs to identify anon/file vma - surenb * from v1r1 * use naming "filter" for new madvise option - dancol Signed-off-by: Minchan Kim --- include/uapi/asm-generic/mman-common.h | 5 + mm/madvise.c | 14 ++ 2 files changed, 19 insertions(+) diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index b8e230de84a6..be59a1b90284 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -66,6 +66,11 @@ #define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */ #define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */ +#define MADV_BEHAVIOR_MASK (~(MADV_ANONYMOUS_FILTER|MADV_FILE_FILTER)) + +#define MADV_ANONYMOUS_FILTER (1<<31) /* works for only anonymous vma */ +#define MADV_FILE_FILTER (1<<30) /* works for only file-backed vma */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/mm/madvise.c b/mm/madvise.c index f4f569dac2bd..116131243540 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1002,7 +1002,15 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, int write; size_t len; struct blk_plug plug; + bool anon_only, file_only; + anon_only = behavior & MADV_ANONYMOUS_FILTER; + file_only = behavior & MADV_FILE_FILTER; + + if (anon_only && file_only) + return error; + + behavior = behavior & MADV_BEHAVIOR_MASK; if (!madvise_behavior_valid(behavior)) return error; @@ -1067,12 +1075,18 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, if (end < tmp) tmp = end; + if (anon_only && vma->vm_file) + goto next; + if (file_only && !vma->vm_file) + goto next; + /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ error = madvise_vma(tsk, vma, , start, tmp, behavior, ); if (error) goto out; *nr_pages += pages; +next: start = tmp; if (prev && start < prev->vm_end) start = prev->vm_end; -- 2.21.0.1020.gf2820cf01a-goog
[RFC 4/7] mm: factor out madvise's core functionality
This patch factor out madvise's core functionality so that upcoming patch can reuse it without duplication. It shouldn't change any behavior. Signed-off-by: Minchan Kim --- mm/madvise.c | 168 +++ 1 file changed, 89 insertions(+), 79 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 9a6698b56845..119e82e1f065 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -742,7 +742,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma, return 0; } -static long madvise_dontneed_free(struct vm_area_struct *vma, +static long madvise_dontneed_free(struct task_struct *tsk, + struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, int behavior) @@ -754,8 +755,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, if (!userfaultfd_remove(vma, start, end)) { *prev = NULL; /* mmap_sem has been dropped, prev is stale */ - down_read(>mm->mmap_sem); - vma = find_vma(current->mm, start); + down_read(>mm->mmap_sem); + vma = find_vma(tsk->mm, start); if (!vma) return -ENOMEM; if (start < vma->vm_start) { @@ -802,7 +803,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, * Application wants to free up the pages and associated backing store. * This is effectively punching a hole into the middle of a file. */ -static long madvise_remove(struct vm_area_struct *vma, +static long madvise_remove(struct task_struct *tsk, + struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end) { @@ -836,13 +838,13 @@ static long madvise_remove(struct vm_area_struct *vma, get_file(f); if (userfaultfd_remove(vma, start, end)) { /* mmap_sem was not released by userfaultfd_remove() */ - up_read(>mm->mmap_sem); + up_read(>mm->mmap_sem); } error = vfs_fallocate(f, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, offset, end - start); fput(f); - down_read(>mm->mmap_sem); + down_read(>mm->mmap_sem); return error; } @@ -916,12 +918,13 @@ static int madvise_inject_error(int behavior, #endif static long -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, - unsigned long start, unsigned long end, int behavior) +madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma, + struct vm_area_struct **prev, unsigned long start, + unsigned long end, int behavior) { switch (behavior) { case MADV_REMOVE: - return madvise_remove(vma, prev, start, end); + return madvise_remove(tsk, vma, prev, start, end); case MADV_WILLNEED: return madvise_willneed(vma, prev, start, end); case MADV_COOL: @@ -930,7 +933,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, return madvise_cold(vma, start, end); case MADV_FREE: case MADV_DONTNEED: - return madvise_dontneed_free(vma, prev, start, end, behavior); + return madvise_dontneed_free(tsk, vma, prev, start, + end, behavior); default: return madvise_behavior(vma, prev, start, end, behavior); } @@ -974,68 +978,8 @@ madvise_behavior_valid(int behavior) } } -/* - * The madvise(2) system call. - * - * Applications can use madvise() to advise the kernel how it should - * handle paging I/O in this VM area. The idea is to help the kernel - * use appropriate read-ahead and caching techniques. The information - * provided is advisory only, and can be safely disregarded by the - * kernel without affecting the correct operation of the application. - * - * behavior values: - * MADV_NORMAL - the default behavior is to read clusters. This - * results in some read-ahead and read-behind. - * MADV_RANDOM - the system should read the minimum amount of data - * on any access, since it is unlikely that the appli- - * cation will need more than what it asks for. - * MADV_SEQUENTIAL - pages in the given range will probably be accessed - * once, so they can be aggressively read ahead, and - * can be freed soon after they are accessed. - * MADV_WILLNEED - the application is notifying the system to read - * some pages ahead. - * MADV_DONTNEED - the application is finished with the given range, - * so the kernel can free resources
[RFC 3/7] mm: introduce MADV_COLD
When a process expects no accesses to a certain memory range for a long time, it could hint kernel that the pages can be reclaimed instantly but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used for a long time. The hint can help kernel in deciding which pages to evict proactively. Internally, it works via reclaiming memory in process context the syscall is called. If the page is dirty but backing storage is not synchronous device, the written page will be rotate back into LRU's tail once the write is done so they will reclaim easily when memory pressure happens. If backing storage is synchrnous device(e.g., zram), hte page will be reclaimed instantly. Signed-off-by: Minchan Kim --- include/linux/swap.h | 1 + include/uapi/asm-generic/mman-common.h | 1 + mm/madvise.c | 123 + mm/vmscan.c| 74 +++ 4 files changed, 199 insertions(+) diff --git a/include/linux/swap.h b/include/linux/swap.h index 64795abea003..7f32a948fc6a 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -365,6 +365,7 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern unsigned long vm_total_pages; +extern unsigned long reclaim_pages(struct list_head *page_list); #ifdef CONFIG_NUMA extern int node_reclaim_mode; extern int sysctl_min_unmapped_ratio; diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index f7a4a5d4b642..b9b51eeb8e1a 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -43,6 +43,7 @@ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ #define MADV_COOL 5 /* deactivatie these pages */ +#define MADV_COLD 6 /* reclaim these pages */ /* common parameters: try to keep these consistent across architectures */ #define MADV_FREE 8 /* free pages only if memory pressure */ diff --git a/mm/madvise.c b/mm/madvise.c index c05817fb570d..9a6698b56845 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -42,6 +42,7 @@ static int madvise_need_mmap_write(int behavior) case MADV_WILLNEED: case MADV_DONTNEED: case MADV_COOL: + case MADV_COLD: case MADV_FREE: return 0; default: @@ -416,6 +417,125 @@ static long madvise_cool(struct vm_area_struct *vma, return 0; } +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr, + unsigned long end, struct mm_walk *walk) +{ + pte_t *orig_pte, *pte, ptent; + spinlock_t *ptl; + LIST_HEAD(page_list); + struct page *page; + int isolated = 0; + struct vm_area_struct *vma = walk->vma; + unsigned long next; + + next = pmd_addr_end(addr, end); + if (pmd_trans_huge(*pmd)) { + spinlock_t *ptl; + + ptl = pmd_trans_huge_lock(pmd, vma); + if (!ptl) + return 0; + + if (is_huge_zero_pmd(*pmd)) + goto huge_unlock; + + page = pmd_page(*pmd); + if (page_mapcount(page) > 1) + goto huge_unlock; + + if (next - addr != HPAGE_PMD_SIZE) { + int err; + + get_page(page); + spin_unlock(ptl); + lock_page(page); + err = split_huge_page(page); + unlock_page(page); + put_page(page); + if (!err) + goto regular_page; + return 0; + } + + if (isolate_lru_page(page)) + goto huge_unlock; + + list_add(>lru, _list); +huge_unlock: + spin_unlock(ptl); + reclaim_pages(_list); + return 0; + } + + if (pmd_trans_unstable(pmd)) + return 0; +regular_page: + orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, ); + for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) { + ptent = *pte; + if (!pte_present(ptent)) + continue; + + page = vm_normal_page(vma, addr, ptent); + if (!page) + continue; + + if (page_mapcount(page) > 1) + continue; + + if (isolate_lru_page(page)) + continue; + + isolated++; + list_add(>lru, _list); +
[RFC 5/7] mm: introduce external memory hinting API
There is some usecase that centralized userspace daemon want to give a memory hint like MADV_[COOL|COLD] to other process. Android's ActivityManagerService is one of them. It's similar in spirit to madvise(MADV_WONTNEED), but the information required to make the reclaim decision is not known to the app. Instead, it is known to the centralized userspace daemon(ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the issue, this patch introduces new syscall process_madvise(2) which works based on pidfd so it could give a hint to the exeternal process. int process_madvise(int pidfd, void *addr, size_t length, int advise); All advises madvise provides can be supported in process_madvise, too. Since it could affect other process's address range, only privileged process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) gives it the right to ptrrace the process could use it successfully. Please suggest better idea if you have other idea about the permission. * from v1r1 * use ptrace capability - surenb, dancol Signed-off-by: Minchan Kim --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/proc_fs.h| 1 + include/linux/syscalls.h | 2 ++ include/uapi/asm-generic/unistd.h | 2 ++ kernel/signal.c| 2 +- kernel/sys_ni.c| 1 + mm/madvise.c | 45 ++ 8 files changed, 54 insertions(+), 1 deletion(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 4cd5f982b1e5..5b9dd55d6b57 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -438,3 +438,4 @@ 425i386io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup 426i386io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter 427i386io_uring_register sys_io_uring_register __ia32_sys_io_uring_register +428i386process_madvise sys_process_madvise __ia32_sys_process_madvise diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 64ca0d06259a..0e5ee78161c9 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -355,6 +355,7 @@ 425common io_uring_setup __x64_sys_io_uring_setup 426common io_uring_enter __x64_sys_io_uring_enter 427common io_uring_register __x64_sys_io_uring_register +428common process_madvise __x64_sys_process_madvise # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h index 52a283ba0465..f8545d7c5218 100644 --- a/include/linux/proc_fs.h +++ b/include/linux/proc_fs.h @@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file) #endif /* CONFIG_PROC_FS */ +extern struct pid *pidfd_to_pid(const struct file *file); struct net; static inline struct proc_dir_entry *proc_net_mkdir( diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index e2870fe1be5b..21c6c9a62006 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void); asmlinkage long sys_mincore(unsigned long start, size_t len, unsigned char __user * vec); asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); +asmlinkage long sys_process_madvise(int pid_fd, unsigned long start, + size_t len, int behavior); asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, unsigned long prot, unsigned long pgoff, unsigned long flags); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index dee7292e1df6..7ee82ce04620 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup) __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter) #define __NR_io_uring_register 427 __SYSCALL(__NR_io_uring_register, sys_io_uring_register) +#define __NR_process_madvise 428 +__SYSCALL(__NR_process_madvise, sys_process_madvise) #undef __NR_syscalls #define __NR_syscalls 428 diff --git a/kernel/signal.c b/kernel/signal.c index 1c86b78a7597..04e75daab1f8 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info) return copy_siginfo_from_user(kinfo, info); } -static struct pid *pidfd_to_pid(const struct file *file) +struct pid *pidfd_to_pid(const struct file *file) { if (file->f_op == _fops)
[RFC 6/7] mm: extend process_madvise syscall to support vector arrary
Currently, process_madvise syscall works for only one address range so user should call the syscall several times to give hints to multiple address range. This patch extends process_madvise syscall to support multiple hints, address ranges and return vaules so user could give hints all at once. struct pr_madvise_param { int size; /* the size of this structure */ const struct iovec __user *vec; /* address range array */ } int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, struct pr_madvise_param *results, struct pr_madvise_param *ranges, unsigned long flags); - pidfd target process fd - nr_elem the number of elemenent of array behavior, results, ranges - behavior hints for each address range in remote process so that user could give different hints for each range. - results array of buffers to get results for associated remote address range action. - ranges array to buffers to have remote process's address ranges to be processed - flags extra argument for the future. It should be zero this moment. Example) struct pr_madvise_param { int size; const struct iovec *vec; }; int main(int argc, char *argv[]) { struct pr_madvise_param retp, rangep; struct iovec result_vec[2], range_vec[2]; int hints[2]; long ret[2]; void *addr[2]; pid_t pid; char cmd[64] = {0,}; addr[0] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE, MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0); if (MAP_FAILED == addr[0]) return 1; addr[1] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE, MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0); if (MAP_FAILED == addr[1]) return 1; hints[0] = MADV_COLD; range_vec[0].iov_base = addr[0]; range_vec[0].iov_len = ALLOC_SIZE; result_vec[0].iov_base = [0]; result_vec[0].iov_len = sizeof(long); retp.vec = result_vec; retp.size = sizeof(struct pr_madvise_param); hints[1] = MADV_COOL; range_vec[1].iov_base = addr[1]; range_vec[1].iov_len = ALLOC_SIZE; result_vec[1].iov_base = [1]; result_vec[1].iov_len = sizeof(long); rangep.vec = range_vec; rangep.size = sizeof(struct pr_madvise_param); pid = fork(); if (!pid) { sleep(10); } else { int pidfd = open(cmd, O_DIRECTORY | O_CLOEXEC); if (pidfd < 0) return 1; /* munmap to make pages private for the child */ munmap(addr[0], ALLOC_SIZE); munmap(addr[1], ALLOC_SIZE); system("cat /proc/vmstat | egrep 'pswpout|deactivate'"); if (syscall(__NR_process_madvise, pidfd, 2, behaviors, , , 0)) perror("process_madvise fail\n"); system("cat /proc/vmstat | egrep 'pswpout|deactivate'"); } return 0; } Signed-off-by: Minchan Kim --- include/uapi/asm-generic/mman-common.h | 5 + mm/madvise.c | 184 + 2 files changed, 166 insertions(+), 23 deletions(-) diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index b9b51eeb8e1a..b8e230de84a6 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -74,4 +74,9 @@ #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\ PKEY_DISABLE_WRITE) +struct pr_madvise_param { + int size; /* the size of this structure */ + const struct iovec __user *vec; /* address range array */ +}; + #endif /* __ASM_GENERIC_MMAN_COMMON_H */ diff --git a/mm/madvise.c b/mm/madvise.c index af02aa17e5c1..f4f569dac2bd 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -320,6 +320,7 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr, struct page *page; struct vm_area_struct *vma = walk->vma; unsigned long next; + long nr_pages = 0; next = pmd_addr_end(addr, end); if (pmd_trans_huge(*pmd)) { @@ -380,9 +381,12 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr, ptep_test_and_clear_young(vma, addr, pte); deactivate_page(page); + nr_pages++; + } pte_unmap_unlock(orig_pte, ptl); + *(long *)walk->private += nr_pages; cond_resched(); return 0; @@ -390,11 +394,13 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr, static void madvise_cool_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, -unsigned long addr,
[RFC 2/7] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN as default. It is for preventing to reclaim dirty pages when CMA try to migrate pages. Strictly speaking, we don't need it because CMA didn't allow to write out by .may_writepage = 0 in reclaim_clean_pages_from_list. Moreover, it has a problem to prevent anonymous pages's swap out even though force_reclaim = true in shrink_page_list on upcoming patch. So this patch makes references's default value to PAGEREF_RECLAIM and rename force_reclaim with skip_reference_check to make it more clear. This is a preparatory work for next patch. Signed-off-by: Minchan Kim --- mm/vmscan.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index d9c3e873eca6..a28e5d17b495 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1102,7 +1102,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, struct scan_control *sc, enum ttu_flags ttu_flags, struct reclaim_stat *stat, - bool force_reclaim) + bool skip_reference_check) { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); @@ -1116,7 +1116,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, struct address_space *mapping; struct page *page; int may_enter_fs; - enum page_references references = PAGEREF_RECLAIM_CLEAN; + enum page_references references = PAGEREF_RECLAIM; bool dirty, writeback; cond_resched(); @@ -1248,7 +1248,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, } } - if (!force_reclaim) + if (!skip_reference_check) references = page_check_references(page, sc); switch (references) { -- 2.21.0.1020.gf2820cf01a-goog
[RFC 1/7] mm: introduce MADV_COOL
When a process expects no accesses to a certain memory range it could hint kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COOL hint to madvise(2) syscall. MADV_COOL can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. Internally, it works via deactivating memory from active list to inactive's head so when the memory pressure happens, they will be reclaimed earlier than other active pages unless there is no access until the time. * v1r2 * use clear_page_young in deactivate_page - joelaf * v1r1 * Revise the description - surenb * Renaming from MADV_WARM to MADV_COOL - surenb Signed-off-by: Minchan Kim --- include/linux/page-flags.h | 1 + include/linux/page_idle.h | 15 include/linux/swap.h | 1 + include/uapi/asm-generic/mman-common.h | 1 + mm/madvise.c | 112 + mm/swap.c | 43 ++ 6 files changed, 173 insertions(+) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 9f8712a4b1a5..58b06654c8dd 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page) TESTPAGEFLAG(Young, young, PF_ANY) SETPAGEFLAG(Young, young, PF_ANY) TESTCLEARFLAG(Young, young, PF_ANY) +CLEARPAGEFLAG(Young, young, PF_ANY) PAGEFLAG(Idle, idle, PF_ANY) #endif diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h index 1e894d34bdce..f3f43b317150 100644 --- a/include/linux/page_idle.h +++ b/include/linux/page_idle.h @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page) SetPageYoung(page); } +static inline void clear_page_young(struct page *page) +{ + ClearPageYoung(page); +} + static inline bool test_and_clear_page_young(struct page *page) { return TestClearPageYoung(page); @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page) set_bit(PAGE_EXT_YOUNG, _ext->flags); } +static void clear_page_young(struct page *page) +{ + struct page_ext *page_ext = lookup_page_ext(page); + + if (unlikely(!page_ext)) + return; + + clear_bit(PAGE_EXT_YOUNG, _ext->flags); +} + static inline bool test_and_clear_page_young(struct page *page) { struct page_ext *page_ext = lookup_page_ext(page); diff --git a/include/linux/swap.h b/include/linux/swap.h index 4bfb5c4ac108..64795abea003 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu); extern void lru_add_drain_all(void); extern void rotate_reclaimable_page(struct page *page); extern void deactivate_file_page(struct page *page); +extern void deactivate_page(struct page *page); extern void mark_page_lazyfree(struct page *page); extern void swap_setup(void); diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index abd238d0f7a4..f7a4a5d4b642 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -42,6 +42,7 @@ #define MADV_SEQUENTIAL2 /* expect sequential page references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ +#define MADV_COOL 5 /* deactivatie these pages */ /* common parameters: try to keep these consistent across architectures */ #define MADV_FREE 8 /* free pages only if memory pressure */ diff --git a/mm/madvise.c b/mm/madvise.c index 628022e674a7..c05817fb570d 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -8,6 +8,7 @@ #include #include +#include #include #include #include @@ -40,6 +41,7 @@ static int madvise_need_mmap_write(int behavior) case MADV_REMOVE: case MADV_WILLNEED: case MADV_DONTNEED: + case MADV_COOL: case MADV_FREE: return 0; default: @@ -307,6 +309,113 @@ static long madvise_willneed(struct vm_area_struct *vma, return 0; } +static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr, + unsigned long end, struct mm_walk *walk) +{ + pte_t *orig_pte, *pte, ptent; + spinlock_t *ptl; + struct page *page; + struct vm_area_struct *vma = walk->vma; + unsigned long next; + + next = pmd_addr_end(addr, end); + if (pmd_trans_huge(*pmd)) { + spinlock_t *ptl; + + ptl = pmd_trans_huge_lock(pmd, vma); + if (!ptl) + return 0; + + if
[RFC 0/7] introduce memory hinting API for external process
- Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. IMHO we should spell it out that this patchset complements MADV_WONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COOL which will deactive activated pages and the other is MADV_COLD which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This approach is similar in spirit to madvise(MADV_WONTNEED), but the information required to make the reclaim decision is not known to the app. Instead, it is known to a centralized userspace daemon, and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the concern, this patch introduces new syscall - struct pr_madvise_param { int size; const struct iovec *vec; } int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, struct pr_madvise_param *restuls, struct pr_madvise_param *ranges, unsigned long flags); The syscall get pidfd to give hints to external process and provides pair of result/ranges vector arguments so that it could give several hints to each address range all at once. I guess others have different ideas about the naming of syscall and options so feel free to suggest better naming. - Experiment We did bunch of testing with several hundreds of real users, not artificial benchmark on android. We saw about 17% cold start decreasement without any significant battery/app startup latency issues. And with artificial benchmark which launches and switching apps, we saw average 7% app launching improvement, 18% less lmkd kill and good stat from vmstat. A is vanilla and B is process_madvise. A B delta ratio(%) allocstall_dma 0 0 0 0.00 allocstall_movable 1464457 -1007 -69.00 allocstall_normal 263210 190763 -72447
Re: [Qemu-devel] [PATCH v9 2/7] virtio-pmem: Add virtio pmem driver
> > > > On 5/16/19 10:35 PM, Pankaj Gupta wrote: > > > Can I take it your reviewed/acked-by? or tested-by tag? for the virtio > > > patch :)I don't feel that I have enough expertise to give the reviewed-by > > > tag, but you can > > take my acked-by + tested-by. > > > > Acked-by: Jakub Staron > > Tested-by: Jakub Staron > > > > No kernel panics/stalls encountered during testing this patches (v9) with > > QEMU + xfstests. > > Thank you for testing and confirming the results. I will add your tested & > acked-by in v10. > > > Some CPU stalls encountered while testing with crosvm instead of QEMU with > > xfstests > > (test generic/464) but no repro for QEMU, so the fault may be on the side > > of > > crosvm. > > yes, looks like crosvm related as we did not see any of this in my and your > testing with Qemu. Also, they don't seem to be related with virtio-pmem. Thanks, Pankaj
[PATCH v2] tipc: Avoid copying bytes beyond the supplied data
TLV_SET is called with a data pointer and a len parameter that tells us how many bytes are pointed to by data. When invoking memcpy() we need to careful to only copy len bytes. Previously we would copy TLV_LENGTH(len) bytes which would copy an extra 4 bytes past the end of the data pointer which newer GCC versions complain about. In file included from test.c:17: In function 'TLV_SET', inlined from 'test' at test.c:186:5: /usr/include/linux/tipc_config.h:317:3: warning: 'memcpy' forming offset [33, 36] is out of the bounds [0, 32] of object 'bearer_name' with type 'char[32]' [-Warray-bounds] memcpy(TLV_DATA(tlv_ptr), data, tlv_len); ^~~~ test.c: In function 'test': test.c::161:10: note: 'bearer_name' declared here char bearer_name[TIPC_MAX_BEARER_NAME]; ^~~ We still want to ensure any padding bytes at the end are initialised, do this with a explicit memset() rather than copy bytes past the end of data. Apply the same logic to TCM_SET. Signed-off-by: Chris Packham --- Changes in v2: - Ensure padding bytes are initialised in both TLV_SET and TCM_SET include/uapi/linux/tipc_config.h | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/tipc_config.h b/include/uapi/linux/tipc_config.h index 4b2c93b1934c..4955e1a9f1bc 100644 --- a/include/uapi/linux/tipc_config.h +++ b/include/uapi/linux/tipc_config.h @@ -307,8 +307,10 @@ static inline int TLV_SET(void *tlv, __u16 type, void *data, __u16 len) tlv_ptr = (struct tlv_desc *)tlv; tlv_ptr->tlv_type = htons(type); tlv_ptr->tlv_len = htons(tlv_len); - if (len && data) - memcpy(TLV_DATA(tlv_ptr), data, tlv_len); + if (len && data) { + memcpy(TLV_DATA(tlv_ptr), data, len); + memset(TLV_DATA(tlv_ptr) + len, 0, TLV_SPACE(len) - tlv_len); + } return TLV_SPACE(len); } @@ -405,8 +407,10 @@ static inline int TCM_SET(void *msg, __u16 cmd, __u16 flags, tcm_hdr->tcm_len = htonl(msg_len); tcm_hdr->tcm_type = htons(cmd); tcm_hdr->tcm_flags = htons(flags); - if (data_len && data) + if (data_len && data) { memcpy(TCM_DATA(msg), data, data_len); + memset(TCM_DATA(msg) + data_len, 0, TCM_SPACE(data_len) - msg_len); + } return TCM_SPACE(data_len); } -- 2.21.0
Re: [Qemu-devel] [PATCH v9 2/7] virtio-pmem: Add virtio pmem driver
> On 5/16/19 10:35 PM, Pankaj Gupta wrote: > > Can I take it your reviewed/acked-by? or tested-by tag? for the virtio > > patch :)I don't feel that I have enough expertise to give the reviewed-by > > tag, but you can > take my acked-by + tested-by. > > Acked-by: Jakub Staron > Tested-by: Jakub Staron > > No kernel panics/stalls encountered during testing this patches (v9) with > QEMU + xfstests. Thank you for testing and confirming the results. I will add your tested & acked-by in v10. > Some CPU stalls encountered while testing with crosvm instead of QEMU with > xfstests > (test generic/464) but no repro for QEMU, so the fault may be on the side of > crosvm. yes, looks like crosvm related as we did not see any of this in my and your testing with Qemu. > > > The dump for the crosvm/xfstests stall: > [ 2504.175276] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: > [ 2504.176681] rcu: 0-...!: (1 GPs behind) idle=9b2/1/0x4000 > softirq=1089198/1089202 fqs=0 > [ 2504.178270] rcu: 2-...!: (1 ticks this GP) > idle=cfe/1/0x4002 softirq=1055108/1055110 fqs=0 > [ 2504.179802] rcu: 3-...!: (1 GPs behind) idle=1d6/1/0x4002 > softirq=1046798/1046802 fqs=0 > [ 2504.181215] rcu: 4-...!: (2 ticks this GP) > idle=522/1/0x4002 softirq=1249063/1249064 fqs=0 > [ 2504.182625] rcu: 5-...!: (1 GPs behind) idle=6da/1/0x4000 > softirq=1131036/1131047 fqs=0 > [ 2504.183955] (detected by 3, t=0 jiffies, g=1232529, q=1370) > [ 2504.184762] Sending NMI from CPU 3 to CPUs 0: > [ 2504.186400] NMI backtrace for cpu 0 > [ 2504.186401] CPU: 0 PID: 6670 Comm: 464 Not tainted 5.1.0+ #1 > [ 2504.186401] Hardware name: ChromiumOS crosvm, BIOS 0 > [ 2504.186402] RIP: 0010:queued_spin_lock_slowpath+0x1c/0x1e0 > [ 2504.186402] Code: e7 89 c8 f0 44 0f b1 07 39 c1 75 dc f3 c3 0f 1f 44 00 00 > ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3 c3 f3 90 ec > 81 fe 00 01 00 00 0f 84 ab 00 00 00 81 e6 00 ff ff ff 75 44 > [ 2504.186403] RSP: 0018:c9003ee8 EFLAGS: 0002 > [ 2504.186404] RAX: 0001 RBX: 0246 RCX: > 00404044 > [ 2504.186404] RDX: 0001 RSI: 0001 RDI: > 8244a280 > [ 2504.186405] RBP: 8244a280 R08: 000f4200 R09: > 024709ed6c32 > [ 2504.186405] R10: R11: 0001 R12: > 8244a280 > [ 2504.186405] R13: 0009 R14: 0009 R15: > > [ 2504.186406] FS: () GS:8880cc60() > knlGS: > [ 2504.186406] CS: 0010 DS: ES: CR0: 80050033 > [ 2504.186406] CR2: 7efd6b0f15d8 CR3: 0260a006 CR4: > 00360ef0 > [ 2504.186407] DR0: DR1: DR2: > > [ 2504.186407] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 2504.186407] Call Trace: > [ 2504.186408] > [ 2504.186408] _raw_spin_lock_irqsave+0x1d/0x30 > [ 2504.186408] rcu_core+0x3b6/0x740 > [ 2504.186408] ? __hrtimer_run_queues+0x133/0x280 > [ 2504.186409] ? recalibrate_cpu_khz+0x10/0x10 > [ 2504.186409] __do_softirq+0xd8/0x2e4 > [ 2504.186409] irq_exit+0xa3/0xb0 > [ 2504.186410] smp_apic_timer_interrupt+0x67/0x120 > [ 2504.186410] apic_timer_interrupt+0xf/0x20 > [ 2504.186410] > [ 2504.186410] RIP: 0010:unmap_page_range+0x47a/0x9b0 > [ 2504.186411] Code: 0f 46 46 10 49 39 6e 18 49 89 46 10 48 89 e8 49 0f 43 46 > 18 41 80 4e 20 08 4d 85 c9 49 89 46 18 0f 84 68 ff ff ff 49 8b 51 08 <48> 8d > 42 ff 83 e2 01 49 0f 44 c1 f6 40 18 01 75 38 48 ba ff 0f 00 > [ 2504.186411] RSP: 0018:c900036cbcc8 EFLAGS: 0282 ORIG_RAX: > ff13 > [ 2504.186412] RAX: RBX: 80003751d045 RCX: > 0001 > [ 2504.186413] RDX: ea0002e09288 RSI: 0269b000 RDI: > 8880b6525e40 > [ 2504.186413] RBP: 0269c000 R08: R09: > eadd4740 > [ 2504.186413] R10: ea0001755700 R11: 8880cc62d120 R12: > 02794000 > [ 2504.186414] R13: 0269b000 R14: c900036cbdf0 R15: > 8880572434d8 > [ 2504.186414] ? unmap_page_range+0x420/0x9b0 > [ 2504.186414] ? release_pages+0x175/0x390 > [ 2504.186414] unmap_vmas+0x7c/0xe0 > [ 2504.186415] exit_mmap+0xa4/0x190 > [ 2504.186415] mmput+0x3b/0x100 > [ 2504.186415] do_exit+0x276/0xc10 > [ 2504.186415] do_group_exit+0x35/0xa0 > [ 2504.186415] __x64_sys_exit_group+0xf/0x10 > [ 2504.186416] do_syscall_64+0x43/0x120 > [ 2504.186416] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 2504.186416] RIP: 0033:0x7efd6ae10618 > [ 2504.186416] Code: Bad RIP value. > [ 2504.186417] RSP: 002b:7ffcac9bde38 EFLAGS: 0246 ORIG_RAX: > 00e7 > [ 2504.186417] RAX: ffda RBX: RCX: > 7efd6ae10618 > [ 2504.186418] RDX: RSI: 003c RDI: > > [ 2504.186418] RBP: 7efd6b0ed8e0 R08: 00e7 R09: >
linux-next: stats
Hi all, As usual, the executive friendly graph is at http://neuling.org/linux-next-size.html :-) (No merge commits counted, next-20190507 was the first linux-next after the merge window opened.) Commits in v5.2-rc1 (relative to v5.1):12064 Commits in next-20190507: 11388 Commits with the same SHA1:10638 Commits with the same patch_id: 350 (1) Commits with the same subject line: 61 (1) (1) not counting those in the lines above. So commits in -rc1 that were in next-20190507: 11049 91% Some breakdown of the list of extra commits (relative to next-20190507) in -rc1: Top ten first word of commit summary: 58 net 46 afs 44 perf 35 kvm 35 drm 34 x86 34 tools 33 documentation 25 thermal 23 drivers Top ten authors: 50 dhowe...@redhat.com 28 yamada.masah...@socionext.com 28 tstoya...@vmware.com 27 changbin...@intel.com 21 amit.kuche...@linaro.org 20 jlay...@kernel.org 16 h...@lst.de 15 bgolaszew...@baylibre.com 14 del...@gmx.de 13 yuehaib...@huawei.com Top ten commiters: 105 da...@davemloft.net 73 a...@redhat.com 59 edubez...@gmail.com 50 lee.jo...@linaro.org 50 dhowe...@redhat.com 42 pbonz...@redhat.com 31 yamada.masah...@socionext.com 31 j...@ziepe.ca 31 cor...@lwn.net 29 idryo...@gmail.com There are also 339 commits in next-20190507 that didn't make it into v5.2-rc1. Top ten first word of commit summary: 27 drm 20 xtensa 20 mm 20 arm 19 nvmem 19 i2c 15 selftests 14 ntb 11 nfc 10 arm64 Top ten authors: 18 jcmvb...@gmail.com 17 a...@linux-foundation.org 12 wsa+rene...@sang-engineering.com 12 evan.q...@amd.com 10 stefan.wah...@i2se.com 10 li...@rasmusvillemoes.dk 9 zo...@linux.ibm.com 9 log...@deltatee.com 7 jean-philippe.bruc...@arm.com 6 tadeusz.st...@intel.com Some of Andrew's patches are fixes for other patches in his tree (and have been merged into those). Top ten commiters: 103 s...@canb.auug.org.au 23 jcmvb...@gmail.com 23 alexander.deuc...@amd.com 22 srinivas.kandaga...@linaro.org 15 w...@the-dreams.de 15 jarkko.sakki...@linux.intel.com 14 jdma...@kudzu.us 12 stefan.wah...@i2se.com 11 sa...@linux.intel.com 11 m...@redhat.com Those commits by me are from the quilt series (mainly Andrew's mmotm tree). -- Cheers, Stephen Rothwell pgpNdIkkaKTDi.pgp Description: OpenPGP digital signature
[PATCH] MAINTAINERS: Add mailing list for csky architecture
From: Guo Ren Add the newly created linux-c...@vger.kernel.org mailing list for patch reviews and discussions. Signed-off-by: Guo Ren --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 5cfbea4..b5fadcc 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3474,6 +3474,7 @@ F:sound/pci/oxygen/ C-SKY ARCHITECTURE M: Guo Ren +L: linux-c...@vger.kernel.org T: git https://github.com/c-sky/csky-linux.git S: Supported F: arch/csky/ -- 2.7.4
RE: [PATCH 1/3] enetc: add hardware timestamping support
Hi, > -Original Message- > From: Claudiu Manoil > Sent: Thursday, May 16, 2019 11:31 PM > To: Richard Cochran ; Y.b. Lu > > Cc: net...@vger.kernel.org; David Miller ; Shawn > Guo ; Rob Herring ; > devicet...@vger.kernel.org; linux-arm-ker...@lists.infradead.org; > linux-kernel@vger.kernel.org > Subject: RE: [PATCH 1/3] enetc: add hardware timestamping support > > > >-Original Message- > >From: Richard Cochran > >Sent: Thursday, May 16, 2019 5:33 PM > >To: Y.b. Lu > >Cc: net...@vger.kernel.org; David Miller ; Claudiu > >Manoil ; Shawn Guo ; > Rob > >Herring ; devicet...@vger.kernel.org; linux-arm- > >ker...@lists.infradead.org; linux-kernel@vger.kernel.org > >Subject: Re: [PATCH 1/3] enetc: add hardware timestamping support > > > >On Thu, May 16, 2019 at 09:59:08AM +, Y.b. Lu wrote: > > > [...] > > > >> static bool enetc_clean_tx_ring(struct enetc_bdr *tx_ring, int > >> napi_budget) { > >>struct net_device *ndev = tx_ring->ndev; > >> + struct enetc_ndev_priv *priv = netdev_priv(ndev); > >>int tx_frm_cnt = 0, tx_byte_cnt = 0; > >>struct enetc_tx_swbd *tx_swbd; > >> + union enetc_tx_bd *txbd; > >> + bool do_tstamp; > >>int i, bds_to_clean; > >> + u64 tstamp = 0; > > > >Please keep in reverse Christmas tree order as much as possible: > > For the xmass tree part, Yangbo, better move the priv and txbd declarations > inside the scope of the if() {} block where they are actually used, i.e.: > > if (unlikely(tx_swbd->check_wb)) { > struct enetc_ndev_priv *priv = netdev_priv(ndev); > union enetc_tx_bd *txbd; > [...] > } > [Y.b. Lu] Will do that. > > > > union enetc_tx_bd *txbd; > > int i, bds_to_clean; > > bool do_tstamp; > > u64 tstamp = 0; > > > >>i = tx_ring->next_to_clean; > >>tx_swbd = _ring->tx_swbd[i]; > >>bds_to_clean = enetc_bd_ready_count(tx_ring, i); > >> > >> + do_tstamp = false; > >> + > >>while (bds_to_clean && tx_frm_cnt < ENETC_DEFAULT_TX_WORK) { > >>bool is_eof = !!tx_swbd->skb; > >> > >> + if (unlikely(tx_swbd->check_wb)) { > >> + txbd = ENETC_TXBD(*tx_ring, i); > >> + > >> + if (!(txbd->flags & ENETC_TXBD_FLAGS_W)) > >> + goto no_wb; > >> + > >> + if (tx_swbd->do_tstamp) { > >> + enetc_get_tx_tstamp(>si->hw, txbd, > >> + ); > >> + do_tstamp = true; > >> + } > >> + } > >> +no_wb: > > > >This goto seems strange and unnecessary. How about this instead? > > > > if (txbd->flags & ENETC_TXBD_FLAGS_W && > > tx_swbd->do_tstamp) { > > enetc_get_tx_tstamp(>si->hw, txbd, > > ); > > do_tstamp = true; > > } > > > > Absolutely, somehow I missed this. I guess the intention was to be able to > support multiple > if() blocks for the writeback case (W flag set) but the code is much better > off > without the goto. [Y.b. Lu] Will use this to support current single tstamp writeback case. > > >>enetc_unmap_tx_buff(tx_ring, tx_swbd); > >>if (is_eof) { > >> + if (unlikely(do_tstamp)) { > >> + enetc_tstamp_tx(tx_swbd->skb, tstamp); > >> + do_tstamp = false; > >> + } > >>napi_consume_skb(tx_swbd->skb, napi_budget); > >>tx_swbd->skb = NULL; > >>} > >> @@ -167,6 +169,11 @@ struct enetc_cls_rule { > >> > >> #define ENETC_MAX_BDR_INT 2 /* fixed to max # of available cpus */ > >> > >> +enum enetc_hw_features { > > > >This is a poor choice of name. It sounds like it describes HW > >capabilities, but you use it to track whether a feature is requested at > >run time. > > > >> + ENETC_F_RX_TSTAMP = BIT(0), > >> + ENETC_F_TX_TSTAMP = BIT(1), > >> +}; > >> + > >> struct enetc_ndev_priv { > >>struct net_device *ndev; > >>struct device *dev; /* dma-mapping device */ @@ -178,6 +185,7 @@ > >> struct enetc_ndev_priv { > >>u16 rx_bd_count, tx_bd_count; > >> > >>u16 msg_enable; > >> + int hw_features; > > > >This is also poorly named. How about "tstamp_request" instead? > > > > This ndev_priv variable was intended to gather flags for all the active h/w > related features, i.e. keeping count of what h/w offloads are enabled for the > current device (at least for those that don't have already a netdev_features_t > flag). > I wouldn't waste an int for 2 timestamp flags, I'd rather have a more generic > name. > Maybe active_offloads then? > > Anyway, the name can be changed later too, when other offloads will be > added. [Y.b. Lu] How about using active_offloads, and add TODO comments in enum enetc_active_offloads? > > Thanks, >
[PATCH] intel_th: msu: Fix unused variable warning on arm64 platform
Fix this compiler warning on arm64 platform. Cc: Alexander Shishkin Cc: Greg Kroah-Hartman Signed-off-by: Shaokun Zhang --- drivers/hwtracing/intel_th/msu.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index 81bb54fa3ce8..e15ed5c308e1 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -780,7 +780,7 @@ static int __msc_buffer_win_alloc(struct msc_window *win, static int msc_buffer_win_alloc(struct msc *msc, unsigned int nr_blocks) { struct msc_window *win; - int ret = -ENOMEM, i; + int ret = -ENOMEM; if (!nr_blocks) return 0; @@ -812,7 +812,7 @@ static int msc_buffer_win_alloc(struct msc *msc, unsigned int nr_blocks) goto err_nomem; #ifdef CONFIG_X86 - for (i = 0; i < ret; i++) + for (int i = 0; i < ret; i++) /* Set the page as uncached */ set_memory_uc((unsigned long)msc_win_block(win, i), 1); #endif @@ -860,8 +860,6 @@ static void __msc_buffer_win_free(struct msc *msc, struct msc_window *win) */ static void msc_buffer_win_free(struct msc *msc, struct msc_window *win) { - int i; - msc->nr_pages -= win->nr_blocks; list_del(>entry); @@ -871,7 +869,7 @@ static void msc_buffer_win_free(struct msc *msc, struct msc_window *win) } #ifdef CONFIG_X86 - for (i = 0; i < win->nr_blocks; i++) + for (int i = 0; i < win->nr_blocks; i++) /* Reset the page to write-back */ set_memory_wb((unsigned long)msc_win_block(win, i), 1); #endif -- 2.7.4
[PATCH] scsi: fix typos in code comments
fix abord to abort Signed-off-by: Weitao Hou --- drivers/scsi/pm8001/pm8001_sas.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/pm8001/pm8001_sas.c b/drivers/scsi/pm8001/pm8001_sas.c index 88eef3b18e41..3de57c5a3299 100644 --- a/drivers/scsi/pm8001/pm8001_sas.c +++ b/drivers/scsi/pm8001/pm8001_sas.c @@ -1181,7 +1181,7 @@ int pm8001_query_task(struct sas_task *task) return rc; } -/* mandatory SAM-3, still need free task/ccb info, abord the specified task */ +/* mandatory SAM-3, still need free task/ccb info, abort the specified task */ int pm8001_abort_task(struct sas_task *task) { unsigned long flags; -- 2.18.0
RE: [EXT] Re: [PATCH 1/3] enetc: add hardware timestamping support
Hi, > -Original Message- > From: Richard Cochran > Sent: Thursday, May 16, 2019 10:33 PM > To: Y.b. Lu > Cc: net...@vger.kernel.org; David Miller ; Claudiu > Manoil ; Shawn Guo ; Rob > Herring ; devicet...@vger.kernel.org; > linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org > Subject: [EXT] Re: [PATCH 1/3] enetc: add hardware timestamping support > > > On Thu, May 16, 2019 at 09:59:08AM +, Y.b. Lu wrote: > > > +config FSL_ENETC_HW_TIMESTAMPING > > + bool "ENETC hardware timestamping support" > > + depends on FSL_ENETC || FSL_ENETC_VF > > + help > > + Enable hardware timestamping support on the Ethernet packets > > + using the SO_TIMESTAMPING API. Because the RX BD ring dynamic > > + allocation hasn't been supported and it's too expensive to use > > s/it's/it is/ [Y.b. Lu] Will modify it. BTW, may I know what's the purpose of dropping single quote character? For searching, script checking, or something else? If require to not use single quote character, I will also modify some other places in Kconfig messages. > > > + extended RX BDs if timestamping isn't used, the option was used > > + to control hardware timestamping/extended RX BDs to be enabled > > + or not. > > ..., this option enables extended RX BDs in order to support hardware > timestamping. [Y.b. Lu] Will rephrase it. > > > static bool enetc_clean_tx_ring(struct enetc_bdr *tx_ring, int > > napi_budget) { > > struct net_device *ndev = tx_ring->ndev; > > + struct enetc_ndev_priv *priv = netdev_priv(ndev); > > int tx_frm_cnt = 0, tx_byte_cnt = 0; > > struct enetc_tx_swbd *tx_swbd; > > + union enetc_tx_bd *txbd; > > + bool do_tstamp; > > int i, bds_to_clean; > > + u64 tstamp = 0; > > Please keep in reverse Christmas tree order as much as possible: > > union enetc_tx_bd *txbd; > int i, bds_to_clean; > bool do_tstamp; > u64 tstamp = 0; > > > i = tx_ring->next_to_clean; > > tx_swbd = _ring->tx_swbd[i]; > > bds_to_clean = enetc_bd_ready_count(tx_ring, i); > > > > + do_tstamp = false; > > + > > while (bds_to_clean && tx_frm_cnt < ENETC_DEFAULT_TX_WORK) { > > bool is_eof = !!tx_swbd->skb; > > > > + if (unlikely(tx_swbd->check_wb)) { > > + txbd = ENETC_TXBD(*tx_ring, i); > > + > > + if (!(txbd->flags & ENETC_TXBD_FLAGS_W)) > > + goto no_wb; > > + > > + if (tx_swbd->do_tstamp) { > > + enetc_get_tx_tstamp(>si->hw, txbd, > > + ); > > + do_tstamp = true; > > + } > > + } > > +no_wb: > > This goto seems strange and unnecessary. How about this instead? > > if (txbd->flags & ENETC_TXBD_FLAGS_W && > tx_swbd->do_tstamp) { > enetc_get_tx_tstamp(>si->hw, txbd, > ); > do_tstamp = true; > } > > > enetc_unmap_tx_buff(tx_ring, tx_swbd); > > if (is_eof) { > > + if (unlikely(do_tstamp)) { > > + enetc_tstamp_tx(tx_swbd->skb, tstamp); > > + do_tstamp = false; > > + } > > napi_consume_skb(tx_swbd->skb, napi_budget); > > tx_swbd->skb = NULL; > > } > > @@ -167,6 +169,11 @@ struct enetc_cls_rule { > > > > #define ENETC_MAX_BDR_INT2 /* fixed to max # of available cpus */ > > > > +enum enetc_hw_features { > > This is a poor choice of name. It sounds like it describes HW capabilities, > but > you use it to track whether a feature is requested at run time. > > > + ENETC_F_RX_TSTAMP = BIT(0), > > + ENETC_F_TX_TSTAMP = BIT(1), > > +}; > > + > > struct enetc_ndev_priv { > > struct net_device *ndev; > > struct device *dev; /* dma-mapping device */ @@ -178,6 +185,7 > @@ > > struct enetc_ndev_priv { > > u16 rx_bd_count, tx_bd_count; > > > > u16 msg_enable; > > + int hw_features; > > This is also poorly named. How about "tstamp_request" instead? > > > > > struct enetc_bdr *tx_ring[16]; > > struct enetc_bdr *rx_ring[16]; > > Thanks, > Richard
[v3 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
A few new fields were added to mmu_gather to make TLB flush smarter for huge page by telling what level of page table is changed. __tlb_reset_range() is used to reset all these page table state to unchanged, which is called by TLB flush for parallel mapping changes for the same range under non-exclusive lock (i.e. read mmap_sem). Before commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"), the syscalls (e.g. MADV_DONTNEED, MADV_FREE) which may update PTEs in parallel don't remove page tables. But, the forementioned commit may do munmap() under read mmap_sem and free page tables. This may result in program hang on aarch64 reported by Jan Stancek. The problem could be reproduced by his test program with slightly modified below. ---8<--- static int map_size = 4096; static int num_iter = 500; static long threads_total; static void *distant_area; void *map_write_unmap(void *ptr) { int *fd = ptr; unsigned char *map_address; int i, j = 0; for (i = 0; i < num_iter; i++) { map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (map_address == MAP_FAILED) { perror("mmap"); exit(1); } for (j = 0; j < map_size; j++) map_address[j] = 'b'; if (munmap(map_address, map_size) == -1) { perror("munmap"); exit(1); } } return NULL; } void *dummy(void *ptr) { return NULL; } int main(void) { pthread_t thid[2]; /* hint for mmap in map_write_unmap() */ distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); munmap(distant_area, (size_t)DISTANT_MMAP_SIZE); distant_area += DISTANT_MMAP_SIZE / 2; while (1) { pthread_create([0], NULL, map_write_unmap, NULL); pthread_create([1], NULL, dummy, NULL); pthread_join(thid[0], NULL); pthread_join(thid[1], NULL); } } ---8<--- The program may bring in parallel execution like below: t1t2 munmap(map_address) downgrade_write(>mmap_sem); unmap_region() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm); free_pgtables() tlb->freed_tables = 1 tlb->cleared_pmds = 1 pthread_exit() madvise(thread_stack, 8M, MADV_DONTNEED) zap_page_range() tlb_gather_mmu() inc_tlb_flush_pending(tlb->mm); tlb_finish_mmu() if (mm_tlb_flush_nested(tlb->mm)) __tlb_reset_range() __tlb_reset_range() would reset freed_tables and cleared_* bits, but this may cause inconsistency for munmap() which do free page tables. Then it may result in some architectures, e.g. aarch64, may not flush TLB completely as expected to have stale TLB entries remained. Use fullmm flush since it yields much better performance on aarch64 and non-fullmm doesn't yields significant difference on x86. The original proposed fix came from Jan Stancek who mainly debugged this issue, I just wrapped up everything together. Reported-by: Jan Stancek Tested-by: Jan Stancek Suggested-by: Will Deacon Cc: Peter Zijlstra Cc: Nick Piggin Cc: "Aneesh Kumar K.V" Cc: Nadav Amit Cc: Minchan Kim Cc: Mel Gorman Cc: sta...@vger.kernel.org 4.20+ Signed-off-by: Yang Shi Signed-off-by: Jan Stancek --- v3: Adopted fullmm flush suggestion from Will v2: Reworked the commit log per Peter and Will Adopted the suggestion from Peter mm/mmu_gather.c | 24 +++- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1..289f8cf 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *tlb, { /* * If there are parallel threads are doing PTE changes on same range -* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB -* flush by batching, a thread has stable TLB entry can fail to flush -* the TLB by observing pte_none|!pte_dirty, for example so flush TLB -* forcefully if we detect parallel PTE batching threads. +* under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB +* flush by batching, one thread may end up seeing inconsistent PTEs +* and result in having stale TLB entries. So flush TLB forcefully +* if we detect parallel PTE batching threads. +* +* However, some syscalls, e.g. munmap(), may free page tables, this +* needs force flush everything in
Re: [PATCH 1/3] kselftest/cgroup: fix unexcepted testing failure on test_memcontrol
> Hi Alex! > > Sorry for the late reply, somehow I missed your e-mails. > > The patchset looks good to me, except a typo in subjects/commit logs: > you probably meant "unexpected failures". > > Please, fix and feel free to use: > Reviewed-by: Roman Gushchin > for the series. > Thanks Roman! BTW, do you know other test cases for cgroup2 function or performance? Thanks Alex
[RFC 2/2] ARM: dts: imx6ull/z: add fusable-node property
Add fusable-node property for OCOTP Signed-off-by: Peng Fan --- arch/arm/boot/dts/imx6ull.dtsi | 7 +++ arch/arm/boot/dts/imx6ulz.dtsi | 6 ++ 2 files changed, 13 insertions(+) diff --git a/arch/arm/boot/dts/imx6ull.dtsi b/arch/arm/boot/dts/imx6ull.dtsi index 22e4a307fa59..b616ed6ee4bf 100644 --- a/arch/arm/boot/dts/imx6ull.dtsi +++ b/arch/arm/boot/dts/imx6ull.dtsi @@ -32,6 +32,13 @@ { compatible = "fsl,imx6ull-ocotp", "syscon"; + + fusable-node = <0xc 22 + 0xc 26 + 0xc 27 +0x10 4 +0x10 5 + >; }; { diff --git a/arch/arm/boot/dts/imx6ulz.dtsi b/arch/arm/boot/dts/imx6ulz.dtsi index 0b5f1a763567..8edd9008e38b 100644 --- a/arch/arm/boot/dts/imx6ulz.dtsi +++ b/arch/arm/boot/dts/imx6ulz.dtsi @@ -19,6 +19,12 @@ }; }; + { + fusable-node = < 0x10 4 +0x10 5 + >; +}; + /delete-node/ /delete-node/ /delete-node/ -- 2.16.4
[RFC 1/2] dt-bindings: imx-ocotp: Add fusable-node property
Introduce fusable-node property for i.MX OCOTP driver. The property will only be used by Firmware(eg. U-Boot) to runtime disable the nodes. Take i.MX6ULL for example, there are several parts that only have limited modules enabled controlled by OCOTP fuse. It is not flexible to provide several dts for the serval parts, instead we could provide one device tree and let Firmware to runtime disable the device tree nodes for those modules that are disable(fused). Signed-off-by: Peng Fan --- Currently NXP vendor use U-Boot to set status to disabled for devices that could not function, https://source.codeaurora.org/external/imx/uboot-imx/tree/arch/arm/mach-imx/mx6/module_fuse.c?h=imx_v2018.03_4.14.98_2.0.0_ga#n149 But this approach is will not work if kernel dts node path changed. There are two approaches to resolve: 1. This patch is to add a fusable-node property, and Firmware will parse the property and read fuse to decide whether to disable or keeep enable the nodes. 2. There is another approach is that add nvmem-cells for all nodes that could be disabled(fused). Then in each linux driver to use nvmem api to detect fused or not, or in linux driver common code to check device functionable or not with nvmem API. To make it easy to work, we choose [1] here. Please advise whether it is acceptable, because the property is not used by linux driver in approach [1]. Or you prefer [2] or please advise if any better solution. Thanks. Documentation/devicetree/bindings/nvmem/imx-ocotp.txt | 5 + 1 file changed, 5 insertions(+) diff --git a/Documentation/devicetree/bindings/nvmem/imx-ocotp.txt b/Documentation/devicetree/bindings/nvmem/imx-ocotp.txt index 7a999a135e56..e9a998588dbd 100644 --- a/Documentation/devicetree/bindings/nvmem/imx-ocotp.txt +++ b/Documentation/devicetree/bindings/nvmem/imx-ocotp.txt @@ -21,6 +21,8 @@ Required properties: Optional properties: - read-only: disable write access +- fusable-node: array of phandles with reg base and bit offset, this + property is used by Firmware to runtime disable nodes. Optional Child nodes: @@ -42,4 +44,7 @@ Example: tempmon_temp_grade: temp-grade@20 { reg = <0x20 4>; }; + + fusable-node = < 0x10 4 +0x10 5>; }; -- 2.16.4
Re: [PATCH RESEND 1/5] ARM: dts: imx6qdl-sabresd: Assign corresponding power supply for LDOs
On Sun, May 12, 2019 at 09:57:20AM +, Anson Huang wrote: > On i.MX6Q/DL SabreSD board, vgen5 supplies vdd1p1/vdd2p5 LDO and > sw2 supplies vdd3p0 LDO, this patch assigns corresponding power > supply for vdd1p1/vdd2p5/vdd3p0 to avoid confusion by below log: > > vdd1p1: supplied by regulator-dummy > vdd3p0: supplied by regulator-dummy > vdd2p5: supplied by regulator-dummy > > With this patch, the power supply is more accurate: > > vdd1p1: supplied by VGEN5 > vdd3p0: supplied by SW2 > vdd2p5: supplied by VGEN5 > > Signed-off-by: Anson Huang Applied all, thanks.
[PATCH v2 2/3] kselftest/cgroup: fix unexpected testing failure on test_core
The cgroup testing relys on the root cgroup's subtree_control setting, If the 'memory' controller isn't set, some test cases will be failed as following: $sudo ./test_core not ok 1 test_cgcore_internal_process_constraint ok 2 test_cgcore_top_down_constraint_enable not ok 3 test_cgcore_top_down_constraint_disable ... To correct this unexpected failure, this patch write the 'memory' to subtree_control of root to get a right result. Signed-off-by: Alex Shi Cc: Shuah Khan Cc: Tejun Heo Cc: Roman Gushchin Cc: Claudio Zumbo Cc: Claudio Cc: linux-kselft...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Roman Gushchin --- tools/testing/selftests/cgroup/test_core.c | 5 + 1 file changed, 5 insertions(+) diff --git a/tools/testing/selftests/cgroup/test_core.c b/tools/testing/selftests/cgroup/test_core.c index be59f9c34ea2..b1a570d7c557 100644 --- a/tools/testing/selftests/cgroup/test_core.c +++ b/tools/testing/selftests/cgroup/test_core.c @@ -376,6 +376,11 @@ int main(int argc, char *argv[]) if (cg_find_unified_root(root, sizeof(root))) ksft_exit_skip("cgroup v2 isn't mounted\n"); + + if (cg_read_strstr(root, "cgroup.subtree_control", "memory")) + if (cg_write(root, "cgroup.subtree_control", "+memory")) + ksft_exit_skip("Failed to set root memory controller\n"); + for (i = 0; i < ARRAY_SIZE(tests); i++) { switch (tests[i].fn(root)) { case KSFT_PASS: -- 2.19.1.856.g8858448bb
[PATCH v2 3/3] kselftest/cgroup: fix incorrect test_core skip
The test_core will skip the test_cgcore_no_internal_process_constraint_on_threads test case if the 'cpu' controller missing in root's subtree_control. In fact we need to set the 'cpu' in subtree_control, to make the testing meaningful. ./test_core ... ok 4 # skip test_cgcore_no_internal_process_constraint_on_threads ... Signed-off-by: Alex Shi Cc: Shuah Khan Cc: Tejun Heo Cc: Roman Gushchin Cc: Claudio Zumbo Cc: Claudio Cc: linux-kselft...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Roman Gushchin --- tools/testing/selftests/cgroup/test_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/cgroup/test_core.c b/tools/testing/selftests/cgroup/test_core.c index b1a570d7c557..2ffc3e07d98d 100644 --- a/tools/testing/selftests/cgroup/test_core.c +++ b/tools/testing/selftests/cgroup/test_core.c @@ -198,7 +198,7 @@ static int test_cgcore_no_internal_process_constraint_on_threads(const char *roo char *parent = NULL, *child = NULL; if (cg_read_strstr(root, "cgroup.controllers", "cpu") || - cg_read_strstr(root, "cgroup.subtree_control", "cpu")) { + cg_write(root, "cgroup.subtree_control", "+cpu")) { ret = KSFT_SKIP; goto cleanup; } -- 2.19.1.856.g8858448bb
[PATCH v2 1/3] kselftest/cgroup: fix unexpected testing failure on test_memcontrol
The cgroup testing relies on the root cgroup's subtree_control setting, If the 'memory' controller isn't set, all test cases will be failed as following: $ sudo ./test_memcontrol not ok 1 test_memcg_subtree_control not ok 2 test_memcg_current ok 3 # skip test_memcg_min not ok 4 test_memcg_low not ok 5 test_memcg_high not ok 6 test_memcg_max not ok 7 test_memcg_oom_events ok 8 # skip test_memcg_swap_max not ok 9 test_memcg_sock not ok 10 test_memcg_oom_group_leaf_events not ok 11 test_memcg_oom_group_parent_events not ok 12 test_memcg_oom_group_score_events To correct this unexpected failure, this patch write the 'memory' to subtree_control of root to get a right result. Signed-off-by: Alex Shi Cc: Shuah Khan Cc: Roman Gushchin Cc: Tejun Heo Cc: Mike Rapoport Cc: Jay Kamat Cc: linux-kselft...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Roman Gushchin --- tools/testing/selftests/cgroup/test_memcontrol.c | 4 1 file changed, 4 insertions(+) diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c index 6f339882a6ca..73612d604a2a 100644 --- a/tools/testing/selftests/cgroup/test_memcontrol.c +++ b/tools/testing/selftests/cgroup/test_memcontrol.c @@ -1205,6 +1205,10 @@ int main(int argc, char **argv) if (cg_read_strstr(root, "cgroup.controllers", "memory")) ksft_exit_skip("memory controller isn't available\n"); + if (cg_read_strstr(root, "cgroup.subtree_control", "memory")) + if (cg_write(root, "cgroup.subtree_control", "+memory")) + ksft_exit_skip("Failed to set root memory controller\n"); + for (i = 0; i < ARRAY_SIZE(tests); i++) { switch (tests[i].fn(root)) { case KSFT_PASS: -- 2.19.1.856.g8858448bb
Re: [PATCH v2 4/4] vfio: vfio_iommu_type1: implement VFIO_IOMMU_INFO_CAPABILITIES
Hi Pierre, I love your patch! Perhaps something to improve: [auto build test WARNING on s390/features] [also build test WARNING on v5.1 next-20190517] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Pierre-Morel/s390-pci-Exporting-access-to-CLP-PCI-function-and-PCI-group/20190520-025155 base: https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git features reproduce: # apt-get install sparse make ARCH=x86_64 allmodconfig make C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' If you fix the issue, kindly add following tag Reported-by: kbuild test robot sparse warnings: (new ones prefixed by >>) >> drivers/vfio/vfio_iommu_type1.c:1707:5: sparse: sparse: symbol >> 'vfio_iommu_type1_caps' was not declared. Should it be static? Please review and possibly fold the followup patch. --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
[RFC PATCH] vfio: vfio_iommu_type1: vfio_iommu_type1_caps() can be static
Fixes: f10b2b74bbea ("vfio: vfio_iommu_type1: implement VFIO_IOMMU_INFO_CAPABILITIES") Signed-off-by: kbuild test robot --- vfio_iommu_type1.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 9435647..46a4939 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -1704,8 +1704,8 @@ static int vfio_iommu_type1_zpci_grp(struct iommu_domain *domain, return ret; } -int vfio_iommu_type1_caps(struct vfio_iommu *iommu, struct vfio_info_cap *caps, - size_t size) +static int vfio_iommu_type1_caps(struct vfio_iommu *iommu, struct vfio_info_cap *caps, +size_t size) { struct vfio_domain *d; unsigned long total_size, fn_size, grp_size;
Re: [PATCH RESEND] ARM: dts: imx6ul: add clock-frequency to CPU node
On Sun, May 12, 2019 at 08:57:16AM +, Anson Huang wrote: > Add clock-frequency property to CPU node. Avoids warnings like > "/cpus/cpu@0 missing clock-frequency property" for "arm,cortex-a7". > > Signed-off-by: Anson Huang Applied, thanks.
Re: [v2 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
On 5/16/19 11:29 PM, Jan Stancek wrote: - Original Message - On Mon, May 13, 2019 at 04:01:09PM -0700, Yang Shi wrote: On 5/13/19 9:38 AM, Will Deacon wrote: On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote: diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 99740e1..469492d 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb, { /* * If there are parallel threads are doing PTE changes on same range -* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB -* flush by batching, a thread has stable TLB entry can fail to flush -* the TLB by observing pte_none|!pte_dirty, for example so flush TLB -* forcefully if we detect parallel PTE batching threads. +* under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB +* flush by batching, one thread may end up seeing inconsistent PTEs +* and result in having stale TLB entries. So flush TLB forcefully +* if we detect parallel PTE batching threads. +* +* However, some syscalls, e.g. munmap(), may free page tables, this +* needs force flush everything in the given range. Otherwise this +* may result in having stale TLB entries for some architectures, +* e.g. aarch64, that could specify flush what level TLB. */ - if (mm_tlb_flush_nested(tlb->mm)) { - __tlb_reset_range(tlb); - __tlb_adjust_range(tlb, start, end - start); + if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) { + /* +* Since we can't tell what we actually should have +* flushed, flush everything in the given range. +*/ + tlb->freed_tables = 1; + tlb->cleared_ptes = 1; + tlb->cleared_pmds = 1; + tlb->cleared_puds = 1; + tlb->cleared_p4ds = 1; + + /* +* Some architectures, e.g. ARM, that have range invalidation +* and care about VM_EXEC for I-Cache invalidation, need force +* vma_exec set. +*/ + tlb->vma_exec = 1; + + /* Force vma_huge clear to guarantee safer flush */ + tlb->vma_huge = 0; + + tlb->start = start; + tlb->end = end; } Whilst I think this is correct, it would be interesting to see whether or not it's actually faster than just nuking the whole mm, as I mentioned before. At least in terms of getting a short-term fix, I'd prefer the diff below if it's not measurably worse. I did a quick test with ebizzy (96 threads with 5 iterations) on my x86 VM, it shows slightly slowdown on records/s but much more sys time spent with fullmm flush, the below is the data. nofullmm fullmm ops (records/s) 225606 225119 sys (s) 0.69 1.14 It looks the slight reduction of records/s is caused by the increase of sys time. That's not what I expected, and I'm unable to explain why moving to fullmm would /increase/ the system time. I would've thought the time spent doing the invalidation would decrease, with the downside that the TLB is cold when returning back to userspace. I tried ebizzy with various parameters (malloc vs mmap, ran it for hour), but performance was very similar for both patches. So, I was looking for workload that would demonstrate the largest difference. Inspired by python xml-rpc, which can handle each request in new thread, I tried following [1]: 16 threads, each looping 100k times over: mmap(16M) touch 1 page madvise(DONTNEED) munmap(16M) This yields quite significant difference for 2 patches when running on my 46 CPU arm host. I checked it twice - applied patch, recompiled, rebooted, but numbers stayed +- couple seconds the same. Thanks for the testing. I'm a little bit surprised by the significant difference. I did the same test on my x86 VM (24 cores), they yield almost same number. Given the significant improvement on arm64 with fullmm version, I'm going to respin the patch. Does it somewhat match your expectation? v2 patch - real2m33.460s user0m3.359s sys 15m32.307s real2m33.895s user0m2.749s sys 16m34.500s real2m35.666s user0m3.528s sys 15m23.377s real2m32.898s user0m2.789s sys 16m18.801s real2m33.087s user0m3.565s sys 16m23.815s fullmm version --- real0m46.811s user0m1.596s sys 1m47.500s real0m47.322s user0m1.803s sys 1m48.449s real0m46.668s user0m1.508s sys 1m47.352s real0m46.742s user0m2.007s sys 1m47.217s real0m46.948s user0m1.785s sys 1m47.906s [1]
Re: [PATCH V12 RESEND 1/3] ARM: imx_v6_v7_defconfig: Add TPM PWM support by default
On Sun, May 12, 2019 at 08:51:15AM +, Anson Huang wrote: > Select CONFIG_PWM_IMX_TPM by default to support i.MX7ULP > TPM PWM. > > Signed-off-by: Anson Huang Applied all, thanks.
[PATCH] kbuild: do not check name uniqueness of builtin modules
I just thought it was a good idea to scan builtin.modules in the name uniqueness checking, but Stephen reported a false positive. ppc64_defconfig produces: warning: same basename if the following are built as modules: arch/powerpc/platforms/powermac/nvram.ko drivers/char/nvram.ko ..., which is a false positive because the former is never built as a module as you see in arch/powerpc/platforms/powermac/Makefile: # CONFIG_NVRAM is an arch. independent tristate symbol, for pmac32 we really # need this to be a bool. Cheat here and pretend CONFIG_NVRAM=m is really # CONFIG_NVRAM=y obj-$(CONFIG_NVRAM:m=y) += nvram.o Since we cannot predict how tricky Makefiles are written in wild, builtin.modules may potentially contain false positives. I do not think it is a big deal as far as kmod is concerned, but false positive warnings in the kernel build makes people upset. It is better to not do it. Even without checking builtin.modules, we have enough (and more solid) test coverage with allmodconfig. While I touched this part, I replaced the sed code with neater one provided by Stephen. Link: https://lkml.org/lkml/2019/5/19/120 Link: https://lkml.org/lkml/2019/5/19/123 Fixes: 3a48a91901c5 ("kbuild: check uniqueness of module names") Reported-by: Stephen Rothwell Signed-off-by: Masahiro Yamada --- scripts/modules-check.sh | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/scripts/modules-check.sh b/scripts/modules-check.sh index 2f659530e1ec..39e8cb36ba19 100755 --- a/scripts/modules-check.sh +++ b/scripts/modules-check.sh @@ -6,10 +6,10 @@ set -e # Check uniqueness of module names check_same_name_modules() { - for m in $(sed 's:.*/::' modules.order modules.builtin | sort | uniq -d) + for m in $(sed 's:.*/::' modules.order | sort | uniq -d) do - echo "warning: same basename if the following are built as modules:" >&2 - sed "/\/$m/!d;s:^kernel/: :" modules.order modules.builtin >&2 + echo "warning: same module names found:" >&2 + sed -n "/\/$m/s:^kernel/: :p" modules.order >&2 done } -- 2.17.1
Re: [PATCH v2 3/3] arm64: dts: nxp: frwy-ls1046a: add support for micron nor flash
On Fri, May 10, 2019 at 01:00:24PM +, Pramod Kumar wrote: > add micron nor flash support for ls1046a frwy board. > > Signed-off-by: Ashish Kumar > Signed-off-by: Pramod Kumar Prefix 'arm64: dts: frwy-ls1046a: ...' would be good enough. > --- > .../boot/dts/freescale/fsl-ls1046a-frwy.dts | 17 + > 1 file changed, 17 insertions(+) > > diff --git a/arch/arm64/boot/dts/freescale/fsl-ls1046a-frwy.dts > b/arch/arm64/boot/dts/freescale/fsl-ls1046a-frwy.dts > index de0d19c02944..890f07122dd0 100644 > --- a/arch/arm64/boot/dts/freescale/fsl-ls1046a-frwy.dts > +++ b/arch/arm64/boot/dts/freescale/fsl-ls1046a-frwy.dts > @@ -113,6 +113,23 @@ > > }; > > + Unnecessary newline. Shawn > + { > + num-cs = <1>; > + bus-num = <0>; > + status = "okay"; > + > + qflash0: flash@0 { > + compatible = "jedec,spi-nor"; > + #address-cells = <1>; > + #size-cells = <1>; > + spi-max-frequency = <5000>; > + reg = <0>; > + spi-rx-bus-width = <4>; > + spi-tx-bus-width = <4>; > + }; > +}; > + > #include "fsl-ls1046-post.dtsi" > > { > -- > 2.17.1 >
RE: [PATCH 1/3] enetc: add hardware timestamping support
> -Original Message- > From: Claudiu Manoil > Sent: Thursday, May 16, 2019 9:31 PM > To: Y.b. Lu ; net...@vger.kernel.org; Richard Cochran > ; David Miller ; Shawn > Guo ; Rob Herring > Cc: devicet...@vger.kernel.org; linux-arm-ker...@lists.infradead.org; > linux-kernel@vger.kernel.org > Subject: RE: [PATCH 1/3] enetc: add hardware timestamping support > > >-Original Message- > >From: Y.b. Lu > [...] > >Subject: [PATCH 1/3] enetc: add hardware timestamping support > > > [...] > > Hi Yangbo, > > These enetc patches targeting net-next will have to be rebased on the latest > enetc net.git commits, otherwise there will be some merge conflicts for > enetc.c > and enetc_ethtool.c. > Thanks, > Claudiu > > see > 22fb43f36006 "enetc: Add missing link state info for ethtool" > f4a0be84d73e "enetc: Fix NULL dma address unmap for Tx BD extensions" [Y.b. Lu] Thanks Claudiu. I should have noticed that. Let me send v2 after net-next is open.
Re: [PATCH v2 2/3] arm64: dts: nxp: add ls1046a-frwy board support
On Fri, May 10, 2019 at 01:00:20PM +, Pramod Kumar wrote: > ls1046afrwy board is based on nxp ls1046a SoC. > Board support's 4GB ddr memory, i2c, microSD card, > serial console,qspi nor flash,ifc nand flash,qsgmii network interface, > usb 3.0 and serdes interface to support two x1gen3 pcie interface. > > Signed-off-by: Vabhav Sharma > Signed-off-by: Pramod Kumar > --- > arch/arm64/boot/dts/freescale/Makefile| 1 + > .../boot/dts/freescale/fsl-ls1046a-frwy.dts | 156 ++ > 2 files changed, 157 insertions(+) > create mode 100644 arch/arm64/boot/dts/freescale/fsl-ls1046a-frwy.dts > > diff --git a/arch/arm64/boot/dts/freescale/Makefile > b/arch/arm64/boot/dts/freescale/Makefile > index 13604e558dc1..84ff6995b41e 100644 > --- a/arch/arm64/boot/dts/freescale/Makefile > +++ b/arch/arm64/boot/dts/freescale/Makefile > @@ -8,6 +8,7 @@ dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls1028a-qds.dtb > dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls1028a-rdb.dtb > dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls1043a-qds.dtb > dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls1043a-rdb.dtb > +dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls1046a-frwy.dtb > dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls1046a-qds.dtb > dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls1046a-rdb.dtb > dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls1088a-qds.dtb > diff --git a/arch/arm64/boot/dts/freescale/fsl-ls1046a-frwy.dts > b/arch/arm64/boot/dts/freescale/fsl-ls1046a-frwy.dts > new file mode 100644 > index ..de0d19c02944 > --- /dev/null > +++ b/arch/arm64/boot/dts/freescale/fsl-ls1046a-frwy.dts > @@ -0,0 +1,156 @@ > +// SPDX-License-Identifier: (GPL-2.0+ OR MIT) > +/* > + * Device Tree Include file for Freescale Layerscape-1046A family SoC. > + * > + * Copyright 2019 NXP. > + * > + */ > + > +/dts-v1/; > + > +#include "fsl-ls1046a.dtsi" > + > +/ { > + model = "LS1046A FRWY Board"; > + compatible = "fsl,ls1046a-frwy", "fsl,ls1046a"; > + > + aliases { > + serial0 = > + serial1 = > + serial2 = > + serial3 = > + }; > + > + chosen { > + stdout-path = "serial0:115200n8"; > + }; > + > + sb_3v3: regulator-sb3v3 { > + compatible = "regulator-fixed"; > + regulator-name = "LT8642SEV-3.3V"; > + regulator-min-microvolt = <330>; > + regulator-max-microvolt = <330>; > + regulator-boot-on; > + regulator-always-on; > + }; > +}; > + > + { > + status = "okay"; > +}; > + > + { > + status = "okay"; > +}; > + > + { > + status = "okay"; > +}; > + > + { > + status = "okay"; > +}; > + > + { > + status = "okay"; > + > + i2c-mux@77 { > + compatible = "nxp,pca9546"; > + reg = <0x77>; > + #address-cells = <1>; > + #size-cells = <0>; > + i2c-mux-never-disable; Undocumented property? > + > + i2c@0 { > + #address-cells = <1>; > + #size-cells = <0>; > + reg = <0>; > + > + eeprom@52 { > + compatible = "atmel,24c512"; > + reg = <0x52>; > + }; > + > + eeprom@53 { > + compatible = "atmel,24c512"; > + reg = <0x53>; > + }; > + > + power-monitor@40 { Sort the nodes in unit-address. Shawn > + compatible = "ti,ina220"; > + reg = <0x40>; > + shunt-resistor = <1000>; > + }; > + > + rtc@51 { > + compatible = "nxp,pcf2129"; > + reg = <0x51>; > + }; > + > + temperature-sensor@4c { > + compatible = "nxp,sa56004"; > + reg = <0x4c>; > + vcc-supply = <_3v3>; > + }; > + > + }; > + }; > +}; > + > + { > + #address-cells = <2>; > + #size-cells = <1>; > + /* NAND Flash */ > + ranges = <0x0 0x0 0x0 0x7e80 0x0001>; > + status = "okay"; > + > + nand@0,0 { > + compatible = "fsl,ifc-nand"; > + #address-cells = <1>; > + #size-cells = <1>; > + reg = <0x0 0x0 0x1>; > + }; > + > +}; > + > +#include "fsl-ls1046-post.dtsi" > + > + { > + ethernet@e { > + phy-handle = <_phy4>; > + phy-connection-type = "qsgmii"; > + }; > + > + ethernet@e8000 { > + phy-handle = <_phy2>; > + phy-connection-type = "qsgmii"; > + }; > + > + ethernet@ea000 { > + phy-handle = <_phy1>; > + phy-connection-type = "qsgmii"; > + }; > + > + ethernet@f2000 { > + phy-handle =
Re: [PATCH v3] tipc: fix modprobe tipc failed after switch order of device registration
On 2019/5/19 19:48, Stephen Rothwell wrote: > Hi, > > On Sun, 19 May 2019 17:13:45 +0800 hujunwei wrote: >> >> Fixes: 7e27e8d6130c >> ("tipc: switch order of device registration to fix a crash") > > Please don't split Fixes tags over more than one line. It is OK if > they are too long. > Hi, Stephen Thanks for your suggestion, I will update it later. Regards, Junwei
Re: [PATCH v5 4/6] usb: roles: add API to get usb_role_switch by node
Hi, On Fri, 2019-05-17 at 16:05 +0300, Heikki Krogerus wrote: > Hi, > > On Fri, May 17, 2019 at 01:37:36PM +0300, Heikki Krogerus wrote: > > On Tue, May 14, 2019 at 04:47:21PM +0800, Chunfeng Yun wrote: > > > Add fwnode_usb_role_switch_get() to make easier to get > > > usb_role_switch by fwnode which register it. > > > It's useful when there is not device_connection registered > > > between two drivers and only knows the fwnode which register > > > usb_role_switch. > > > > > > Signed-off-by: Chunfeng Yun > > > Tested-by: Biju Das > > > > Acked-by: Heikki Krogerus > > Hold on. I just noticed Rob's comment on patch 2/6, where he points out > that you don't need to use device graph since the controller is the > parent of the connector. Doesn't that mean you don't really need this > API? No, I still need it. The change is about the way how to get fwnode; when use device graph, get fwnode by of_graph_get_remote_node(); but now will get fwnode by of_get_parent(); > > > > --- > > > v5 changes: > > > 1. remove linux/of.h suggested by Biju > > > 2. add tested by Biju > > > > > > Note: still depends on [1] > > > [1]: [v6,08/13] usb: roles: Introduce stubs for the exiting functions in > > > role.h > > > https://patchwork.kernel.org/patch/10909971/ > > > > > > v4 changes: > > > 1. use switch_fwnode_match() to find fwnode suggested by Heikki > > > 2. this patch now depends on [1] > > > > > > [1] [v6,08/13] usb: roles: Introduce stubs for the exiting functions in > > > role.h > > > https://patchwork.kernel.org/patch/10909971/ > > > > > > v3 changes: > > > 1. use fwnodes instead of node suggested by Andy > > > 2. rebuild the API suggested by Heikki > > > > > > v2 no changes > > > --- > > > drivers/usb/roles/class.c | 24 > > > include/linux/usb/role.h | 8 > > > 2 files changed, 32 insertions(+) > > > > > > diff --git a/drivers/usb/roles/class.c b/drivers/usb/roles/class.c > > > index f45d8df5cfb8..4a1f09a41ec0 100644 > > > --- a/drivers/usb/roles/class.c > > > +++ b/drivers/usb/roles/class.c > > > @@ -135,6 +135,30 @@ struct usb_role_switch *usb_role_switch_get(struct > > > device *dev) > > > } > > > EXPORT_SYMBOL_GPL(usb_role_switch_get); > > > > > > +/** > > > + * fwnode_usb_role_switch_get - Find USB role switch by it's parent > > > fwnode > > > + * @fwnode: The fwnode that register USB role switch > > > + * > > > + * Finds and returns role switch registered by @fwnode. The reference > > > count > > > + * for the found switch is incremented. > > > + */ > > > +struct usb_role_switch * > > > +fwnode_usb_role_switch_get(struct fwnode_handle *fwnode) > > > +{ > > > + struct usb_role_switch *sw; > > > + struct device *dev; > > > + > > > + dev = class_find_device(role_class, NULL, fwnode, switch_fwnode_match); > > > + if (!dev) > > > + return ERR_PTR(-EPROBE_DEFER); > > > + > > > + sw = to_role_switch(dev); > > > + WARN_ON(!try_module_get(sw->dev.parent->driver->owner)); > > > + > > > + return sw; > > > +} > > > +EXPORT_SYMBOL_GPL(fwnode_usb_role_switch_get); > > This function only basically converts the fwnode to usb_role_switch, > but I would actually prefer that we walked through the device graph > here instead of expecting the caller to do that. > > So this function should probable be called fwnode_to_usb_role_switch() > and not fwnode_usb_role_switch_get(), but I guess you don't need it > at all, right? > > > thanks, >
[PATCH v2] kernel: fix typos and some coding style in comments
fix lenght to length Signed-off-by: Weitao Hou --- Changes in v2: - fix space before tab warnings --- kernel/sysctl.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 943c89178e3d..f78f725f225e 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -187,17 +187,17 @@ extern int no_unaligned_warning; * enum sysctl_writes_mode - supported sysctl write modes * * @SYSCTL_WRITES_LEGACY: each write syscall must fully contain the sysctl value - * to be written, and multiple writes on the same sysctl file descriptor - * will rewrite the sysctl value, regardless of file position. No warning - * is issued when the initial position is not 0. + * to be written, and multiple writes on the same sysctl file descriptor + * will rewrite the sysctl value, regardless of file position. No warning + * is issued when the initial position is not 0. * @SYSCTL_WRITES_WARN: same as above but warn when the initial file position is - * not 0. + * not 0. * @SYSCTL_WRITES_STRICT: writes to numeric sysctl entries must always be at - * file position 0 and the value must be fully contained in the buffer - * sent to the write syscall. If dealing with strings respect the file - * position, but restrict this to the max length of the buffer, anything - * passed the max lenght will be ignored. Multiple writes will append - * to the buffer. + * file position 0 and the value must be fully contained in the buffer + * sent to the write syscall. If dealing with strings respect the file + * position, but restrict this to the max length of the buffer, anything + * passed the max length will be ignored. Multiple writes will append + * to the buffer. * * These write modes control how current file position affects the behavior of * updating sysctl values through the proc interface on each write. -- 2.18.0
Re: [PATCH v2 02/11] ARM: dts: imx7s: Update coresight DT bindings
On Wed, May 08, 2019 at 10:18:53AM +0800, Leo Yan wrote: > CoreSight DT bindings have been updated, thus the old compatible strings > are obsolete and the drivers will report warning if DTS uses these > obsolete strings. > > This patch switches to the new bindings for CoreSight dynamic funnel and > static replicator, so can dismiss warning during initialisation. > > Cc: Shawn Guo > Cc: Chris Healy > Cc: Andrew Lunn > Cc: Fabio Estevam > Cc: Mathieu Poirier > Cc: Suzuki K Poulose > Signed-off-by: Leo Yan Applied, thanks.
Re: [v3 PATCH] dt-binding: usb: add usb-role-switch property
On Fri, 2019-05-17 at 16:27 +0300, Heikki Krogerus wrote: > On Wed, May 08, 2019 at 05:17:44PM +0800, Chunfeng Yun wrote: > > Add a property usb-role-switch to tell the driver that use > > USB Role Switch framework to handle the role switch, > > it's useful when the driver has already supported other ways, > > such as extcon framework etc. > > > > Cc: Biju Das > > Cc: Yu Chen > > Signed-off-by: Chunfeng Yun > > Who is meant to pick this? > Can you include this in your series where > you introduce that USB Type-B GPIO connector driver? > Ok, I'll do it if need > FWIW: > > Reviewed-by: Heikki Krogerus > > > --- > > v3: > > add property type, modify description suggested by Heikki > > > > v2: > > describe it in terms of h/w functionality suggested by Rob > > > > v1: > > the property is discussed in: > > [v2,2/7] dt-bindings: usb: renesas_usb3: add usb-role-switch property > > https://patchwork.kernel.org/patch/10852497/ > > > > Mediatek and Hisilicon also try to use it: > > [v4,3/6] dt-bindings: usb: mtu3: add properties about USB Role Switch > > https://patchwork.kernel.org/patch/10918385/ > > [v4,6/6] usb: mtu3: register a USB Role Switch for dual role mode > > https://patchwork.kernel.org/patch/10918367/ > > > > [v6,10/13] usb: dwc3: Registering a role switch in the DRD code > > https://patchwork.kernel.org/patch/10909981/ > > --- > > Documentation/devicetree/bindings/usb/generic.txt | 4 > > 1 file changed, 4 insertions(+) > > > > diff --git a/Documentation/devicetree/bindings/usb/generic.txt > > b/Documentation/devicetree/bindings/usb/generic.txt > > index 0a74ab8dfdc2..cf5a1ad456e6 100644 > > --- a/Documentation/devicetree/bindings/usb/generic.txt > > +++ b/Documentation/devicetree/bindings/usb/generic.txt > > @@ -30,6 +30,10 @@ Optional properties: > > optional for OTG device. > > - adp-disable: tells OTG controllers we want to disable OTG ADP, ADP is > > optional for OTG device. > > + - usb-role-switch: boolean, indicates that the device is capable of > > assigning > > + the USB data role (USB host or USB device) for a given > > + USB connector, such as Type-C, Type-B(micro). > > + see connector/usb-connector.txt. > > > > This is an attribute to a USB controller such as: > > > > -- > > 2.21.0 > > thanks, >
RE: [PATCH 2/6] Revert "arm64: dts: renesas: r8a7796: Enable DMA for SCIF2"
Hi Eugeniu-san, Geert-san, > From: Eugeniu Rosca, Sent: Tuesday, May 7, 2019 4:43 AM > > > [0] v5.0-rc6 commit 97f26702bc95b5 ("arm64: dts: renesas: r8a7796: Enable > > > DMA for SCIF2") > > > [1] v4.14.106 commit 703db5d1b1759f ("arm64: dts: renesas: r8a7796: > > > Enable DMA for SCIF2") > > > [2] scif (DEBUG) and rcar-dmac logs: > > > https://gist.github.com/erosca/132cce76a619724a9e4fa61d1db88c66 > Enabling DEBUG in drivers/dma/sh/rcar-dmac.c, I can notice that one of > the symptoms is a NULL dst_addr revealed by: > > rcar-dmac e730.dma-controller: chan0: queue chunk (ptrval): > 0@0x800639eb8090 -> 0x > > In working scenarios, dst_addr is never zero. Does it give any hints? Thank you for the report! It's very helpful to me. I think we should fix the sh-sci driver at least. According to the [2] log above, [4.379716] sh-sci e6e88000.serial: sci_dma_tx_work_fn: 800639b55000: 0...0, cookie 126 This "0...0" means the s->tx_dma_len on the sci_dma_tx_work_fn will be zero. And, > rcar-dmac e730.dma-controller: chan0: queue chunk (ptrval): > 0@0x800639eb8090 -> 0x This means the chunk->dst_addr is not set to the "dst_addr" for SCIF because the len on rcar_dmac_chan_prep_sg is zero. So, I'm thinking: - we have to fix the sh_sci driver to avoid "tx_dma_len = 0" transferring. and - also we have to fix the rcar-dmac driver to avoid this issue because the DMA Engine API guide doesn't prevent the len = 0. Eugeniu-san, Geert-san, what do you think? Best regards, Yoshihiro Shimoda >> > > Thanks! > > Likewise! > > > > > Gr{oetje,eeting}s, > > > > Geert > > > > -- > > Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- > > ge...@linux-m68k.org > > > > In personal conversations with technical people, I call myself a hacker. But > > when I'm talking to journalists I just say "programmer" or something like > > that. > > -- Linus Torvalds > > -- > Best Regards, > Eugeniu.
Re: [f2fs-dev] [PATCH v2 2/2] f2fs: relocate chksum_offset for large_nat_bitmap feature
On 2019/5/20 1:31, Ju Hyung Park wrote: > Hi Jaegeuk and Chao, > > I was semi-forced today to use the new kernel and test f2fs. > > My Ubuntu initramfs got a bit wonky and I had to boot into live CD and > fix some stuffs. The live CD was using 4.15 kernel, and just mounting > the f2fs partition there corrupted f2fs and my 4.19(with 5.1-rc1-4.19 > f2fs-stable merged) refused to mount with "SIT is corrupted node" > message. > > I used the latest f2fs-tools sent by Chao including "fsck.f2fs: fix to > repair cp_loads blocks at correct position" > > It spit out 140M worth of output, but at least I didn't have to run it > twice. Everything returned "Ok" in the 2nd run. > The new log is at > http://arter97.com/f2fs/final > > After fixing the image, I used my 4.19 kernel with 5.2-rc1-4.19 > f2fs-stable merged and it mounted. > > But, I got this: > [1.047791] F2FS-fs (nvme0n1p3): layout of large_nat_bitmap is > deprecated, run fsck to repair, chksum_offset: 4092 > [1.081307] F2FS-fs (nvme0n1p3): Found nat_bits in checkpoint > [1.161520] F2FS-fs (nvme0n1p3): recover fsync data on readonly fs > [1.162418] F2FS-fs (nvme0n1p3): Mounted with checkpoint version = 761c7e00 > > But after doing a reboot, the message is gone: > [1.098423] F2FS-fs (nvme0n1p3): Found nat_bits in checkpoint > [1.11] F2FS-fs (nvme0n1p3): recover fsync data on readonly fs > [1.178365] F2FS-fs (nvme0n1p3): Mounted with checkpoint version = 761c7eda > > I'm not exactly sure why the kernel detected that I'm still using the > old layout on the first boot. Maybe fsck didn't fix it properly, or > the check from the kernel is improper. The fsck should have fixed layout problem, that's just unnecessary message, we can avoid it by relocating sanity check of layout. I've sent one patch for that, and tested with it, it acts as expected. :) > > I also noticed that Jaegeuk sent v1 of this patch to upstream. (Maybe > that's why the kernel detected old layout?) Please send v2 to upstream > soon, as running older fsck will cause much more headaches. Since v1 was upstreamed in 5.2-rc1, I guess we need another patch for that, maybe in below patch? [PATCH] f2fs: fix to check layout on last valid checkpoint park Thanks, > > Thanks. > > > On Fri, Apr 26, 2019 at 11:26 AM Chao Yu wrote: >> >> For large_nat_bitmap feature, there is a design flaw: >> >> Previous: >> >> struct f2fs_checkpoint layout: >> +--+ 0x >> | checkpoint_ver | >> | .. | >> | checksum_offset |--+ >> | .. | | >> | sit_nat_version_bitmap[] |<-|---+ >> | .. | | | >> | checksum_value |<-+ | >> +--+ 0x1000 | >> | | nat_bitmap + sit_bitmap >> | payload blocks | | >> | | | >> +--|<-+ >> >> Obviously, if nat_bitmap size + sit_bitmap size is larger than >> MAX_BITMAP_SIZE_IN_CKPT, nat_bitmap or sit_bitmap may overlap >> checkpoint checksum's position, once checkpoint() is triggered >> from kernel, nat or sit bitmap will be damaged by checksum field. >> >> In order to fix this, let's relocate checksum_value's position >> to the head of sit_nat_version_bitmap as below, then nat/sit >> bitmap and chksum value update will become safe. >> >> After: >> >> struct f2fs_checkpoint layout: >> +--+ 0x >> | checkpoint_ver | >> | .. | >> | checksum_offset |--+ >> | .. | | >> | sit_nat_version_bitmap[] |<-+ >> | .. |<-+ >> | | | >> +--+ 0x1000 | >> | | nat_bitmap + sit_bitmap >> | payload blocks | | >> | | | >> +--|<-+ >> >> Related report and discussion: >> >> https://sourceforge.net/p/linux-f2fs/mailman/message/36642346/ >> >> Reported-by: Park Ju Hyung >> Signed-off-by: Chao Yu >> --- >> v2: >> - improve hint message suggested by Ju Hyung. >> - move verification to f2fs_sanity_check_ckpt(). >> fs/f2fs/f2fs.h | 4 +++- >> fs/f2fs/super.c | 13 + >> 2 files changed, 16 insertions(+), 1 deletion(-) >> >> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h >> index 119bc5a9783e..aa71c1aa9eaa 100644 >> --- a/fs/f2fs/f2fs.h >> +++ b/fs/f2fs/f2fs.h >> @@ -1909,9 +1909,11 @@ static inline void *__bitmap_ptr(struct f2fs_sb_info >> *sbi, int flag) >> int offset; >> >> if (is_set_ckpt_flags(sbi, CP_LARGE_NAT_BITMAP_FLAG)) { >> + unsigned int chksum_size = sizeof(__le32); >> + >> offset = (flag == SIT_BITMAP) ? >> le32_to_cpu(ckpt->nat_ver_bitmap_bytesize) :