Re: [PATCH v2 7/7] ABI: sysfs-kernel-mm-cma: fix two cross-references
On 3/25/21 3:38 AM, Mauro Carvalho Chehab wrote: Change the text in order to generate cross-references for alloc_pages_success and alloc_pages_fail symbols. Signed-off-by: Mauro Carvalho Chehab --- Documentation/ABI/testing/sysfs-kernel-mm-cma | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-cma b/Documentation/ABI/testing/sysfs-kernel-mm-cma index 02b2bb60c296..86e261185561 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-cma +++ b/Documentation/ABI/testing/sysfs-kernel-mm-cma @@ -5,12 +5,10 @@ Description: /sys/kernel/mm/cma/ contains a subdirectory for each CMA heap name (also sometimes called CMA areas). - Each CMA heap subdirectory (that is, each - /sys/kernel/mm/cma/ directory) contains the - following items: + Each CMA heap subdirectory contains the following items: - alloc_pages_success - alloc_pages_fail + - /sys/kernel/mm/cma//alloc_pages_success + - /sys/kernel/mm/cma//alloc_pages_fail I agree that this is clearer and easier on the reader, who can now see directly what the full path to each item is. As for calling it a "fix", that seems a bit much. It's an upgrade. I'm not sure how many people realize that this sort of change causes cross refs to magically start working. I certainly didn't until now. But either way, this improvement is nice to have, so: Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA What: /sys/kernel/mm/cma//alloc_pages_success Date: Feb 2021
Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap
On 3/30/21 8:56 PM, John Hubbard wrote: On 3/30/21 3:56 PM, Alistair Popple wrote: ... +1 for renaming "munlock*" items to "mlock*", where applicable. good grief. At least the situation was weird enough to prompt further investigation :) Renaming to mlock* doesn't feel like the right solution to me either though. I am not sure if you saw me responding to myself earlier but I am thinking renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() -> page_mlock_one() might be better. Thoughts? Quite confused by this naming idea. Because: try_to_munlock() returns void, so a boolean-style name such as "page_mlocked()" is already not a good fit. Even more important, though, is that try_to_munlock() is mlock-ing the page, right? Is there some subtle point I'm missing? It really is doing an mlock to the best of my knowledge here. Although the kerneldoc comment for try_to_munlock() seems questionable too: /** * try_to_munlock - try to munlock a page * @page: the page to be munlocked * * Called from munlock code. Checks all of the VMAs mapping the page * to make sure nobody else has this page mlocked. The page will be * returned with PG_mlocked cleared if no other vmas have it mlocked. */ ...because I don't see where, in *this* routine, it clears PG_mlocked! Obviously we agree that a routine should be named based on what it does, rather than on who calls it. So I think that still leads to: try_to_munlock() --> try_to_mlock() try_to_munlock_one() --> try_to_mlock_one() Sorry if I'm missing something really obvious. Actually, re-reading your and Jason's earlier points in the thread, I see that I'm *not* missing anything, and we are actually in agreement about how the code operates. OK, good! Also, as you point out above, maybe the "try_" prefix is not really accurate either, given how this works. So maybe we have arrived at something like: try_to_munlock() --> page_mlock() // or mlock_page()... try_to_munlock_one() --> page_mlock_one() thanks, -- John Hubbard NVIDIA
Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap
On 3/30/21 3:56 PM, Alistair Popple wrote: ... +1 for renaming "munlock*" items to "mlock*", where applicable. good grief. At least the situation was weird enough to prompt further investigation :) Renaming to mlock* doesn't feel like the right solution to me either though. I am not sure if you saw me responding to myself earlier but I am thinking renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() -> page_mlock_one() might be better. Thoughts? Quite confused by this naming idea. Because: try_to_munlock() returns void, so a boolean-style name such as "page_mlocked()" is already not a good fit. Even more important, though, is that try_to_munlock() is mlock-ing the page, right? Is there some subtle point I'm missing? It really is doing an mlock to the best of my knowledge here. Although the kerneldoc comment for try_to_munlock() seems questionable too: /** * try_to_munlock - try to munlock a page * @page: the page to be munlocked * * Called from munlock code. Checks all of the VMAs mapping the page * to make sure nobody else has this page mlocked. The page will be * returned with PG_mlocked cleared if no other vmas have it mlocked. */ ...because I don't see where, in *this* routine, it clears PG_mlocked! Obviously we agree that a routine should be named based on what it does, rather than on who calls it. So I think that still leads to: try_to_munlock() --> try_to_mlock() try_to_munlock_one() --> try_to_mlock_one() Sorry if I'm missing something really obvious. This is actually inspired from a suggestion in Documentation/vm/unevictable- lru.rst which warns about this problem: try_to_munlock() Reverse Map Scan - .. warning:: [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the page_referenced() reverse map walker. This is actually rather bad advice! page_referenced() returns an int-that-is-really-a-boolean, whereas try_to_munlock(), at least as it stands now, returns void. Usually when I'm writing a TODO item, I'm in a hurry, and I think that's what probably happened here, too. :) Although, it seems reasonable to tack such renaming patches onto the tail end of this series. But whatever works. Unless anyone objects strongly I will roll the rename into this patch as there is only one caller of try_to_munlock. - Alistair No objections here. :) thanks, -- John Hubbard NVIDIA
Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap
On 3/30/21 3:24 PM, Jason Gunthorpe wrote: ... As far as I can tell this has always been called try_to_munlock() even though it appears to do the opposite. Maybe we should change it then? /** * try_to_munlock - try to munlock a page * @page: the page to be munlocked * * Called from munlock code. Checks all of the VMAs mapping the page * to make sure nobody else has this page mlocked. The page will be * returned with PG_mlocked cleared if no other vmas have it mlocked. */ In other words it sets PG_mlocked if one or more vmas has it mlocked. So try_to_mlock() might be a better name, except that seems to have the potential for confusion as well because it's only called from the munlock code path and never for mlock. That explanation makes more sense.. This function looks like it is 'set PG_mlocked of the page if any vm->flags has VM_LOCKED' Maybe call it check_vm_locked or something then and reword the above comment? (and why is it OK to read vm->flags for this without any locking?) Something needs attention here.. I think the code is correct, but perhaps the naming could be better. Would be interested hearing any thoughts on renaming try_to_munlock() to try_to_mlock() as the current name appears based on the context it is called from (munlock) rather than what it does (mlock). The point of this patch is to make it clearer, after all, so I'd change something and maybe slightly clarify the comment. I'd add that, after looking around the calling code, this is a really unhappy pre-existing situation. Anyone reading this has to remember at which point in the call stack the naming transitions from "do the opposite of what the name says", to "do what the name says". +1 for renaming "munlock*" items to "mlock*", where applicable. good grief. Although, it seems reasonable to tack such renaming patches onto the tail end of this series. But whatever works. thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm: gup: remove FOLL_SPLIT
On 3/29/21 11:24 PM, kernel test robot wrote: Hi Yang, Thank you for the patch! Yet something to improve: [auto build test ERROR on hnaz-linux-mm/master] url: https://github.com/0day-ci/linux/commits/Yang-Shi/mm-gup-remove-FOLL_SPLIT/20210330-034042 base: https://github.com/hnaz/linux-mm master config: s390-randconfig-r032-20210330 (attached as .config) compiler: s390-linux-gcc (GCC) 9.3.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/0day-ci/linux/commit/c8563a636718f98af86a3965d94e25b8f2cf2354 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Yang-Shi/mm-gup-remove-FOLL_SPLIT/20210330-034042 git checkout c8563a636718f98af86a3965d94e25b8f2cf2354 # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=s390 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot All errors (new ones prefixed by >>): arch/s390/mm/gmap.c: In function 'thp_split_mm': arch/s390/mm/gmap.c:2498:27: error: 'FOLL_SPLIT' undeclared (first use in this function); did you mean 'FOLL_PIN'? 2498 |follow_page(vma, addr, FOLL_SPLIT); | ^~ | FOLL_PIN arch/s390/mm/gmap.c:2498:27: note: each undeclared identifier is reported only once for each function it appears in vim +2498 arch/s390/mm/gmap.c There appears to be an imperfection in this 0day testing system, because (just as the patch says), commit ba925fa35057a062ac98c3e8138b013ce4ce351c ("s390/gmap: improve THP splitting"), July 29, 2020, removes the above use of FOLL_SPLIT. And "git grep", just to be sure, shows it is not there in today's linux.git. So I guess the https://github.com/0day-ci/linux repo needs a better way to stay in sync? thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm: gup: remove FOLL_SPLIT
On 3/29/21 12:38 PM, Yang Shi wrote: Since commit 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT") and commit ba925fa35057 ("s390/gmap: improve THP splitting") FOLL_SPLIT has not been used anymore. Remove the dead code. Signed-off-by: Yang Shi --- include/linux/mm.h | 1 - mm/gup.c | 28 ++-- 2 files changed, 2 insertions(+), 27 deletions(-) Looks nice. As long as I'm running git grep here, there is one more search hit that should also be fixed up, as part of a "remove FOLL_SPLIT" patch: git grep -nw FOLL_SPLIT Documentation/vm/transhuge.rst:57:follow_page, the FOLL_SPLIT bit can be specified as a parameter to Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA diff --git a/include/linux/mm.h b/include/linux/mm.h index 8ba434287387..3568836841f9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2780,7 +2780,6 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, #define FOLL_NOWAIT 0x20/* if a disk transfer is needed, start the IO * and return without waiting upon it */ #define FOLL_POPULATE 0x40/* fault in page */ -#define FOLL_SPLIT 0x80/* don't return transhuge pages, split them */ #define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ #define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ #define FOLL_MIGRATION0x400 /* wait for page to replace migration entry */ diff --git a/mm/gup.c b/mm/gup.c index e40579624f10..f3d45a8f18ae 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -435,18 +435,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, } } - if (flags & FOLL_SPLIT && PageTransCompound(page)) { - get_page(page); - pte_unmap_unlock(ptep, ptl); - lock_page(page); - ret = split_huge_page(page); - unlock_page(page); - put_page(page); - if (ret) - return ERR_PTR(ret); - goto retry; - } - /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */ if (unlikely(!try_grab_page(page, flags))) { page = ERR_PTR(-ENOMEM); @@ -591,7 +579,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, spin_unlock(ptl); return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); } - if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) { + if (flags & FOLL_SPLIT_PMD) { int ret; page = pmd_page(*pmd); if (is_huge_zero_page(page)) { @@ -600,19 +588,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, split_huge_pmd(vma, pmd, address); if (pmd_trans_unstable(pmd)) ret = -EBUSY; - } else if (flags & FOLL_SPLIT) { - if (unlikely(!try_get_page(page))) { - spin_unlock(ptl); - return ERR_PTR(-ENOMEM); - } - spin_unlock(ptl); - lock_page(page); - ret = split_huge_page(page); - unlock_page(page); - put_page(page); - if (pmd_none(*pmd)) - return no_page_table(vma, flags); - } else { /* flags & FOLL_SPLIT_PMD */ + } else { spin_unlock(ptl); split_huge_pmd(vma, pmd, address); ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
Re: [PATCH v3] kernel/resource: Fix locking in request_free_mem_region
On 3/29/21 9:59 PM, Alistair Popple wrote: ... res->desc = IORES_DESC_DEVICE_PRIVATE_MEMORY; + if (dev) { + dr->parent = &iomem_resource; + dr->start = addr; + dr->n = size; + devres_add(dev, dr); + } + + write_unlock(&resource_lock); + revoke_iomem(res); This is new, and not mentioned in the commit log, and therefore quite surprising. It seems like the right thing to do but it also seems like a different fix! I'm not saying that it should be a separate patch, but it does seem worth loudly mentioning in the commit log, yes? This isn't a different fix though, it is just about maintaining the original behaviour which called revoke_iomem() after dropping the lock. I inadvertently switched this around in the initial patch such that revoke_iomem() got called with the lock, leading to deadlock on x86 with CONFIG_IO_STRICT_DEVMEM=y. This does change the order of revoke_iomem() and devres_add() slightly, but as far as I can tell that shouldn't be an issue. Can call that out in the commit log. Maybe that's why it looked like a change to me. I do think it's worth mentioning. thanks, -- John Hubbard NVIDIA
Re: [PATCH v3] kernel/resource: Fix locking in request_free_mem_region
e *iomem_get_mapping(void) return smp_load_acquire(&iomem_inode)->i_mapping; } -/** - * __request_region - create a new busy resource region - * @parent: parent resource descriptor - * @start: resource start address - * @n: resource region size - * @name: reserving caller's ID string - * @flags: IO resource flags - */ -struct resource * __request_region(struct resource *parent, - resource_size_t start, resource_size_t n, - const char *name, int flags) +static bool request_region_locked(struct resource *parent, + struct resource *res, resource_size_t start, + resource_size_t n, const char *name, int flags) { DECLARE_WAITQUEUE(wait, current); - struct resource *res = alloc_resource(GFP_KERNEL); - struct resource *orig_parent = parent; - - if (!res) - return NULL; res->name = name; res->start = start; res->end = start + n - 1; - write_lock(&resource_lock); - for (;;) { struct resource *conflict; @@ -1230,14 +1224,37 @@ struct resource * __request_region(struct resource *parent, write_lock(&resource_lock); continue; } + return false; + } + + return true; +} + +/** + * __request_region - create a new busy resource region + * @parent: parent resource descriptor + * @start: resource start address + * @n: resource region size + * @name: reserving caller's ID string + * @flags: IO resource flags + */ +struct resource *__request_region(struct resource *parent, + resource_size_t start, resource_size_t n, + const char *name, int flags) +{ + struct resource *res = alloc_resource(GFP_KERNEL); + + if (!res) + return NULL; + + write_lock(&resource_lock); + if (!request_region_locked(parent, res, start, n, name, flags)) { /* Uhhuh, that didn't work out.. */ free_resource(res); res = NULL; - break; } write_unlock(&resource_lock); - - if (res && orig_parent == &iomem_resource) + if (res && parent == &iomem_resource) revoke_iomem(res); return res; @@ -1779,26 +1796,51 @@ static struct resource *__request_free_mem_region(struct device *dev, { resource_size_t end, addr; struct resource *res; + struct region_devres *dr = NULL; + + res = alloc_resource(GFP_KERNEL); + if (!res) + return ERR_PTR(-ENOMEM); + + if (dev) { + dr = devres_alloc(devm_region_release, sizeof(struct region_devres), + GFP_KERNEL); + if (!dr) { + free_resource(res); + return ERR_PTR(-ENOMEM); + } + } size = ALIGN(size, 1UL << PA_SECTION_SHIFT); end = min_t(unsigned long, base->end, (1UL << MAX_PHYSMEM_BITS) - 1); addr = end - size + 1UL; + write_lock(&resource_lock); for (; addr > size && addr >= base->start; addr -= size) { - if (region_intersects(addr, size, 0, IORES_DESC_NONE) != + if (__region_intersects(addr, size, 0, IORES_DESC_NONE) != REGION_DISJOINT) continue; - if (dev) - res = devm_request_mem_region(dev, addr, size, name); - else - res = request_mem_region(addr, size, name); - if (!res) - return ERR_PTR(-ENOMEM); + if (!request_region_locked(&iomem_resource, res, addr, + size, name, 0)) + break; + res->desc = IORES_DESC_DEVICE_PRIVATE_MEMORY; + if (dev) { + dr->parent = &iomem_resource; + dr->start = addr; + dr->n = size; + devres_add(dev, dr); + } + + write_unlock(&resource_lock); + revoke_iomem(res); This is new, and not mentioned in the commit log, and therefore quite surprising. It seems like the right thing to do but it also seems like a different fix! I'm not saying that it should be a separate patch, but it does seem worth loudly mentioning in the commit log, yes? return res; } + write_unlock(&resource_lock); + free_resource(res); + return ERR_PTR(-ERANGE); } thanks, -- John Hubbard NVIDIA
Re: [PATCH v7] mm: cma: support sysfs
On 3/24/21 3:11 PM, Dmitry Osipenko wrote: 25.03.2021 01:01, John Hubbard пишет: On 3/24/21 2:31 PM, Dmitry Osipenko wrote: ... +#include + +struct cma_kobject { + struct cma *cma; + struct kobject kobj; If you'll place the kobj as the first member of the struct, then container_of will be a no-op. However, *this does not matter*. Let's not get carried away. If container_of() ends up as a compile-time addition of +8, instead of +0, there is not going to be a visible effect in the world. Or do you have some perf data to the contrary? Sometimes these kinds of things matter. But other times, they are just pointless to fret about, and this is once such case. Performance is out of question here, my main point is about maintaining In that case, there is even less reason to harass people about the order of members of a struct. a good coding style. Otherwise there is no point in not embedding kobj into cma struct as well, IMO. We really don't need to worry about the order of members in a struct, from a "coding style" point of view. It is a solid step too far. Sorry if that sounds a little too direct. But this review has tended to go quite too far into nitpicks that are normally left as-is, and I've merely picked one that is particularly questionable. I realize that other coding communities have their own standards. Here, I'm explaining what I have observed about linux-mm and linux-kernel, which needs to be respected. thanks, -- John Hubbard NVIDIA
Re: [PATCH v7] mm: cma: support sysfs
On 3/24/21 2:31 PM, Dmitry Osipenko wrote: ... +#include + +struct cma_kobject { + struct cma *cma; + struct kobject kobj; If you'll place the kobj as the first member of the struct, then container_of will be a no-op. However, *this does not matter*. Let's not get carried away. If container_of() ends up as a compile-time addition of +8, instead of +0, there is not going to be a visible effect in the world. Or do you have some perf data to the contrary? Sometimes these kinds of things matter. But other times, they are just pointless to fret about, and this is once such case. thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm: cma: fix corruption cma_sysfs_alloc_pages_count
On 3/24/21 12:20 PM, Minchan Kim wrote: struct cma_stat's lifespan for cma_sysfs is different with struct cma because kobject for sysfs requires dynamic object while CMA is static object[1]. When CMA is initialized, it couldn't use slab to allocate cma_stat since slab was not initialized yet. Thus, it allocates the dynamic object in subsys_initcall. However, the cma allocation can happens before subsys_initcall then, it goes crash. Dmitry reported[2]: .. [1.226190] [] (cma_sysfs_alloc_pages_count) from [] (cma_alloc+0x153/0x274) [1.226720] [] (cma_alloc) from [] (__alloc_from_contiguous+0x37/0x8c) [1.227272] [] (__alloc_from_contiguous) from [] (atomic_pool_init+0x7b/0x126) [1.233596] [] (atomic_pool_init) from [] (do_one_initcall+0x45/0x1e4) [1.234188] [] (do_one_initcall) from [] (kernel_init_freeable+0x157/0x1a6) [1.234741] [] (kernel_init_freeable) from [] (kernel_init+0xd/0xe0) [1.235289] [] (kernel_init) from [] (ret_from_fork+0x11/0x1c) This patch moves those statistic fields of cma_stat into struct cma and introduces cma_kobject wrapper to follow kobject's rule. At the same time, it fixes other routines based on suggestions[3][4]. [1] https://lore.kernel.org/linux-mm/ycoamxqt6dzkc...@kroah.com/ [2] https://lore.kernel.org/linux-mm/fead70a2-4330-79ff-e79a-d8511eab1...@gmail.com/ [3] https://lore.kernel.org/linux-mm/20210323195050.2577017-1-minc...@kernel.org/ [4] https://lore.kernel.org/linux-mm/20210324010547.4134370-1-minc...@kernel.org/ Reported-by: Dmitry Osipenko Tested-by: Dmitry Osipenko Suggested-by: Dmitry Osipenko Suggested-by: John Hubbard Suggested-by: Matthew Wilcox Signed-off-by: Minchan Kim --- I belive it's worth to have separate patch rather than replacing original patch. It will also help to merge without conflict since we already filed other patch based on it. Strictly speaking, separating fix part and readbility part in this patch would be better but it's gray to separate them since most code in this patch was done while we were fixing the bug. Since we don't release it yet, I hope it will work. Otherwise, I can send a replacement patch inclucing all of changes happend until now with gathering SoB. If we still have a choice, we should not merge a patch that has a known serious problem, such as a crash. That's only done if the broken problematic patch has already been committed to a tree that doesn't allow rebasing, such as of course the main linux.git. Here, I *think* it's just in linux-next and mmotm, so we still are allowed to fix the original patch. thanks, -- John Hubbard NVIDIA
Re: [PATCH v6] mm: cma: support sysfs
On 3/23/21 11:57 PM, Minchan Kim wrote: ... , how about approximately this: struct cma_kobject_wrapper { struct cma *parent; struct kobject kobj; }; struct cma { ... struct cma_kobject_wrapper *cma_kobj_wrapper; }; ...thus allowing readers of cma_sysfs.c to read that file more easily. I agree cma->kobj->kobj is awkward but personally, I don't like the naming: cma_kobject_wrapper parent pointer. cma_kobject is alredy wrapper so it sounds me redundant and it's not a parent in same hierarchy. Since the kobj->kobj is just one line in the code(I don't imagine it could grow up in cma_sysfs in future), I don't think it would be a problem. If we really want to make it more clear, maybe? cma->cma_kobj->kobj It would be consistent with other variables in cma_sysfs_init. OK, that's at least better than it was. thanks, -- John Hubbard NVIDIA
Re: [PATCH v6] mm: cma: support sysfs
On 3/23/21 10:44 PM, Minchan Kim wrote: On Tue, Mar 23, 2021 at 09:47:27PM -0700, John Hubbard wrote: On 3/23/21 8:27 PM, Minchan Kim wrote: ... +static int __init cma_sysfs_init(void) +{ + unsigned int i; + + cma_kobj_root = kobject_create_and_add("cma", mm_kobj); + if (!cma_kobj_root) + return -ENOMEM; + + for (i = 0; i < cma_area_count; i++) { + int err; + struct cma *cma; + struct cma_kobject *cma_kobj; + + cma_kobj = kzalloc(sizeof(*cma_kobj), GFP_KERNEL); + if (!cma_kobj) { + kobject_put(cma_kobj_root); + return -ENOMEM; This leaks little cma_kobj's all over the floor. :) I thought kobject_put(cma_kobj_root) should deal with it. No? If this fails when i > 0, there will be cma_kobj instances that were stashed in the cma_areas[] array. But this code only deletes the most recently allocated cma_kobj, not anything allocated on previous iterations of the loop. Oh, I misunderstood that destroying of root kobject will release children recursively. Seems not true. Go back to old version. index 16c81c9cb9b7..418951a3f138 100644 --- a/mm/cma_sysfs.c +++ b/mm/cma_sysfs.c @@ -80,20 +80,19 @@ static struct kobj_type cma_ktype = { static int __init cma_sysfs_init(void) { unsigned int i; + int err; + struct cma *cma; + struct cma_kobject *cma_kobj; cma_kobj_root = kobject_create_and_add("cma", mm_kobj); if (!cma_kobj_root) return -ENOMEM; for (i = 0; i < cma_area_count; i++) { - int err; - struct cma *cma; - struct cma_kobject *cma_kobj; - cma_kobj = kzalloc(sizeof(*cma_kobj), GFP_KERNEL); if (!cma_kobj) { - kobject_put(cma_kobj_root); - return -ENOMEM; + err = -ENOMEM; + goto out; } cma = &cma_areas[i]; @@ -103,11 +102,21 @@ static int __init cma_sysfs_init(void) cma_kobj_root, "%s", cma->name); if (err) { kobject_put(&cma_kobj->kobj); - kobject_put(cma_kobj_root); - return err; + goto out; } } return 0; +out: + while (--i >= 0) { + cma = &cma_areas[i]; + + kobject_put(&cma->kobj->kobj); OK. As long as you are spinning a new version, let's fix up the naming to be a little better, too. In this case, with a mildly dizzying mix of cma's and kobjects, it actually makes a real difference. I wouldn't have asked, but the above cma->kobj->kobj chain really made it obvious for me just now. So instead of this (in cma.h): struct cma_kobject { struct cma *cma; struct kobject kobj; }; struct cma { ... struct cma_kobject *kobj; }; , how about approximately this: struct cma_kobject_wrapper { struct cma *parent; struct kobject kobj; }; struct cma { ... struct cma_kobject_wrapper *cma_kobj_wrapper; }; ...thus allowing readers of cma_sysfs.c to read that file more easily. + kfree(cma->kobj); + cma->kobj = NULL; + } + kobject_put(cma_kobj_root); + + return err; } subsys_initcall(cma_sysfs_init); thanks, -- John Hubbard NVIDIA
Re: [PATCH v6] mm: cma: support sysfs
On 3/23/21 8:27 PM, Minchan Kim wrote: ... +static int __init cma_sysfs_init(void) +{ + unsigned int i; + + cma_kobj_root = kobject_create_and_add("cma", mm_kobj); + if (!cma_kobj_root) + return -ENOMEM; + + for (i = 0; i < cma_area_count; i++) { + int err; + struct cma *cma; + struct cma_kobject *cma_kobj; + + cma_kobj = kzalloc(sizeof(*cma_kobj), GFP_KERNEL); + if (!cma_kobj) { + kobject_put(cma_kobj_root); + return -ENOMEM; This leaks little cma_kobj's all over the floor. :) I thought kobject_put(cma_kobj_root) should deal with it. No? If this fails when i > 0, there will be cma_kobj instances that were stashed in the cma_areas[] array. But this code only deletes the most recently allocated cma_kobj, not anything allocated on previous iterations of the loop. thanks, -- John Hubbard NVIDIA
Re: [PATCH v6] mm: cma: support sysfs
kobject_put(&cma_kobj->kobj); + kobject_put(cma_kobj_root); + return err; Hopefully this little bit of logic could also go into the cleanup routine. + } + } + + return 0; +} +subsys_initcall(cma_sysfs_init); thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm: cma: Fix potential null dereference on pointer cma
On 3/16/21 3:04 AM, Colin King wrote: From: Colin Ian King At the start of the function there is a null pointer check on cma and then branch to error handling label 'out'. The subsequent call to cma_sysfs_fail_pages_count dereferences cma, hence there is a potential null pointer deference issue. Fix this by only calling As far as I can tell, it's not possible to actually cause that null failure with the existing kernel code paths. *Might* be worth mentioning that here (unless I'm wrong), but either way, looks good, so: Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA cma_sysfs_fail_pages_count if cma is not null. Addresses-Coverity: ("Dereference after null check") Fixes: dc4764f7e9ac ("mm: cma: support sysfs") Signed-off-by: Colin Ian King --- mm/cma.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/cma.c b/mm/cma.c index 6d4dbafdb318..90e27458ddb7 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -512,7 +512,8 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, cma_sysfs_alloc_pages_count(cma, count); } else { count_vm_event(CMA_ALLOC_FAIL); - cma_sysfs_fail_pages_count(cma, count); + if (cma) + cma_sysfs_fail_pages_count(cma, count); } return page;
Re: [PATCH v2] mm: vmstat: add cma statistics
On 3/2/21 10:33, Minchan Kim wrote: Since CMA is used more widely, it's worth to have CMA allocation statistics into vmstat. With it, we could know how agressively system uses cma allocation and how often it fails. Signed-off-by: Minchan Kim --- * from v1 - https://lore.kernel.org/linux-mm/20210217170025.512704-1-minc...@kernel.org/ * change alloc_attempt with alloc_success - jhubbard * item per line for vm_event_item - jhubbard include/linux/vm_event_item.h | 4 mm/cma.c | 12 +--- mm/vmstat.c | 4 3 files changed, 17 insertions(+), 3 deletions(-) Seems reasonable, and the diffs look good. Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 18e75974d4e3..21d7c7f72f1c 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -70,6 +70,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #endif #ifdef CONFIG_HUGETLB_PAGE HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL, +#endif +#ifdef CONFIG_CMA + CMA_ALLOC_SUCCESS, + CMA_ALLOC_FAIL, #endif UNEVICTABLE_PGCULLED, /* culled to noreclaim list */ UNEVICTABLE_PGSCANNED, /* scanned for reclaimability */ diff --git a/mm/cma.c b/mm/cma.c index 23d4a97c834a..04ca863d1807 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -435,13 +435,13 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, int ret = -ENOMEM; if (!cma || !cma->count || !cma->bitmap) - return NULL; + goto out; pr_debug("%s(cma %p, count %zu, align %d)\n", __func__, (void *)cma, count, align); if (!count) - return NULL; + goto out; mask = cma_bitmap_aligned_mask(cma, align); offset = cma_bitmap_aligned_offset(cma, align); @@ -449,7 +449,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, bitmap_count = cma_bitmap_pages_to_bits(cma, count); if (bitmap_count > bitmap_maxno) - return NULL; + goto out; for (;;) { mutex_lock(&cma->lock); @@ -506,6 +506,12 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, } pr_debug("%s(): returned %p\n", __func__, page); +out: + if (page) + count_vm_event(CMA_ALLOC_SUCCESS); + else + count_vm_event(CMA_ALLOC_FAIL); + return page; } diff --git a/mm/vmstat.c b/mm/vmstat.c index 97fc32a53320..d8c32a33208d 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1305,6 +1305,10 @@ const char * const vmstat_text[] = { #ifdef CONFIG_HUGETLB_PAGE "htlb_buddy_alloc_success", "htlb_buddy_alloc_fail", +#endif +#ifdef CONFIG_CMA + "cma_alloc_success", + "cma_alloc_fail", #endif "unevictable_pgs_culled", "unevictable_pgs_scanned",
Re: [PATCH] mm: vmstat: add cma statistics
On 2/17/21 9:00 AM, Minchan Kim wrote: Since CMA is used more widely, it's worth to have CMA allocation statistics into vmstat. With it, we could know how agressively system uses cma allocation and how often it fails. Signed-off-by: Minchan Kim --- include/linux/vm_event_item.h | 3 +++ mm/cma.c | 12 +--- mm/vmstat.c | 4 3 files changed, 16 insertions(+), 3 deletions(-) diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 18e75974d4e3..0c567014ce82 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -70,6 +70,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #endif #ifdef CONFIG_HUGETLB_PAGE HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL, +#endif +#ifdef CONFIG_CMA + CMA_ALLOC, CMA_ALLOC_FAIL, This seems wrong: here it's called "alloc", but in the output it's called "alloc success", and in the implementation it's clearly "alloc attempt" that is being counted. Once these are all made consistent, then the bug should naturally go away as part of that. nit: I think the multiple items per line is a weak idea at best, even though it's used here already. Each item is important and needs to be visually compared to it's output item later. So one per line might have helped avoid mismatches, and I think we should change to that to encourage that trend. thanks, -- John Hubbard NVIDIA #endif UNEVICTABLE_PGCULLED, /* culled to noreclaim list */ UNEVICTABLE_PGSCANNED, /* scanned for reclaimability */ diff --git a/mm/cma.c b/mm/cma.c index 23d4a97c834a..ea1e39559526 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -434,14 +434,16 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, struct page *page = NULL; int ret = -ENOMEM; + count_vm_event(CMA_ALLOC); + if (!cma || !cma->count || !cma->bitmap) - return NULL; + goto out; pr_debug("%s(cma %p, count %zu, align %d)\n", __func__, (void *)cma, count, align); if (!count) - return NULL; + goto out; mask = cma_bitmap_aligned_mask(cma, align); offset = cma_bitmap_aligned_offset(cma, align); @@ -449,7 +451,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, bitmap_count = cma_bitmap_pages_to_bits(cma, count); if (bitmap_count > bitmap_maxno) - return NULL; + goto out; for (;;) { mutex_lock(&cma->lock); @@ -506,6 +508,10 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, } pr_debug("%s(): returned %p\n", __func__, page); +out: + if (!page) + count_vm_event(CMA_ALLOC_FAIL); + return page; } diff --git a/mm/vmstat.c b/mm/vmstat.c index 97fc32a53320..d8c32a33208d 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1305,6 +1305,10 @@ const char * const vmstat_text[] = { #ifdef CONFIG_HUGETLB_PAGE "htlb_buddy_alloc_success", "htlb_buddy_alloc_fail", +#endif +#ifdef CONFIG_CMA + "cma_alloc_success", + "cma_alloc_fail", #endif "unevictable_pgs_culled", "unevictable_pgs_scanned",
Re: [PATCH 0/9] Add support for SVM atomics in Nouveau
On 2/10/21 4:59 AM, Daniel Vetter wrote: ... GPU atomic operations to sysmem are hard to categorize, because because application programmers could easily write programs that do a long series of atomic operations. Such a program would be a little weird, but it's hard to rule out. Yeah, but we can forcefully break this whenever we feel like by revoking the page, moving it, and then reinstating the gpu pte again and let it continue. Oh yes, that's true. If that's no possible then what we need here instead is an mlock() type of thing I think. No need for that, then. thanks, -- John Hubbard NVIDIA
Re: [PATCH v3 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()
On 2/5/21 12:41 PM, Joao Martins wrote: Add a unpin_user_page_range_dirty_lock() API which takes a starting page and how many consecutive pages we want to unpin and optionally dirty. To that end, define another iterator for_each_compound_range() that operates in page ranges as opposed to page array. For users (like RDMA mr_dereg) where each sg represents a contiguous set of pages, we're able to more efficiently unpin pages without having to supply an array of pages much of what happens today with unpin_user_pages(). Suggested-by: Jason Gunthorpe Signed-off-by: Joao Martins --- include/linux/mm.h | 2 ++ mm/gup.c | 62 ++ 2 files changed, 64 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index a608feb0d42e..b76063f7f18a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1265,6 +1265,8 @@ static inline void put_page(struct page *page) void unpin_user_page(struct page *page); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, bool make_dirty); +void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, + bool make_dirty); void unpin_user_pages(struct page **pages, unsigned long npages); /** diff --git a/mm/gup.c b/mm/gup.c index 467a11df216d..938964d31494 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -215,6 +215,32 @@ void unpin_user_page(struct page *page) } EXPORT_SYMBOL(unpin_user_page); +static inline void compound_range_next(unsigned long i, unsigned long npages, + struct page **list, struct page **head, + unsigned int *ntails) Yes, the new names look good, and I have failed to find any logic errors, so: Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA +{ + struct page *next, *page; + unsigned int nr = 1; + + if (i >= npages) + return; + + next = *list + i; + page = compound_head(next); + if (PageCompound(page) && compound_order(page) >= 1) + nr = min_t(unsigned int, + page + compound_nr(page) - next, npages - i); + + *head = page; + *ntails = nr; +} + +#define for_each_compound_range(__i, __list, __npages, __head, __ntails) \ + for (__i = 0, \ +compound_range_next(__i, __npages, __list, &(__head), &(__ntails)); \ +__i < __npages; __i += __ntails, \ +compound_range_next(__i, __npages, __list, &(__head), &(__ntails))) + static inline void compound_next(unsigned long i, unsigned long npages, struct page **list, struct page **head, unsigned int *ntails) @@ -303,6 +329,42 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, } EXPORT_SYMBOL(unpin_user_pages_dirty_lock); +/** + * unpin_user_page_range_dirty_lock() - release and optionally dirty + * gup-pinned page range + * + * @page: the starting page of a range maybe marked dirty, and definitely released. + * @npages: number of consecutive pages to release. + * @make_dirty: whether to mark the pages dirty + * + * "gup-pinned page range" refers to a range of pages that has had one of the + * get_user_pages() variants called on that page. + * + * For the page ranges defined by [page .. page+npages], make that range (or + * its head pages, if a compound page) dirty, if @make_dirty is true, and if the + * page range was previously listed as clean. + * + * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is + * required, then the caller should a) verify that this is really correct, + * because _lock() is usually required, and b) hand code it: + * set_page_dirty_lock(), unpin_user_page(). + * + */ +void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, + bool make_dirty) +{ + unsigned long index; + struct page *head; + unsigned int ntails; + + for_each_compound_range(index, &page, npages, head, ntails) { + if (make_dirty && !PageDirty(head)) + set_page_dirty_lock(head); + put_compound_head(head, ntails, FOLL_PIN); + } +} +EXPORT_SYMBOL(unpin_user_page_range_dirty_lock); + /** * unpin_user_pages() - release an array of gup-pinned pages. * @pages: array of pages to be marked dirty and released.
Re: [PATCH v3] mm: cma: support sysfs
On 2/9/21 11:55 PM, Minchan Kim wrote: Since CMA is getting used more widely, it's more important to keep monitoring CMA statistics for system health since it's directly related to user experience. This patch introduces sysfs statistics for CMA, in order to provide some basic monitoring of the CMA allocator. * the number of CMA page allocation attempts * the number of CMA page allocation failures These two values allow the user to calcuate the allocation failure rate for each CMA area. e.g.) /sys/kernel/mm/cma/WIFI/cma_alloc_pages_[attempts|fails] /sys/kernel/mm/cma/SENSOR/cma_alloc_pages_[attempts|fails] /sys/kernel/mm/cma/BLUETOOTH/cma_alloc_pages_[attempts|fails] Signed-off-by: Minchan Kim --- Looks good. Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA From v2 - https://lore.kernel.org/linux-mm/20210208180142.2765456-1-minc...@kernel.org/ * sysfs doc and description modification - jhubbard From v1 - https://lore.kernel.org/linux-mm/20210203155001.4121868-1-minc...@kernel.org/ * fix sysfs build and refactoring - willy * rename and drop some attributes - jhubbard Documentation/ABI/testing/sysfs-kernel-mm-cma | 25 mm/Kconfig| 7 ++ mm/Makefile | 1 + mm/cma.c | 6 +- mm/cma.h | 18 +++ mm/cma_sysfs.c| 114 ++ 6 files changed, 170 insertions(+), 1 deletion(-) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-cma create mode 100644 mm/cma_sysfs.c diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-cma b/Documentation/ABI/testing/sysfs-kernel-mm-cma new file mode 100644 index ..f518af819cee --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-cma @@ -0,0 +1,25 @@ +What: /sys/kernel/mm/cma/ +Date: Feb 2021 +Contact: Minchan Kim +Description: + /sys/kernel/mm/cma/ contains a subdirectory for each CMA + heap name (also sometimes called CMA areas). + + Each CMA heap subdirectory (that is, each + /sys/kernel/mm/cma/ directory) contains the + following items: + + cma_alloc_pages_attempts + cma_alloc_pages_fails + +What: /sys/kernel/mm/cma//cma_alloc_pages_attempts +Date: Feb 2021 +Contact: Minchan Kim +Description: + the number of pages CMA API tried to allocate + +What: /sys/kernel/mm/cma//cma_alloc_pages_fails +Date: Feb 2021 +Contact: Minchan Kim +Description: + the number of pages CMA API failed to allocate diff --git a/mm/Kconfig b/mm/Kconfig index ec35bf406439..ad7e9c065657 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -513,6 +513,13 @@ config CMA_DEBUGFS help Turns on the DebugFS interface for CMA. +config CMA_SYSFS + bool "CMA information through sysfs interface" + depends on CMA && SYSFS + help + This option exposes some sysfs attributes to get information + from CMA. + config CMA_AREAS int "Maximum count of the CMA areas" depends on CMA diff --git a/mm/Makefile b/mm/Makefile index b2a564eec27f..e10c14300191 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -109,6 +109,7 @@ obj-$(CONFIG_CMA) += cma.o obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o +obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o diff --git a/mm/cma.c b/mm/cma.c index 23d4a97c834a..b42ccafc71e5 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -447,9 +447,10 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, offset = cma_bitmap_aligned_offset(cma, align); bitmap_maxno = cma_bitmap_maxno(cma); bitmap_count = cma_bitmap_pages_to_bits(cma, count); + cma_sysfs_alloc_count(cma, count); if (bitmap_count > bitmap_maxno) - return NULL; + goto out; for (;;) { mutex_lock(&cma->lock); @@ -504,6 +505,9 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, __func__, count, ret); cma_debug_show_areas(cma); } +out: + if (!page) + cma_sysfs_fail_count(cma, count); pr_debug("%s(): returned %p\n", __func__, page); return page; diff --git a/mm/cma.h b/mm/cma.h index 42ae082cb067..24a1d61eabc7 100644 --- a/mm/cma.h +++ b/mm/cma.h @@ -3,6 +3,14 @@ #define __MM_CMA_H__ #include +#include + +struct cma_stat { + spinlock_t lock; + unsigned long pages_
Re: [PATCH v2] mm: cma: support sysfs
On 2/9/21 11:26 PM, Greg KH wrote: ... I just am not especially happy about the inability to do natural, efficient things here, such as use a statically allocated set of things with sysfs. And I remain convinced that the above is not "improper"; it's a reasonable step, given the limitations of the current sysfs design. I just wanted to say that out loud, as my proposal sinks to the bottom of the trench here. haha :) What is "odd" is that you are creating an object in the kernel that you _never_ free. That's not normal at all in the kernel, and so, your wish to have a kobject that you never free represent this object also is not normal :) OK, thanks for taking the time to explain that, much appreciated! thanks, -- John Hubbard NVIDIA
Re: [PATCH v2] mm: cma: support sysfs
On 2/9/21 11:12 PM, Minchan Kim wrote: ... Agreed. How about this for the warning part? + +/* + * note: kobj_type should provide a release function to free dynamically + * allocated object since kobject is responsible for controlling lifespan + * of the object. However, cma_area is static object so technially, it + * doesn't need release function. It's very exceptional case so pleaes + * do not follow this model. + */ static struct kobj_type cma_ktype = { .sysfs_ops = &kobj_sysfs_ops, .default_groups = cma_groups + .release = NULL, /* do not follow. See above */ }; No, please no. Just do it the correct way, what is the objection to creating a few dynamic kobjects from the heap? How many of these are you going to have that it will somehow be "wasteful"? Please do it properly. Oh, I misunderstood your word "don't provide a release function for the kobject" so thought you agreed on John. If you didn't, we are stuck again: IIUC, the objection from John was the cma_stat lifetime should be on parent object, which is reasonable and make code simple. Frankly speaking, I don't have strong opinion about either approach. John? We should do it as Greg requests, now that it's quite clear that he's insisting on this. Not a big deal. I just am not especially happy about the inability to do natural, efficient things here, such as use a statically allocated set of things with sysfs. And I remain convinced that the above is not "improper"; it's a reasonable step, given the limitations of the current sysfs design. I just wanted to say that out loud, as my proposal sinks to the bottom of the trench here. haha :) thanks, -- John Hubbard NVIDIA
Re: [PATCH 0/9] Add support for SVM atomics in Nouveau
On 2/9/21 5:37 AM, Daniel Vetter wrote: On Tue, Feb 9, 2021 at 1:57 PM Alistair Popple wrote: On Tuesday, 9 February 2021 9:27:05 PM AEDT Daniel Vetter wrote: Recent changes to pin_user_pages() prevent the creation of pinned pages in ZONE_MOVABLE. This series allows pinned pages to be created in ZONE_MOVABLE as attempts to migrate may fail which would be fatal to userspace. In this case migration of the pinned page is unnecessary as the page can be unpinned at anytime by having the driver revoke atomic permission as it does for the migrate_to_ram() callback. However a method of calling this when memory needs to be moved has yet to be resolved so any discussion is welcome. Why do we need to pin for gpu atomics? You still have the callback for cpu faults, so you can move the page as needed, and hence a long-term pin sounds like the wrong approach. Technically a real long term unmoveable pin isn't required, because as you say the page can be moved as needed at any time. However I needed some way of stopping the CPU page from being freed once the userspace mappings for it had been removed. Obviously I could have just used get_page() but from the perspective of page migration the result is much the same as a pin - a page which can't be moved because of the extra refcount. long term pin vs short term page reference aren't fully fleshed out. But the rule more or less is: - short term page reference: _must_ get released in finite time for migration and other things, either because you have a callback, or because it's just for direct I/O, which will complete. This means short term pins will delay migration, but not foul it complete GPU atomic operations to sysmem are hard to categorize, because because application programmers could easily write programs that do a long series of atomic operations. Such a program would be a little weird, but it's hard to rule out. - long term pin: the page cannot be moved, all migration must fail. Also this will have an impact on COW behaviour for fork (but not sure where those patches are, John Hubbard will know). That would be Jason's commit 57efa1fe59576 ("mm/gup: prevent gup_fast from racing with COW during fork"), which is in linux-next 20201216. So I think for your use case here you want a) short term page reference to make sure it doesn't disappear plus b) callback to make sure migrate isn't blocked. Breaking ZONE_MOVEABLE with either allowing long term pins or failing migrations because you don't release your short term page reference isn't good. The normal solution of registering an MMU notifier to unpin the page when it needs to be moved also doesn't work as the CPU page tables now point to the device-private page and hence the migration code won't call any invalidate notifiers for the CPU page. Yeah you need some other callback for migration on the page directly. it's a bit awkward since there is one already for struct address_space, but that's own by the address_space/page cache, not HMM. So I think we need something else, maybe something for each ZONE_DEVICE? This direction sounds at least...possible. Using MMU notifiers instead of pins is definitely appealing. I'm not quite clear on the callback idea above, but overall it seems like taking advantage of the ZONE_DEVICE tracking of pages (without having to put anything additional in each struct page), could work. Additional notes or ideas here are definitely welcome. thanks, -- John Hubbard NVIDIA
Re: [PATCH v2] mm: cma: support sysfs
On 2/9/21 9:49 AM, Greg KH wrote: That's fine if you want to add it to the parent. If so, then the kobject controls the lifetime of the structure, nothing else can. The problem was parent object(i.e., struct cma cma_areas) is static arrary so kobj->release function will be NULL or just dummy. Is it okay? I thought it was one of the what you wanted to avoid it. No, that is not ok. Either is fine with me, what is "forbidden" is having a kobject and somehow thinking that it does not control the lifetime of the structure. Since parent object is static arrary, there is no need to control the lifetime so I am curious if parent object approach is okay from kobject handling point of view. So the array is _NEVER_ freed? If not, fine, don't provide a release function for the kobject, but ick, just make a dynamic kobject I don't see the problem for something so tiny and not very many... Yeah, I wasn't trying to generate so much discussion, I initially thought it would be a minor comment: "just use an embedded struct and avoid some extra code", at first. I worry that any static kobject might be copied/pasted as someone might think this is an ok thing to do. And it's not an ok thing to do. Overall, then, we're seeing that there is a small design hole: in order to use sysfs most naturally, you either much provide a dynamically allocated item for it, or you must use the static kobject, and the second one sets a bad example. I think we should just use a static kobject, with a cautionary comment to would-be copy-pasters, that explains the design constraints above. That way, we still get a nice, less-code implementation, a safe design, and it only really costs us a single carefully written comment. thanks, -- John Hubbard NVIDIA
Re: [PATCH v2] mm: cma: support sysfs
On 2/8/21 10:27 PM, John Hubbard wrote: On 2/8/21 10:13 PM, Greg KH wrote: On Mon, Feb 08, 2021 at 05:57:17PM -0800, John Hubbard wrote: On 2/8/21 3:36 PM, Minchan Kim wrote: ... char name[CMA_MAX_NAME]; +#ifdef CONFIG_CMA_SYSFS + struct cma_stat *stat; This should not be a pointer. By making it a pointer, you've added a bunch of pointless extra code to the implementation. Originally, I went with the object lifetime with struct cma as you suggested to make code simple. However, Greg KH wanted to have release for kobj_type since it is consistent with other kboject handling. Are you talking about the kobj in your new struct cma_stat? That seems like circular logic if so. I'm guessing Greg just wanted kobj methods to be used *if* you are dealing with kobjects. That's a narrower point. I can't imagine that he would have insisted on having additional allocations just so that kobj freeing methods could be used. :) Um, yes, I was :) You can not add a kobject to a structure and then somehow think you can just ignore the reference counting issues involved. If a kobject is part of a structure then the kobject is responsible for controling the lifespan of the memory, nothing else can be. So by making the kobject dynamic, you properly handle that memory lifespan of the object, instead of having to worry about the lifespan of the larger object (which the original patch was not doing.) Does that make sense? That part makes sense, yes, thanks. The part that I'm trying to straighten out is, why was kobject even added to the struct cma_stat in the first place? Why not just leave .stat as a static member variable, without a kobject in it, and done? Sorry, I think I get it now: this is in order to allow a separate lifetime for the .stat member. I was sort of implicitly assuming that the "right" way to do it is just have the whole object use one lifetime management, but as you say, there is no kobject being added to the parent. I still feel odd about the allocation and freeing of something that seems to be logically the same lifetime (other than perhaps a few, briefly pending sysfs reads, at the end of life). So I'd still think that the kobject should be added to the parent... thanks, -- John Hubbard NVIDIA
Re: [PATCH v2] mm: cma: support sysfs
On 2/8/21 10:13 PM, Greg KH wrote: On Mon, Feb 08, 2021 at 05:57:17PM -0800, John Hubbard wrote: On 2/8/21 3:36 PM, Minchan Kim wrote: ... char name[CMA_MAX_NAME]; +#ifdef CONFIG_CMA_SYSFS + struct cma_stat *stat; This should not be a pointer. By making it a pointer, you've added a bunch of pointless extra code to the implementation. Originally, I went with the object lifetime with struct cma as you suggested to make code simple. However, Greg KH wanted to have release for kobj_type since it is consistent with other kboject handling. Are you talking about the kobj in your new struct cma_stat? That seems like circular logic if so. I'm guessing Greg just wanted kobj methods to be used *if* you are dealing with kobjects. That's a narrower point. I can't imagine that he would have insisted on having additional allocations just so that kobj freeing methods could be used. :) Um, yes, I was :) You can not add a kobject to a structure and then somehow think you can just ignore the reference counting issues involved. If a kobject is part of a structure then the kobject is responsible for controling the lifespan of the memory, nothing else can be. So by making the kobject dynamic, you properly handle that memory lifespan of the object, instead of having to worry about the lifespan of the larger object (which the original patch was not doing.) Does that make sense? That part makes sense, yes, thanks. The part that I'm trying to straighten out is, why was kobject even added to the struct cma_stat in the first place? Why not just leave .stat as a static member variable, without a kobject in it, and done? thanks, -- John Hubbard NVIDIA
Re: [PATCH v2] mm: cma: support sysfs
On 2/8/21 9:18 PM, John Hubbard wrote: On 2/8/21 8:19 PM, Minchan Kim wrote: On Mon, Feb 08, 2021 at 05:57:17PM -0800, John Hubbard wrote: On 2/8/21 3:36 PM, Minchan Kim wrote: ... char name[CMA_MAX_NAME]; +#ifdef CONFIG_CMA_SYSFS + struct cma_stat *stat; This should not be a pointer. By making it a pointer, you've added a bunch of pointless extra code to the implementation. Originally, I went with the object lifetime with struct cma as you suggested to make code simple. However, Greg KH wanted to have release for kobj_type since it is consistent with other kboject handling. Are you talking about the kobj in your new struct cma_stat? That seems like circular logic if so. I'm guessing Greg just wanted kobj methods to be used *if* you are dealing with kobjects. That's a narrower point. I can't imagine that he would have insisted on having additional allocations just so that kobj freeing methods could be used. :) I have no objection if Greg agree static kobject is okay in this case. Greg? What I meant is, no kobject at all in the struct cma_stat member variable. The lifetime of the cma_stat member is the same as the containing struct, so no point in putting a kobject into it. ...unless...are you actually *wanting* to keep the lifetimes separate? Hmmm, given the short nature of sysfs reads, though, I'd be inclined to just let the parent object own the lifetime. But maybe I'm missing some design point here? thanks, -- John Hubbard NVIDIA
Re: [PATCH v2] mm: cma: support sysfs
On 2/8/21 8:19 PM, Minchan Kim wrote: On Mon, Feb 08, 2021 at 05:57:17PM -0800, John Hubbard wrote: On 2/8/21 3:36 PM, Minchan Kim wrote: ... char name[CMA_MAX_NAME]; +#ifdef CONFIG_CMA_SYSFS + struct cma_stat *stat; This should not be a pointer. By making it a pointer, you've added a bunch of pointless extra code to the implementation. Originally, I went with the object lifetime with struct cma as you suggested to make code simple. However, Greg KH wanted to have release for kobj_type since it is consistent with other kboject handling. Are you talking about the kobj in your new struct cma_stat? That seems like circular logic if so. I'm guessing Greg just wanted kobj methods to be used *if* you are dealing with kobjects. That's a narrower point. I can't imagine that he would have insisted on having additional allocations just so that kobj freeing methods could be used. :) I have no objection if Greg agree static kobject is okay in this case. Greg? What I meant is, no kobject at all in the struct cma_stat member variable. The lifetime of the cma_stat member is the same as the containing struct, so no point in putting a kobject into it. thanks, -- John Hubbard NVIDIA
Re: [PATCH v2] mm: cma: support sysfs
On 2/8/21 3:36 PM, Minchan Kim wrote: ... char name[CMA_MAX_NAME]; +#ifdef CONFIG_CMA_SYSFS + struct cma_stat *stat; This should not be a pointer. By making it a pointer, you've added a bunch of pointless extra code to the implementation. Originally, I went with the object lifetime with struct cma as you suggested to make code simple. However, Greg KH wanted to have release for kobj_type since it is consistent with other kboject handling. Are you talking about the kobj in your new struct cma_stat? That seems like circular logic if so. I'm guessing Greg just wanted kobj methods to be used *if* you are dealing with kobjects. That's a narrower point. I can't imagine that he would have insisted on having additional allocations just so that kobj freeing methods could be used. :) thanks, -- John Hubbard NVIDIA
Re: [PATCH v2] mm: cma: support sysfs
+ spin_lock_init(&cma->stat.lock); + if (kobject_init_and_add(&cma->stat.kobj, &cma_ktype, cma_kobj, "%s", cma->name)) { - kobject_put(&cma->stat->kobj); + kobject_put(&cma->stat.kobj); goto out; } - } while (++i < cma_area_count) + } while (++i < cma_area_count); return 0; out: while (--i >= 0) { cma = &cma_areas[i]; - kobject_put(&cma->stat->kobj); + kobject_put(&cma->stat.kobj); } - kfree(cma_stats); kobject_put(cma_kobj); return -ENOMEM; +#endif }; extern struct cma cma_areas[MAX_CMA_AREAS]; @@ -26,4 +37,11 @@ static inline unsigned long cma_bitmap_maxno(struct cma *cma) return cma->count >> cma->order_per_bit; } +#ifdef CONFIG_CMA_SYSFS +void cma_sysfs_alloc_count(struct cma *cma, size_t count); +void cma_sysfs_fail(struct cma *cma, size_t count); +#else +static inline void cma_sysfs_alloc_count(struct cma *cma, size_t count) {}; +static inline void cma_sysfs_fail(struct cma *cma, size_t count) {}; +#endif #endif diff --git a/mm/cma_sysfs.c b/mm/cma_sysfs.c new file mode 100644 index ..1f6b9f785825 --- /dev/null +++ b/mm/cma_sysfs.c @@ -0,0 +1,114 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * CMA SysFS Interface + * + * Copyright (c) 2021 Minchan Kim + */ + +#include +#include +#include + +#include "cma.h" + +static struct cma_stat *cma_stats; I don't know what that's for but it definitely is not needed if you make cma.stat not a pointer, and not in any other case either. + +void cma_sysfs_alloc_count(struct cma *cma, size_t count) +{ + spin_lock(&cma->stat->lock); + cma->stat->pages_attempt += count; + spin_unlock(&cma->stat->lock); +} + +void cma_sysfs_fail(struct cma *cma, size_t count) +{ + spin_lock(&cma->stat->lock); + cma->stat->pages_fail += count; + spin_unlock(&cma->stat->lock); +} + +#define CMA_ATTR_RO(_name) \ + static struct kobj_attribute _name##_attr = __ATTR_RO(_name) + +static struct kobject *cma_kobj; + +static ssize_t cma_alloc_pages_attempt_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct cma_stat *stat = container_of(kobj, struct cma_stat, kobj); + + return sysfs_emit(buf, "%lu\n", stat->pages_attempt); +} +CMA_ATTR_RO(cma_alloc_pages_attempt); + +static ssize_t cma_alloc_pages_fail_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct cma_stat *stat = container_of(kobj, struct cma_stat, kobj); + + return sysfs_emit(buf, "%lu\n", stat->pages_fail); +} +CMA_ATTR_RO(cma_alloc_pages_fail); + +static void cma_kobj_release(struct kobject *kobj) +{ + struct cma_stat *stat = container_of(kobj, struct cma_stat, kobj); + + kfree(stat); +} + +static struct attribute *cma_attrs[] = { + &cma_alloc_pages_attempt_attr.attr, + &cma_alloc_pages_fail_attr.attr, + NULL, +}; +ATTRIBUTE_GROUPS(cma); + +static struct kobj_type cma_ktype = { + .release = cma_kobj_release, + .sysfs_ops = &kobj_sysfs_ops, + .default_groups = cma_groups +}; + +static int __init cma_sysfs_init(void) +{ + int i = 0; + struct cma *cma; + + cma_kobj = kobject_create_and_add("cma", mm_kobj); + if (!cma_kobj) { + pr_err("failed to create cma kobject\n"); + return -ENOMEM; + } + + cma_stats = kzalloc(array_size(sizeof(struct cma_stat), + cma_area_count), GFP_KERNEL); + if (!cma_stats) { + pr_err("failed to create cma_stats\n"); + goto out; + } + + do { + cma = &cma_areas[i]; + cma->stat = &cma_stats[i]; + spin_lock_init(&cma->stat->lock); + if (kobject_init_and_add(&cma->stat->kobj, &cma_ktype, + cma_kobj, "%s", cma->name)) { + kobject_put(&cma->stat->kobj); + goto out; + } + } while (++i < cma_area_count) This clearly is not going to compile! Don't forget to build and test the patches. + + return 0; +out: + while (--i >= 0) { + cma = &cma_areas[i]; + kobject_put(&cma->stat->kobj); + } + + kfree(cma_stats); + kobject_put(cma_kobj); + + return -ENOMEM; +} +subsys_initcall(cma_sysfs_init); thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm: cma: support sysfs
On 2/6/21 9:08 AM, Pintu Agarwal wrote: ... # cat meminfo | grep -i cma CmaTotal:1048576 kB CmaFree: 1046872 kB This CMA info was added by me way back in 2014. At that time I even thought about adding this cma alloc/fail counter in vmstat. That time I also had an internal patch about it but later dropped (never released to mainline). If required I can re-work again on this part. And I have few more points to add here. 1) In the past I have faced this cma allocation failure (which could be common in small embedded devices). Earlier it was even difficult to figure if cma failure actually happened. Thus I added this simple patch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.11-rc6&id=5984af1082f3b115082178ed88c47033d43b924d 2) IMO just reporting CMA alloc/fail may not be enough (at times). The main point is for which client/dev is this failure happening ? Sometimes just tuning the size or fixing the alignment can resolve the failure if we know the client. For global CMA it will be just NULL (dev). 3) IMO having 2 options SYSFS and DEBUGFS may confuse the developer/user (personal experience). So are we going to remove the DEBUGFS or merge it ? It is confusing to have a whole bunch of different places to find data about a system. Here, I think the key is to add to the Documentation/ directory. So far, the documentation I see is: admin-guide/mm/cma_debugfs.rst: only covers debugfs admin-guide/kernel-parameters.txt:602: for CMA kernel parameters If we add sysfs items then we will want a new .rst document that covers that, and lists all the places to look. So anyway, the underlying guidelines for which fs to user are approximately: * sysfs: used for monitoring. One value per item, stable ABI, production. * debugfs: *theoretically* not a stable ABI (we hope no one locks us in by doing obnoxious things that break userspace if the debugfs APIs change). The intention is that developers can put *anything* in there. debugfs has a firm place in debugging, and is probably a keeper here. I originally thought we should combine CMA's sysfs and debugfs items, but Minchan listed an example that seems to show a pretty clear need for monitoring of CMA areas, in production systems, and that's what sysfs is for. So it looks like we want both debugfs and sysfs for CMA, plus a new overall CMA documentation that points to everything. 4) Sometimes CMA (or DMA) allocation failure can happen even very early in boot time itself. At that time these are anyways not useful. Some system may not proceed if CMA/DMA allocation is failing (again personal experience). ==> Anyways this is altogether is different case... 5) The default max CMA areas are defined to be 7 but user can configure it to any number, may be 20 or 50 (???) 6) Thus I would like to propose another thing here. Just like we have /proc/vmallocinfo, /proc/slabinfo, etc., we can also have: /proc/cmainfo Here in /proc/cmaminfo we can capute more detailed information: Global CMA Data: Alloc/Free Client specific data: name, size, success, fail, flags, align, etc (just a random example). ==> This can dynamically grow as large as possible ==> It can also be enabled/disabled based on CMA config itself (no additional config required) Any feedback on point (6) specifically ? Well, /proc these days is for process-specific items. And CMA areas are system-wide. So that's an argument against it. However...if there is any process-specific CMA allocation info to report, then /proc is just the right place for it. thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm: cma: support sysfs
On 2/5/21 1:28 PM, Minchan Kim wrote: On Fri, Feb 05, 2021 at 12:25:52PM -0800, John Hubbard wrote: On 2/5/21 8:15 AM, Minchan Kim wrote: ... OK. But...what *is* your goal, and why is this useless (that's what orthogonal really means here) for your goal? As I mentioned, the goal is to monitor the failure from each of CMA since they have each own purpose. Let's have an example. System has 5 CMA area and each CMA is associated with each user scenario. They have exclusive CMA area to avoid fragmentation problem. CMA-1 depends on bluetooh CMA-2 depends on WIFI CMA-3 depends on sensor-A CMA-4 depends on sensor-B CMA-5 depends on sensor-C aha, finally. I had no idea that sort of use case was happening. This would be good to put in the patch commit description. With this, we could catch which module was affected but with global failure, I couldn't find who was affected. Also, would you be willing to try out something simple first, such as providing indication that cma is active and it's overall success rate, like this: /proc/vmstat: cma_alloc_success 125 cma_alloc_failure 25 ...or is the only way to provide the more detailed items, complete with per-CMA details, in a non-debugfs location? ...and then, to see if more is needed, some questions: a) Do you know of an upper bound on how many cma areas there can be (I think Matthew also asked that)? There is no upper bound since it's configurable. OK, thanks,so that pretty much rules out putting per-cma details into anything other than a directory or something like it. b) Is tracking the cma area really as valuable as other possibilities? We can put "a few" to "several" items here, so really want to get your very favorite bits of information in. If, for example, there can be *lots* of cma areas, then maybe tracking At this moment, allocation/failure for each CMA area since they have particular own usecase, which makes me easy to keep which module will be affected. I think it is very useful per-CMA statistics as minimum code change so I want to enable it by default under CONFIG_CMA && CONFIG_SYSFS. by a range of allocation sizes is better... I takes your suggestion something like this. [alloc_range] could be order or range by interval /sys/kernel/mm/cma/cma-A/[alloc_range]/success /sys/kernel/mm/cma/cma-A/[alloc_range]/fail .. .. /sys/kernel/mm/cma/cma-Z/[alloc_range]/success /sys/kernel/mm/cma/cma-Z/[alloc_range]/fail Actually, I meant, "ranges instead of cma areas", like this: / Understand your point but it would make hard to find who was affected by the failure. That's why I suggested to have your suggestion under additional config since per-cma metric with simple sucess/failure are enough. I agree it would be also useful but I'd like to enable it under CONFIG_CMA_SYSFS_ALLOC_RANGE as separate patchset. I will stop harassing you very soon, just want to bottom out on understanding the real goals first. :) I hope my example makes the goal more clear for you. Yes it does. Based on the (rather surprising) use of cma-area-per-device, it seems clear that you will need this, so I'll drop my objections to putting it in sysfs. I still think the "number of allocation failures" needs refining, probably via a range-based thing, as we've discussed. But the number of pages failed per cma looks OK now. thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm: cma: support sysfs
On 2/5/21 1:52 PM, Suren Baghdasaryan wrote: I takes your suggestion something like this. [alloc_range] could be order or range by interval /sys/kernel/mm/cma/cma-A/[alloc_range]/success /sys/kernel/mm/cma/cma-A/[alloc_range]/fail .. .. /sys/kernel/mm/cma/cma-Z/[alloc_range]/success /sys/kernel/mm/cma/cma-Z/[alloc_range]/fail The interface above seems to me the most useful actually, if by [alloc_range] you mean the different allocation orders. This would cover Minchan's per-CMA failure tracking and would also allow us to understand what kind of allocations are failing and therefore if the problem is caused by pinning/fragmentation or by over-utilization. I agree. That seems about right, now that we've established that cma areas are a must-have. thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm: cma: support sysfs
On 2/5/21 8:15 AM, Minchan Kim wrote: ... Yes, approximately. I was wondering if this would suffice at least as a baseline: cma_alloc_success 125 cma_alloc_failure 25 IMO, regardless of the my patch, it would be good to have such statistics in that CMA was born to replace carved out memory with dynamic allocation ideally for memory efficiency ideally so failure should regard critical so admin could notice it how the system is hurt. Right. So CMA failures are useful for the admin to see, understood. Anyway, it's not enough for me and orthgonal with my goal. OK. But...what *is* your goal, and why is this useless (that's what orthogonal really means here) for your goal? Also, would you be willing to try out something simple first, such as providing indication that cma is active and it's overall success rate, like this: /proc/vmstat: cma_alloc_success 125 cma_alloc_failure 25 ...or is the only way to provide the more detailed items, complete with per-CMA details, in a non-debugfs location? ...and then, to see if more is needed, some questions: a) Do you know of an upper bound on how many cma areas there can be (I think Matthew also asked that)? There is no upper bound since it's configurable. OK, thanks,so that pretty much rules out putting per-cma details into anything other than a directory or something like it. b) Is tracking the cma area really as valuable as other possibilities? We can put "a few" to "several" items here, so really want to get your very favorite bits of information in. If, for example, there can be *lots* of cma areas, then maybe tracking At this moment, allocation/failure for each CMA area since they have particular own usecase, which makes me easy to keep which module will be affected. I think it is very useful per-CMA statistics as minimum code change so I want to enable it by default under CONFIG_CMA && CONFIG_SYSFS. by a range of allocation sizes is better... I takes your suggestion something like this. [alloc_range] could be order or range by interval /sys/kernel/mm/cma/cma-A/[alloc_range]/success /sys/kernel/mm/cma/cma-A/[alloc_range]/fail .. .. /sys/kernel/mm/cma/cma-Z/[alloc_range]/success /sys/kernel/mm/cma/cma-Z/[alloc_range]/fail Actually, I meant, "ranges instead of cma areas", like this: / I agree it would be also useful but I'd like to enable it under CONFIG_CMA_SYSFS_ALLOC_RANGE as separate patchset. I will stop harassing you very soon, just want to bottom out on understanding the real goals first. :) thanks, -- John Hubbard NVIDIA
Re: [PATCH] selftests/vm: rename file run_vmtests to run_vmtests.sh
On 2/5/21 12:55 AM, Rong Chen wrote: Commit c2aa8afc36fa has renamed run_vmtests in Makefile, but the file still uses the old name. The kernel test robot reported the following issue: # selftests: vm: run_vmtests.sh # Warning: file run_vmtests.sh is missing! not ok 1 selftests: vm: run_vmtests.sh I don't know exactly what is going on here, but there was originally a mistake on my part in renaming run_vmtests to run_vmtests.sh (I was trying to set the executable bit, which is not always supported by the patch flow), and that caused some churn in Andrews's tree. And so maybe that rename got lost/dropped along the way. Reported-by: kernel test robot Fixes: c2aa8afc36fa (selftests/vm: rename run_vmtests --> run_vmtests.sh) Signed-off-by: Rong Chen --- tools/testing/selftests/vm/{run_vmtests => run_vmtests.sh} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename tools/testing/selftests/vm/{run_vmtests => run_vmtests.sh} (100%) diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests.sh similarity index 100% rename from tools/testing/selftests/vm/run_vmtests rename to tools/testing/selftests/vm/run_vmtests.sh So I guess this is OK, given that I see "run_vmtests" in both -next and main. Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA
Re: [PATCH v2 1/4] mm/gup: add compound page list iterator
On 2/5/21 2:46 AM, Joao Martins wrote: ...>> If instead you keep npages constant like it naturally wants to be, you could just do a "(*ntails)++" in the loop, to take care of *ntails. I didn't do it as such as I would need to deref @ntails per iteration, so it felt more efficient to do as above. On a second thought, I could alternatively do the following below, thoughts? diff --git a/mm/gup.c b/mm/gup.c index d68bcb482b11..8defe4f670d5 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -215,6 +215,32 @@ void unpin_user_page(struct page *page) } EXPORT_SYMBOL(unpin_user_page); +static inline void compound_next(unsigned long i, unsigned long npages, +struct page **list, struct page **head, +unsigned int *ntails) +{ + struct page *page; + unsigned int nr; + + if (i >= npages) + return; + + page = compound_head(list[i]); + for (nr = i + 1; nr < npages; nr++) { + if (compound_head(list[nr]) != page) + break; + } + + *head = page; + *ntails = nr - i; +} + Yes, this is cleaner and quite a bit easier to verify that it is correct. However, given that the patch is correct and works as-is, the above is really just an optional idea, so please feel free to add: Reviewed-by: John Hubbard Thanks! Hopefully I can retain that if the snippet above is preferred? Joao Yes. Still looks good. thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm: cma: support sysfs
On 2/4/21 10:24 PM, Minchan Kim wrote: On Thu, Feb 04, 2021 at 09:49:54PM -0800, John Hubbard wrote: On 2/4/21 9:17 PM, Minchan Kim wrote: ... # cat vmstat | grep -i cma nr_free_cma 261718 # cat meminfo | grep -i cma CmaTotal:1048576 kB CmaFree: 1046872 kB OK, given that CMA is already in those two locations, maybe we should put this information in one or both of those, yes? Do you suggest something liks this, for example? cat vmstat | grep -i cma cma_a_success 125 cma_a_fail 25 cma_b_success 130 cma_b_fail 156 .. cma_f_fail xxx Yes, approximately. I was wondering if this would suffice at least as a baseline: cma_alloc_success 125 cma_alloc_failure 25 ...and then, to see if more is needed, some questions: a) Do you know of an upper bound on how many cma areas there can be (I think Matthew also asked that)? b) Is tracking the cma area really as valuable as other possibilities? We can put "a few" to "several" items here, so really want to get your very favorite bits of information in. If, for example, there can be *lots* of cma areas, then maybe tracking by a range of allocation sizes is better... thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm: cma: support sysfs
On 2/4/21 9:17 PM, Minchan Kim wrote: ... Presumably, having the source code, you can easily deduce that a bluetooth allocation failure goes directly to a CMA allocation failure, right? Still wondering about this... It would work if we have full source code and stack are not complicated for every usecases. Having said, having a good central place automatically popped up is also beneficial for not to add similar statistics for each call sites. Why do we have too many item in slab sysfs instead of creating each call site inventing on each own? I'm not sure I understand that question fully, but I don't think we need to invent anything unique here. So far we've discussed debugfs, sysfs, and /proc, none of which are new mechanisms. ... It's actually easier to monitor one or two simpler items than it is to monitor a larger number of complicated items. And I get the impression that this is sort of a top-level, production software indicator. Let me clarify one more time. What I'd like to get ultimately is per-CMA statistics instead of global vmstat for the usecase at this moment. Global vmstat could help the decision whether I should go deeper but it ends up needing per-CMA statistics. And I'd like to keep them in sysfs, not debugfs since it should be stable as a telemetric. What points do you disagree in this view? No huge disagreements, I just want to get us down to the true essential elements of what is required--and find a good home for the data. Initial debugging always has excesses, and those should not end up in the more carefully vetted production code. If I were doing this, I'd probably consider HugeTLB pages as an example to follow, because they have a lot in common with CMA: it's another memory allocation pool, and people also want to monitor it. HugeTLB pages and THP pages are monitored in /proc: /proc/meminfo and /proc/vmstat: # cat meminfo |grep -i huge AnonHugePages: 88064 kB ShmemHugePages:0 kB FileHugePages: 0 kB HugePages_Total: 500 HugePages_Free: 500 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB Hugetlb: 1024000 kB # cat vmstat | grep -i huge nr_shmem_hugepages 0 nr_file_hugepages 0 nr_anon_transparent_hugepages 43 numa_huge_pte_updates 0 ...aha, so is CMA: # cat vmstat | grep -i cma nr_free_cma 261718 # cat meminfo | grep -i cma CmaTotal:1048576 kB CmaFree: 1046872 kB OK, given that CMA is already in those two locations, maybe we should put this information in one or both of those, yes? thanks, -- John Hubbard NVIDIA
Re: [PATCH v2 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()
marked dirty and released. Didn't spot any actual problems with how this works. thanks, -- John Hubbard NVIDIA
Re: [PATCH v2 1/4] mm/gup: add compound page list iterator
On 2/4/21 12:24 PM, Joao Martins wrote: Add an helper that iterates over head pages in a list of pages. It essentially counts the tails until the next page to process has a different head that the current. This is going to be used by unpin_user_pages() family of functions, to batch the head page refcount updates once for all passed consecutive tail pages. Suggested-by: Jason Gunthorpe Signed-off-by: Joao Martins --- mm/gup.c | 29 + 1 file changed, 29 insertions(+) diff --git a/mm/gup.c b/mm/gup.c index d68bcb482b11..d1549c61c2f6 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -215,6 +215,35 @@ void unpin_user_page(struct page *page) } EXPORT_SYMBOL(unpin_user_page); +static inline void compound_next(unsigned long i, unsigned long npages, +struct page **list, struct page **head, +unsigned int *ntails) +{ + struct page *page; + unsigned int nr; + + if (i >= npages) + return; + + list += i; + npages -= i; It is worth noting that this is slightly more complex to read than it needs to be. You are changing both endpoints of a loop at once. That's hard to read for a human. And you're only doing it in order to gain the small benefit of being able to use nr directly at the end of the routine. If instead you keep npages constant like it naturally wants to be, you could just do a "(*ntails)++" in the loop, to take care of *ntails. However, given that the patch is correct and works as-is, the above is really just an optional idea, so please feel free to add: Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA + page = compound_head(*list); + + for (nr = 1; nr < npages; nr++) { + if (compound_head(list[nr]) != page) + break; + } + + *head = page; + *ntails = nr; +} + +#define for_each_compound_head(__i, __list, __npages, __head, __ntails) \ + for (__i = 0, \ +compound_next(__i, __npages, __list, &(__head), &(__ntails)); \ +__i < __npages; __i += __ntails, \ +compound_next(__i, __npages, __list, &(__head), &(__ntails))) + /** * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages * @pages: array of pages to be maybe marked dirty, and definitely released.
Re: [PATCH] xfs: fix unused variable build warning in xfs_log.c
On 2/4/21 7:30 PM, Darrick J. Wong wrote: On Thu, Feb 04, 2021 at 07:18:14PM -0800, John Hubbard wrote: Delete the unused "log" variable in xfs_log_cover(). Fixes: 303591a0a9473 ("xfs: cover the log during log quiesce") Cc: Brian Foster Cc: Christoph Hellwig Cc: Darrick J. Wong Cc: Allison Henderson Signed-off-by: John Hubbard --- Hi, I just ran into this on today's linux-next, so here you go! Thanks for the tipoff, I just realized with horror that I got the git push wrong and never actually updated xfs-linux.git#for-next. This (and all the other gcc warnings) are fixed in "xfs-for-next" which ... is not for-next. Sigh. so much for trying to get things in for testing. :( Well, if it's any consolation, this is the *only* warning that fired during my particular build, in linux-next. :) thanks, -- John Hubbard NVIDIA
[PATCH] xfs: fix unused variable build warning in xfs_log.c
Delete the unused "log" variable in xfs_log_cover(). Fixes: 303591a0a9473 ("xfs: cover the log during log quiesce") Cc: Brian Foster Cc: Christoph Hellwig Cc: Darrick J. Wong Cc: Allison Henderson Signed-off-by: John Hubbard --- Hi, I just ran into this on today's linux-next, so here you go! thanks, John Hubbard NVIDIA fs/xfs/xfs_log.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 58699881c100..5a9cca3f7cbf 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -1108,7 +1108,6 @@ static int xfs_log_cover( struct xfs_mount*mp) { - struct xlog *log = mp->m_log; int error = 0; boolneed_covered; base-commit: 0e2c50f40b7ffb73a039157f7c38495c6d99e86f -- 2.30.0
Re: [PATCH] mm: cma: support sysfs
On 2/4/21 5:44 PM, Minchan Kim wrote: On Thu, Feb 04, 2021 at 04:24:20PM -0800, John Hubbard wrote: On 2/4/21 4:12 PM, Minchan Kim wrote: ... Then, how to know how often CMA API failed? Why would you even need to know that, *in addition* to knowing specific page allocation numbers that failed? Again, there is no real-world motivation cited yet, just "this is good data". Need more stories and support here. Let me give an example. Let' assume we use memory buffer allocation via CMA for bluetooth enable of device. If user clicks the bluetooth button in the phone but fail to allocate the memory from CMA, user will still see bluetooth button gray. User would think his touch was not enough powerful so he try clicking again and fortunately CMA allocation was successful this time and they will see bluetooh button enabled and could listen the music. Here, product team needs to monitor how often CMA alloc failed so if the failure ratio is steadily increased than the bar, it means engineers need to go investigation. Make sense? Yes, except that it raises more questions: 1) Isn't this just standard allocation failure? Don't you already have a way to track that? Presumably, having the source code, you can easily deduce that a bluetooth allocation failure goes directly to a CMA allocation failure, right? Still wondering about this... Anyway, even though the above is still a little murky, I expect you're right that it's good to have *some* indication, somewhere about CMA behavior... Thinking about this some more, I wonder if this is really /proc/vmstat sort of data that we're talking about. It seems to fit right in there, yes? Thing is CMA instance are multiple, cma-A, cma-B, cma-C and each of CMA heap has own specific scenario. /proc/vmstat could be bloated a lot while CMA instance will be increased. Yes, that would not fit in /proc/vmstat...assuming that you really require knowing--at this point--which CMA heap is involved. And that's worth poking at. If you get an overall indication in vmstat that CMA is having trouble, then maybe that's all you need to start digging further. It's actually easier to monitor one or two simpler items than it is to monitor a larger number of complicated items. And I get the impression that this is sort of a top-level, production software indicator. thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm: cma: support sysfs
On 2/4/21 4:25 PM, John Hubbard wrote: On 2/4/21 3:45 PM, Suren Baghdasaryan wrote: ... 2) The overall CMA allocation attempts/failures (first two items above) seem an odd pair of things to track. Maybe that is what was easy to track, but I'd vote for just omitting them. Then, how to know how often CMA API failed? Why would you even need to know that, *in addition* to knowing specific page allocation numbers that failed? Again, there is no real-world motivation cited yet, just "this is good data". Need more stories and support here. IMHO it would be very useful to see whether there are multiple small-order allocation failures or a few large-order ones, especially for CMA where large allocations are not unusual. For that I believe both alloc_pages_attempt and alloc_pages_fail would be required. Sorry, I meant to say "both cma_alloc_fail and alloc_pages_fail would be required". So if you want to know that, the existing items are still a little too indirect to really get it right. You can only know the average allocation size, by dividing. Instead, we should provide the allocation size, for each count. The limited interface makes this a little awkward, but using zones/ranges could work: "for this range of allocation sizes, there were the following stats". Or, some other technique that I haven't thought of (maybe two items per file?) would be better. On the other hand, there's an argument for keeping this minimal and simple. That would probably lead us to putting in a couple of items into /proc/vmstat, as I just mentioned in my other response, and calling it good. ...and remember: if we keep it nice and minimal and clean, we can put it into /proc/vmstat and monitor it. And then if a problem shows up, the more complex and advanced debugging data can go into debugfs's CMA area. And you're all set. If Android made up some policy not to use debugfs, then: a) that probably won't prevent engineers from using it anyway, for advanced debugging, and b) If (a) somehow falls short, then we need to talk about what Android's plans are to fill the need. And "fill up sysfs with debugfs items, possibly duplicating some of them, and generally making an unecessary mess, to compensate for not using debugfs" is not my first choice. :) thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm: cma: support sysfs
On 2/4/21 3:45 PM, Suren Baghdasaryan wrote: ... 2) The overall CMA allocation attempts/failures (first two items above) seem an odd pair of things to track. Maybe that is what was easy to track, but I'd vote for just omitting them. Then, how to know how often CMA API failed? Why would you even need to know that, *in addition* to knowing specific page allocation numbers that failed? Again, there is no real-world motivation cited yet, just "this is good data". Need more stories and support here. IMHO it would be very useful to see whether there are multiple small-order allocation failures or a few large-order ones, especially for CMA where large allocations are not unusual. For that I believe both alloc_pages_attempt and alloc_pages_fail would be required. Sorry, I meant to say "both cma_alloc_fail and alloc_pages_fail would be required". So if you want to know that, the existing items are still a little too indirect to really get it right. You can only know the average allocation size, by dividing. Instead, we should provide the allocation size, for each count. The limited interface makes this a little awkward, but using zones/ranges could work: "for this range of allocation sizes, there were the following stats". Or, some other technique that I haven't thought of (maybe two items per file?) would be better. On the other hand, there's an argument for keeping this minimal and simple. That would probably lead us to putting in a couple of items into /proc/vmstat, as I just mentioned in my other response, and calling it good. thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm: cma: support sysfs
On 2/4/21 4:12 PM, Minchan Kim wrote: ... Then, how to know how often CMA API failed? Why would you even need to know that, *in addition* to knowing specific page allocation numbers that failed? Again, there is no real-world motivation cited yet, just "this is good data". Need more stories and support here. Let me give an example. Let' assume we use memory buffer allocation via CMA for bluetooth enable of device. If user clicks the bluetooth button in the phone but fail to allocate the memory from CMA, user will still see bluetooth button gray. User would think his touch was not enough powerful so he try clicking again and fortunately CMA allocation was successful this time and they will see bluetooh button enabled and could listen the music. Here, product team needs to monitor how often CMA alloc failed so if the failure ratio is steadily increased than the bar, it means engineers need to go investigation. Make sense? Yes, except that it raises more questions: 1) Isn't this just standard allocation failure? Don't you already have a way to track that? Presumably, having the source code, you can easily deduce that a bluetooth allocation failure goes directly to a CMA allocation failure, right? Anyway, even though the above is still a little murky, I expect you're right that it's good to have *some* indication, somewhere about CMA behavior... Thinking about this some more, I wonder if this is really /proc/vmstat sort of data that we're talking about. It seems to fit right in there, yes? thanks, -- John Hubbard NVIDIA
Re: [PATCH 1/4] mm/gup: add compound page list iterator
On 2/4/21 11:53 AM, Jason Gunthorpe wrote: On Wed, Feb 03, 2021 at 03:00:01PM -0800, John Hubbard wrote: +static inline void compound_next(unsigned long i, unsigned long npages, +struct page **list, struct page **head, +unsigned int *ntails) +{ + if (i >= npages) + return; + + *ntails = count_ntails(list + i, npages - i); + *head = compound_head(list[i]); +} + +#define for_each_compound_head(i, list, npages, head, ntails) \ When using macros, which are dangerous in general, you have to worry about things like name collisions. I really dislike that C has forced this unsafe pattern upon us, but of course we are stuck with it, for iterator helpers. Given that we're stuck, you should probably use names such as __i, __list, etc, in the the above #define. Otherwise you could stomp on existing variables. Not this macro, it after cpp gets through with it all the macro names vanish, it can't collide with variables. Yes, I guess it does just vaporize, because it turns all the args into their substituted values. I was just having flashbacks from similar cases I guess. The usual worry is you might collide with other #defines, but we don't seem to worry about that in the kernel Well, I worry about it a little anyway. haha :) thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm: cma: support sysfs
On 2/4/21 12:07 PM, Minchan Kim wrote: On Thu, Feb 04, 2021 at 12:50:58AM -0800, John Hubbard wrote: On 2/3/21 7:50 AM, Minchan Kim wrote: Since CMA is getting used more widely, it's more important to keep monitoring CMA statistics for system health since it's directly related to user experience. This patch introduces sysfs for the CMA and exposes stats below to keep monitor for telemetric in the system. * the number of CMA allocation attempts * the number of CMA allocation failures * the number of CMA page allocation attempts * the number of CMA page allocation failures The desire to report CMA data is understandable, but there are a few odd things here: 1) First of all, this has significant overlap with /sys/kernel/debug/cma items. I suspect that all of these items could instead go into At this moment, I don't see any overlap with item from cma_debugfs. Could you specify what item you are mentioning? Just the fact that there would be two systems under /sys, both of which are doing very very similar things: providing information that is intended to help diagnose CMA. /sys/kernel/debug/cma, right? Anyway, thing is I need an stable interface for that and need to use it in Android production build, too(Unfortunately, Android deprecated the debugfs https://source.android.com/setup/start/android-11-release#debugfs ) That's the closest hint to a "why this is needed" that we've seen yet. But it's only a hint. What should be in debugfs and in sysfs? What's the criteria? Well, it's a gray area. "Debugging support" goes into debugfs, and "production-level monitoring and control" goes into sysfs, roughly speaking. And here you have items that could be classified as either. Some statistic could be considered about debugging aid or telemetric depening on view point and usecase. And here, I want to use it for telemetric, get an stable interface and use it in production build of Android. In this chance, I'd like to get concrete guideline what should be in sysfs and debugfs so that pointing out this thread whenever folks dump their stat into sysfs to avoid waste of time for others in future. :) 2) The overall CMA allocation attempts/failures (first two items above) seem an odd pair of things to track. Maybe that is what was easy to track, but I'd vote for just omitting them. Then, how to know how often CMA API failed? Why would you even need to know that, *in addition* to knowing specific page allocation numbers that failed? Again, there is no real-world motivation cited yet, just "this is good data". Need more stories and support here. thanks, -- John Hubbard NVIDIA There are various size allocation request for a CMA so only page allocation stat are not enough to know it. Signed-off-by: Minchan Kim --- Documentation/ABI/testing/sysfs-kernel-mm-cma | 39 + include/linux/cma.h | 1 + mm/Makefile | 1 + mm/cma.c | 6 +- mm/cma.h | 20 +++ mm/cma_sysfs.c| 143 ++ 6 files changed, 209 insertions(+), 1 deletion(-) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-cma create mode 100644 mm/cma_sysfs.c diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-cma b/Documentation/ABI/testing/sysfs-kernel-mm-cma new file mode 100644 index ..2a43c0aacc39 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-cma @@ -0,0 +1,39 @@ +What: /sys/kernel/mm/cma/ +Date: Feb 2021 +Contact: Minchan Kim +Description: + /sys/kernel/mm/cma/ contains a number of subdirectories by + cma-heap name. The subdirectory contains a number of files + to represent cma allocation statistics. Somewhere, maybe here, there should be a mention of the closely related /sys/kernel/debug/cma files. + + There are number of files under + /sys/kernel/mm/cma/ directory + + - cma_alloc_attempt + - cma_alloc_fail Are these really useful? They a summary of the alloc_pages items, really. + - alloc_pages_attempt + - alloc_pages_fail This should also have "cma" in the name, really: cma_alloc_pages_*. No problem. + +What: /sys/kernel/mm/cma//cma_alloc_attempt +Date: Feb 2021 +Contact: Minchan Kim +Description: + the number of cma_alloc API attempted + +What: /sys/kernel/mm/cma//cma_alloc_fail +Date: Feb 2021 +Contact: Minchan Kim +Description: + the number of CMA_alloc API failed + +What: /sys/kernel/mm/cma//alloc_pages_attempt +Date: Feb 2021 +Contact: Minchan Kim
Re: [PATCH] mm: cma: support sysfs
On 2/3/21 7:50 AM, Minchan Kim wrote: Since CMA is getting used more widely, it's more important to keep monitoring CMA statistics for system health since it's directly related to user experience. This patch introduces sysfs for the CMA and exposes stats below to keep monitor for telemetric in the system. * the number of CMA allocation attempts * the number of CMA allocation failures * the number of CMA page allocation attempts * the number of CMA page allocation failures The desire to report CMA data is understandable, but there are a few odd things here: 1) First of all, this has significant overlap with /sys/kernel/debug/cma items. I suspect that all of these items could instead go into /sys/kernel/debug/cma, right? 2) The overall CMA allocation attempts/failures (first two items above) seem an odd pair of things to track. Maybe that is what was easy to track, but I'd vote for just omitting them. Signed-off-by: Minchan Kim --- Documentation/ABI/testing/sysfs-kernel-mm-cma | 39 + include/linux/cma.h | 1 + mm/Makefile | 1 + mm/cma.c | 6 +- mm/cma.h | 20 +++ mm/cma_sysfs.c| 143 ++ 6 files changed, 209 insertions(+), 1 deletion(-) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-cma create mode 100644 mm/cma_sysfs.c diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-cma b/Documentation/ABI/testing/sysfs-kernel-mm-cma new file mode 100644 index ..2a43c0aacc39 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-cma @@ -0,0 +1,39 @@ +What: /sys/kernel/mm/cma/ +Date: Feb 2021 +Contact: Minchan Kim +Description: + /sys/kernel/mm/cma/ contains a number of subdirectories by + cma-heap name. The subdirectory contains a number of files + to represent cma allocation statistics. Somewhere, maybe here, there should be a mention of the closely related /sys/kernel/debug/cma files. + + There are number of files under + /sys/kernel/mm/cma/ directory + + - cma_alloc_attempt + - cma_alloc_fail Are these really useful? They a summary of the alloc_pages items, really. + - alloc_pages_attempt + - alloc_pages_fail This should also have "cma" in the name, really: cma_alloc_pages_*. + +What: /sys/kernel/mm/cma//cma_alloc_attempt +Date: Feb 2021 +Contact: Minchan Kim +Description: + the number of cma_alloc API attempted + +What: /sys/kernel/mm/cma//cma_alloc_fail +Date: Feb 2021 +Contact: Minchan Kim +Description: + the number of CMA_alloc API failed + +What: /sys/kernel/mm/cma//alloc_pages_attempt +Date: Feb 2021 +Contact: Minchan Kim +Description: + the number of pages CMA API tried to allocate + +What: /sys/kernel/mm/cma//alloc_pages_fail +Date: Feb 2021 +Contact: Minchan Kim +Description: + the number of pages CMA API failed to allocate diff --git a/include/linux/cma.h b/include/linux/cma.h index 217999c8a762..71a28a5bb54e 100644 --- a/include/linux/cma.h +++ b/include/linux/cma.h @@ -49,4 +49,5 @@ extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count); extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data); + A single additional blank line seems to be the only change to this file. :) thanks, -- John Hubbard NVIDIA
Re: [PATCH 4/4] RDMA/umem: batch page unpin in __ib_mem_release()
On 2/3/21 2:00 PM, Joao Martins wrote: Use the newly added unpin_user_page_range_dirty_lock() for more quickly unpinning a consecutive range of pages represented as compound pages. This will also calculate number of pages to unpin (for the tail pages which matching head page) and thus batch the refcount update. Running a test program which calls mr reg/unreg on a 1G in size and measures cost of both operations together (in a guest using rxe) with THP and hugetlbfs: In the patch subject line: s/__ib_mem_release/__ib_umem_release/ Before: 590 rounds in 5.003 sec: 8480.335 usec / round 6898 rounds in 60.001 sec: 8698.367 usec / round After: 2631 rounds in 5.001 sec: 1900.618 usec / round 31625 rounds in 60.001 sec: 1897.267 usec / round Signed-off-by: Joao Martins --- drivers/infiniband/core/umem.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 2dde99a9ba07..ea4ebb3261d9 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -47,17 +47,17 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty) { - struct sg_page_iter sg_iter; - struct page *page; + bool make_dirty = umem->writable && dirty; + struct scatterlist *sg; + int i; Maybe unsigned int is better, so as to perfectly match the scatterlist.length. if (umem->nmap > 0) ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents, DMA_BIDIRECTIONAL); - for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) { - page = sg_page_iter_page(&sg_iter); - unpin_user_pages_dirty_lock(&page, 1, umem->writable && dirty); - } + for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i) The change from umem->sg_nents to umem->nmap looks OK, although we should get IB people to verify that there is not some odd bug or reason to leave it as is. + unpin_user_page_range_dirty_lock(sg_page(sg), + DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty); Is it really OK to refer directly to sg->length? The scatterlist library goes to some effort to avoid having callers directly access the struct member variables. Actually, the for_each_sg() code and its behavior with sg->length and sg_page(sg) confuses me because I'm new to it, and I don't quite understand how this works. Especially with SG_CHAIN. I'm assuming that you've monitored /proc/vmstat for nr_foll_pin* ? sg_free_table(&umem->sg_head); } thanks, -- John Hubbard NVIDIA
Re: [PATCH 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()
On 2/3/21 2:00 PM, Joao Martins wrote: ... +void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, + bool make_dirty) +{ + unsigned long index; + struct page *head; + unsigned int ntails; + + for_each_compound_range(index, &page, npages, head, ntails) { + if (make_dirty && !PageDirty(head)) + set_page_dirty_lock(head); + put_compound_head(head, ntails, FOLL_PIN); + } +} +EXPORT_SYMBOL(unpin_user_page_range_dirty_lock); + Also, looking at this again while trying to verify the sg diffs in patch #4, I noticed that the name is tricky. Usually a "range" would not have a single struct page* as the argument. So in this case, perhaps a comment above the function would help, something approximately like this: /* * The "range" in the function name refers to the fact that the page may be a * compound page. If so, then that compound page will be treated as one or more * ranges of contiguous tail pages. */ ...I guess. Or maybe the name is just wrong (a comment block explaining a name is always a bad sign). Perhaps we should rename it to something like: unpin_user_compound_page_dirty_lock() ? thanks, -- John Hubbard NVIDIA
Re: [PATCH 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()
On 2/3/21 2:00 PM, Joao Martins wrote: Add a unpin_user_page_range() API which takes a starting page and how many consecutive pages we want to dirty. Given that we won't be iterating on a list of changes, change compound_next() to receive a bool, whether to calculate from the starting page, or walk the page array. Finally add a separate iterator, A bool arg is sometimes, but not always, a hint that you really just want a separate set of routines. Below... for_each_compound_range() that just operate in page ranges as opposed to page array. For users (like RDMA mr_dereg) where each sg represents a contiguous set of pages, we're able to more efficiently unpin pages without having to supply an array of pages much of what happens today with unpin_user_pages(). Suggested-by: Jason Gunthorpe Signed-off-by: Joao Martins --- include/linux/mm.h | 2 ++ mm/gup.c | 48 ++ 2 files changed, 42 insertions(+), 8 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index a608feb0d42e..b76063f7f18a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1265,6 +1265,8 @@ static inline void put_page(struct page *page) void unpin_user_page(struct page *page); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, bool make_dirty); +void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, + bool make_dirty); void unpin_user_pages(struct page **pages, unsigned long npages); /** diff --git a/mm/gup.c b/mm/gup.c index 971a24b4b73f..1b57355d5033 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -215,11 +215,16 @@ void unpin_user_page(struct page *page) } EXPORT_SYMBOL(unpin_user_page); -static inline unsigned int count_ntails(struct page **pages, unsigned long npages) +static inline unsigned int count_ntails(struct page **pages, + unsigned long npages, bool range) { - struct page *head = compound_head(pages[0]); + struct page *page = pages[0], *head = compound_head(page); unsigned int ntails; + if (range) + return (!PageCompound(head) || compound_order(head) <= 1) ? 1 : + min_t(unsigned int, (head + compound_nr(head) - page), npages); Here, you clearly should use a separate set of _range routines. Because you're basically creating two different routines here! Keep it simple. Once you're in a separate routine, you might feel more comfortable expanding that to a more readable form, too: if (!PageCompound(head) || compound_order(head) <= 1) return 1; return min_t(unsigned int, (head + compound_nr(head) - page), npages); thanks, -- John Hubbard NVIDIA + for (ntails = 1; ntails < npages; ntails++) { if (compound_head(pages[ntails]) != head) break; @@ -229,20 +234,32 @@ static inline unsigned int count_ntails(struct page **pages, unsigned long npage } static inline void compound_next(unsigned long i, unsigned long npages, -struct page **list, struct page **head, -unsigned int *ntails) +struct page **list, bool range, +struct page **head, unsigned int *ntails) { + struct page *p, **next = &p; + if (i >= npages) return; - *ntails = count_ntails(list + i, npages - i); - *head = compound_head(list[i]); + if (range) + *next = *list + i; + else + next = list + i; + + *ntails = count_ntails(next, npages - i, range); + *head = compound_head(*next); } +#define for_each_compound_range(i, list, npages, head, ntails) \ + for (i = 0, compound_next(i, npages, list, true, &head, &ntails); \ +i < npages; i += ntails, \ +compound_next(i, npages, list, true, &head, &ntails)) + #define for_each_compound_head(i, list, npages, head, ntails) \ - for (i = 0, compound_next(i, npages, list, &head, &ntails); \ + for (i = 0, compound_next(i, npages, list, false, &head, &ntails); \ i < npages; i += ntails, \ -compound_next(i, npages, list, &head, &ntails)) +compound_next(i, npages, list, false, &head, &ntails)) /** * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages @@ -306,6 +323,21 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, } EXPORT_SYMBOL(unpin_user_pages_dirty_lock); +void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, + bool make_dirty) +{ + unsigned long index; + struct page *head; +
Re: [PATCH 2/4] mm/gup: decrement head page once for group of subpages
On 2/3/21 2:00 PM, Joao Martins wrote: Rather than decrementing the head page refcount one by one, we walk the page array and checking which belong to the same compound_head. Later on we decrement the calculated amount of references in a single write to the head page. To that end switch to for_each_compound_head() does most of the work. set_page_dirty() needs no adjustment as it's a nop for non-dirty head pages and it doesn't operate on tail pages. This considerably improves unpinning of pages with THP and hugetlbfs: - THP gup_test -t -m 16384 -r 10 [-L|-a] -S -n 512 -w PIN_LONGTERM_BENCHMARK (put values): ~87.6k us -> ~23.2k us - 16G with 1G huge page size gup_test -f /mnt/huge/file -m 16384 -r 10 [-L|-a] -S -n 512 -w PIN_LONGTERM_BENCHMARK: (put values): ~87.6k us -> ~27.5k us Signed-off-by: Joao Martins --- mm/gup.c | 29 +++-- 1 file changed, 11 insertions(+), 18 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 4f88dcef39f2..971a24b4b73f 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -270,20 +270,15 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, bool make_dirty) { unsigned long index; - - /* -* TODO: this can be optimized for huge pages: if a series of pages is -* physically contiguous and part of the same compound page, then a -* single operation to the head page should suffice. -*/ Great to see this TODO (and the related one below) finally done! Everything looks correct here. Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA + struct page *head; + unsigned int ntails; if (!make_dirty) { unpin_user_pages(pages, npages); return; } - for (index = 0; index < npages; index++) { - struct page *page = compound_head(pages[index]); + for_each_compound_head(index, pages, npages, head, ntails) { /* * Checking PageDirty at this point may race with * clear_page_dirty_for_io(), but that's OK. Two key @@ -304,9 +299,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, * written back, so it gets written back again in the * next writeback cycle. This is harmless. */ - if (!PageDirty(page)) - set_page_dirty_lock(page); - unpin_user_page(page); + if (!PageDirty(head)) + set_page_dirty_lock(head); + put_compound_head(head, ntails, FOLL_PIN); } } EXPORT_SYMBOL(unpin_user_pages_dirty_lock); @@ -323,6 +318,8 @@ EXPORT_SYMBOL(unpin_user_pages_dirty_lock); void unpin_user_pages(struct page **pages, unsigned long npages) { unsigned long index; + struct page *head; + unsigned int ntails; /* * If this WARN_ON() fires, then the system *might* be leaking pages (by @@ -331,13 +328,9 @@ void unpin_user_pages(struct page **pages, unsigned long npages) */ if (WARN_ON(IS_ERR_VALUE(npages))) return; - /* -* TODO: this can be optimized for huge pages: if a series of pages is -* physically contiguous and part of the same compound page, then a -* single operation to the head page should suffice. -*/ - for (index = 0; index < npages; index++) - unpin_user_page(pages[index]); + + for_each_compound_head(index, pages, npages, head, ntails) + put_compound_head(head, ntails, FOLL_PIN); } EXPORT_SYMBOL(unpin_user_pages);
Re: [PATCH 1/4] mm/gup: add compound page list iterator
On 2/3/21 2:00 PM, Joao Martins wrote: Add an helper that iterates over head pages in a list of pages. It essentially counts the tails until the next page to process has a different head that the current. This is going to be used by unpin_user_pages() family of functions, to batch the head page refcount updates once for all passed consecutive tail pages. Suggested-by: Jason Gunthorpe Signed-off-by: Joao Martins --- mm/gup.c | 29 + 1 file changed, 29 insertions(+) diff --git a/mm/gup.c b/mm/gup.c index d68bcb482b11..4f88dcef39f2 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -215,6 +215,35 @@ void unpin_user_page(struct page *page) } EXPORT_SYMBOL(unpin_user_page); +static inline unsigned int count_ntails(struct page **pages, unsigned long npages) Silly naming nit: could we please name this function count_pagetails()? count_ntails is a bit redundant, plus slightly less clear. +{ + struct page *head = compound_head(pages[0]); + unsigned int ntails; + + for (ntails = 1; ntails < npages; ntails++) { + if (compound_head(pages[ntails]) != head) + break; + } + + return ntails; +} + +static inline void compound_next(unsigned long i, unsigned long npages, +struct page **list, struct page **head, +unsigned int *ntails) +{ + if (i >= npages) + return; + + *ntails = count_ntails(list + i, npages - i); + *head = compound_head(list[i]); +} + +#define for_each_compound_head(i, list, npages, head, ntails) \ When using macros, which are dangerous in general, you have to worry about things like name collisions. I really dislike that C has forced this unsafe pattern upon us, but of course we are stuck with it, for iterator helpers. Given that we're stuck, you should probably use names such as __i, __list, etc, in the the above #define. Otherwise you could stomp on existing variables. + for (i = 0, compound_next(i, npages, list, &head, &ntails); \ +i < npages; i += ntails, \ +compound_next(i, npages, list, &head, &ntails)) + /** * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages * @pages: array of pages to be maybe marked dirty, and definitely released. thanks, -- John Hubbard NVIDIA
Re: [PATCH v2 net-next 3/4] net: introduce common dev_page_is_reserved()
On 1/30/21 11:45 AM, Alexander Lobakin wrote: From: Jakub Kicinski Date: Sat, 30 Jan 2021 11:07:07 -0800 On Sat, 30 Jan 2021 15:42:29 + Alexander Lobakin wrote: On Wed, 27 Jan 2021 20:11:23 + Alexander Lobakin wrote: + * dev_page_is_reserved - check whether a page can be reused for network Rx + * @page: the page to test + * + * A page shouldn't be considered for reusing/recycling if it was allocated + * under memory pressure or at a distant memory node. + * + * Returns true if this page should be returned to page allocator, false + * otherwise. + */ +static inline bool dev_page_is_reserved(const struct page *page) Am I the only one who feels like "reusable" is a better term than "reserved". I thought about it, but this will need to inverse the conditions in most of the drivers. I decided to keep it as it is. I can redo if "reusable" is preferred. Naming is hard. As long as the condition is not a double negative it reads fine to me, but that's probably personal preference. The thing that doesn't sit well is the fact that there is nothing "reserved" about a page from another NUMA node.. But again, if nobody +1s this it's whatever... Agree on NUMA and naming. I'm a bit surprised that 95% of drivers have this helper called "reserved" (one of the reasons why I finished with this variant). Let's say, if anybody else will vote for "reusable", I'll pick it for v3. Definitely "reusable" seems better to me, and especially anything *other* than "reserved" is a good idea, IMHO. thanks, -- John Hubbard NVIDIA
Re: [PATCH v7 14/14] selftests/vm: test faulting in kernel, and verify pinnable pages
On 1/24/21 3:18 PM, John Hubbard wrote: On 1/21/21 7:37 PM, Pavel Tatashin wrote: When pages are pinned they can be faulted in userland and migrated, and they can be faulted right in kernel without migration. In either case, the pinned pages must end-up being pinnable (not movable). Add a new test to gup_test, to help verify that the gup/pup (get_user_pages() / pin_user_pages()) behavior with respect to pinnable and movable pages is reasonable and correct. Specifically, provide a way to: 1) Verify that only "pinnable" pages are pinned. This is checked automatically for you. 2) Verify that gup/pup performance is reasonable. This requires comparing benchmarks between doing gup/pup on pages that have been pre-faulted in from user space, vs. doing gup/pup on pages that are not faulted in until gup/pup time (via FOLL_TOUCH). This decision is controlled with the new -z command line option. Signed-off-by: Pavel Tatashin --- mm/gup_test.c | 6 ++ tools/testing/selftests/vm/gup_test.c | 23 +++ 2 files changed, 25 insertions(+), 4 deletions(-) This also looks good. I do see the WARN_ON_ONCE firing in internal_get_user_pages_fast(), when running with *only* the new -z option. I'll poke around the rest of the patchset and see if that is expected and normal, but either way the test code itself looks correct and seems The warning that is firing in internal_get_user_pages_fast() is: if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_FORCE | FOLL_PIN | FOLL_GET | FOLL_FAST_ONLY))) return -EINVAL; ...OK, so this is because "./gup_test -z" invokes get_user_pages_fast(), which so far does not allow passing in FOLL_TOUCH. Probably because there is nothing "fast" about touching and possibly faulting in pages. :) So, again, the test code still looks correct, even though it's possible to pass in options that run into things that are rejected by gup.c thanks, -- John Hubbard NVIDIA
Re: [PATCH v7 14/14] selftests/vm: test faulting in kernel, and verify pinnable pages
On 1/21/21 7:37 PM, Pavel Tatashin wrote: When pages are pinned they can be faulted in userland and migrated, and they can be faulted right in kernel without migration. In either case, the pinned pages must end-up being pinnable (not movable). Add a new test to gup_test, to help verify that the gup/pup (get_user_pages() / pin_user_pages()) behavior with respect to pinnable and movable pages is reasonable and correct. Specifically, provide a way to: 1) Verify that only "pinnable" pages are pinned. This is checked automatically for you. 2) Verify that gup/pup performance is reasonable. This requires comparing benchmarks between doing gup/pup on pages that have been pre-faulted in from user space, vs. doing gup/pup on pages that are not faulted in until gup/pup time (via FOLL_TOUCH). This decision is controlled with the new -z command line option. Signed-off-by: Pavel Tatashin --- mm/gup_test.c | 6 ++ tools/testing/selftests/vm/gup_test.c | 23 +++ 2 files changed, 25 insertions(+), 4 deletions(-) This also looks good. I do see the WARN_ON_ONCE firing in internal_get_user_pages_fast(), when running with *only* the new -z option. I'll poke around the rest of the patchset and see if that is expected and normal, but either way the test code itself looks correct and seems to be passing my set of "run a bunch of different gup_test options" here, so feel free to add: Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA diff --git a/mm/gup_test.c b/mm/gup_test.c index a6ed1c877679..d974dec19e1c 100644 --- a/mm/gup_test.c +++ b/mm/gup_test.c @@ -52,6 +52,12 @@ static void verify_dma_pinned(unsigned int cmd, struct page **pages, dump_page(page, "gup_test failure"); break; + } else if (cmd == PIN_LONGTERM_BENCHMARK && + WARN(!is_pinnable_page(page), +"pages[%lu] is NOT pinnable but pinned\n", +i)) { + dump_page(page, "gup_test failure"); + break; } } break; diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c index 943cc2608dc2..1e662d59c502 100644 --- a/tools/testing/selftests/vm/gup_test.c +++ b/tools/testing/selftests/vm/gup_test.c @@ -13,6 +13,7 @@ /* Just the flags we need, copied from mm.h: */ #define FOLL_WRITE0x01/* check pte is writable */ +#define FOLL_TOUCH 0x02/* mark page accessed */ static char *cmd_to_str(unsigned long cmd) { @@ -39,11 +40,11 @@ int main(int argc, char **argv) unsigned long size = 128 * MB; int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1; unsigned long cmd = GUP_FAST_BENCHMARK; - int flags = MAP_PRIVATE; + int flags = MAP_PRIVATE, touch = 0; char *file = "/dev/zero"; char *p; - while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) { + while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHpz")) != -1) { switch (opt) { case 'a': cmd = PIN_FAST_BENCHMARK; @@ -110,6 +111,10 @@ int main(int argc, char **argv) case 'H': flags |= (MAP_HUGETLB | MAP_ANONYMOUS); break; + case 'z': + /* fault pages in gup, do not fault in userland */ + touch = 1; + break; default: return -1; } @@ -167,8 +172,18 @@ int main(int argc, char **argv) else if (thp == 0) madvise(p, size, MADV_NOHUGEPAGE); - for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE) - p[0] = 0; + /* +* FOLL_TOUCH, in gup_test, is used as an either/or case: either +* fault pages in from the kernel via FOLL_TOUCH, or fault them +* in here, from user space. This allows comparison of performance +* between those two cases. +*/ + if (touch) { + gup.gup_flags |= FOLL_TOUCH; + } else { + for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE) + p[0] = 0; + } /* Only report timing information on the *_BENCHMARK commands: */ if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) ||
Re: [PATCH v7 13/14] selftests/vm: test flag is broken
On 1/21/21 7:37 PM, Pavel Tatashin wrote: In gup_test both gup_flags and test_flags use the same flags field. This is broken. Farther, in the actual gup_test.c all the passed gup_flags are erased and unconditionally replaced with FOLL_WRITE. Which means that test_flags are ignored, and code like this always performs pin dump test: 155 if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN) 156 nr = pin_user_pages(addr, nr, gup->flags, 157 pages + i, NULL); 158 else 159 nr = get_user_pages(addr, nr, gup->flags, 160 pages + i, NULL); 161 break; Add a new test_flags field, to allow raw gup_flags to work. Add a new subcommand for DUMP_USER_PAGES_TEST to specify that pin test should be performed. Remove unconditional overwriting of gup_flags via FOLL_WRITE. But, preserve the previous behaviour where FOLL_WRITE was the default flag, and add a new option "-W" to unset FOLL_WRITE. Rename flags with gup_flags. Hi Pavel, Thanks again for fixing this up! Looks good, with a tiny point about the subject line: 1) To follow convention, the subject line should say what you're doing, not what the previous condition was. Also, there are several tests in that directory, so we should say which one. So more like this: "selftests/vm: gup_test: fix test flag" That is just a minor documentation point, so either way, feel free to add: Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA With the fix, dump works like this: root@virtme:/# gup_test -c page #0, starting from user virt addr: 0x7f8acb9e4000 page:d3d2ee27 refcount:2 mapcount:1 mapping: index:0x0 pfn:0x100bcf anon flags: 0x3080016(referenced|uptodate|lru|swapbacked) raw: 03080016 d0e204021608 d0e208df2e88 8ea04243ec61 raw: 0002 page dumped because: gup_test: dump_pages() test DUMP_USER_PAGES_TEST: done root@virtme:/# gup_test -c -p page #0, starting from user virt addr: 0x7fd19701b000 page:baed3c7d refcount:1025 mapcount:1 mapping: index:0x0 pfn:0x108008 anon flags: 0x3080014(uptodate|lru|swapbacked) raw: 03080014 d0e204200188 d0e205e09088 8ea04243ee71 raw: 0401 page dumped because: gup_test: dump_pages() test DUMP_USER_PAGES_TEST: done Refcount shows the difference between pin vs no-pin case. Also change type of nr from int to long, as it counts number of pages. Signed-off-by: Pavel Tatashin --- mm/gup_test.c | 23 ++- mm/gup_test.h | 3 ++- tools/testing/selftests/vm/gup_test.c | 15 +++ 3 files changed, 23 insertions(+), 18 deletions(-) diff --git a/mm/gup_test.c b/mm/gup_test.c index e3cf78e5873e..a6ed1c877679 100644 --- a/mm/gup_test.c +++ b/mm/gup_test.c @@ -94,7 +94,7 @@ static int __gup_test_ioctl(unsigned int cmd, { ktime_t start_time, end_time; unsigned long i, nr_pages, addr, next; - int nr; + long nr; struct page **pages; int ret = 0; bool needs_mmap_lock = @@ -126,37 +126,34 @@ static int __gup_test_ioctl(unsigned int cmd, nr = (next - addr) / PAGE_SIZE; } - /* Filter out most gup flags: only allow a tiny subset here: */ - gup->flags &= FOLL_WRITE; - switch (cmd) { case GUP_FAST_BENCHMARK: - nr = get_user_pages_fast(addr, nr, gup->flags, + nr = get_user_pages_fast(addr, nr, gup->gup_flags, pages + i); break; case GUP_BASIC_TEST: - nr = get_user_pages(addr, nr, gup->flags, pages + i, + nr = get_user_pages(addr, nr, gup->gup_flags, pages + i, NULL); break; case PIN_FAST_BENCHMARK: - nr = pin_user_pages_fast(addr, nr, gup->flags, + nr = pin_user_pages_fast(addr, nr, gup->gup_flags, pages + i); break; case PIN_BASIC_TEST: - nr = pin_user_pages(addr, nr, gup->flags, pages + i, + nr = pin_user_pages(addr, nr, gup->gup_flags, pages + i, NULL); break; case PIN_LONGTERM_BENCHMARK: nr = pin_user_pages(addr, nr, - gup-&g
Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse
On 1/15/21 11:46 AM, David Hildenbrand wrote: 7) There is no easy way to detect if a page really was pinned: we might have false positives. Further, there is no way to distinguish if it was pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking we most probably would need more counters, which we cannot fit into struct page. (AFAIU, for huge pages it's easier). I think this is the real issue. We can only store so much information, so we have to decide which things work and which things are broken. So far someone hasn't presented a way to record everything at least.. I do wonder how many (especially long-term) GUP readers/writers we have to expect, and especially, support for a single base page. Do we have a rough estimate? With RDMA, I would assume we only need a single one (e.g., once RDMA device; I'm pretty sure I'm wrong, sounds too easy). With VFIO I guess we need one for each VFIO container (~ in the worst case one for each passthrough device). With direct I/O, vmsplice and other GUP users ?? No idea. If we could somehow put a limit on the #GUP we support, and fail further GUP (e.g., -EAGAIN?) once a limit is reached, we could partition the refcount into something like (assume max #15 GUP READ and #15 GUP R/W, which is most probably a horribly bad choice) [ GUP READ ][ GUP R/W ] [ ordinary ] 31 ... 28 27 ... 24 23 0 But due to saturate handling in "ordinary", we would lose further 2 bits (AFAIU), leaving us "only" 22 bits for "ordinary". Now, I have no idea how many bits we actually need in practice. Maybe we need less for GUP READ, because most users want GUP R/W? No idea. Just wild ideas. Most probably that has already been discussed, and most probably people figured that it's impossible :) I proposed this exact idea a few days ago [1]. It's remarkable that we both picked nearly identical values for the layout! :) But as the responses show, security problems prevent pursuing that approach. [1] https://lore.kernel.org/r/45806a5a-65c2-67ce-fc92-dc8c2144d...@nvidia.com thanks, -- John Hubbard NVIDIA
Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse
On 1/10/21 11:30 AM, Linus Torvalds wrote: On Sat, Jan 9, 2021 at 7:51 PM Linus Torvalds wrote: Just having a bit in the page flags for "I already made this exclusive, and fork() will keep it so" is I feel the best option. In a way, "page is writable" right now _is_ that bit. By definition, if you have a writable page in an anonymous mapping, that is an exclusive user. But because "writable" has these interactions with other operations, it would be better if it was a harder bit than that "maybe_pinned()", though. It would be lovely if a regular non-pinning write-GUP just always set it, for example. "maybe_pinned()" is good enough for the fork() case, which is the one that matters for long-term pinning. But it's admittedly not perfect. There is at least one way to improve this part of it--maybe. IMHO, a lot of the bits in page _refcount are still being wasted (even after GUP_PIN_COUNTING_BIAS overloading), because it's unlikely that there are many callers of gup/pup per page. If anyone points out that that is wrong, then the rest of this falls apart, but...if we were to make a rule that "only a very few FOLL_GET or FOLL_PIN pins are ever simultaneously allowed on a given page", then several things become possible: 1) "maybe dma pinned" could become "very likely dma-pinned", by allowing perhaps 23 bits for normal page refcounts (leaving just 8 bits for dma pins), thus all but ensuring no aliasing between normal refcounts and dma pins. The name change below to "likely" is not actually necessary, I'm just using it to make the point clear: diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8cd6ae..28f7f64e888f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1241,7 +1241,7 @@ static inline void put_page(struct page *page) * get_user_pages and page_mkclean and other calls that race to set up page * table entries. */ -#define GUP_PIN_COUNTING_BIAS (1U << 10) +#define GUP_PIN_COUNTING_BIAS (1U << 23) void unpin_user_page(struct page *page); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, @@ -1274,7 +1274,7 @@ void unpin_user_pages(struct page **pages, unsigned long npages); * @Return:True, if it is likely that the page has been "dma-pinned". * False, if the page is definitely not dma-pinned. */ -static inline bool page_maybe_dma_pinned(struct page *page) +static inline bool page_likely_dma_pinned(struct page *page) { if (hpage_pincount_available(page)) return compound_pincount(page) > 0; 2) Doing (1) allows, arguably, failing mprotect calls if page_likely_dma_pinned() returns true, because the level of confidence is much higher now. 3) An additional counter, for normal gup (FOLL_GET) is possible: divide up the top 8 bits into two separate counters of just 4 bits each. Only allow either FOLL_GET or FOLL_PIN (and btw, I'm now mentally equating FOLL_PIN with FOLL_LONGTERM), not both, on a given page. Like this: diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8cd6ae..ce5af27e3057 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1241,7 +1241,8 @@ static inline void put_page(struct page *page) * get_user_pages and page_mkclean and other calls that race to set up page * table entries. */ -#define GUP_PIN_COUNTING_BIAS (1U << 10) +#define DMA_PIN_COUNTING_BIAS (1U << 23) +#define GUP_PIN_COUNTING_BIAS (1U << 27) void unpin_user_page(struct page *page); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, So: FOLL_PIN: would use DMA_PIN_COUNTING_BIAS to increment page refcount. These are long term pins for dma. FOLL_GET: would use GUP_PIN_COUNTING_BIAS to increment page refcount. These are not long term pins. Doing (3) would require yet another release call variant: unpin_user_pages() would need to take a flag to say which refcount to release: FOLL_GET or FOLL_PIN. However, I think that's relatively easy (compared to the original effort of teasing apart generic put_page() calls, into unpin_user_pages() calls). We already have all the unpin_user_pages() calls in place, and so it's "merely" a matter of adding a flag to 74 call sites. The real question is whether we can get away with supporting only a very low count of FOLL_PIN and FOLL_GET pages. Can we? thanks, -- John Hubbard NVIDIA
Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending
On 1/7/21 2:24 PM, Linus Torvalds wrote: On Thu, Jan 7, 2021 at 2:20 PM Linus Torvalds wrote: Hmm, add a check for the page being PageAnon(), perhaps? If it's a shared vma, then the page can be pinned shared with multiple mappings, I agree. Or check the vma directly for whether it's a COW vma. That's probably a more obvious test, but would have to be done outside of page_maybe_dma_pinned(). For example, in copy_present_page(), we've already done that COW-vma test, so if we want to strengthen just _that_ test, then it would be sufficient to just add a /* This cannot be a pinned page if it has more than one mapping */ if (page_mappings(page) != 1) return 1; to copy_present_page() along with the existing page_maybe_dma_pinned() test. No? Linus That approach makes me a lot happier, yes. Because it doesn't add constraints to the RDMA cases, but still does what we need here. thanks, -- John Hubbard NVIDIA
Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending
On 1/7/21 2:00 PM, Linus Torvalds wrote: On Thu, Jan 7, 2021 at 1:53 PM John Hubbard wrote: Now, I do agree that from a QoI standpoint, it would be really lovely if we actually enforced it. I'm not entirely sure we can, but maybe it would be reasonable to use that mm->has_pinned && page_maybe_dma_pinned(page) at least as the beginning of a heuristic. In fact, I do think that "page_maybe_dma_pinned()" could possibly be made stronger than it is. Because at *THAT* point, we might say "we What exactly did you have in mind, to make it stronger? I think the answer is in this email but I don't quite see it yet... Literally just adding a " && page_mapcount(page) == 1" in there (probably best done inside page_maybe_dma_pinned() itself) Well, that means that pages that are used for pinned DMA like this, can not be shared with other processes. Is that an acceptable limitation for the RDMA users? It seems a bit constraining, at first glance anyway. Direct IO pins, on the other hand, are more transient. We can probably live without tagging Direct IO pages as FOLL_PIN. I think. Yes. I think direct-IO writes should be able to just do a transient GUP, and if it causes a COW fault that isn't coherent, that's the correct semantics, I think (ie the direct-IO will see the original data, the COW faulter will get it's own private copy to make changes to). I think pinning should be primarily limited to things that _require_ coherency (ie you pin because you're going to do some active two-way communication using that page) Does that match your thinking? Yes, perfectly. I'm going to update Documentation/core-api/pin_user_pages.rst accordingly, once the dust settles on these discussions. thanks, -- John Hubbard NVIDIA
Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending
On 1/7/21 1:29 PM, Linus Torvalds wrote: On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli wrote: The problem is it's not even possible to detect reliably if there's really a long term GUP pin because of speculative pagecache lookups. So none of the normal code _needs_ that any more these days, which is what I think is so nice. Any pinning will do the COW, and then we have the logic to make sure it stays writable, and that keeps everything nicely coherent and is all fairly simple. And yes, it does mean that if somebody then explicitly write-protects a page, it may end up being COW'ed after all, but if you first pinned it, and then started playing with the protections of that page, why should you be surprised? So to me, this sounds like a "don't do that then" situation. Anybody who does page pinning and wants coherency should NOT TOUCH THE MAPPING IT PINNED. (And if you do touch it, it's your own fault, and you get to keep both of the broken pieces) Now, I do agree that from a QoI standpoint, it would be really lovely if we actually enforced it. I'm not entirely sure we can, but maybe it would be reasonable to use that mm->has_pinned && page_maybe_dma_pinned(page) at least as the beginning of a heuristic. In fact, I do think that "page_maybe_dma_pinned()" could possibly be made stronger than it is. Because at *THAT* point, we might say "we What exactly did you have in mind, to make it stronger? I think the answer is in this email but I don't quite see it yet... Also, now seems to be a good time to mention that I've been thinking about a number of pup/gup pinning cases (Direct IO, GPU/NIC, NVMe/storage peer to peer with GUP/NIC, and HMM support for atomic operations from a device). And it seems like the following approach would help: * Use pin_user_pages/FOLL_PIN for long-term pins. Long-term here (thanks to Jason for this point) means "user space owns the lifetime". We might even end up deleting either FOLL_PIN or FOLL_LONGTERM, because this would make them mean the same thing. The idea is that there are no "short term" pins of this kind of memory. * Continue to use FOLL_GET (only) for Direct IO. That's a big change of plans, because several of us had thought that Direct IO needs FOLL_PIN. However, this recent conversation, plus my list of cases above, seems to indicate otherwise. That's because we only have one refcount approach for marking pages in this way, and we should spend it on the long-term pinned pages. Those are both hard to identify otherwise, and actionable once we identify them. Direct IO pins, on the other hand, are more transient. We can probably live without tagging Direct IO pages as FOLL_PIN. I think. This is all assuming that we make progress in the area of "if it's not a page_maybe_dma_pinned() page, then we can wait for it or otherwise do reasonable things about the refcount". So we end up with a clear (-ish) difference between pages that can be waited for, and pages that should not be waited for in the kernel. I hope this helps, but if it's too much of a side-track, please disregard. thanks, -- John Hubbard NVIDIA
Re: [PATCH v4 10/10] selftests/vm: test faulting in kernel, and verify pinnable pages
OK, now it's clear what to do. My (other) mistake with -F was that I mis-predicted what the next future feature would need: I thought we'd be adding sub-flags for the dump pages test, but instead we are (at least so far) adding flags for the gup/pup calls. So let's adapt to that. Some of this contradicts what I proposed earlier, but now it forms a consistent API as a whole. How's this sound: * Change the meaning of the -F option, slightly: it now means "raw flags to pass into .gup_flags and thus to gup/pup, and these will override other options". * Use -F 0x2 (the value of FOLL_TOUCH) instead of the -z option. * We can remain backward compatible with -w at the command line level, with a caveat: passing in "-F 0" or "-F 2" (anything that doesn't set FOLL_WRITE) will override -w. The rule would be that raw -F flags override other settings, just for a consistent way to think about it. * No need, obviously, for any option to undo -w. -F can do that. * Do *not* provide direct access to .test_control_flags. Just set those according to a new command line option. Because these aren't (yet) exploding. If this is too much fooling around, let me know and I'll send some diffs or even a patch. I don't want to take away too much time from the main patch, and I feel some obligation to get gup_test straightened out. :) static char *cmd_to_str(unsigned long cmd) { @@ -39,11 +40,11 @@ int main(int argc, char **argv) unsigned long size = 128 * MB; int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1; unsigned long cmd = GUP_FAST_BENCHMARK; - int flags = MAP_PRIVATE; + int flags = MAP_PRIVATE, touch = 0; Silly nit, can we put it on its own line? This pre-existing mess of declarations makes it hard to read everything. One item per line is easier on the reader, who is often just looking for a single item at a time. Actually why not rename it slightly while we're here (see below), maybe to this: int use_foll_touch = 0; Sure, I will move it. I also prefer one initialization per-line, but tried to keep the current style. I can also correct the other initializations in this function. char *file = "/dev/zero"; char *p; - while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) { + while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHpz")) != -1) { Yes, this seems worth its own command line option. switch (opt) { case 'a': cmd = PIN_FAST_BENCHMARK; @@ -110,6 +111,10 @@ int main(int argc, char **argv) case 'H': flags |= (MAP_HUGETLB | MAP_ANONYMOUS); break; + case 'z': + /* fault pages in gup, do not fault in userland */ How about: /* * Use gup/pup(FOLL_TOUCH), *instead* of faulting * pages in from user space. */ use_foll_touch = 1; Sure. + touch = 1; + break; default: return -1; } @@ -167,8 +172,12 @@ int main(int argc, char **argv) else if (thp == 0) madvise(p, size, MADV_NOHUGEPAGE); - for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE) - p[0] = 0; + if (touch) { + gup.flags |= FOLL_TOUCH; + } else { + for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE) + p[0] = 0; + } Here, the test would change to "if (gup.gup_flags & FOLL_TOUCH)". And we need a comment somewhere (maybe right above that) that says something like: /* * FOLL_TOUCH, in gup_test, is used as an either/or case: either * fault pages in from the kernel via FOLL_TOUCH, or fault them * in here, from user space. This allows comparison of performance * between those two cases. */ ...OR, you know what? We can do even better. Because the above is too quirky and weird. Instead, let's leave FOLL_TOUCH "pure": it's just a gup flag. And then, add an option (maybe -z, after all) that says, "skip faulting pages in from user space". That's a lot clearer! And you can tell it's better, because we don't have to write a chunk of documentation explaining the odd quirks. Ha, it feels better! What do you think? (Again, if you want me to send over some diffs that put this all together, let me know. I'm kind of embarrassed at all the typing required here.) thanks, -- John Hubbard NVIDIA
Re: [PATCH v4 10/10] selftests/vm: test faulting in kernel, and verify pinnable pages
H), *instead* of faulting * pages in from user space. */ use_foll_touch = 1; + touch = 1; + break; default: return -1; } @@ -167,8 +172,12 @@ int main(int argc, char **argv) else if (thp == 0) madvise(p, size, MADV_NOHUGEPAGE); - for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE) - p[0] = 0; + if (touch) { + gup.flags |= FOLL_TOUCH; + } else { + for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE) + p[0] = 0; + } OK. /* Only report timing information on the *_BENCHMARK commands: */ if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) || thanks, -- John Hubbard NVIDIA
Re: [PATCH v4 09/10] selftests/vm: test flag is broken
On 12/18/20 1:06 AM, John Hubbard wrote: Add a new test_flags field, to allow raw gup_flags to work. I think .test_control_flags field would be a good name, to make it very clear that it's not destined for gup_flags. Just .test_flags is not quite as clear a distinction from .gup_flags, as .test_control_flags is, IMHO. And maybe renaming .flags to .gup_flags, just to make it really clear. thanks, -- John Hubbard NVIDIA
Re: [PATCH v4 09/10] selftests/vm: test flag is broken
amp;= FOLL_WRITE; - switch (cmd) { case GUP_FAST_BENCHMARK: nr = get_user_pages_fast(addr, nr, gup->flags, @@ -152,7 +149,7 @@ static int __gup_test_ioctl(unsigned int cmd, pages + i, NULL); break; case DUMP_USER_PAGES_TEST: - if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN) + if (gup->test_flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN) nr = pin_user_pages(addr, nr, gup->flags, pages + i, NULL); else @@ -187,7 +184,7 @@ static int __gup_test_ioctl(unsigned int cmd, start_time = ktime_get(); - put_back_pages(cmd, pages, nr_pages, gup->flags); + put_back_pages(cmd, pages, nr_pages, gup->test_flags); end_time = ktime_get(); gup->put_delta_usec = ktime_us_delta(end_time, start_time); diff --git a/mm/gup_test.h b/mm/gup_test.h index 90a6713d50eb..b1b376b5d3b2 100644 --- a/mm/gup_test.h +++ b/mm/gup_test.h @@ -22,6 +22,7 @@ struct gup_test { __u64 size; __u32 nr_pages_per_call; __u32 flags; + __u32 test_flags; /* * Each non-zero entry is the number of the page (1-based: first page is * page 1, so that zero entries mean "do nothing") from the .addr base. diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c index 6c6336dd3b7f..42c71483729f 100644 --- a/tools/testing/selftests/vm/gup_test.c +++ b/tools/testing/selftests/vm/gup_test.c @@ -37,13 +37,13 @@ int main(int argc, char **argv) { struct gup_test gup = { 0 }; unsigned long size = 128 * MB; - int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 0; + int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1; Let's leave that alone, as noted above. unsigned long cmd = GUP_FAST_BENCHMARK; int flags = MAP_PRIVATE; char *file = "/dev/zero"; char *p; - while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwSH")) != -1) { + while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) { And also as noted above I don't think we need this or any of the remaining hunks below. switch (opt) { case 'a': cmd = PIN_FAST_BENCHMARK; @@ -65,6 +65,10 @@ int main(int argc, char **argv) */ gup.which_pages[0] = 1; break; + case 'p': + /* works only with DUMP_USER_PAGES_TEST */ + gup.test_flags |= GUP_TEST_FLAG_DUMP_PAGES_USE_PIN; + break; case 'F': /* strtol, so you can pass flags in hex form */ gup.flags = strtol(optarg, 0, 0); @@ -93,6 +97,9 @@ int main(int argc, char **argv) case 'w': write = 1; break; + case 'W': + write = 0; + break; case 'f': file = optarg; break; thanks, -- John Hubbard NVIDIA
Re: [PATCH 18/25] btrfs: Use readahead_batch_length
On 12/17/20 5:42 AM, Matthew Wilcox wrote: On Thu, Dec 17, 2020 at 12:12:46PM +, Matthew Wilcox wrote: ehh ... probably an off-by-one. Does subtracting 1 from contig_end fix it? I'll spool up a test VM shortly and try it out. Yes, this fixed it: - u64 contig_end = contig_start + readahead_batch_length(rac); + u64 contig_end = contig_start + readahead_batch_length(rac) - 1; Yes, confirmed on my end, too. thanks, -- John Hubbard NVIDIA
Re: [PATCH 18/25] btrfs: Use readahead_batch_length
On 12/16/20 10:23 AM, Matthew Wilcox (Oracle) wrote: Implement readahead_batch_length() to determine the number of bytes in the current batch of readahead pages and use it in btrfs. Signed-off-by: Matthew Wilcox (Oracle) --- fs/btrfs/extent_io.c| 6 ++ include/linux/pagemap.h | 9 + 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 6e3b72e63e42..42936a83a91b 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4436,10 +4436,8 @@ void extent_readahead(struct readahead_control *rac) int nr; while ((nr = readahead_page_batch(rac, pagepool))) { - u64 contig_start = page_offset(pagepool[0]); - u64 contig_end = page_offset(pagepool[nr - 1]) + PAGE_SIZE - 1; - - ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end); + u64 contig_start = readahead_pos(rac); + u64 contig_end = contig_start + readahead_batch_length(rac); Something in this tiny change is breaking btrfs: it hangs my Fedora 33 test system (which changed over to btrfs) on boot. I haven't quite figured out what's really wrong, but git bisect lands here, *and* turning the whole extent_readahead() function into a no-op (on top of the whole series) allows everything to work once again. Sorry for not actually solving the root cause, but I figured you'd be able to jump straight to the answer, with the above information, so I'm sending it out early. thanks, -- John Hubbard NVIDIA
Re: [PATCH v14 10/10] secretmem: test: add basic selftest for memfd_secret(2)
On 12/2/20 10:29 PM, Mike Rapoport wrote: From: Mike Rapoport ... +#include "../kselftest.h" + +#define fail(fmt, ...) ksft_test_result_fail(fmt, ##__VA_ARGS__) +#define pass(fmt, ...) ksft_test_result_pass(fmt, ##__VA_ARGS__) +#define skip(fmt, ...) ksft_test_result_skip(fmt, ##__VA_ARGS__) + +#ifdef __NR_memfd_secret + +#include Hi Mike, Say, when I tried this out from today's linux-next, I had to delete the above line. In other words, the following was required in order to build: diff --git a/tools/testing/selftests/vm/memfd_secret.c b/tools/testing/selftests/vm/memfd_secret.c index 79578dfd13e6..c878c2b841fc 100644 --- a/tools/testing/selftests/vm/memfd_secret.c +++ b/tools/testing/selftests/vm/memfd_secret.c @@ -29,8 +29,6 @@ #ifdef __NR_memfd_secret -#include - #define PATTERN0x55 static const int prot = PROT_READ | PROT_WRITE; ...and that makes sense to me, because: a) secretmem.h is not in the uapi, which this selftests/vm build system expects (it runs "make headers_install" for us, which is *not* going to pick up items in the kernel include dirs), and b) There is nothing in secretmem.h that this test uses, anyway! Just these: bool vma_is_secretmem(struct vm_area_struct *vma); bool page_is_secretmem(struct page *page); bool secretmem_active(void); ...or am I just Doing It Wrong? :) thanks, -- John Hubbard NVIDIA
Re: [PATCH v3 5/6] mm/gup: migrate pinned pages out of movable zone
On 12/11/20 3:00 PM, Pavel Tatashin wrote: I guess revert what we did (unpin) and return an error. The interesting question is what can make migration/isolation fail OK. I will make the necessary changes. Let's handle errors properly. Whatever the cause for the error, we will know it when it happens, and when error is returned. I think I will add a 10-time retry instead of the infinite retry that we currently have. The 10-times retry we currently have during the hot-remove path. It occurs to me that maybe the pre-existing infinite loop shouldn't be there at all? Would it be better to let the callers retry? Obviously that would be a separate patch and I'm not sure it's safe to make that change, but the current loop seems buried maybe too far down. Thoughts, anyone? thanks, -- John Hubbard NVIDIA
Re: [PATCH 5/6] mm: honor PF_MEMALLOC_NOMOVABLE for all allocations
On 12/3/20 7:06 AM, Pavel Tatashin wrote: ... diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 611799c72da5..7a6d86d0bc5f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3766,20 +3766,25 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask) return alloc_flags; } -static inline unsigned int current_alloc_flags(gfp_t gfp_mask, - unsigned int alloc_flags) +static inline unsigned int cma_alloc_flags(gfp_t gfp_mask, +unsigned int alloc_flags) Actually, maybe the original name should be left intact. This handles current alloc flags, which right now happen to only cover CMA flags, so the original name seems accurate, right? The reason I re-named it is because we do not access current context anymore, only use gfp_mask to get cma flag. - unsigned int pflags = current->flags; So, keeping "current" in the function name makes its intent misleading. OK, I see. That sounds OK then. thanks, -- John Hubbard NVIDIA
Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone
On 12/1/20 9:23 PM, Pavel Tatashin wrote: We do not allocate pin pages in ZONE_MOVABLE, but if pages were already allocated before pinning they need to migrated to a different zone. Currently, we migrate movable CMA pages only. Generalize the function that migrates CMA pages to migrate all movable pages. Signed-off-by: Pavel Tatashin --- include/linux/migrate.h| 1 + include/trace/events/migrate.h | 3 +- mm/gup.c | 56 +- 3 files changed, 24 insertions(+), 36 deletions(-) I like the cleanup so far, even at this point it's a relief to finally see the nested ifdefs get removed. One naming nit/idea below, but this looks fine as is, so: Reviewed-by: John Hubbard diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 0f8d1583fa8e..00bab23d1ee5 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -27,6 +27,7 @@ enum migrate_reason { MR_MEMPOLICY_MBIND, MR_NUMA_MISPLACED, MR_CONTIG_RANGE, + MR_LONGTERM_PIN, MR_TYPES }; diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h index 4d434398d64d..363b54ce104c 100644 --- a/include/trace/events/migrate.h +++ b/include/trace/events/migrate.h @@ -20,7 +20,8 @@ EM( MR_SYSCALL, "syscall_or_cpuset") \ EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind")\ EM( MR_NUMA_MISPLACED, "numa_misplaced") \ - EMe(MR_CONTIG_RANGE,"contig_range") + EM( MR_CONTIG_RANGE,"contig_range") \ + EMe(MR_LONGTERM_PIN,"longterm_pin") /* * First define the enums in the above macros to be exported to userspace diff --git a/mm/gup.c b/mm/gup.c index 724d8a65e1df..1d511f65f8a7 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1593,19 +1593,18 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages) } #endif -#ifdef CONFIG_CMA -static long check_and_migrate_cma_pages(struct mm_struct *mm, - unsigned long start, - unsigned long nr_pages, - struct page **pages, - struct vm_area_struct **vmas, - unsigned int gup_flags) +static long check_and_migrate_movable_pages(struct mm_struct *mm, + unsigned long start, + unsigned long nr_pages, + struct page **pages, + struct vm_area_struct **vmas, + unsigned int gup_flags) { unsigned long i; unsigned long step; bool drain_allow = true; bool migrate_allow = true; - LIST_HEAD(cma_page_list); + LIST_HEAD(page_list); Maybe naming it "movable_page_list", would be a nice touch. thanks, -- John Hubbard NVIDIA long ret = nr_pages; struct migration_target_control mtc = { .nid = NUMA_NO_NODE, @@ -1623,13 +1622,12 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm, */ step = compound_nr(head) - (pages[i] - head); /* -* If we get a page from the CMA zone, since we are going to -* be pinning these entries, we might as well move them out -* of the CMA zone if possible. +* If we get a movable page, since we are going to be pinning +* these entries, try to move them out if possible. */ - if (is_migrate_cma_page(head)) { + if (is_migrate_movable(get_pageblock_migratetype(head))) { if (PageHuge(head)) - isolate_huge_page(head, &cma_page_list); + isolate_huge_page(head, &page_list); else { if (!PageLRU(head) && drain_allow) { lru_add_drain_all(); @@ -1637,7 +1635,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm, } if (!isolate_lru_page(head)) { - list_add_tail(&head->lru, &cma_page_list); + list_add_tail(&head->lru, &page_list); mod_node_page_state(page_pgdat(head), NR_ISOLATED_ANON + page_is_file_lru(head), @@ -1649,7 +1647,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
Re: [PATCH 5/6] mm: honor PF_MEMALLOC_NOMOVABLE for all allocations
On 12/1/20 9:23 PM, Pavel Tatashin wrote: PF_MEMALLOC_NOMOVABLE is only honored for CMA allocations, extend this flag to work for any allocations by removing __GFP_MOVABLE from gfp_mask when this flag is passed in the current context, thus prohibiting allocations from ZONE_MOVABLE. Signed-off-by: Pavel Tatashin --- mm/hugetlb.c| 2 +- mm/page_alloc.c | 26 -- 2 files changed, 17 insertions(+), 11 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 02213c74ed6b..00e786201d8b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1036,7 +1036,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) bool nomovable = !!(current->flags & PF_MEMALLOC_NOMOVABLE); list_for_each_entry(page, &h->hugepage_freelists[nid], lru) { - if (nomovable && is_migrate_cma_page(page)) + if (nomovable && is_migrate_movable(get_pageblock_migratetype(page))) I wonder if we should add a helper, like is_migrate_cma_page(), that avoids having to call get_pageblock_migratetype() at all of the callsites? continue; if (PageHWPoison(page)) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 611799c72da5..7a6d86d0bc5f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3766,20 +3766,25 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask) return alloc_flags; } -static inline unsigned int current_alloc_flags(gfp_t gfp_mask, - unsigned int alloc_flags) +static inline unsigned int cma_alloc_flags(gfp_t gfp_mask, + unsigned int alloc_flags) Actually, maybe the original name should be left intact. This handles current alloc flags, which right now happen to only cover CMA flags, so the original name seems accurate, right? thanks, John Hubbard NVIDIA { #ifdef CONFIG_CMA - unsigned int pflags = current->flags; - - if (!(pflags & PF_MEMALLOC_NOMOVABLE) && - gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE) + if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE) alloc_flags |= ALLOC_CMA; - #endif return alloc_flags; } +static inline gfp_t current_gfp_checkmovable(gfp_t gfp_mask) +{ + unsigned int pflags = current->flags; + + if ((pflags & PF_MEMALLOC_NOMOVABLE)) + return gfp_mask & ~__GFP_MOVABLE; + return gfp_mask; +} + /* * get_page_from_freelist goes through the zonelist trying to allocate * a page. @@ -4423,7 +4428,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask) } else if (unlikely(rt_task(current)) && !in_interrupt()) alloc_flags |= ALLOC_HARDER; - alloc_flags = current_alloc_flags(gfp_mask, alloc_flags); + alloc_flags = cma_alloc_flags(gfp_mask, alloc_flags); return alloc_flags; } @@ -4725,7 +4730,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, reserve_flags = __gfp_pfmemalloc_flags(gfp_mask); if (reserve_flags) - alloc_flags = current_alloc_flags(gfp_mask, reserve_flags); + alloc_flags = cma_alloc_flags(gfp_mask, reserve_flags); /* * Reset the nodemask and zonelist iterators if memory policies can be @@ -4894,7 +4899,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, if (should_fail_alloc_page(gfp_mask, order)) return false; - *alloc_flags = current_alloc_flags(gfp_mask, *alloc_flags); + *alloc_flags = cma_alloc_flags(gfp_mask, *alloc_flags); /* Dirty zone balancing only done in the fast path */ ac->spread_dirty_pages = (gfp_mask & __GFP_WRITE); @@ -4932,6 +4937,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, } gfp_mask &= gfp_allowed_mask; + gfp_mask = current_gfp_checkmovable(gfp_mask); alloc_mask = gfp_mask; if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags)) return NULL;
Re: [PATCH 4/6] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_NOMOVABLE
On 12/1/20 9:23 PM, Pavel Tatashin wrote: PF_MEMALLOC_NOCMA is used for longterm pinning and has an effect of clearing _GFP_MOVABLE or prohibiting allocations from ZONE_MOVABLE. We will prohibit allocating any pages that are getting longterm pinned from ZONE_MOVABLE, and we would want to unify and re-use this flag. So, rename it to generic PF_MEMALLOC_NOMOVABLE. Also re-name: memalloc_nocma_save()/memalloc_nocma_restore to memalloc_nomovable_save()/memalloc_nomovable_restore() and make the new functions common. Signed-off-by: Pavel Tatashin Looks accurate, and grep didn't find any lingering rename candidates after this series is applied. And it's a good rename. Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA --- include/linux/sched.h| 2 +- include/linux/sched/mm.h | 21 + mm/gup.c | 4 ++-- mm/hugetlb.c | 4 ++-- mm/page_alloc.c | 4 ++-- 5 files changed, 12 insertions(+), 23 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 76cd21fa5501..f1bf05f5f5fa 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1548,7 +1548,7 @@ extern struct pid *cad_pid; #define PF_SWAPWRITE 0x0080 /* Allowed to write to swap */ #define PF_NO_SETAFFINITY 0x0400 /* Userland is not allowed to meddle with cpus_mask */ #define PF_MCE_EARLY 0x0800 /* Early kill for mce process policy */ -#define PF_MEMALLOC_NOCMA 0x1000 /* All allocation request will have _GFP_MOVABLE cleared */ +#define PF_MEMALLOC_NOMOVABLE 0x1000 /* All allocation request will have _GFP_MOVABLE cleared */ #define PF_FREEZER_SKIP 0x4000 /* Freezer should not count it as freezable */ #define PF_SUSPEND_TASK 0x8000 /* This thread called freeze_processes() and should not be frozen */ diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index d5ece7a9a403..5bb9a6b69479 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -254,29 +254,18 @@ static inline void memalloc_noreclaim_restore(unsigned int flags) current->flags = (current->flags & ~PF_MEMALLOC) | flags; } -#ifdef CONFIG_CMA -static inline unsigned int memalloc_nocma_save(void) +static inline unsigned int memalloc_nomovable_save(void) { - unsigned int flags = current->flags & PF_MEMALLOC_NOCMA; + unsigned int flags = current->flags & PF_MEMALLOC_NOMOVABLE; - current->flags |= PF_MEMALLOC_NOCMA; + current->flags |= PF_MEMALLOC_NOMOVABLE; return flags; } -static inline void memalloc_nocma_restore(unsigned int flags) +static inline void memalloc_nomovable_restore(unsigned int flags) { - current->flags = (current->flags & ~PF_MEMALLOC_NOCMA) | flags; + current->flags = (current->flags & ~PF_MEMALLOC_NOMOVABLE) | flags; } -#else -static inline unsigned int memalloc_nocma_save(void) -{ - return 0; -} - -static inline void memalloc_nocma_restore(unsigned int flags) -{ -} -#endif #ifdef CONFIG_MEMCG DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg); diff --git a/mm/gup.c b/mm/gup.c index 0e2de888a8b0..724d8a65e1df 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1726,7 +1726,7 @@ static long __gup_longterm_locked(struct mm_struct *mm, if (!vmas_tmp) return -ENOMEM; } - flags = memalloc_nocma_save(); + flags = memalloc_nomovable_save(); } rc = __get_user_pages_locked(mm, start, nr_pages, pages, @@ -1749,7 +1749,7 @@ static long __gup_longterm_locked(struct mm_struct *mm, rc = check_and_migrate_cma_pages(mm, start, rc, pages, vmas_tmp, gup_flags); out: - memalloc_nocma_restore(flags); + memalloc_nomovable_restore(flags); } if (vmas_tmp != vmas) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 37f15c3c24dc..02213c74ed6b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1033,10 +1033,10 @@ static void enqueue_huge_page(struct hstate *h, struct page *page) static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) { struct page *page; - bool nocma = !!(current->flags & PF_MEMALLOC_NOCMA); + bool nomovable = !!(current->flags & PF_MEMALLOC_NOMOVABLE); list_for_each_entry(page, &h->hugepage_freelists[nid], lru) { - if (nocma && is_migrate_cma_page(page)) + if (nomovable && is_migrate_cma_page(page)) continue; if (PageHWPoison(page)) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eaa227a479e4..611799c72da5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3772,8 +3772,8 @@ static inline unsigned int current_alloc
Re: [PATCH 3/6] mm/gup: make __gup_longterm_locked common
On 12/1/20 9:23 PM, Pavel Tatashin wrote: __gup_longterm_locked() has CMA || FS_DAX version and a common stub version. In the preparation of prohibiting longterm pinning of pages from movable zone make the CMA || FS_DAX version common, and delete the stub version. Signed-off-by: Pavel Tatashin --- mm/gup.c | 13 - 1 file changed, 13 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 3a76c005a3e2..0e2de888a8b0 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1567,7 +1567,6 @@ struct page *get_dump_page(unsigned long addr) } #endif /* CONFIG_ELF_CORE */ -#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA) #ifdef CONFIG_FS_DAX static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages) { @@ -1757,18 +1756,6 @@ static long __gup_longterm_locked(struct mm_struct *mm, kfree(vmas_tmp); return rc; } -#else /* !CONFIG_FS_DAX && !CONFIG_CMA */ -static __always_inline long __gup_longterm_locked(struct mm_struct *mm, - unsigned long start, - unsigned long nr_pages, - struct page **pages, - struct vm_area_struct **vmas, - unsigned int flags) -{ - return __get_user_pages_locked(mm, start, nr_pages, pages, vmas, - NULL, flags); -} -#endif /* CONFIG_FS_DAX || CONFIG_CMA */ static bool is_valid_gup_flags(unsigned int gup_flags) { At last some simplification here, yea! Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA
Re: [PATCH 2/6] mm/gup: don't pin migrated cma pages in movable zone
On 12/1/20 9:23 PM, Pavel Tatashin wrote: In order not to fragment CMA the pinned pages are migrated. However, they are migrated to ZONE_MOVABLE, which also should not have pinned pages. Remove __GFP_MOVABLE, so pages can be migrated to zones where pinning is allowed. Signed-off-by: Pavel Tatashin --- mm/gup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/gup.c b/mm/gup.c index cdb8b9eeb016..3a76c005a3e2 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1610,7 +1610,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm, long ret = nr_pages; struct migration_target_control mtc = { .nid = NUMA_NO_NODE, - .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_NOWARN, + .gfp_mask = GFP_USER | __GFP_NOWARN, }; check_again: Reviewed-by: John Hubbard ...while I was here, I noticed this, which is outside the scope of your patchset, but I thought I'd just mention it anyway in case anyone cares: static inline bool is_migrate_movable(int mt) { return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE; } ...that really should take an "enum migratetype mt" instead of an int, I think. thanks, -- John Hubbard NVIDIA
Re: [PATCH 1/6] mm/gup: perform check_dax_vmas only when FS_DAX is enabled
On 12/1/20 9:23 PM, Pavel Tatashin wrote: There is no need to check_dax_vmas() and run through the npage loop of pinned pages if FS_DAX is not enabled. Add a stub check_dax_vmas() function for no-FS_DAX case. Signed-off-by: Pavel Tatashin --- mm/gup.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/mm/gup.c b/mm/gup.c index 98eb8e6d2609..cdb8b9eeb016 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1568,6 +1568,7 @@ struct page *get_dump_page(unsigned long addr) #endif /* CONFIG_ELF_CORE */ #if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA) +#ifdef CONFIG_FS_DAX static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages) { long i; @@ -1586,6 +1587,12 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages) } return false; } +#else +static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages) +{ + return false; +} +#endif #ifdef CONFIG_CMA static long check_and_migrate_cma_pages(struct mm_struct *mm, Looks obviously correct, and the follow-up simplication is very nice. Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA
Re: Pinning ZONE_MOVABLE pages
On 11/20/20 12:27 PM, Pavel Tatashin wrote: Recently, I encountered a hang that is happening during memory hot remove operation. It turns out that the hang is caused by pinned user pages in ZONE_MOVABLE. Kernel expects that all pages in ZONE_MOVABLE can be migrated, but this is not the case if a user applications such as through dpdk libraries pinned them via vfio dma map. Kernel keeps trying to hot-remove them, but refcnt never gets to zero, so we are looping until the hardware watchdog kicks in. We cannot do dma unmaps before hot-remove, because hot-remove is a slow operation, and we have thousands for network flows handled by dpdk that we just cannot suspend for the duration of hot-remove operation. The solution is for dpdk to allocate pages from a zone below ZONE_MOVAVLE, i.e. ZONE_NORMAL/ZONE_HIGHMEM, but this is not possible. There is no user interface that we have that allows applications to select what zone the memory should come from. I've spoken with Stephen Hemminger, and he said that DPDK is moving in the direction of using transparent huge pages instead of HugeTLBs, which means that we need to allow at least anonymous, and anonymous transparent huge pages to come from non-movable zones on demand. Here is what I am proposing: 1. Add a new flag that is passed through pin_user_pages_* down to fault handlers, and allow the fault handler to allocate from a non-movable zone. I like where the discussion so far (in the other threads) has taken this. And the current plan also implies, I think, that you can probably avoid any new flags at all: just check that both FOLL_LONGTERM and FOLL_PIN are set, and if they are, then make your attempt to migrate away from ZONE_MOVABLE. Sample function stacks through which this info needs to be passed is this: pin_user_pages_remote(gup_flags) __get_user_pages_remote(gup_flags) __gup_longterm_locked(gup_flags) __get_user_pages_locked(gup_flags) __get_user_pages(gup_flags) faultin_page(gup_flags) Convert gup_flags into fault_flags handle_mm_fault(fault_flags) I'm pleased that the gup_flags have pretty much been plumbed through all the main places that they were missing, so there shouldn't be too much required at this point. From handle_mm_fault(), the stack diverges into various faults, examples include: Transparent Huge Page handle_mm_fault(fault_flags) __handle_mm_fault(fault_flags) Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask create_huge_pmd(vmf); do_huge_pmd_anonymous_page(vmf); mm_get_huge_zero_page(vma->vm_mm); -> flag is lost, so flag from vmf.gfp_mask should be passed as well. There are several other similar paths in a transparent huge page, also there is a named path where allocation is based on filesystems, and the flag should be honored there as well, but it does not have to be added at the same time. Regular Pages handle_mm_fault(fault_flags) __handle_mm_fault(fault_flags) Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask handle_pte_fault(vmf) do_anonymous_page(vmf); page = alloc_zeroed_user_highpage_movable(vma, vmf->address); -> replace change this call according to gfp_mask. The above only take care of the case if user application faults on the page during pinning time, but there are also cases where pages already exist. 2. Add an internal move_pages_zone() similar to move_pages() syscall but instead of migrating to a different NUMA node, migrate pages from ZONE_MOVABLE to another zone. Call move_pages_zone() on demand prior to pinning pages from vfio_pin_map_dma() for instance. 3. Perhaps, it also makes sense to add madvise() flag, to allocate pages from non-movable zone. When a user application knows that it will do DMA mapping, and pin pages for a long time, the memory that it allocates should never be migrated or hot-removed, so make sure that it comes from the appropriate place. The benefit of adding madvise() flag is that we won't have to deal with slow page migration during pin time, but the disadvantage is that we would need to change the user interface. Before I start working on the above approaches, I would like to get an opinion from the community on an appropriate path forward for this problem. If what I described sounds reasonable, or if there are other ideas on how to address the problem that I am seeing. I'm also in favor with avoiding (3) for now and maybe forever, depending on how it goes. Good luck... :) thanks, -- John Hubbard NVIDIA
Re: [mm/gup] 47e29d32af: phoronix-test-suite.npb.FT.A.total_mop_s -45.0% regression
On 11/18/20 10:17 AM, Dan Williams wrote: On Wed, Nov 18, 2020 at 5:51 AM Jan Kara wrote: On Mon 16-11-20 19:35:31, John Hubbard wrote: On 11/16/20 6:48 PM, kernel test robot wrote: Greeting, FYI, we noticed a -45.0% regression of phoronix-test-suite.npb.FT.A.total_mop_s due to commit: That's a huge slowdown... commit: 47e29d32afba11b13efb51f03154a8cf22fb4360 ("mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master ...but that commit happened in April, 2020. Surely if this were a serious issue we would have some other indication...is this worth following up on?? I'm inclined to ignore it, honestly. Why this was detected so late is a fair question although it doesn't quite invalidate the report... I don't know what specifically happened in this case, perhaps someone from the lkp team can comment? However, the myth / contention that "surely someone else would have noticed by now" is why the lkp project was launched. Kernels regressed without much complaint and it wasn't until much later in the process, around the time enterprise distros rebased to new kernels, did end users start filing performance loss regression reports. Given -stable kernel releases, 6-7 months is still faster than many end user upgrade cycles to new kernel baselines. I see, thanks for explaining. I'll take a peek, then. thanks, -- John Hubbard NVIDIA
Re: [mm/gup] 47e29d32af: phoronix-test-suite.npb.FT.A.total_mop_s -45.0% regression
On 11/16/20 6:48 PM, kernel test robot wrote: Greeting, FYI, we noticed a -45.0% regression of phoronix-test-suite.npb.FT.A.total_mop_s due to commit: That's a huge slowdown... commit: 47e29d32afba11b13efb51f03154a8cf22fb4360 ("mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master ...but that commit happened in April, 2020. Surely if this were a serious issue we would have some other indication...is this worth following up on?? I'm inclined to ignore it, honestly. thanks, -- John Hubbard NVIDIA in testcase: phoronix-test-suite on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G memory with following parameters: test: npb-1.3.1 option_a: FT.A cpufreq_governor: performance ucode: 0x5002f01 test-description: The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added. test-url: http://www.phoronix-test-suite.com/ If you fix the issue, kindly add following tag Reported-by: kernel test robot Details are as below: --> To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp install job.yaml # job file is attached in this email bin/lkp run job.yaml = compiler/cpufreq_governor/kconfig/option_a/rootfs/tbox_group/test/testcase/ucode: gcc-9/performance/x86_64-rhel-8.3/FT.A/debian-x86_64-phoronix/lkp-csl-2sp8/npb-1.3.1/phoronix-test-suite/0x5002f01 commit: 3faa52c03f ("mm/gup: track FOLL_PIN pages") 47e29d32af ("mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages") 3faa52c03f440d1b 47e29d32afba11b13efb51f0315 --- fail:runs %reproductionfail:runs | | | 1:4 -25%:4 kmsg.Spurious_LAPIC_timer_interrupt_on_cpu %stddev %change %stddev \ |\ 4585 ± 2% -45.0% 2522 phoronix-test-suite.npb.FT.A.total_mop_s 1223 ± 4% +40.2% 1714 phoronix-test-suite.time.percent_of_cpu_this_job_got phoronix-test-suite.npb.FT.A.total_mop_s 6500 ++ | .+. .+. .+. | 6000 |.+ +.+.+.++.+.+.+.+.+.+.+ +.+.++ +.+.+.+.+.+.+.+.+.++.+ | 5500 |-+ : | | : | 5000 |-+: | 4500 |-++.+.+.| || 4000 |-+ | 3500 |-+ | || 3000 |-+ | 2500 |-+ O O O| | O O O O O OO O O O O O O O O O O| 2000 ++ [*] bisect-good sample [O] bisect-bad sample Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Thanks, Oliver Sang
Re: [PATCH v3 1/7] compiler-clang: add build check for clang 10.0.1
Hi, I just ran into this and it's a real pain to figure out, because even with the very latest Fedora 33 on my test machine, which provides clang version 11.0.0: $ clang --version clang version 11.0.0 (Fedora 11.0.0-2.fc33) Target: x86_64-unknown-linux-gnu ...the bpftrace program still chokes on some, but not all commands, in ways that invisible to normal debugging. For example: $ sudo bpftrace -e 'tracepoint:syscalls:sys_enter_vmsplice { @[kstack()] = count(); }' /lib/modules/5.10.0-rc4-hubbard-github+/source/include/linux/compiler-clang.h:12:3: error: Sorry, your version of Clang is too old - please use 10.0.1 or newer. But Jarkko's recommended fix works! In other words, applying the diff below fixes it for me. So I'm replying in order to note that the problem is real and hoping that the fix is applied soon. diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h index dd7233c48bf3..c2228b957fd7 100644 --- a/include/linux/compiler-clang.h +++ b/include/linux/compiler-clang.h @@ -7,9 +7,11 @@ + __clang_minor__ * 100\ + __clang_patchlevel__) +#ifndef __BPF_TRACING__ #if CLANG_VERSION < 11 # error Sorry, your version of Clang is too old - please use 10.0.1 or newer. #endif +#endif /* Compiler specific definitions for Clang compiler */ thanks, -- John Hubbard NVIDIA
Re: [PATCH v2] mm/gup_test: GUP_TEST depends on DEBUG_FS
On 11/8/20 12:37 AM, Barry Song wrote: Without DEBUG_FS, all the code in gup_test becomes meaningless. For sure kernel provides debugfs stub while DEBUG_FS is disabled, but the point here is that GUP_TEST can do nothing without DEBUG_FS. Cc: John Hubbard Cc: Ralph Campbell Cc: Randy Dunlap Suggested-by: John Garry Signed-off-by: Barry Song --- -v2: add comment as a prompt to users as commented by John and Randy, thanks! mm/Kconfig | 4 1 file changed, 4 insertions(+) Thanks for suffering through a lot of discussion about this! Reviewed-by: John Hubbard thanks, -- John Hubbard NVIDIA diff --git a/mm/Kconfig b/mm/Kconfig index 01b0ae0cd9d3..a7ff0d31afd5 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -836,6 +836,7 @@ config PERCPU_STATS config GUP_TEST bool "Enable infrastructure for get_user_pages()-related unit tests" + depends on DEBUG_FS help Provides /sys/kernel/debug/gup_test, which in turn provides a way to make ioctl calls that can launch kernel-based unit tests for @@ -853,6 +854,9 @@ config GUP_TEST See tools/testing/selftests/vm/gup_test.c +comment "GUP_TEST needs to have DEBUG_FS enabled" + depends on !GUP_TEST && !DEBUG_FS + config GUP_GET_PTE_LOW_HIGH bool
Re: [PATCH 2/2] tomoyo: Fixed typo in documentation
On 11/8/20 7:41 PM, Souptick Joarder wrote: On Sat, Nov 7, 2020 at 2:27 PM John Hubbard wrote: On 11/7/20 12:24 AM, Souptick Joarder wrote: Fixed typo s/Poiner/Pointer Fixes: 5b636857fee6 ("TOMOYO: Allow using argv[]/envp[] of execve() as conditions.") Signed-off-by: Souptick Joarder Cc: John Hubbard --- security/tomoyo/domain.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/security/tomoyo/domain.c b/security/tomoyo/domain.c index bd748be..7b2babe 100644 --- a/security/tomoyo/domain.c +++ b/security/tomoyo/domain.c @@ -891,7 +891,7 @@ int tomoyo_find_next_domain(struct linux_binprm *bprm) * * @bprm: Pointer to "struct linux_binprm". * @pos: Location to dump. - * @dump: Poiner to "struct tomoyo_page_dump". + * @dump: Pointer to "struct tomoyo_page_dump". Not worth a separate patch, especially since the original comment is merely copying the C sources, and as such, does not add any value. I'd either a) craft a new documentation line that adds some value, or b) just merge this patch into the previous one, and make a note in the commit description to the effect that you've included a trivial typo fix as long as you're there. John, as patch[1/2] is dropped, can we take this patch forward with some more updates in documentations ? That's really up to the folks who work on this code. Personally I would rarely post a patch *just* for this, but on the other hand it is a correction. Either way is fine with me of course. thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm/gup_benchmark: GUP_BENCHMARK depends on DEBUG_FS
On 11/7/20 11:35 PM, Song Bao Hua (Barry Song) wrote: Or do you want this ? (Code B) diff --git a/mm/Kconfig b/mm/Kconfig index 01b0ae0cd9d3..a7ff0d31afd5 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -836,6 +836,7 @@ config PERCPU_STATS config GUP_TEST bool "Enable infrastructure for get_user_pages()-related unit tests" + depends on DEBUG_FS help Provides /sys/kernel/debug/gup_test, which in turn provides a way to make ioctl calls that can launch kernel-based unit tests for @@ -853,6 +854,9 @@ config GUP_TEST See tools/testing/selftests/vm/gup_test.c +comment "GUP_TEST needs to have DEBUG_FS enabled" + depends on !GUP_TEST && !DEBUG_FS + config GUP_GET_PTE_LOW_HIGH bool The above, please. It's your original patch, plus a way to display something if the "depends on" is not met. To be honest, I am not a big fan of both of code A and B. I think "depends on" has clearly said everything the redundant comment wants to say. It's not really just a "comment", in the sense of commenting the Kconfig sources. It actually has an effect, which is that it displays something in "make menuconfig". So it's not redundant, because it provides that output *instead* of hiding the option entirely, when !DEBUG_FS. Try it out, it's nice. │ Symbol: GUP_TEST [=] │ Type : bool │ Defined at mm/Kconfig:837 │ Prompt: Enable infrastructure for get_user_pages()-related unit tests │ Depends on: DEBUG_FS [=n] │ Location: │ (1) -> Memory Management option Menuconfig shows GUP_TEST depends on DEBUG_FS and right now DEBUG_FS is "n". so it is impossible to enable GUP_TEST. "comment" is a good thing, but it is more likely to be used for a menu or a group of configurations to extend a bundle of things. On the other hand, If this particular case needs this comment, so do countless other configurations in hundreds of Kconfig files. Well, maybe, yes. I personally find it quite difficult, having options appear and disappear on me, in this system. If they all had this "comment" behavior by default, to show up as a placeholder, I think it would be a better user experience. thanks, -- John Hubbard NVIDIA
Re: [PATCH 1/2] tomoyo: Convert get_user_pages*() to pin_user_pages*()
On 11/7/20 8:12 PM, Tetsuo Handa wrote: On 2020/11/08 11:17, John Hubbard wrote: Excuse me, but Documentation/core-api/pin_user_pages.rst says "CASE 5: Pinning in order to _write_ to the data within the page" while tomoyo_dump_page() is for "_read_ the data within the page". Do we want to convert to pin_user_pages_remote() or lock_page() ? Sorry, I missed the direction here, was too focused on the Case 5 aspect. Yes. Case 5 (which, again, I think we're about to re-document) is only about *writing* to data within the page. So in this case, where it is just reading from the page, I think it's already from a gup vs pup point of view. btw, it's not clear to me whether the current code is susceptible to any sort of problem involving something writing to the page while it is being dumped (I am curious). But changing from gup to pup wouldn't fix that, if it were a problem. It a separate question from this patch. The "struct page" tomoyo_dump_page() accesses is argv/envp arguments passed to execve() syscall. Therefore, these pages are not visible from threads except current thread, and thus there is no possibility that these pages are modified by other threads while current thread is reading. Perfect. So since I accidentally left out the word "correct" above (I meant to write, "it's already correct"), let me be extra clear: Souptick, we should just drop this patch. thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm/gup_benchmark: GUP_BENCHMARK depends on DEBUG_FS
On 11/7/20 8:11 PM, Randy Dunlap wrote: ... Look at kconfig-language.rst instead. aha, yes. One thing that could be done (and is done in a few places for other reasons) is to add a Kconfig comment if DEBUG_FS is not enabled: comment "GUP_TEST needs to have DEBUG_FS enabled" depends on !GUP_TEST && !DEBUG_FS Sweet--I just applied that here, and it does exactly what I wanted: puts a nice clear message on the "make menuconfig" screen. No more hidden item. Brilliant! Let's go with that, shall we? thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm/gup_benchmark: GUP_BENCHMARK depends on DEBUG_FS
On 11/7/20 7:14 PM, John Hubbard wrote: On 11/7/20 6:58 PM, Song Bao Hua (Barry Song) wrote: On 11/7/20 2:20 PM, Randy Dunlap wrote: On 11/7/20 11:16 AM, John Hubbard wrote: On 11/7/20 11:05 AM, Song Bao Hua (Barry Song) wrote: From: John Hubbard [mailto:jhubb...@nvidia.com] ... But if you really disagree, then I'd go with, just drop the patch entirely, because it doesn't really make things better as written...IMHO anyway. :) Just imagine a case, we don't enable DEBUG_FS but we enable GUP_TEST, we will get an image with totally useless code section since GUP_TEST depends on debugfs entry to perform any useful functionality. Looking at the choices, from the user's (usually kernel developer's) experience: a) The user enables GUP_TEST, then boots up, runs, and is briefly surprised by a runtime failure. But it's a very quick diagnosis: "open: No such file or directory", when trying to make that ioctl call. The path indicates that it's a debug fs path, so the solution is pretty clear, at least for the main audience. b) The other choice: the user *never even sees* GUP_TEST as a choice. This especially bothers me because sometimes you find things by poking around in the menu, although of course "you should already know about it"...but there's a lot to "already know" in a large kernel. From a user experience, it's way better to simply see what you want, and select it in the menu. Or, at least get some prompt that you need to pre-select something else. ...and again, this is all fallout from Kconfig. I might be missing some advanced feature, because it seems surprising to only be allowed to choose between missing dependencies (which this patch would correct), or a poorer user experience (which I argue that this patch would also provide). Ideally, we'd just set up the dependencies, and then have some options for visibility, but I'm not aware of any way to do that...and after a quick peek at Documentation/kbuild/kconfig-macro-language.rst it looks pretty bare bones. thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm/gup_benchmark: GUP_BENCHMARK depends on DEBUG_FS
On 11/7/20 6:58 PM, Song Bao Hua (Barry Song) wrote: On 11/7/20 2:20 PM, Randy Dunlap wrote: On 11/7/20 11:16 AM, John Hubbard wrote: On 11/7/20 11:05 AM, Song Bao Hua (Barry Song) wrote: From: John Hubbard [mailto:jhubb...@nvidia.com] ... But if you really disagree, then I'd go with, just drop the patch entirely, because it doesn't really make things better as written...IMHO anyway. :) Just imagine a case, we don't enable DEBUG_FS but we enable GUP_TEST, we will get an image with totally useless code section since GUP_TEST depends on debugfs entry to perform any useful functionality. Looking at the choices, from the user's (usually kernel developer's) experience: a) The user enables GUP_TEST, then boots up, runs, and is briefly surprised by a runtime failure. But it's a very quick diagnosis: "open: No such file or directory", when trying to make that ioctl call. The path indicates that it's a debug fs path, so the solution is pretty clear, at least for the main audience. b) The other choice: the user *never even sees* GUP_TEST as a choice. This especially bothers me because sometimes you find things by poking around in the menu, although of course "you should already know about it"...but there's a lot to "already know" in a large kernel. From a user experience, it's way better to simply see what you want, and select it in the menu. Or, at least get some prompt that you need to pre-select something else. The difference between "depends on" and "select" for this case is like: depends on: if we want to use GUP_TEST, we have to enable DEBUG_FS first; select: if we enable GUP_TEST, Kconfig will enable DEBUG_FS automatically. To me, I am 60% inclined to "depends on" as I think "DEBUG_FS" is more of a pre-condition of GUP_TEST than an internal part of GUP_TEST. So people should realize the pre-condition must be met before using GUP_TEST and Right, but first of course they must read every single line of the test code carefully. And while it is true the you often *do* end up reading most or all of the test code, there are situations in which you don't need to. We'd be taking away some of those situations. :) they must manually enable it if they haven't. That's why I think this patch is making things better. ...which makes things a little bit worse. thanks, -- John Hubbard NVIDIA
Re: [PATCH 1/2] tomoyo: Convert get_user_pages*() to pin_user_pages*()
On 11/7/20 5:13 PM, Tetsuo Handa wrote: On 2020/11/08 4:17, John Hubbard wrote: On 11/7/20 1:04 AM, John Hubbard wrote: On 11/7/20 12:24 AM, Souptick Joarder wrote: In 2019, we introduced pin_user_pages*() and now we are converting get_user_pages*() to the new API as appropriate. [1] & [2] could be referred for more information. This is case 5 as per document [1]. It turns out that Case 5 can be implemented via a better pattern, as long as we're just dealing with a page at a time, briefly: lock_page() write to page's data unlock_page() ...which neatly synchronizes with writeback and other fs activities. Ahem, I left out a key step: set_page_dirty()! lock_page() write to page's data set_page_dirty() unlock_page() Excuse me, but Documentation/core-api/pin_user_pages.rst says "CASE 5: Pinning in order to _write_ to the data within the page" while tomoyo_dump_page() is for "_read_ the data within the page". Do we want to convert to pin_user_pages_remote() or lock_page() ? Sorry, I missed the direction here, was too focused on the Case 5 aspect. Yes. Case 5 (which, again, I think we're about to re-document) is only about *writing* to data within the page. So in this case, where it is just reading from the page, I think it's already from a gup vs pup point of view. btw, it's not clear to me whether the current code is susceptible to any sort of problem involving something writing to the page while it is being dumped (I am curious). But changing from gup to pup wouldn't fix that, if it were a problem. It a separate question from this patch. (Souptick, if you're interested, the Case 5 documentation change and callsite retrofit is all yours if you want it. Otherwise it's on my list.) thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm/gup_benchmark: GUP_BENCHMARK depends on DEBUG_FS
On 11/7/20 4:24 PM, Randy Dunlap wrote: On 11/7/20 4:03 PM, John Hubbard wrote: On 11/7/20 2:20 PM, Randy Dunlap wrote: On 11/7/20 11:16 AM, John Hubbard wrote: On 11/7/20 11:05 AM, Song Bao Hua (Barry Song) wrote: ... OK, thanks, I see how you get that list now. JFTR, those are not 42 independent users of DEBUG_FS. There are lots of &&s and a few ||s in that list. You are right, and that means that I have to withdraw my earlier claim that "42 committers" made such a decision. xconfig shows this for DEBUG_FS: (13 selects for x86_64 allmodconfig) ...maybe only 13 committers and dependency chain, then. :) thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm/gup_benchmark: GUP_BENCHMARK depends on DEBUG_FS
On 11/7/20 2:20 PM, Randy Dunlap wrote: On 11/7/20 11:16 AM, John Hubbard wrote: On 11/7/20 11:05 AM, Song Bao Hua (Barry Song) wrote: -Original Message- From: John Hubbard [mailto:jhubb...@nvidia.com] ... config GUP_BENCHMARK bool "Enable infrastructure for get_user_pages() and related calls benchmarking" + depends on DEBUG_FS I think "select DEBUG_FS" is better here. "depends on" has the obnoxious behavior of hiding the choice from you, if the dependencies aren't already met. Whereas what the developer *really* wants is a no-nonsense activation of the choice: "enable GUP_BENCHMARK and the debug fs that it requires". To some extent, I agree with you. But I still think here it is better to use "depends on". According to https://www.kernel.org/doc/Documentation/kbuild/kconfig-language.txt select should be used with care. select will force a symbol to a value without visiting the dependencies. By abusing select you are able to select a symbol FOO even if FOO depends on BAR that is not set. In general use select only for non-visible symbols (no prompts anywhere) and for symbols with no dependencies. That will limit the usefulness but on the other hand avoid the illegal configurations all over. On the other hand, in kernel there are 78 "depends on DEBUG_FS" and only 14 "select DEBUG_FS". You're not looking at the best statistics. Go look at what *already* selects DEBUG_FS, and you'll find about 50 items. Sorry, I'm not following you. I see the same 14 "select DEBUG_FS" as Barry. I ran make menuconfig, and looked at it. Because I want to see the true end result, and I didn't trust my grep use, given that the system has interlocking dependencies, and I think one select could end up activating others (yes?). And sure enough, there are 42 items listed, here they are, cleaned up so that there is one per line: ZSMALLOC_STAT [=n] ZSMALLOC [=m] BCACHE_CLOSURES_DEBUG [=n] MD [=y] BCACHE [=n] DVB_C8SECTPFE [=n] MEDIA_SUPPORT [=m] MEDIA_PLATFORM_SUPPORT [=y] DVB_PLATFORM_DRIVERS [=n] PINCT DRM_I915_DEBUG [=n] HAS_IOMEM [=y] EXPERT [=y] DRM_I915 [=m] EDAC_DEBUG [=n] EDAC [=y] SUNRPC_DEBUG [=n] NETWORK_FILESYSTEMS [=y] SUNRPC [=m] SYSCTL [=y] PAGE_OWNER [=n] DEBUG_KERNEL [=y] STACKTRACE_SUPPORT [=y] DEBUG_KMEMLEAK [=n] DEBUG_KERNEL [=y] HAVE_DEBUG_KMEMLEAK [=y] BLK_DEV_IO_TRACE [=n] TRACING_SUPPORT [=y] FTRACE [=y] SYSFS [=y] BLOCK [=y] PUNIT_ATOM_DEBUG [=n] PCI [=y] NOTIFIER_ERROR_INJECTION [=n] DEBUG_KERNEL [=y] FAIL_FUTEX [=n] FAULT_INJECTION [=n] FUTEX [=y] KCOV [=n] ARCH_HAS_KCOV [=y] CC_HAS_SANCOV_TRACE_PC [=y] GCC_PLUGINS In general we don't want any one large "feature" (or subsystem) to be enabled by one driver. If someone has gone to the trouble to disable DEBUG_FS (or whatever), then a different Kconfig symbol shouldn't undo that. I agree with the "in general" point, yes. And my complaint is really 80% due to the very unhappy situation with Kconfig, where we seem to get a choice between *hiding* a feature, or forcing a dependency break. What we really want is a way to indicate a dependency that doesn't hide entire features, unless we want that. (Maybe I should attempt to get into the implementation, although I suspect it's harder than I realize.) But the other 20% of my complaint is, given what we have, I think the appropriate adaptation for GUP_BENCHMARK's relationship to DEBUG_FS *in particular*, is: select. And 42 other committers have chosen the same thing, for their relationship to DEBUG_FS. I'm in good company. But if you really disagree, then I'd go with, just drop the patch entirely, because it doesn't really make things better as written...IMHO anyway. :) thanks, -- John Hubbard NVIDIA
Re: [PATCH 1/2] tomoyo: Convert get_user_pages*() to pin_user_pages*()
On 11/7/20 1:04 AM, John Hubbard wrote: On 11/7/20 12:24 AM, Souptick Joarder wrote: In 2019, we introduced pin_user_pages*() and now we are converting get_user_pages*() to the new API as appropriate. [1] & [2] could be referred for more information. This is case 5 as per document [1]. It turns out that Case 5 can be implemented via a better pattern, as long as we're just dealing with a page at a time, briefly: lock_page() write to page's data unlock_page() ...which neatly synchronizes with writeback and other fs activities. Ahem, I left out a key step: set_page_dirty()! lock_page() write to page's data set_page_dirty() unlock_page() thanks, -- John Hubbard NVIDIA
Re: [PATCH] mm/gup_benchmark: GUP_BENCHMARK depends on DEBUG_FS
On 11/7/20 11:05 AM, Song Bao Hua (Barry Song) wrote: -Original Message- From: John Hubbard [mailto:jhubb...@nvidia.com] ... config GUP_BENCHMARK bool "Enable infrastructure for get_user_pages() and related calls benchmarking" + depends on DEBUG_FS I think "select DEBUG_FS" is better here. "depends on" has the obnoxious behavior of hiding the choice from you, if the dependencies aren't already met. Whereas what the developer *really* wants is a no-nonsense activation of the choice: "enable GUP_BENCHMARK and the debug fs that it requires". To some extent, I agree with you. But I still think here it is better to use "depends on". According to https://www.kernel.org/doc/Documentation/kbuild/kconfig-language.txt select should be used with care. select will force a symbol to a value without visiting the dependencies. By abusing select you are able to select a symbol FOO even if FOO depends on BAR that is not set. In general use select only for non-visible symbols (no prompts anywhere) and for symbols with no dependencies. That will limit the usefulness but on the other hand avoid the illegal configurations all over. On the other hand, in kernel there are 78 "depends on DEBUG_FS" and only 14 "select DEBUG_FS". You're not looking at the best statistics. Go look at what *already* selects DEBUG_FS, and you'll find about 50 items. thanks, -- John Hubbard NVIDIA
Re: [PATCH 1/2] tomoyo: Convert get_user_pages*() to pin_user_pages*()
On 11/7/20 12:24 AM, Souptick Joarder wrote: In 2019, we introduced pin_user_pages*() and now we are converting get_user_pages*() to the new API as appropriate. [1] & [2] could be referred for more information. This is case 5 as per document [1]. It turns out that Case 5 can be implemented via a better pattern, as long as we're just dealing with a page at a time, briefly: lock_page() write to page's data unlock_page() ...which neatly synchronizes with writeback and other fs activities. I was going to track down the Case 5's and do that [1]. +CC Jan and Matthew, to keep us on the straight and narrow, just in case I'm misunderstanding something. [1] https://lore.kernel.org/r/e78fb7af-627b-ce80-275e-51f97f1f3...@nvidia.com thanks, -- John Hubbard NVIDIA [1] Documentation/core-api/pin_user_pages.rst [2] "Explicit pinning of user-space pages": https://lwn.net/Articles/807108/ Signed-off-by: Souptick Joarder Cc: John Hubbard --- security/tomoyo/domain.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/security/tomoyo/domain.c b/security/tomoyo/domain.c index dc4ecc0..bd748be 100644 --- a/security/tomoyo/domain.c +++ b/security/tomoyo/domain.c @@ -914,7 +914,7 @@ bool tomoyo_dump_page(struct linux_binprm *bprm, unsigned long pos, * (represented by bprm). 'current' is the process doing * the execve(). */ - if (get_user_pages_remote(bprm->mm, pos, 1, + if (pin_user_pages_remote(bprm->mm, pos, 1, FOLL_FORCE, &page, NULL, NULL) <= 0) return false; #else @@ -936,7 +936,7 @@ bool tomoyo_dump_page(struct linux_binprm *bprm, unsigned long pos, } /* Same with put_arg_page(page) in fs/exec.c */ #ifdef CONFIG_MMU - put_page(page); + unpin_user_page(page); #endif return true; }
Re: [PATCH 2/2] tomoyo: Fixed typo in documentation
On 11/7/20 12:24 AM, Souptick Joarder wrote: Fixed typo s/Poiner/Pointer Fixes: 5b636857fee6 ("TOMOYO: Allow using argv[]/envp[] of execve() as conditions.") Signed-off-by: Souptick Joarder Cc: John Hubbard --- security/tomoyo/domain.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/security/tomoyo/domain.c b/security/tomoyo/domain.c index bd748be..7b2babe 100644 --- a/security/tomoyo/domain.c +++ b/security/tomoyo/domain.c @@ -891,7 +891,7 @@ int tomoyo_find_next_domain(struct linux_binprm *bprm) * * @bprm: Pointer to "struct linux_binprm". * @pos: Location to dump. - * @dump: Poiner to "struct tomoyo_page_dump". + * @dump: Pointer to "struct tomoyo_page_dump". Not worth a separate patch, especially since the original comment is merely copying the C sources, and as such, does not add any value. I'd either a) craft a new documentation line that adds some value, or b) just merge this patch into the previous one, and make a note in the commit description to the effect that you've included a trivial typo fix as long as you're there. thanks, -- John Hubbard NVIDIA * * Returns true on success, false otherwise. */
Re: [PATCH] mm/gup_benchmark: GUP_BENCHMARK depends on DEBUG_FS
On 11/4/20 2:05 AM, Barry Song wrote: Without DEBUG_FS, all the code in gup_benchmark becomes meaningless. For sure kernel provides debugfs stub while DEBUG_FS is disabled, but the point here is that GUP_BENCHMARK can do nothing without DEBUG_FS. Cc: John Hubbard Cc: Ralph Campbell Inspired-by: John Garry Signed-off-by: Barry Song --- * inspired by John's comment in this patch: https://lore.kernel.org/linux-iommu/184797b8-512e-e3da-fae7-25c7d6626...@huawei.com/ mm/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/Kconfig b/mm/Kconfig index d42423f..91fa923 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -836,6 +836,7 @@ config PERCPU_STATS config GUP_BENCHMARK bool "Enable infrastructure for get_user_pages() and related calls benchmarking" + depends on DEBUG_FS I think "select DEBUG_FS" is better here. "depends on" has the obnoxious behavior of hiding the choice from you, if the dependencies aren't already met. Whereas what the developer *really* wants is a no-nonsense activation of the choice: "enable GUP_BENCHMARK and the debug fs that it requires". So depends on really on is better for things that you just can't control, such as the cpu arch you're on, etc. Also note that this will have some minor merge conflict with mmotm, Due to renaming to GUP_TEST. No big deal though. thanks, -- John Hubbard NVIDIA
Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
On 11/5/20 4:49 AM, Jason Gunthorpe wrote: On Thu, Nov 05, 2020 at 10:25:24AM +0100, Daniel Vetter wrote: /* * If we can't determine whether or not a pte is special, then fail immediately * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not * to be special. * * For a futex to be placed on a THP tail page, get_futex_key requires a * get_user_pages_fast_only implementation that can pin pages. Thus it's still * useful to have gup_huge_pmd even if we can't operate on ptes. */ We support hugepage faults in gpu drivers since recently, and I'm not seeing a pud_mkhugespecial anywhere. So not sure this works, but probably just me missing something again. It means ioremap can't create an IO page PUD, it has to be broken up. Does ioremap even create anything larger than PTEs? From my reading, yes. See ioremap_try_huge_pmd(). thanks, -- John Hubbard NVIDIA
Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
On 11/4/20 10:17 AM, Jason Gunthorpe wrote: On Wed, Nov 04, 2020 at 04:41:19PM +, Christoph Hellwig wrote: On Wed, Nov 04, 2020 at 04:37:58PM +, Christoph Hellwig wrote: On Wed, Nov 04, 2020 at 05:26:58PM +0100, Daniel Vetter wrote: What we're discussing is whether gup_fast and pup_fast also obey this, or fall over and can give you the struct page that's backing the dma_mmap_* memory. Since the _fast variant doesn't check for vma->vm_flags, and afaict that's the only thing which closes this gap. And like you restate, that would be a bit a problem. So where's that check which Jason&me aren't spotting? remap_pte_range uses pte_mkspecial to set up the PTEs, and gup_pte_range errors out on pte_special. Of course this only works for the CONFIG_ARCH_HAS_PTE_SPECIAL case, for other architectures we do have a real problem. Except that we don't really support pte-level gup-fast without CONFIG_ARCH_HAS_PTE_SPECIAL, and in fact all architectures selecting HAVE_FAST_GUP also select ARCH_HAS_PTE_SPECIAL, so we should be fine. Mm, I thought it was probably the special flag.. Knowing that CONFIG_HAVE_FAST_GUP can't be set without CONFIG_ARCH_HAS_PTE_SPECIAL is pretty insightful, can we put that in the Kconfig? config HAVE_FAST_GUP depends on MMU depends on ARCH_HAS_PTE_SPECIAL bool Well, the !CONFIG_ARCH_HAS_PTE_SPECIAL case points out in a comment that gup-fast is not *completely* unavailable there, so I don't think you want to shut it off like that: /* * If we can't determine whether or not a pte is special, then fail immediately * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not * to be special. * * For a futex to be placed on a THP tail page, get_futex_key requires a * get_user_pages_fast_only implementation that can pin pages. Thus it's still * useful to have gup_huge_pmd even if we can't operate on ptes. */ thanks, -- John Hubbard NVIDIA