On Fri, May 16, 2025 at 11:13 AM Liam R. Howlett <liam.howl...@oracle.com> wrote: > > * Nico Pache <npa...@redhat.com> [250514 23:23]: > > The khugepaged daemon and madvise_collapse have two different > > implementations that do almost the same thing. > > > > Create khugepaged_collapse_single_pmd to increase code > > reuse and create an entry point for future khugepaged changes. > > > > Refactor madvise_collapse and khugepaged_scan_mm_slot to use > > the new khugepaged_collapse_single_pmd function. > > > > Reviewed-by: Baolin Wang <baolin.w...@linux.alibaba.com> > > Signed-off-by: Nico Pache <npa...@redhat.com> > > --- > > mm/khugepaged.c | 96 +++++++++++++++++++++++++------------------------ > > 1 file changed, 49 insertions(+), 47 deletions(-) > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 806bcd8c5185..5457571d505a 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -2353,6 +2353,48 @@ static int khugepaged_scan_file(struct mm_struct > > *mm, unsigned long addr, > > return result; > > } > > > > +/* > > + * Try to collapse a single PMD starting at a PMD aligned addr, and return > > + * the results. > > + */ > > +static int khugepaged_collapse_single_pmd(unsigned long addr, > > + struct vm_area_struct *vma, bool > > *mmap_locked, > > + struct collapse_control *cc) > > +{ > > + int result = SCAN_FAIL; > > + struct mm_struct *mm = vma->vm_mm; > > + > > + if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) { > > why IS_ENABLED(CONFIG_SHMEM) here, it seems new? Fixed in the next version. It was a mishandled rebase conflict. > > > + struct file *file = get_file(vma->vm_file); > > + pgoff_t pgoff = linear_page_index(vma, addr); > > + > > + mmap_read_unlock(mm); > > + *mmap_locked = false; > > + result = khugepaged_scan_file(mm, addr, file, pgoff, cc); > > + fput(file); > > + if (result == SCAN_PTE_MAPPED_HUGEPAGE) { > > + mmap_read_lock(mm); > > + *mmap_locked = true; > > + if (khugepaged_test_exit_or_disable(mm)) { > > + result = SCAN_ANY_PROCESS; > > + goto end; > > + } > > + result = collapse_pte_mapped_thp(mm, addr, > > + !cc->is_khugepaged); > > + if (result == SCAN_PMD_MAPPED) > > + result = SCAN_SUCCEED; > > + mmap_read_unlock(mm); > > + *mmap_locked = false; > > + } > > + } else { > > + result = khugepaged_scan_pmd(mm, vma, addr, mmap_locked, cc); > > + } > > + if (cc->is_khugepaged && result == SCAN_SUCCEED) > > + ++khugepaged_pages_collapsed; > > +end: > > + return result; > > This function can return with mmap_read_locked or unlocked.. > > > +} > > + > > static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int > > *result, > > struct collapse_control *cc) > > __releases(&khugepaged_mm_lock) > > @@ -2427,34 +2469,12 @@ static unsigned int > > khugepaged_scan_mm_slot(unsigned int pages, int *result, > > VM_BUG_ON(khugepaged_scan.address < hstart || > > khugepaged_scan.address + HPAGE_PMD_SIZE > > > hend); > > - if (!vma_is_anonymous(vma)) { > > - struct file *file = get_file(vma->vm_file); > > - pgoff_t pgoff = linear_page_index(vma, > > - khugepaged_scan.address); > > - > > - mmap_read_unlock(mm); > > - mmap_locked = false; > > - *result = hpage_collapse_scan_file(mm, > > - khugepaged_scan.address, file, pgoff, > > cc); > > - fput(file); > > - if (*result == SCAN_PTE_MAPPED_HUGEPAGE) { > > - mmap_read_lock(mm); > > - if > > (hpage_collapse_test_exit_or_disable(mm)) > > - goto breakouterloop; > > - *result = collapse_pte_mapped_thp(mm, > > - khugepaged_scan.address, > > false); > > - if (*result == SCAN_PMD_MAPPED) > > - *result = SCAN_SUCCEED; > > - mmap_read_unlock(mm); > > - } > > - } else { > > - *result = hpage_collapse_scan_pmd(mm, vma, > > - khugepaged_scan.address, > > &mmap_locked, cc); > > - } > > - > > - if (*result == SCAN_SUCCEED) > > - ++khugepaged_pages_collapsed; > > > > + *ngle_pmd(khugepaged_scan.address, > > + vma, &mmap_locked, cc); > > + /* If we return SCAN_ANY_PROCESS we are holding the > > mmap_lock */ > > But this comment makes it obvious that you know that.. > > > + if (*result == SCAN_ANY_PROCESS) > > + goto breakouterloop; > > But later.. > > breakouterloop: > mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */ > breakouterloop_mmap_lock: > > > So if you return with SCAN_ANY_PROCESS, we are holding the lock and go > immediately and drop it. This seems unnecessarily complicated and > involves a lock. SCAN_ANY_PROCESS indicates that the process we are working on has either exited, or THPs have been disabled mid-scan. So we have to drop the lock regardless. > > That would leave just the khugepaged_scan_pmd() path with the > unfortunate locking mess - which is a static function and called in one > location.. > > Looking at what happens after the return seems to indicate we could > clean that up as well, sometime later. I see your point, all other instances handle the unlock within their own function and this one should too. instead of handling the unlock in the parent function I should just return with it unlocked and have the already established if(!mmap_locked) do the cleanup. > > > /* move to next address */ > > khugepaged_scan.address += HPAGE_PMD_SIZE; > > progress += HPAGE_PMD_NR; > > @@ -2773,36 +2793,18 @@ int madvise_collapse(struct vm_area_struct *vma, > > struct vm_area_struct **prev, > > mmap_assert_locked(mm); > > memset(cc->node_load, 0, sizeof(cc->node_load)); > > nodes_clear(cc->alloc_nmask); > > - if (!vma_is_anonymous(vma)) { > > - struct file *file = get_file(vma->vm_file); > > - pgoff_t pgoff = linear_page_index(vma, addr); > > > > - mmap_read_unlock(mm); > > - mmap_locked = false; > > - result = hpage_collapse_scan_file(mm, addr, file, > > pgoff, > > - cc); > > - fput(file); > > - } else { > > - result = hpage_collapse_scan_pmd(mm, vma, addr, > > - &mmap_locked, cc); > > - } > > + result = khugepaged_collapse_single_pmd(addr, vma, > > &mmap_locked, cc); > > + > > if (!mmap_locked) > > *prev = NULL; /* Tell caller we dropped mmap_lock */ > > > > -handle_result: > > switch (result) { > > case SCAN_SUCCEED: > > case SCAN_PMD_MAPPED: > > ++thps; > > break; > > case SCAN_PTE_MAPPED_HUGEPAGE: > > - BUG_ON(mmap_locked); > > - BUG_ON(*prev); > > - mmap_read_lock(mm); > > - result = collapse_pte_mapped_thp(mm, addr, true); > > - mmap_read_unlock(mm); > > - goto handle_result; > > All of the above should probably be replaced with a BUG_ON(1) since it's > not expected now? Or at least WARN_ON_ONCE(), but it should be safe to > continue if that's the case. I dont think we should warn as this is the return value for indicating that we are trying to collapse to a mTHP that is smaller than the already established folio (see __collapse_huge_page_isolate), but continuing should be ok. > > It looks like the mmap_locked boolean is used to ensure that *prev is > safe, but we are now dropping the lock and re-acquiring it (and > potentially returning here) with it set to true, so perv will not be set > to NULL like it should. Luckily Lorenzo just cleaned this up with the madvise code changes he made, but yes you are correct. > > I think you can handle this by ensuring that > khugepaged_collapse_single_pmd() returns with mmap_locked false in the > SCAN_ANY_PROCESS case. > > > - /* Whitelisted set of results where continuing OK */ > > This seems worth keeping? I'll add that back, thanks. > > > case SCAN_PMD_NULL: > > case SCAN_PTE_NON_PRESENT: > > case SCAN_PTE_UFFD_WP: > > I guess SCAN_ANY_PROCESS should be handled by the default case > statement? It should probably be added to the switch? I believe it should be handled by the default case, since we dont want to continue, so we break out as intended. > > That is to say, before your change the result would come from either > hpage_collapse_scan_file(), then lead to collapse_pte_mapped_thp() > above. In the khugepaged case we do the following check (khugepaged_test_exit_or_disable) before calling pte_mapped_thp, but we weren't doing it in the madvise_collapse case. seems like we had a bug lingering or unnecessary code in the original implementation (its been that way since day 1). I can note the slight difference in the commit log. I believe having the same check for both is wise, although now I have to ask why we arent using the revalidate function like all other callers do when they drop the lock. I will note this small difference in the commit log, and will invest some time in the future into cleaning up this madness. I think unifying these two callers into one, as I'm trying to do here, will make these behavioral deviations harder in the future, and we can have sanity knowing there is *mostly* one way to call the collapse. > > Now, you can have khugepaged_test_exit_or_disable() happen to return > SCAN_ANY_PROCESS and it will fall through to the default in this switch > statement, which seems like new behaviour? > > At the very least, this information should be added to the git log on > what this patch does - if it's expected? Will do, thanks for the thought provoking review, I had to do some digging to verify this one :)
-- Nico > > Thanks, > Liam >