Re: [PATCH v3 2/2] mm: rid swapoff of quadratic complexity
On Wed, Jan 2, 2019 at 2:43 PM Hugh Dickins wrote: > > Wrong. Without heavier locking that would add unwelcome overhead to > common paths, we shall "always" need the retry logic. It does not > come into play very often, but here are two examples of why it's > needed (if I thought longer, I might find more). And in practice, > yes, I sometimes saw 1 retry needed. > Understood. Sorry, I missed these corner cases. > I don't use frontswap myself, and haven't paid any attention to the > frontswap partial swapoff case (though notice now that shmem_unuse() > lacks the plumbing needed for it - that needs fixing); but doubt it > would be a good idea to refactor it out as a separate case. > I shall rework the shmem side to take care of the frontswap and retain the retry logic in a simplified manner. Thanks again for all the comments and insights.. ~Vineeth
Re: [PATCH v3 2/2] mm: rid swapoff of quadratic complexity
On Wed, 2 Jan 2019, Vineeth Pillai wrote: > > After reading the code again, I feel like we can make the retry logic > simpler and avoid the use of oldi. If my understanding is correct, > except for frontswap case, we reach try_to_unuse() only after we > disable the swap device. So I think, we would not be seeing any more > swap usage on the disabled swap device, after we loop through all the > process and swapin the pages on that device. In that case, we would > not need the retry logic right? Wrong. Without heavier locking that would add unwelcome overhead to common paths, we shall "always" need the retry logic. It does not come into play very often, but here are two examples of why it's needed (if I thought longer, I might find more). And in practice, yes, I sometimes saw 1 retry needed. One, the issue already discussed, of a multiply-mapped page which is swapped out, one pte swapped off, but swapped back in by concurrent fault before the last pte has been swapped off and the page finally deleted from swap cache. That swapin still references the disabled swapfile, and will need a retry to unuse (and that retry might need another). We may fix this later with an rmap walk while still holding page locked for the first pte; but even if we do, I'd still want to retain the retry logic, to avoid dependence on corner-case-free reliable rmap walks. Two, get_swap_page() allocated a swap entry for shmem file or vma just before the swapoff started, but the swapper did not reach the point of inserting that swap entry before try_to_unuse() scanned the shmem file or vma in question. > For frontswap case, the patch was missing a check for pages_to_unuse. > We would still need the retry logic, but as you mentioned, I can > easily remove the oldi logic and make it simpler. Or probably, > refactor the frontswap code out as a special case if pages_to_unuse is > still not zero after the initial loop. I don't use frontswap myself, and haven't paid any attention to the frontswap partial swapoff case (though notice now that shmem_unuse() lacks the plumbing needed for it - that needs fixing); but doubt it would be a good idea to refactor it out as a separate case. Hugh
Re: [PATCH v3 2/2] mm: rid swapoff of quadratic complexity
On Tue, Jan 1, 2019 at 11:16 PM Hugh Dickins wrote: > One more fix on top of what I sent yesterday: once I delved into > the retries, I found that the major cause of exceeding MAX_RETRIES > was the way the retry code neatly avoided retrying the last part of > its work. With this fix in, I have not yet seen retries go above 1: > no doubt it could, but at present I have no actual evidence that > the MAX_RETRIES-or-livelock issue needs to be dealt with urgently. > Fix sent for completeness, but it reinforces the point that the > structure of try_to_unuse() should be reworked, and oldi gone. > Thanks for the fix and suggestions Hugh! After reading the code again, I feel like we can make the retry logic simpler and avoid the use of oldi. If my understanding is correct, except for frontswap case, we reach try_to_unuse() only after we disable the swap device. So I think, we would not be seeing any more swap usage on the disabled swap device, after we loop through all the process and swapin the pages on that device. In that case, we would not need the retry logic right? For frontswap case, the patch was missing a check for pages_to_unuse. We would still need the retry logic, but as you mentioned, I can easily remove the oldi logic and make it simpler. Or probably, refactor the frontswap code out as a special case if pages_to_unuse is still not zero after the initial loop. Thanks, Vineeth
Re: [PATCH v3 2/2] mm: rid swapoff of quadratic complexity
On Tue, 1 Jan 2019, Vineeth Pillai wrote: > Thanks a lot for the fixes and detailed explanation Hugh! I shall fold all > the changes from you and Huang in the next iteration. > > Thanks for all the suggestions and comments as well. I am looking into all > those and will include all the changes in the next version. Will discuss > over mail in case of any clarifications. One more fix on top of what I sent yesterday: once I delved into the retries, I found that the major cause of exceeding MAX_RETRIES was the way the retry code neatly avoided retrying the last part of its work. With this fix in, I have not yet seen retries go above 1: no doubt it could, but at present I have no actual evidence that the MAX_RETRIES-or-livelock issue needs to be dealt with urgently. Fix sent for completeness, but it reinforces the point that the structure of try_to_unuse() should be reworked, and oldi gone. Hugh --- mm/swapfile.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) --- mmotm/mm/swapfile.c 2018-12-31 12:30:55.822407154 -0800 +++ linux/mm/swapfile.c 2019-01-01 19:50:34.377277830 -0800 @@ -2107,8 +2107,8 @@ int try_to_unuse(unsigned int type, bool struct swap_info_struct *si = swap_info[type]; struct page *page; swp_entry_t entry; - unsigned int i = 0; - unsigned int oldi = 0; + unsigned int i; + unsigned int oldi; int retries = 0; if (!frontswap) @@ -2154,6 +2154,7 @@ retry: goto out; } + i = oldi = 0; while ((i = find_next_to_unuse(si, i, frontswap)) != 0) { /* * Under global memory pressure, swap entries
Re: [PATCH v3 2/2] mm: rid swapoff of quadratic complexity
On Mon, 3 Dec 2018, Vineeth Remanan Pillai wrote: > This patch was initially posted by Kelley(kelley...@gmail.com). > Reposting the patch with all review comments addressed and with minor > modifications and optimizations. Tests were rerun and commit message > updated with new results. > > The function try_to_unuse() is of quadratic complexity, with a lot of > wasted effort. It unuses swap entries one by one, potentially iterating > over all the page tables for all the processes in the system for each > one. > > This new proposed implementation of try_to_unuse simplifies its > complexity to linear. It iterates over the system's mms once, unusing > all the affected entries as it walks each set of page tables. It also > makes similar changes to shmem_unuse. Hi Vineeth, please fold in fixes below before reposting your "mm,swap: rid swapoff of quadratic complexity" patch - or ask for more detail if unclear. I could split it up, of course, but since they should all (except perhaps one) just be merged into the base patch before going any further, it'll save me time to keep them together here and just explain:- shmem_unuse_swap_entries(): If a user fault races with swapoff, it's very normal for shmem_swapin_page() to return -EEXIST, and the old code was careful not to pass back any error but -ENOMEM; whereas on mmotm, /usr/sbin/swapoff often failed silently because it got that EEXIST. shmem_unuse(): A couple of crashing bugs there: a list_del_init without holding the mutex, and too much faith in the "safe" in list_for_each_entry_safe(): it does assume that the mutex has been held throughout, you (very nicely!) drop it, but that does require "next" to be re-evaluated. shmem_writepage(): Not a bug fix, this is the "except perhaps one": minor optimization, could be left out, but if shmem_unuse() is going through the list in the forward direction, and may completely unswap a file and del it from the list, then pages from that file can be swapped out to *other* swap areas after that, and it be reinserted in the list: better to reinsert it behind shmem_unuse()'s cursor than in front of it, which would entail a second pointless pass over that file. try_to_unuse(): Moved up the assignment of "oldi = i" (and changed the test to "oldi <= i"), so as not to get trapped in that find_next_to_unuse() loop when find_get_page() does not find it. try_to_unuse(): But the main problem was passing entry.val to find_get_page() there: that used to be correct, but since f6ab1f7f6b2d we need to pass just the offset - as it stood, it could only find the pages when swapping off area 0 (a similar issue was fixed in shmem_replace_page() recently). That (together with the late oldi assignment) was why my swapoffs were hanging on SWAP_HAS_CACHE swap_map entries. With those changes, it all seems to work rather well, and is a nice simplification of the source, in addition to removing the quadratic complexity. To my great surprise, the KSM pages are already handled fairly well - the ksm_might_need_to_copy() that has long been in unuse_pte() turns out to do (almost) a good enough job already, so most users of KSM and swapoff would never see any problem. And I'd been afraid of swapin readahead causing spurious -ENOMEMs, but have seen nothing of that in practice (though something else in mmotm does appear to use up more memory than before). My remaining criticisms would be: As Huang Ying pointed out in other mail, there is a danger of livelock (or rather, hitting the MAX_RETRIES limit) when a multiply mapped page (most especially a KSM page, whose mappings are not likely to be nearby in the mmlist) is swapped out then partially swapped off then some ptes swapped back out. As indeed the "Under global memory pressure" comment admits. I have hit the MAX_RETRIES 3 limit several times in load testing, not investigated but I presume due to such a multiply mapped page, so at present we do have a regression there. A very simple answer would be to remove the retries limiting - perhaps you just added that to get around the find_get_page() failure before it was understood? That does then tend towards the livelock alternative, but you've kept the signal_pending() check, so there's still the same way out as the old technique had (but greater likelihood of needing it with the new technique). The right fix will be to do an rmap walk to unuse all the swap entries of a single anon_vma while holding page lock (with KSM needing that page force-deleted from swap cache before moving on); but none of us have written that code yet, maybe just removing the retries limit good enough. Two dislikes on the code structure, probably one solution: the "goto retry", up two levels from inside the lower loop, is easy to misunderstand; and the oldi business is ugly - find_next_to_unuse() was written to wrap around continuously to suit the old loop, but now it's left with its "++i >= max" code to achieve that, then your "i <= oldi" code to detect when it did, to
[PATCH v3 2/2] mm: rid swapoff of quadratic complexity
This patch was initially posted by Kelley(kelley...@gmail.com). Reposting the patch with all review comments addressed and with minor modifications and optimizations. Tests were rerun and commit message updated with new results. The function try_to_unuse() is of quadratic complexity, with a lot of wasted effort. It unuses swap entries one by one, potentially iterating over all the page tables for all the processes in the system for each one. This new proposed implementation of try_to_unuse simplifies its complexity to linear. It iterates over the system's mms once, unusing all the affected entries as it walks each set of page tables. It also makes similar changes to shmem_unuse. Improvement swapoff was called on a swap partition containing about 6G of data, in a VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted. Present implementationabout 1200M calls(8min, avg 80% cpu util). Prototype.about 9.0K calls(3min, avg 5% cpu util). Details In shmem_unuse(), iterate over the shmem_swaplist and, for each shmem_inode_info that contains a swap entry, pass it to shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(), iterate over its associated xarray, and store the index and value of each swap entry in an array for passing to shmem_swapin_page() outside of the RCU critical section. In try_to_unuse(), instead of iterating over the entries in the type and unusing them one by one, perhaps walking all the page tables for all the processes for each one, iterate over the mmlist, making one pass. Pass each mm to unuse_mm() to begin its page table walk, and during the walk, unuse all the ptes that have backing store in the swap type received by try_to_unuse(). After the walk, check the type for orphaned swap entries with find_next_to_unuse(), and remove them from the swap cache. If find_next_to_unuse() starts over at the beginning of the type, repeat the check of the shmem_swaplist and the walk a maximum of three times. Change unuse_mm() and the intervening walk functions down to unuse_pte_range() to take the type as a parameter, and to iterate over their entire range, calling the next function down on every iteration. In unuse_pte_range(), make a swap entry from each pte in the range using the passed in type. If it has backing store in the type, call swapin_readahead() to retrieve the page and pass it to unuse_pte(). Pass the count of pages_to_unuse down the page table walks in try_to_unuse(), and return from the walk when the desired number of pages has been swapped back in. Change in v3 - Addressed review comments - Refactored out swap-in logic from shmem_getpage_fp Changes in v2 - Updated patch to use Xarray instead of radix tree Signed-off-by: Vineeth Remanan Pillai Signed-off-by: Kelley Nielsen CC: Rik van Riel --- include/linux/shmem_fs.h | 2 +- mm/shmem.c | 218 ++--- mm/swapfile.c| 413 +++ 3 files changed, 260 insertions(+), 373 deletions(-) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index f155dc607112..1dd02592bb53 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -72,7 +72,7 @@ extern void shmem_unlock_mapping(struct address_space *mapping); extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping, pgoff_t index, gfp_t gfp_mask); extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end); -extern int shmem_unuse(swp_entry_t entry, struct page *page); +extern int shmem_unuse(unsigned int type); extern unsigned long shmem_swap_usage(struct vm_area_struct *vma); extern unsigned long shmem_partial_swap_usage(struct address_space *mapping, diff --git a/mm/shmem.c b/mm/shmem.c index 035ea2c10f54..404f7b785fce 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1093,159 +1093,143 @@ static void shmem_evict_inode(struct inode *inode) clear_inode(inode); } -static unsigned long find_swap_entry(struct xarray *xa, void *item) +static int shmem_find_swap_entries(struct address_space *mapping, + pgoff_t start, unsigned int nr_entries, + struct page **entries, pgoff_t *indices) { - XA_STATE(xas, xa, 0); - unsigned int checked = 0; - void *entry; + XA_STATE(xas, >i_pages, start); + struct page *page; + unsigned int ret = 0; + + if (!nr_entries) + return 0; rcu_read_lock(); - xas_for_each(, entry, ULONG_MAX) { - if (xas_retry(, entry)) + xas_for_each(, page, ULONG_MAX) { + if (xas_retry(, page)) continue; - if (entry == item) - break; - checked++; - if ((checked % XA_CHECK_SCHED) != 0) + + if (!xa_is_value(page)) continue; - xas_pause();
[PATCH v3 2/2] mm: rid swapoff of quadratic complexity
This patch was initially posted by Kelley(kelley...@gmail.com). Reposting the patch with all review comments addressed and with minor modifications and optimizations. Tests were rerun and commit message updated with new results. The function try_to_unuse() is of quadratic complexity, with a lot of wasted effort. It unuses swap entries one by one, potentially iterating over all the page tables for all the processes in the system for each one. This new proposed implementation of try_to_unuse simplifies its complexity to linear. It iterates over the system's mms once, unusing all the affected entries as it walks each set of page tables. It also makes similar changes to shmem_unuse. Improvement swapoff was called on a swap partition containing about 6G of data, in a VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted. Present implementationabout 1200M calls(8min, avg 80% cpu util). Prototype.about 9.0K calls(3min, avg 5% cpu util). Details In shmem_unuse(), iterate over the shmem_swaplist and, for each shmem_inode_info that contains a swap entry, pass it to shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(), iterate over its associated xarray, and store the index and value of each swap entry in an array for passing to shmem_swapin_page() outside of the RCU critical section. In try_to_unuse(), instead of iterating over the entries in the type and unusing them one by one, perhaps walking all the page tables for all the processes for each one, iterate over the mmlist, making one pass. Pass each mm to unuse_mm() to begin its page table walk, and during the walk, unuse all the ptes that have backing store in the swap type received by try_to_unuse(). After the walk, check the type for orphaned swap entries with find_next_to_unuse(), and remove them from the swap cache. If find_next_to_unuse() starts over at the beginning of the type, repeat the check of the shmem_swaplist and the walk a maximum of three times. Change unuse_mm() and the intervening walk functions down to unuse_pte_range() to take the type as a parameter, and to iterate over their entire range, calling the next function down on every iteration. In unuse_pte_range(), make a swap entry from each pte in the range using the passed in type. If it has backing store in the type, call swapin_readahead() to retrieve the page and pass it to unuse_pte(). Pass the count of pages_to_unuse down the page table walks in try_to_unuse(), and return from the walk when the desired number of pages has been swapped back in. Change in v3 - Addressed review comments - Refactored out swap-in logic from shmem_getpage_fp Changes in v2 - Updated patch to use Xarray instead of radix tree Signed-off-by: Vineeth Remanan Pillai Signed-off-by: Kelley Nielsen CC: Rik van Riel --- include/linux/shmem_fs.h | 2 +- mm/shmem.c | 218 ++--- mm/swapfile.c| 413 +++ 3 files changed, 260 insertions(+), 373 deletions(-) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index f155dc607112..1dd02592bb53 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -72,7 +72,7 @@ extern void shmem_unlock_mapping(struct address_space *mapping); extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping, pgoff_t index, gfp_t gfp_mask); extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end); -extern int shmem_unuse(swp_entry_t entry, struct page *page); +extern int shmem_unuse(unsigned int type); extern unsigned long shmem_swap_usage(struct vm_area_struct *vma); extern unsigned long shmem_partial_swap_usage(struct address_space *mapping, diff --git a/mm/shmem.c b/mm/shmem.c index 035ea2c10f54..404f7b785fce 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1093,159 +1093,143 @@ static void shmem_evict_inode(struct inode *inode) clear_inode(inode); } -static unsigned long find_swap_entry(struct xarray *xa, void *item) +static int shmem_find_swap_entries(struct address_space *mapping, + pgoff_t start, unsigned int nr_entries, + struct page **entries, pgoff_t *indices) { - XA_STATE(xas, xa, 0); - unsigned int checked = 0; - void *entry; + XA_STATE(xas, >i_pages, start); + struct page *page; + unsigned int ret = 0; + + if (!nr_entries) + return 0; rcu_read_lock(); - xas_for_each(, entry, ULONG_MAX) { - if (xas_retry(, entry)) + xas_for_each(, page, ULONG_MAX) { + if (xas_retry(, page)) continue; - if (entry == item) - break; - checked++; - if ((checked % XA_CHECK_SCHED) != 0) + + if (!xa_is_value(page)) continue; - xas_pause();