On Wed, Jul 22, 2020 at 07:04:25PM +0100, Matthew Wilcox wrote: > On Wed, Jul 22, 2020 at 07:44:36PM +0200, Andrea Righi wrote: > > Waiting for lock_page() with mm->mmap_sem held in unuse_pte_range() can > > lead to stalls while running swapoff (i.e., not being able to ssh into > > the system, inability to execute simple commands like 'ps', etc.). > > > > Replace lock_page() with trylock_page() and release mm->mmap_sem if we > > fail to lock it, giving other tasks a chance to continue and prevent > > the stall. > > I think you've removed the warning at the expense of turning a stall > into a potential livelock. > > > @@ -1977,7 +1977,11 @@ static int unuse_pte_range(struct vm_area_struct > > *vma, pmd_t *pmd, > > return -ENOMEM; > > } > > > > - lock_page(page); > > + if (!trylock_page(page)) { > > + ret = -EAGAIN; > > + put_page(page); > > + goto out; > > + } > > If you look at the patterns we have elsewhere in the MM for doing > this kind of thing (eg truncate_inode_pages_range()), we iterate over the > entire range, take care of the easy cases, then go back and deal with the > hard cases later. > > So that would argue for skipping any page that we can't trylock, but > continue over at least the VMA, and quite possibly the entire MM until > we're convinced that we have unused all of the required pages. > > Another thing we could do is drop the MM semaphore _here_, sleep on this > page until it's unlocked, then go around again. > > if (!trylock_page(page)) { > mmap_read_unlock(mm); > lock_page(page); > unlock_page(page); > put_page(page); > ret = -EAGAIN; > goto out; > } > > (I haven't checked the call paths; maybe you can't do this because > sometimes it's called with the mmap sem held for write) > > Also, if we're trying to scale this better, there are some fun > workloads where readers block writers who block subsequent readers > and we shouldn't wait for I/O in swapin_readahead(). See patches like > 6b4c9f4469819a0c1a38a0a4541337e0f9bf6c11 for more on this kind of thing.
Thanks for the review, Matthew. I'll see if I can find a better solution following your useful hints! -Andrea