Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-14 Thread Miaohe Lin
On 2021/4/15 0:13, Tim Chen wrote:
> 
> 
> On 4/13/21 6:04 PM, Huang, Ying wrote:
>> Tim Chen  writes:
>>
>>> On 4/12/21 6:27 PM, Huang, Ying wrote:
>>>

 This isn't the commit that introduces the race.  You can use `git blame`
 find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
 swap: skip swapcache for swapin of synchronous device".

 And I suggest to merge 1/5 and 2/5 to make it easy to get the full
 picture.
>>>
>>> I'll suggest make fix to do_swap_page race with get/put_swap_device
>>> as a first patch. Then the per_cpu_ref stuff in patch 1 and patch 2 can
>>> be combined together.
>>
>> The original get/put_swap_device() use rcu_read_lock/unlock().  I don't
>> think it's good to wrap swap_read_page() with it.  After all, some
>> complex operations are done in swap_read_page(), including
>> blk_io_schedule().
>>
> 
> In that case then have the patches to make get/put_swap_device to use
> percpu_ref first.  And the patch to to fix the race in do_swap_page
> later in another patch.
> 
> Patch 2 is mixing the two.
> 

Looks like a good way to organize this patch series. Many thanks!

> Tim
> .
> 



Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-14 Thread Tim Chen



On 4/13/21 6:04 PM, Huang, Ying wrote:
> Tim Chen  writes:
> 
>> On 4/12/21 6:27 PM, Huang, Ying wrote:
>>
>>>
>>> This isn't the commit that introduces the race.  You can use `git blame`
>>> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
>>> swap: skip swapcache for swapin of synchronous device".
>>>
>>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>>> picture.
>>
>> I'll suggest make fix to do_swap_page race with get/put_swap_device
>> as a first patch. Then the per_cpu_ref stuff in patch 1 and patch 2 can
>> be combined together.
> 
> The original get/put_swap_device() use rcu_read_lock/unlock().  I don't
> think it's good to wrap swap_read_page() with it.  After all, some
> complex operations are done in swap_read_page(), including
> blk_io_schedule().
> 

In that case then have the patches to make get/put_swap_device to use
percpu_ref first.  And the patch to to fix the race in do_swap_page
later in another patch.

Patch 2 is mixing the two.

Tim


Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-13 Thread Miaohe Lin
On 2021/4/14 11:07, Huang, Ying wrote:
> Miaohe Lin  writes:
> 
>> On 2021/4/13 9:27, Huang, Ying wrote:
>>> Miaohe Lin  writes:
>>>
 When I was investigating the swap code, I found the below possible race
 window:

 CPU 1  CPU 2
 -  -
 do_swap_page
   synchronous swap_readpage
 alloc_page_vma
swapoff
  release swap_file, bdev, or ...
   swap_readpage
check sis->flags is ok
  access swap_file, bdev...[oops!]
si->flags = 0

 Using current get/put_swap_device() to guard against concurrent swapoff for
 swap_readpage() looks terrible because swap_readpage() may take really long
 time. And this race may not be really pernicious because swapoff is usually
 done when system shutdown only. To reduce the performance overhead on the
 hot-path as much as possible, it appears we can use the percpu_ref to close
 this race window(as suggested by Huang, Ying).

 Fixes: 235b62176712 ("mm/swap: add cluster lock")
>>>
>>> This isn't the commit that introduces the race.  You can use `git blame`
>>> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
>>> swap: skip swapcache for swapin of synchronous device".
>>>
>>
>> Sorry about it! What I refer to is commit eb085574a752 ("mm, swap: fix race 
>> between
>> swapoff and some swap operations"). And I think this commit does not fix the 
>> race
>> condition completely, so I reuse the Fixes tag inside it.
>>
>>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>>> picture.
>>>
 Signed-off-by: Miaohe Lin 
 ---
  include/linux/swap.h |  2 +-
  mm/memory.c  | 10 ++
  mm/swapfile.c| 28 +++-
  3 files changed, 22 insertions(+), 18 deletions(-)

 diff --git a/include/linux/swap.h b/include/linux/swap.h
 index 849ba5265c11..9066addb57fd 100644
 --- a/include/linux/swap.h
 +++ b/include/linux/swap.h
 @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
  
  static inline void put_swap_device(struct swap_info_struct *si)
  {
 -  rcu_read_unlock();
 +  percpu_ref_put(>users);
  }
  
  #else /* CONFIG_SWAP */
 diff --git a/mm/memory.c b/mm/memory.c
 index cc71a445c76c..8543c47b955c 100644
 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
  {
struct vm_area_struct *vma = vmf->vma;
struct page *page = NULL, *swapcache;
 +  struct swap_info_struct *si = NULL;
swp_entry_t entry;
pte_t pte;
int locked;
 @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}
  

>>>
>>> I suggest to add comments here as follows (words copy from Matthew Wilcox)
>>>
>>> /* Prevent swapoff from happening to us */
>>
>> Ok.
>>
>>>
 +  si = get_swap_device(entry);
 +  /* In case we raced with swapoff. */
 +  if (unlikely(!si))
 +  goto out;
 +
>>>
>>> Because we wrap the whole do_swap_page() with get/put_swap_device()
>>> now.  We can remove several get/put_swap_device() for function called by
>>> do_swap_page().  That can be another optimization patch.
>>
>> I tried to remove several get/put_swap_device() for function called
>> by do_swap_page() only before I send this series. But it seems they have
>> other callers without proper get/put_swap_device().
> 
> Then we need to revise these callers instead.  Anyway, can be another
> series.

Yes. can be another series.
Thanks.

> 
> Best Regards,
> Huang, Ying
> 
> .
> 



Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-13 Thread Huang, Ying
Miaohe Lin  writes:

> On 2021/4/13 9:27, Huang, Ying wrote:
>> Miaohe Lin  writes:
>> 
>>> When I was investigating the swap code, I found the below possible race
>>> window:
>>>
>>> CPU 1   CPU 2
>>> -   -
>>> do_swap_page
>>>   synchronous swap_readpage
>>> alloc_page_vma
>>> swapoff
>>>   release swap_file, bdev, or ...
>>>   swap_readpage
>>> check sis->flags is ok
>>>   access swap_file, bdev...[oops!]
>>> si->flags = 0
>>>
>>> Using current get/put_swap_device() to guard against concurrent swapoff for
>>> swap_readpage() looks terrible because swap_readpage() may take really long
>>> time. And this race may not be really pernicious because swapoff is usually
>>> done when system shutdown only. To reduce the performance overhead on the
>>> hot-path as much as possible, it appears we can use the percpu_ref to close
>>> this race window(as suggested by Huang, Ying).
>>>
>>> Fixes: 235b62176712 ("mm/swap: add cluster lock")
>> 
>> This isn't the commit that introduces the race.  You can use `git blame`
>> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
>> swap: skip swapcache for swapin of synchronous device".
>> 
>
> Sorry about it! What I refer to is commit eb085574a752 ("mm, swap: fix race 
> between
> swapoff and some swap operations"). And I think this commit does not fix the 
> race
> condition completely, so I reuse the Fixes tag inside it.
>
>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>> picture.
>> 
>>> Signed-off-by: Miaohe Lin 
>>> ---
>>>  include/linux/swap.h |  2 +-
>>>  mm/memory.c  | 10 ++
>>>  mm/swapfile.c| 28 +++-
>>>  3 files changed, 22 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 849ba5265c11..9066addb57fd 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>>>  
>>>  static inline void put_swap_device(struct swap_info_struct *si)
>>>  {
>>> -   rcu_read_unlock();
>>> +   percpu_ref_put(>users);
>>>  }
>>>  
>>>  #else /* CONFIG_SWAP */
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index cc71a445c76c..8543c47b955c 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>  {
>>> struct vm_area_struct *vma = vmf->vma;
>>> struct page *page = NULL, *swapcache;
>>> +   struct swap_info_struct *si = NULL;
>>> swp_entry_t entry;
>>> pte_t pte;
>>> int locked;
>>> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> }
>>>  
>>>
>> 
>> I suggest to add comments here as follows (words copy from Matthew Wilcox)
>> 
>>  /* Prevent swapoff from happening to us */
>
> Ok.
>
>> 
>>> +   si = get_swap_device(entry);
>>> +   /* In case we raced with swapoff. */
>>> +   if (unlikely(!si))
>>> +   goto out;
>>> +
>> 
>> Because we wrap the whole do_swap_page() with get/put_swap_device()
>> now.  We can remove several get/put_swap_device() for function called by
>> do_swap_page().  That can be another optimization patch.
>
> I tried to remove several get/put_swap_device() for function called
> by do_swap_page() only before I send this series. But it seems they have
> other callers without proper get/put_swap_device().

Then we need to revise these callers instead.  Anyway, can be another
series.

Best Regards,
Huang, Ying


Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-13 Thread Miaohe Lin
On 2021/4/13 9:27, Huang, Ying wrote:
> Miaohe Lin  writes:
> 
>> When I was investigating the swap code, I found the below possible race
>> window:
>>
>> CPU 1CPU 2
>> --
>> do_swap_page
>>   synchronous swap_readpage
>> alloc_page_vma
>>  swapoff
>>release swap_file, bdev, or ...
>>   swap_readpage
>>  check sis->flags is ok
>>access swap_file, bdev...[oops!]
>>  si->flags = 0
>>
>> Using current get/put_swap_device() to guard against concurrent swapoff for
>> swap_readpage() looks terrible because swap_readpage() may take really long
>> time. And this race may not be really pernicious because swapoff is usually
>> done when system shutdown only. To reduce the performance overhead on the
>> hot-path as much as possible, it appears we can use the percpu_ref to close
>> this race window(as suggested by Huang, Ying).
>>
>> Fixes: 235b62176712 ("mm/swap: add cluster lock")
> 
> This isn't the commit that introduces the race.  You can use `git blame`
> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
> swap: skip swapcache for swapin of synchronous device".
> 

Sorry about it! What I refer to is commit eb085574a752 ("mm, swap: fix race 
between
swapoff and some swap operations"). And I think this commit does not fix the 
race
condition completely, so I reuse the Fixes tag inside it.

> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
> picture.
> 
>> Signed-off-by: Miaohe Lin 
>> ---
>>  include/linux/swap.h |  2 +-
>>  mm/memory.c  | 10 ++
>>  mm/swapfile.c| 28 +++-
>>  3 files changed, 22 insertions(+), 18 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 849ba5265c11..9066addb57fd 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>>  
>>  static inline void put_swap_device(struct swap_info_struct *si)
>>  {
>> -rcu_read_unlock();
>> +percpu_ref_put(>users);
>>  }
>>  
>>  #else /* CONFIG_SWAP */
>> diff --git a/mm/memory.c b/mm/memory.c
>> index cc71a445c76c..8543c47b955c 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  {
>>  struct vm_area_struct *vma = vmf->vma;
>>  struct page *page = NULL, *swapcache;
>> +struct swap_info_struct *si = NULL;
>>  swp_entry_t entry;
>>  pte_t pte;
>>  int locked;
>> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  }
>>  
>>
> 
> I suggest to add comments here as follows (words copy from Matthew Wilcox)
> 
>   /* Prevent swapoff from happening to us */

Ok.

> 
>> +si = get_swap_device(entry);
>> +/* In case we raced with swapoff. */
>> +if (unlikely(!si))
>> +goto out;
>> +
> 
> Because we wrap the whole do_swap_page() with get/put_swap_device()
> now.  We can remove several get/put_swap_device() for function called by
> do_swap_page().  That can be another optimization patch.

I tried to remove several get/put_swap_device() for function called
by do_swap_page() only before I send this series. But it seems they have
other callers without proper get/put_swap_device().

> 
>>  delayacct_set_flag(DELAYACCT_PF_SWAPIN);
>>  page = lookup_swap_cache(entry, vma, vmf->address);
>>  swapcache = page;
>> @@ -3514,6 +3520,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  unlock:
>>  pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  out:
>> +if (si)
>> +put_swap_device(si);
>>  return ret;
>>  out_nomap:
>>  pte_unmap_unlock(vmf->pte, vmf->ptl);
>> @@ -3525,6 +3533,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  unlock_page(swapcache);
>>  put_page(swapcache);
>>  }
>> +if (si)
>> +put_swap_device(si);
>>  return ret;
>>  }
>>  
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 724173cd7d0c..01032c72ceae 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -1280,18 +1280,12 @@ static unsigned char __swap_entry_free_locked(struct 
>> swap_info_struct *p,
>>   * via preventing the swap device from being swapoff, until
>>   * put_swap_device() is called.  Otherwise return NULL.
>>   *
>> - * The entirety of the RCU read critical section must come before the
>> - * return from or after the call to synchronize_rcu() in
>> - * enable_swap_info() or swapoff().  So if "si->flags & SWP_VALID" is
>> - * true, the si->map, si->cluster_info, etc. must be valid in the
>> - * critical section.
>> - *
>>   * Notice that swapoff or swapoff+swapon can still happen before the
>> - * rcu_read_lock() in get_swap_device() or after the rcu_read_unlock()
>> - * in put_swap_device() if there isn't any other way to prevent

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-13 Thread Miaohe Lin
On 2021/4/14 9:04, Huang, Ying wrote:
> Tim Chen  writes:
> 
>> On 4/12/21 6:27 PM, Huang, Ying wrote:
>>
>>>
>>> This isn't the commit that introduces the race.  You can use `git blame`
>>> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
>>> swap: skip swapcache for swapin of synchronous device".
>>>
>>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>>> picture.
>>
>> I'll suggest make fix to do_swap_page race with get/put_swap_device
>> as a first patch. Then the per_cpu_ref stuff in patch 1 and patch 2 can
>> be combined together.
> 
> The original get/put_swap_device() use rcu_read_lock/unlock().  I don't
> think it's good to wrap swap_read_page() with it.  After all, some
> complex operations are done in swap_read_page(), including
> blk_io_schedule().
> 

The patch was split to make it easier to review originally, i.e. 1/5 introduces
the percpu_ref to swap and 2/5 uses it to fix the race between do_swap_page()
and swapoff.
Btw, I have no preference for merging 1/5 and 2/5 or not.

> Best Regards,
> Huang, Ying
> .
> 



Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-13 Thread Huang, Ying
Tim Chen  writes:

> On 4/12/21 6:27 PM, Huang, Ying wrote:
>
>> 
>> This isn't the commit that introduces the race.  You can use `git blame`
>> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
>> swap: skip swapcache for swapin of synchronous device".
>> 
>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>> picture.
>
> I'll suggest make fix to do_swap_page race with get/put_swap_device
> as a first patch. Then the per_cpu_ref stuff in patch 1 and patch 2 can
> be combined together.

The original get/put_swap_device() use rcu_read_lock/unlock().  I don't
think it's good to wrap swap_read_page() with it.  After all, some
complex operations are done in swap_read_page(), including
blk_io_schedule().

Best Regards,
Huang, Ying


Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-13 Thread Tim Chen



On 4/12/21 6:27 PM, Huang, Ying wrote:

> 
> This isn't the commit that introduces the race.  You can use `git blame`
> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
> swap: skip swapcache for swapin of synchronous device".
> 
> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
> picture.

I'll suggest make fix to do_swap_page race with get/put_swap_device
as a first patch. Then the per_cpu_ref stuff in patch 1 and patch 2 can
be combined together.

Tim


Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-12 Thread Huang, Ying
Miaohe Lin  writes:

> When I was investigating the swap code, I found the below possible race
> window:
>
> CPU 1 CPU 2
> - -
> do_swap_page
>   synchronous swap_readpage
> alloc_page_vma
>   swapoff
> release swap_file, bdev, or ...
>   swap_readpage
>   check sis->flags is ok
> access swap_file, bdev...[oops!]
>   si->flags = 0
>
> Using current get/put_swap_device() to guard against concurrent swapoff for
> swap_readpage() looks terrible because swap_readpage() may take really long
> time. And this race may not be really pernicious because swapoff is usually
> done when system shutdown only. To reduce the performance overhead on the
> hot-path as much as possible, it appears we can use the percpu_ref to close
> this race window(as suggested by Huang, Ying).
>
> Fixes: 235b62176712 ("mm/swap: add cluster lock")

This isn't the commit that introduces the race.  You can use `git blame`
find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
swap: skip swapcache for swapin of synchronous device".

And I suggest to merge 1/5 and 2/5 to make it easy to get the full
picture.

> Signed-off-by: Miaohe Lin 
> ---
>  include/linux/swap.h |  2 +-
>  mm/memory.c  | 10 ++
>  mm/swapfile.c| 28 +++-
>  3 files changed, 22 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 849ba5265c11..9066addb57fd 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>  
>  static inline void put_swap_device(struct swap_info_struct *si)
>  {
> - rcu_read_unlock();
> + percpu_ref_put(>users);
>  }
>  
>  #else /* CONFIG_SWAP */
> diff --git a/mm/memory.c b/mm/memory.c
> index cc71a445c76c..8543c47b955c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  {
>   struct vm_area_struct *vma = vmf->vma;
>   struct page *page = NULL, *swapcache;
> + struct swap_info_struct *si = NULL;
>   swp_entry_t entry;
>   pte_t pte;
>   int locked;
> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   }
>  
>

I suggest to add comments here as follows (words copy from Matthew Wilcox)

/* Prevent swapoff from happening to us */

> + si = get_swap_device(entry);
> + /* In case we raced with swapoff. */
> + if (unlikely(!si))
> + goto out;
> +

Because we wrap the whole do_swap_page() with get/put_swap_device()
now.  We can remove several get/put_swap_device() for function called by
do_swap_page().  That can be another optimization patch.

>   delayacct_set_flag(DELAYACCT_PF_SWAPIN);
>   page = lookup_swap_cache(entry, vma, vmf->address);
>   swapcache = page;
> @@ -3514,6 +3520,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  unlock:
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
>  out:
> + if (si)
> + put_swap_device(si);
>   return ret;
>  out_nomap:
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -3525,6 +3533,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   unlock_page(swapcache);
>   put_page(swapcache);
>   }
> + if (si)
> + put_swap_device(si);
>   return ret;
>  }
>  
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 724173cd7d0c..01032c72ceae 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1280,18 +1280,12 @@ static unsigned char __swap_entry_free_locked(struct 
> swap_info_struct *p,
>   * via preventing the swap device from being swapoff, until
>   * put_swap_device() is called.  Otherwise return NULL.
>   *
> - * The entirety of the RCU read critical section must come before the
> - * return from or after the call to synchronize_rcu() in
> - * enable_swap_info() or swapoff().  So if "si->flags & SWP_VALID" is
> - * true, the si->map, si->cluster_info, etc. must be valid in the
> - * critical section.
> - *
>   * Notice that swapoff or swapoff+swapon can still happen before the
> - * rcu_read_lock() in get_swap_device() or after the rcu_read_unlock()
> - * in put_swap_device() if there isn't any other way to prevent
> - * swapoff, such as page lock, page table lock, etc.  The caller must
> - * be prepared for that.  For example, the following situation is
> - * possible.
> + * percpu_ref_tryget_live() in get_swap_device() or after the
> + * percpu_ref_put() in put_swap_device() if there isn't any other way
> + * to prevent swapoff, such as page lock, page table lock, etc.  The
> + * caller must be prepared for that.  For example, the following
> + * situation is possible.
>   *
>   *   CPU1CPU2
>   *   do_swap_page()
> @@ -1319,21 +1313,21 @@ struct swap_info_struct 

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-11 Thread Miaohe Lin
On 2021/4/12 9:44, Huang, Ying wrote:
> Miaohe Lin  writes:
> 
>> On 2021/4/10 1:17, Tim Chen wrote:
>>>
>>>
>>> On 4/9/21 1:42 AM, Miaohe Lin wrote:
 On 2021/4/9 5:34, Tim Chen wrote:
>
>
> On 4/8/21 6:08 AM, Miaohe Lin wrote:
>> When I was investigating the swap code, I found the below possible race
>> window:
>>
>> CPU 1CPU 2
>> --
>> do_swap_page
>>   synchronous swap_readpage
>> alloc_page_vma
>>  swapoff
>>release swap_file, bdev, or ...
>

 Many thanks for quick review and reply!

> Perhaps I'm missing something.  The release of swap_file, bdev etc
> happens after we have cleared the SWP_VALID bit in si->flags in 
> destroy_swap_extents
> if I read the swapoff code correctly.
 Agree. Let's look this more close:
 CPU1   CPU2
 -  -
 swap_readpage
   if (data_race(sis->flags & SWP_FS_OPS)) {
swapoff
  p->swap_file 
 = NULL;
 struct file *swap_file = sis->swap_file;
 struct address_space *mapping = swap_file->f_mapping;[oops!]
  ...
  p->flags = 0;
 ...

 Does this make sense for you?
>>>
>>> p->swapfile = NULL happens after the 
>>> p->flags &= ~SWP_VALID, synchronize_rcu(), destroy_swap_extents() sequence 
>>> in swapoff().
>>>
>>> So I don't think the sequence you illustrated on CPU2 is in the right order.
>>> That said, without get_swap_device/put_swap_device in swap_readpage, you 
>>> could
>>> potentially blow pass synchronize_rcu() on CPU2 and causes a problem.  so I 
>>> think
>>> the problematic race looks something like the following:
>>>
>>>
>>> CPU1CPU2
>>> -   -
>>> swap_readpage
>>>   if (data_race(sis->flags & SWP_FS_OPS)) {
>>> swapoff
>>>   p->flags = &= 
>>> ~SWP_VALID;
>>>   ..
>>>   
>>> synchronize_rcu();
>>>   ..
>>>   p->swap_file 
>>> = NULL;
>>> struct file *swap_file = sis->swap_file;
>>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>>>   ...
>>> ...
>>>
>>
>> Agree. This is also what I meant to illustrate. And you provide a better 
>> one. Many thanks!
> 
> For the pages that are swapped in through swap cache.  That isn't an
> issue.  Because the page is locked, the swap entry will be marked with
> SWAP_HAS_CACHE, so swapoff() cannot proceed until the page has been
> unlocked.
> 
> So the race is for the fast path as follows,
> 
>   if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>   __swap_count(entry) == 1)
> 
> I found it in your original patch description.  But please make it more
> explicit to reduce the potential confusing.

Sure. Should I rephrase the commit log to clarify this or add a comment in the 
code?

Thanks.

> 
> Best Regards,
> Huang, Ying
> .
> 



Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-11 Thread Huang, Ying
Miaohe Lin  writes:

> On 2021/4/10 1:17, Tim Chen wrote:
>> 
>> 
>> On 4/9/21 1:42 AM, Miaohe Lin wrote:
>>> On 2021/4/9 5:34, Tim Chen wrote:


 On 4/8/21 6:08 AM, Miaohe Lin wrote:
> When I was investigating the swap code, I found the below possible race
> window:
>
> CPU 1 CPU 2
> - -
> do_swap_page
>   synchronous swap_readpage
> alloc_page_vma
>   swapoff
> release swap_file, bdev, or ...

>>>
>>> Many thanks for quick review and reply!
>>>
 Perhaps I'm missing something.  The release of swap_file, bdev etc
 happens after we have cleared the SWP_VALID bit in si->flags in 
 destroy_swap_extents
 if I read the swapoff code correctly.
>>> Agree. Let's look this more close:
>>> CPU1CPU2
>>> -   -
>>> swap_readpage
>>>   if (data_race(sis->flags & SWP_FS_OPS)) {
>>> swapoff
>>>   p->swap_file 
>>> = NULL;
>>> struct file *swap_file = sis->swap_file;
>>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>>>   ...
>>>   p->flags = 0;
>>> ...
>>>
>>> Does this make sense for you?
>> 
>> p->swapfile = NULL happens after the 
>> p->flags &= ~SWP_VALID, synchronize_rcu(), destroy_swap_extents() sequence 
>> in swapoff().
>> 
>> So I don't think the sequence you illustrated on CPU2 is in the right order.
>> That said, without get_swap_device/put_swap_device in swap_readpage, you 
>> could
>> potentially blow pass synchronize_rcu() on CPU2 and causes a problem.  so I 
>> think
>> the problematic race looks something like the following:
>> 
>> 
>> CPU1 CPU2
>> --
>> swap_readpage
>>   if (data_race(sis->flags & SWP_FS_OPS)) {
>>  swapoff
>>p->flags = &= 
>> ~SWP_VALID;
>>..
>>
>> synchronize_rcu();
>>..
>>p->swap_file 
>> = NULL;
>> struct file *swap_file = sis->swap_file;
>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>>...
>> ...
>> 
>
> Agree. This is also what I meant to illustrate. And you provide a better one. 
> Many thanks!

For the pages that are swapped in through swap cache.  That isn't an
issue.  Because the page is locked, the swap entry will be marked with
SWAP_HAS_CACHE, so swapoff() cannot proceed until the page has been
unlocked.

So the race is for the fast path as follows,

if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
__swap_count(entry) == 1)

I found it in your original patch description.  But please make it more
explicit to reduce the potential confusing.

Best Regards,
Huang, Ying


Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-09 Thread Miaohe Lin
On 2021/4/10 1:17, Tim Chen wrote:
> 
> 
> On 4/9/21 1:42 AM, Miaohe Lin wrote:
>> On 2021/4/9 5:34, Tim Chen wrote:
>>>
>>>
>>> On 4/8/21 6:08 AM, Miaohe Lin wrote:
 When I was investigating the swap code, I found the below possible race
 window:

 CPU 1  CPU 2
 -  -
 do_swap_page
   synchronous swap_readpage
 alloc_page_vma
swapoff
  release swap_file, bdev, or ...
>>>
>>
>> Many thanks for quick review and reply!
>>
>>> Perhaps I'm missing something.  The release of swap_file, bdev etc
>>> happens after we have cleared the SWP_VALID bit in si->flags in 
>>> destroy_swap_extents
>>> if I read the swapoff code correctly.
>> Agree. Let's look this more close:
>> CPU1 CPU2
>> --
>> swap_readpage
>>   if (data_race(sis->flags & SWP_FS_OPS)) {
>>  swapoff
>>p->swap_file 
>> = NULL;
>> struct file *swap_file = sis->swap_file;
>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>>...
>>p->flags = 0;
>> ...
>>
>> Does this make sense for you?
> 
> p->swapfile = NULL happens after the 
> p->flags &= ~SWP_VALID, synchronize_rcu(), destroy_swap_extents() sequence in 
> swapoff().
> 
> So I don't think the sequence you illustrated on CPU2 is in the right order.
> That said, without get_swap_device/put_swap_device in swap_readpage, you could
> potentially blow pass synchronize_rcu() on CPU2 and causes a problem.  so I 
> think
> the problematic race looks something like the following:
> 
> 
> CPU1  CPU2
> - -
> swap_readpage
>   if (data_race(sis->flags & SWP_FS_OPS)) {
>   swapoff
> p->flags = &= 
> ~SWP_VALID;
> ..
> 
> synchronize_rcu();
> ..
> p->swap_file 
> = NULL;
> struct file *swap_file = sis->swap_file;
> struct address_space *mapping = swap_file->f_mapping;[oops!]
> ...
> ...
> 

Agree. This is also what I meant to illustrate. And you provide a better one. 
Many thanks!

> By adding get_swap_device/put_swap_device, then the race is fixed.
> 
> 
> CPU1  CPU2
> - -
> swap_readpage
>   get_swap_device()
>   ..
>   if (data_race(sis->flags & SWP_FS_OPS)) {
>   swapoff
> p->flags = &= 
> ~SWP_VALID;
> ..
> struct file *swap_file = sis->swap_file;
> struct address_space *mapping = swap_file->f_mapping;[valid value]
>   ..
>   put_swap_device()
> 
> synchronize_rcu();
> ..
> p->swap_file 
> = NULL;
> 
> 
>>

   swap_readpage
check sis->flags is ok
  access swap_file, bdev...[oops!]
si->flags = 0
>>>
>>> This happens after we clear the si->flags
>>> synchronize_rcu()
>>> release swap_file, bdev, in 
>>> destroy_swap_extents()
>>>
>>> So I think if we have get_swap_device/put_swap_device in do_swap_page,
>>> it should fix the race you've pointed out here.  
>>> Then synchronize_rcu() will wait till we have completed do_swap_page and
>>> call put_swap_device.
>>
>> Right, get_swap_device/put_swap_device could fix this race. __But__ 
>> rcu_read_lock()
>> in get_swap_device() could disable preempt and do_swap_page() may take a 
>> really long
>> time because it involves I/O. It may not be acceptable to disable preempt 
>> for such a
>> long time. :(
> 
> I can see that it is not a good idea to hold rcu read lock for a long
> time over slow file I/O operation, which will be the side effect of
> introducing 

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-09 Thread Tim Chen



On 4/9/21 1:42 AM, Miaohe Lin wrote:
> On 2021/4/9 5:34, Tim Chen wrote:
>>
>>
>> On 4/8/21 6:08 AM, Miaohe Lin wrote:
>>> When I was investigating the swap code, I found the below possible race
>>> window:
>>>
>>> CPU 1   CPU 2
>>> -   -
>>> do_swap_page
>>>   synchronous swap_readpage
>>> alloc_page_vma
>>> swapoff
>>>   release swap_file, bdev, or ...
>>
> 
> Many thanks for quick review and reply!
> 
>> Perhaps I'm missing something.  The release of swap_file, bdev etc
>> happens after we have cleared the SWP_VALID bit in si->flags in 
>> destroy_swap_extents
>> if I read the swapoff code correctly.
> Agree. Let's look this more close:
> CPU1  CPU2
> - -
> swap_readpage
>   if (data_race(sis->flags & SWP_FS_OPS)) {
>   swapoff
> p->swap_file 
> = NULL;
> struct file *swap_file = sis->swap_file;
> struct address_space *mapping = swap_file->f_mapping;[oops!]
> ...
> p->flags = 0;
> ...
> 
> Does this make sense for you?

p->swapfile = NULL happens after the 
p->flags &= ~SWP_VALID, synchronize_rcu(), destroy_swap_extents() sequence in 
swapoff().

So I don't think the sequence you illustrated on CPU2 is in the right order.
That said, without get_swap_device/put_swap_device in swap_readpage, you could
potentially blow pass synchronize_rcu() on CPU2 and causes a problem.  so I 
think
the problematic race looks something like the following:


CPU1CPU2
-   -
swap_readpage
  if (data_race(sis->flags & SWP_FS_OPS)) {
swapoff
  p->flags = &= 
~SWP_VALID;
  ..
  
synchronize_rcu();
  ..
  p->swap_file 
= NULL;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;[oops!]
  ...
...

By adding get_swap_device/put_swap_device, then the race is fixed.


CPU1CPU2
-   -
swap_readpage
  get_swap_device()
  ..
  if (data_race(sis->flags & SWP_FS_OPS)) {
swapoff
  p->flags = &= 
~SWP_VALID;
  ..
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;[valid value]
  ..
  put_swap_device()
  
synchronize_rcu();
  ..
  p->swap_file 
= NULL;


> 
>>>
>>>   swap_readpage
>>> check sis->flags is ok
>>>   access swap_file, bdev...[oops!]
>>> si->flags = 0
>>
>> This happens after we clear the si->flags
>>  synchronize_rcu()
>>  release swap_file, bdev, in 
>> destroy_swap_extents()
>>
>> So I think if we have get_swap_device/put_swap_device in do_swap_page,
>> it should fix the race you've pointed out here.  
>> Then synchronize_rcu() will wait till we have completed do_swap_page and
>> call put_swap_device.
> 
> Right, get_swap_device/put_swap_device could fix this race. __But__ 
> rcu_read_lock()
> in get_swap_device() could disable preempt and do_swap_page() may take a 
> really long
> time because it involves I/O. It may not be acceptable to disable preempt for 
> such a
> long time. :(

I can see that it is not a good idea to hold rcu read lock for a long
time over slow file I/O operation, which will be the side effect of
introducing get/put_swap_device to swap_readpage.  So using percpu_ref
will then be preferable for synchronization once we introduce 
get/put_swap_device into swap_readpage.

Tim


Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-09 Thread Miaohe Lin
On 2021/4/9 5:34, Tim Chen wrote:
> 
> 
> On 4/8/21 6:08 AM, Miaohe Lin wrote:
>> When I was investigating the swap code, I found the below possible race
>> window:
>>
>> CPU 1CPU 2
>> --
>> do_swap_page
>>   synchronous swap_readpage
>> alloc_page_vma
>>  swapoff
>>release swap_file, bdev, or ...
> 

Many thanks for quick review and reply!

> Perhaps I'm missing something.  The release of swap_file, bdev etc
> happens after we have cleared the SWP_VALID bit in si->flags in 
> destroy_swap_extents
> if I read the swapoff code correctly.
Agree. Let's look this more close:
CPU1CPU2
-   -
swap_readpage
  if (data_race(sis->flags & SWP_FS_OPS)) {
swapoff
  p->swap_file 
= NULL;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;[oops!]
  ...
  p->flags = 0;
...

Does this make sense for you?

> >
>>   swap_readpage
>>  check sis->flags is ok
>>access swap_file, bdev...[oops!]
>>  si->flags = 0
> 
> This happens after we clear the si->flags
>   synchronize_rcu()
>   release swap_file, bdev, in 
> destroy_swap_extents()
> 
> So I think if we have get_swap_device/put_swap_device in do_swap_page,
> it should fix the race you've pointed out here.  
> Then synchronize_rcu() will wait till we have completed do_swap_page and
> call put_swap_device.

Right, get_swap_device/put_swap_device could fix this race. __But__ 
rcu_read_lock()
in get_swap_device() could disable preempt and do_swap_page() may take a really 
long
time because it involves I/O. It may not be acceptable to disable preempt for 
such a
long time. :(

>   
>>
>> Using current get/put_swap_device() to guard against concurrent swapoff for
>> swap_readpage() looks terrible because swap_readpage() may take really long
>> time. And this race may not be really pernicious because swapoff is usually
>> done when system shutdown only. To reduce the performance overhead on the
>> hot-path as much as possible, it appears we can use the percpu_ref to close
>> this race window(as suggested by Huang, Ying).
> 
> I think it is better to break this patch into two.
> > One patch is to fix the race in do_swap_page and swapoff
> by adding get_swap_device/put_swap_device in do_swap_page.
> 
> The second patch is to modify get_swap_device and put_swap_device
> with percpu_ref. But swapoff is a relatively rare events.  

Sounds reasonable. Will do it.

> 
> I am not sure making percpu_ref change for performance is really beneficial.
> Did you encounter a real use case where you see a problem with swapoff?
> The delay in swapoff is primarily in try_to_unuse to bring all
> the swapped off pages back into memory.  Synchronizing with other
> CPU for paging in probably is a small component in overall scheme
> of things.
> 

I can't find a more simple and stable way to fix this potential and 
*theoretical* issue.
This could happen in real word but the race window should be very small. While 
swapoff
is usually done when system shutdown only, I'am not really sure if this effort 
is worth.

But IMO, we should eliminate any potential trouble. :)

> Thanks.
> 

Thanks again.

> Tim
> 
> .
> 


Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-08 Thread Tim Chen



On 4/8/21 6:08 AM, Miaohe Lin wrote:
> When I was investigating the swap code, I found the below possible race
> window:
> 
> CPU 1 CPU 2
> - -
> do_swap_page
>   synchronous swap_readpage
> alloc_page_vma
>   swapoff
> release swap_file, bdev, or ...

Perhaps I'm missing something.  The release of swap_file, bdev etc
happens after we have cleared the SWP_VALID bit in si->flags in 
destroy_swap_extents
if I read the swapoff code correctly.
 

>   swap_readpage
>   check sis->flags is ok
> access swap_file, bdev...[oops!]
>   si->flags = 0

This happens after we clear the si->flags
synchronize_rcu()
release swap_file, bdev, in 
destroy_swap_extents()

So I think if we have get_swap_device/put_swap_device in do_swap_page,
it should fix the race you've pointed out here.  
Then synchronize_rcu() will wait till we have completed do_swap_page and
call put_swap_device.

> 
> Using current get/put_swap_device() to guard against concurrent swapoff for
> swap_readpage() looks terrible because swap_readpage() may take really long
> time. And this race may not be really pernicious because swapoff is usually
> done when system shutdown only. To reduce the performance overhead on the
> hot-path as much as possible, it appears we can use the percpu_ref to close
> this race window(as suggested by Huang, Ying).

I think it is better to break this patch into two.

One patch is to fix the race in do_swap_page and swapoff
by adding get_swap_device/put_swap_device in do_swap_page.

The second patch is to modify get_swap_device and put_swap_device
with percpu_ref. But swapoff is a relatively rare events.  

I am not sure making percpu_ref change for performance is really beneficial.
Did you encounter a real use case where you see a problem with swapoff?
The delay in swapoff is primarily in try_to_unuse to bring all
the swapped off pages back into memory.  Synchronizing with other
CPU for paging in probably is a small component in overall scheme
of things.

Thanks.

Tim



[PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-08 Thread Miaohe Lin
When I was investigating the swap code, I found the below possible race
window:

CPU 1   CPU 2
-   -
do_swap_page
  synchronous swap_readpage
alloc_page_vma
swapoff
  release swap_file, bdev, or ...
  swap_readpage
check sis->flags is ok
  access swap_file, bdev...[oops!]
si->flags = 0

Using current get/put_swap_device() to guard against concurrent swapoff for
swap_readpage() looks terrible because swap_readpage() may take really long
time. And this race may not be really pernicious because swapoff is usually
done when system shutdown only. To reduce the performance overhead on the
hot-path as much as possible, it appears we can use the percpu_ref to close
this race window(as suggested by Huang, Ying).

Fixes: 235b62176712 ("mm/swap: add cluster lock")
Signed-off-by: Miaohe Lin 
---
 include/linux/swap.h |  2 +-
 mm/memory.c  | 10 ++
 mm/swapfile.c| 28 +++-
 3 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 849ba5265c11..9066addb57fd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
 
 static inline void put_swap_device(struct swap_info_struct *si)
 {
-   rcu_read_unlock();
+   percpu_ref_put(>users);
 }
 
 #else /* CONFIG_SWAP */
diff --git a/mm/memory.c b/mm/memory.c
index cc71a445c76c..8543c47b955c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct page *page = NULL, *swapcache;
+   struct swap_info_struct *si = NULL;
swp_entry_t entry;
pte_t pte;
int locked;
@@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}
 
 
+   si = get_swap_device(entry);
+   /* In case we raced with swapoff. */
+   if (unlikely(!si))
+   goto out;
+
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
page = lookup_swap_cache(entry, vma, vmf->address);
swapcache = page;
@@ -3514,6 +3520,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
 out:
+   if (si)
+   put_swap_device(si);
return ret;
 out_nomap:
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -3525,6 +3533,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
unlock_page(swapcache);
put_page(swapcache);
}
+   if (si)
+   put_swap_device(si);
return ret;
 }
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 724173cd7d0c..01032c72ceae 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1280,18 +1280,12 @@ static unsigned char __swap_entry_free_locked(struct 
swap_info_struct *p,
  * via preventing the swap device from being swapoff, until
  * put_swap_device() is called.  Otherwise return NULL.
  *
- * The entirety of the RCU read critical section must come before the
- * return from or after the call to synchronize_rcu() in
- * enable_swap_info() or swapoff().  So if "si->flags & SWP_VALID" is
- * true, the si->map, si->cluster_info, etc. must be valid in the
- * critical section.
- *
  * Notice that swapoff or swapoff+swapon can still happen before the
- * rcu_read_lock() in get_swap_device() or after the rcu_read_unlock()
- * in put_swap_device() if there isn't any other way to prevent
- * swapoff, such as page lock, page table lock, etc.  The caller must
- * be prepared for that.  For example, the following situation is
- * possible.
+ * percpu_ref_tryget_live() in get_swap_device() or after the
+ * percpu_ref_put() in put_swap_device() if there isn't any other way
+ * to prevent swapoff, such as page lock, page table lock, etc.  The
+ * caller must be prepared for that.  For example, the following
+ * situation is possible.
  *
  *   CPU1  CPU2
  *   do_swap_page()
@@ -1319,21 +1313,21 @@ struct swap_info_struct *get_swap_device(swp_entry_t 
entry)
si = swp_swap_info(entry);
if (!si)
goto bad_nofile;
-
-   rcu_read_lock();
if (data_race(!(si->flags & SWP_VALID)))
-   goto unlock_out;
+   goto out;
+   if (!percpu_ref_tryget_live(>users))
+   goto out;
offset = swp_offset(entry);
if (offset >= si->max)
-   goto unlock_out;
+   goto put_out;
 
return si;
 bad_nofile:
pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
 out:
return NULL;
-unlock_out:
-   rcu_read_unlock();
+put_out:
+   percpu_ref_put(>users);
return NULL;
 }
 
-- 
2.19.1