Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

2020-08-19 Thread Gao Xiang
Hi Ying,

On Thu, Aug 20, 2020 at 12:36:08PM +0800, Huang, Ying wrote:
> Gao Xiang  writes:
> 
> > SWP_FS doesn't mean the device is file-backed swap device,
> > which just means each writeback request should go through fs
> > by DIO. Or it'll just use extents added by .swap_activate(),
> > but it also works as file-backed swap device.
> >
> > So in order to achieve the goal of the original patch,
> > SWP_BLKDEV should be used instead.
> >
> > FS corruption can be observed with SSD device + XFS +
> > fragmented swapfile due to CONFIG_THP_SWAP=y.
> >
> > Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file 
> > backed swap device")
> > Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
> > Cc: "Huang, Ying" 
> > Cc: stable 
> > Signed-off-by: Gao Xiang 
> 
> Good catch!  The fix itself looks good me!  Although the description is
> a little confusing.
> 
> After some digging, it seems that SWP_FS is set on the swap devices
> which make swap entry read/write go through the file system specific
> callback (now used by swap over NFS only).

Okay, let me send out v2 with the updated commit message in
https://lore.kernel.org/r/20200820012409.gb5...@xiangao.remote.csb/

Thanks,
Gao Xiang

> 
> Best Regards,
> Huang, Ying
> 
> > ---
> >
> > I reproduced the issue with the following details:
> >
> > Environment:
> > QEMU + upstream kernel + buildroot + NVMe (2 GB)
> >
> > Kernel config:
> > CONFIG_BLK_DEV_NVME=y
> > CONFIG_THP_SWAP=y
> >
> > Some reproducable steps:
> > mkfs.xfs -f /dev/nvme0n1
> > mkdir /tmp/mnt
> > mount /dev/nvme0n1 /tmp/mnt
> > bs="32k"
> > sz="1024m"# doesn't matter too much, I also tried 16m
> > xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> > xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> > xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> > xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> > xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw
> >
> > mkswap /tmp/mnt/sw
> > swapon /tmp/mnt/sw
> >
> > stress --vm 2 --vm-bytes 600M   # doesn't matter too much as well
> >
> > Symptoms:
> >  - FS corruption (e.g. checksum failure)
> >  - memory corruption at: 0xd2808010
> >  - segfault
> >  ... 
> >
> >  mm/swapfile.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 6c26916e95fd..2937daf3ca02 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1074,7 +1074,7 @@ int get_swap_pages(int n_goal, swp_entry_t 
> > swp_entries[], int entry_size)
> > goto nextsi;
> > }
> > if (size == SWAPFILE_CLUSTER) {
> > -   if (!(si->flags & SWP_FS))
> > +   if (si->flags & SWP_BLKDEV)
> > n_ret = swap_alloc_cluster(si, swp_entries);
> > } else
> > n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> 



Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

2020-08-19 Thread Huang, Ying
Gao Xiang  writes:

> SWP_FS doesn't mean the device is file-backed swap device,
> which just means each writeback request should go through fs
> by DIO. Or it'll just use extents added by .swap_activate(),
> but it also works as file-backed swap device.
>
> So in order to achieve the goal of the original patch,
> SWP_BLKDEV should be used instead.
>
> FS corruption can be observed with SSD device + XFS +
> fragmented swapfile due to CONFIG_THP_SWAP=y.
>
> Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file 
> backed swap device")
> Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
> Cc: "Huang, Ying" 
> Cc: stable 
> Signed-off-by: Gao Xiang 

Good catch!  The fix itself looks good me!  Although the description is
a little confusing.

After some digging, it seems that SWP_FS is set on the swap devices
which make swap entry read/write go through the file system specific
callback (now used by swap over NFS only).

Best Regards,
Huang, Ying

> ---
>
> I reproduced the issue with the following details:
>
> Environment:
> QEMU + upstream kernel + buildroot + NVMe (2 GB)
>
> Kernel config:
> CONFIG_BLK_DEV_NVME=y
> CONFIG_THP_SWAP=y
>
> Some reproducable steps:
> mkfs.xfs -f /dev/nvme0n1
> mkdir /tmp/mnt
> mount /dev/nvme0n1 /tmp/mnt
> bs="32k"
> sz="1024m"# doesn't matter too much, I also tried 16m
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw
>
> mkswap /tmp/mnt/sw
> swapon /tmp/mnt/sw
>
> stress --vm 2 --vm-bytes 600M   # doesn't matter too much as well
>
> Symptoms:
>  - FS corruption (e.g. checksum failure)
>  - memory corruption at: 0xd2808010
>  - segfault
>  ... 
>
>  mm/swapfile.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6c26916e95fd..2937daf3ca02 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1074,7 +1074,7 @@ int get_swap_pages(int n_goal, swp_entry_t 
> swp_entries[], int entry_size)
>   goto nextsi;
>   }
>   if (size == SWAPFILE_CLUSTER) {
> - if (!(si->flags & SWP_FS))
> + if (si->flags & SWP_BLKDEV)
>   n_ret = swap_alloc_cluster(si, swp_entries);
>   } else
>   n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,


Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

2020-08-19 Thread Gao Xiang
Hi Yang,

On Wed, Aug 19, 2020 at 02:41:08PM -0700, Yang Shi wrote:
> On Wed, Aug 19, 2020 at 1:15 PM Gao Xiang  wrote:
> >
> > Hi Andrew,
> >
> > On Wed, Aug 19, 2020 at 01:05:06PM -0700, Andrew Morton wrote:
> > > On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang  wrote:
> > >
> > > > SWP_FS doesn't mean the device is file-backed swap device,
> > > > which just means each writeback request should go through fs
> > > > by DIO. Or it'll just use extents added by .swap_activate(),
> > > > but it also works as file-backed swap device.
> > >
> > > This is very hard to understand :(
> >
> > Thanks for your reply...
> >
> > The related logic is in __swap_writepage() and setup_swap_extents(),
> > and also see e.g generic_swapfile_activate() or iomap_swapfile_activate()...
> 
> I think just NFS falls into this case, so you may rephrase it to:
> 
> SWP_FS is only used for swap files over NFS. So, !SWP_FS means non NFS
> swap, it could be either file backed or device backed.

Thanks for your suggestion...

That looks reasonable, and after I looked
bc4ae27d817a ("mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS")

I think it could be rephrased into

"
The SWP_FS flag is used to make swap_{read,write}page() go
through the filesystem, and it's only used for swap files
over NFS. So, !SWP_FS means non NFS for now, it could be
either file backed or device backed. Something similar goes
with legacy SWP_FILE.
"

Does it look sane? And I will wait for further suggestion
about this for a while.

And IMO, SWP_FS flag might be useful for other uses later
(e.g. laterly for some CoW swapfile use, but I don't think
 carefully if it's practical or not...)

Thanks,
Gao Xiang



Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

2020-08-19 Thread Yang Shi
On Wed, Aug 19, 2020 at 1:15 PM Gao Xiang  wrote:
>
> Hi Andrew,
>
> On Wed, Aug 19, 2020 at 01:05:06PM -0700, Andrew Morton wrote:
> > On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang  wrote:
> >
> > > SWP_FS doesn't mean the device is file-backed swap device,
> > > which just means each writeback request should go through fs
> > > by DIO. Or it'll just use extents added by .swap_activate(),
> > > but it also works as file-backed swap device.
> >
> > This is very hard to understand :(
>
> Thanks for your reply...
>
> The related logic is in __swap_writepage() and setup_swap_extents(),
> and also see e.g generic_swapfile_activate() or iomap_swapfile_activate()...

I think just NFS falls into this case, so you may rephrase it to:

SWP_FS is only used for swap files over NFS. So, !SWP_FS means non NFS
swap, it could be either file backed or device backed.

Does this look more understandable?

> I will also talk with "Huang, Ying" in person if no response here.
>
> >
> > > So in order to achieve the goal of the original patch,
> > > SWP_BLKDEV should be used instead.
> > >
> > > FS corruption can be observed with SSD device + XFS +
> > > fragmented swapfile due to CONFIG_THP_SWAP=y.
> > >
> > > Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file 
> > > backed swap device")
> > > Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
> >
> > Why do you think it has taken three years to discover this?
>
> I'm not sure if the Redhat BZ is available for public, it can be reproduced
> since rhel 8
> https://bugzilla.redhat.com/show_bug.cgi?id=1855474
>
> It seems hard to believe, but I think just because rare user uses the SSD 
> device +
> THP + file-backed swap device combination... maybe I'm wrong here, but my test
> shows as it is.
>
> Thanks,
> Gao Xiang
>
> >
> >
> >
>
>


Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

2020-08-19 Thread Gao Xiang
Hi Rafael,

On Wed, Aug 19, 2020 at 04:44:05PM -0400, Rafael Aquini wrote:
> On Wed, Aug 19, 2020 at 01:05:06PM -0700, Andrew Morton wrote:
> > On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang  wrote:
> > 
> > > SWP_FS doesn't mean the device is file-backed swap device,
> > > which just means each writeback request should go through fs
> > > by DIO. Or it'll just use extents added by .swap_activate(),
> > > but it also works as file-backed swap device.
> > 
> > This is very hard to understand :(
> > 
> 
> I'll work with Gao to rephrase that message. Sorry!

Sorry about that :( I just finished the test and went through
the related swap code and finally saw this so I think it wouldn't
work entirely for the current swap code... and Sorry about
my limited English.

Kindly feel free to repost the patch with rephrased commit
message. Anyway, I've done this task :)

Thanks,
Gao Xiang



Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

2020-08-19 Thread Rafael Aquini
On Wed, Aug 19, 2020 at 01:05:06PM -0700, Andrew Morton wrote:
> On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang  wrote:
> 
> > SWP_FS doesn't mean the device is file-backed swap device,
> > which just means each writeback request should go through fs
> > by DIO. Or it'll just use extents added by .swap_activate(),
> > but it also works as file-backed swap device.
> 
> This is very hard to understand :(
> 

I'll work with Gao to rephrase that message. Sorry!


> > So in order to achieve the goal of the original patch,
> > SWP_BLKDEV should be used instead.
> > 
> > FS corruption can be observed with SSD device + XFS +
> > fragmented swapfile due to CONFIG_THP_SWAP=y.
> > 
> > Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file 
> > backed swap device")
> > Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
> 
> Why do you think it has taken three years to discover this?
>

My bet here is that it's rare to go for a swapfile on non-rotational
devices, and even rarer to create the swapfile when the filesystem is
already fragmented. 
 
RHEL-8, v4.18-based, is starting to see more adpters among Red Hat's
customer base, thus the report now. We are also working on a secondary 
issue related to CONFIG_THP_SWAP, as well, where the deferred THP split
registered shriker goes for a NULL pointer dereference in case the
swap device is backed by a rotational drive.

-- Rafael



Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

2020-08-19 Thread Gao Xiang
Hi Andrew,

On Wed, Aug 19, 2020 at 01:05:06PM -0700, Andrew Morton wrote:
> On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang  wrote:
> 
> > SWP_FS doesn't mean the device is file-backed swap device,
> > which just means each writeback request should go through fs
> > by DIO. Or it'll just use extents added by .swap_activate(),
> > but it also works as file-backed swap device.
> 
> This is very hard to understand :(

Thanks for your reply...

The related logic is in __swap_writepage() and setup_swap_extents(),
and also see e.g generic_swapfile_activate() or iomap_swapfile_activate()...

I will also talk with "Huang, Ying" in person if no response here.

> 
> > So in order to achieve the goal of the original patch,
> > SWP_BLKDEV should be used instead.
> > 
> > FS corruption can be observed with SSD device + XFS +
> > fragmented swapfile due to CONFIG_THP_SWAP=y.
> > 
> > Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file 
> > backed swap device")
> > Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
> 
> Why do you think it has taken three years to discover this?

I'm not sure if the Redhat BZ is available for public, it can be reproduced
since rhel 8
https://bugzilla.redhat.com/show_bug.cgi?id=1855474

It seems hard to believe, but I think just because rare user uses the SSD 
device +
THP + file-backed swap device combination... maybe I'm wrong here, but my test
shows as it is.

Thanks,
Gao Xiang

> 
> 
> 



Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

2020-08-19 Thread Andrew Morton
On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang  wrote:

> SWP_FS doesn't mean the device is file-backed swap device,
> which just means each writeback request should go through fs
> by DIO. Or it'll just use extents added by .swap_activate(),
> but it also works as file-backed swap device.

This is very hard to understand :(

> So in order to achieve the goal of the original patch,
> SWP_BLKDEV should be used instead.
> 
> FS corruption can be observed with SSD device + XFS +
> fragmented swapfile due to CONFIG_THP_SWAP=y.
> 
> Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file 
> backed swap device")
> Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")

Why do you think it has taken three years to discover this?




[PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

2020-08-19 Thread Gao Xiang
SWP_FS doesn't mean the device is file-backed swap device,
which just means each writeback request should go through fs
by DIO. Or it'll just use extents added by .swap_activate(),
but it also works as file-backed swap device.

So in order to achieve the goal of the original patch,
SWP_BLKDEV should be used instead.

FS corruption can be observed with SSD device + XFS +
fragmented swapfile due to CONFIG_THP_SWAP=y.

Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file 
backed swap device")
Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
Cc: "Huang, Ying" 
Cc: stable 
Signed-off-by: Gao Xiang 
---

I reproduced the issue with the following details:

Environment:
QEMU + upstream kernel + buildroot + NVMe (2 GB)

Kernel config:
CONFIG_BLK_DEV_NVME=y
CONFIG_THP_SWAP=y

Some reproducable steps:
mkfs.xfs -f /dev/nvme0n1
mkdir /tmp/mnt
mount /dev/nvme0n1 /tmp/mnt
bs="32k"
sz="1024m"# doesn't matter too much, I also tried 16m
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw

mkswap /tmp/mnt/sw
swapon /tmp/mnt/sw

stress --vm 2 --vm-bytes 600M   # doesn't matter too much as well

Symptoms:
 - FS corruption (e.g. checksum failure)
 - memory corruption at: 0xd2808010
 - segfault
 ... 

 mm/swapfile.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6c26916e95fd..2937daf3ca02 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1074,7 +1074,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], 
int entry_size)
goto nextsi;
}
if (size == SWAPFILE_CLUSTER) {
-   if (!(si->flags & SWP_FS))
+   if (si->flags & SWP_BLKDEV)
n_ret = swap_alloc_cluster(si, swp_entries);
} else
n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
-- 
2.18.1