Re: kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0 (Kernel v6.5.9, 32bit ppc)

2024-06-06 Thread Yu Zhao
On Thu, Jun 6, 2024 at 11:42 AM Yosry Ahmed  wrote:
>
> On Thu, Jun 6, 2024 at 10:14 AM Takero Funaki  wrote:
> >
> > 2024年6月6日(木) 8:42 Yosry Ahmed :
> >
> > > I think there are multiple ways to go forward here:
> > > (a) Make the number of zpools a config option, leave the default as
> > > 32, but allow special use cases to set it to 1 or similar. This is
> > > probably not preferable because it is not clear to users how to set
> > > it, but the idea is that no one will have to set it except special use
> > > cases such as Erhard's (who will want to set it to 1 in this case).
> > >
> > > (b) Make the number of zpools scale linearly with the number of CPUs.
> > > Maybe something like nr_cpus/4 or nr_cpus/8. The problem with this
> > > approach is that with a large number of CPUs, too many zpools will
> > > start having diminishing returns. Fragmentation will keep increasing,
> > > while the scalability/concurrency gains will diminish.
> > >
> > > (c) Make the number of zpools scale logarithmically with the number of
> > > CPUs. Maybe something like 4log2(nr_cpus). This will keep the number
> > > of zpools from increasing too much and close to the status quo. The
> > > problem is that at a small number of CPUs (e.g. 2), 4log2(nr_cpus)
> > > will actually give a nr_zpools > nr_cpus. So we will need to come up
> > > with a more fancy magic equation (e.g. 4log2(nr_cpus/4)).
> > >
> >
> > I just posted a patch to limit the number of zpools, with some
> > theoretical background explained in the code comments. I believe that
> > 2 * CPU linearly is sufficient to reduce contention, but the scale can
> > be reduced further. All CPUs are trying to allocate/free zswap is
> > unlikely to happen.
> >  How many concurrent accesses were the original 32 zpools supposed to
> > handle? I think it was for 16 cpu or more. or nr_cpus/4 would be
> > enough?
>
> We use 32 zpools on machines with 100s of CPUs. Two zpools per CPU is
> an overkill imo.

Not to choose a camp; just a friendly note on why I strongly disagree
with the N zpools per CPU approach:
1. It is fundamentally flawed to assume the system is linear;
2. Nonlinear systems usually have diminishing returns.

For Google data centers, using nr_cpus as the scaling factor had long
passed the acceptable ROI threshold. Per-CPU data, especially when
compounded per memcg or even per process, is probably the number-one
overhead in terms of DRAM efficiency.



> I have further comments that I will leave on the patch, but I mainly
> think this should be driven by real data, not theoretical possibility
> of lock contention.


Re: kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0 (Kernel v6.5.9, 32bit ppc)

2024-06-05 Thread Yu Zhao
On Wed, Jun 5, 2024 at 9:12 PM Michael Ellerman  wrote:
>
> David Hildenbrand  writes:
> > On 01.06.24 08:01, Yu Zhao wrote:
> >> On Wed, May 15, 2024 at 4:06 PM Yu Zhao  wrote:
> ...
> >>
> >> Your system has 2GB memory and it uses zswap with zsmalloc (which is
> >> good since it can allocate from the highmem zone) and zstd/lzo (which
> >> doesn't matter much). Somehow -- I couldn't figure out why -- it
> >> splits the 2GB into a 0.25GB DMA zone and a 1.75GB highmem zone:
> >>
> >> [0.00] Zone ranges:
> >> [0.00]   DMA  [mem 0x-0x2fff]
> >> [0.00]   Normal   empty
> >> [0.00]   HighMem  [mem 0x3000-0x7fff]
> >
> > That's really odd. But we are messing with "PowerMac3,6", so I don't
> > really know what's right or wrong ...
>
> The DMA zone exists because 9739ab7eda45 ("powerpc: enable a 30-bit
> ZONE_DMA for 32-bit pmac") selects it.
>
> It's 768MB (not 0.25GB) because it's clamped at max_low_pfn:

Right. (I meant 0.75GB.)

> #ifdef CONFIG_ZONE_DMA
> max_zone_pfns[ZONE_DMA] = min(max_low_pfn,
>   1UL << (zone_dma_bits - PAGE_SHIFT));
> #endif
>
> Which comes eventually from CONFIG_LOWMEM_SIZE, which defaults to 768MB.

I see. I grep'ed VMSPLIT which is used on x86 and arm but apparently
not on powerpc.

> I think it's 768MB because the user:kernel split is 3G:1G, and then the
> kernel needs some of that 1G virtual space for vmalloc/ioremap/highmem,
> so it splits it 768M:256M.
>
> Then ZONE_NORMAL is empty because it is also limited to max_low_pfn:
>
> max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
>
> The rest of RAM is highmem.
>
> So I think that's all behaving as expected, but I don't know 32-bit /
> highmem stuff that well so I could be wrong.

Yes, the three zones work as intended.

Erhard,

Since your system only has 2GB memory, I'd try the 2G:2G split, which
would in theory allow both the kernel and userspace to all memory.

CONFIG_LOWMEM_SIZE_BOOL=y
CONFIG_LOWMEM_SIZE=0x700

(Michael, please correct me if the above wouldn't work.)


Re: kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0 (Kernel v6.5.9, 32bit ppc)

2024-06-05 Thread Yu Zhao
On Wed, Jun 5, 2024 at 5:42 PM Yosry Ahmed  wrote:
>
> On Wed, Jun 5, 2024 at 4:04 PM Erhard Furtner  wrote:
> >
> > On Tue, 4 Jun 2024 20:03:27 -0700
> > Yosry Ahmed  wrote:
> >
> > > Could you check if the attached patch helps? It basically changes the
> > > number of zpools from 32 to min(32, nr_cpus).
> >
> > Thanks! The patch does not fix the issue but it helps.
> >
> > Means I still get to see the 'kswapd0: page allocation failure' in the 
> > dmesg, a 'stress-ng-vm: page allocation failure' later on, another kswapd0 
> > error later on, etc. _but_ the machine keeps running the workload, stays 
> > usable via VNC and I get no hard crash any longer.
> >
> > Without patch kswapd0 error and hard crash (need to power-cycle) <3min. 
> > With patch several kswapd0 errors but running for 2 hrs now. I double 
> > checked this to be sure.
>
> Thanks for trying this out. This is interesting, so even two zpools is
> too much fragmentation for your use case.

Now I'm a little bit skeptical that the problem is due to fragmentation.

> I think there are multiple ways to go forward here:
> (a) Make the number of zpools a config option, leave the default as
> 32, but allow special use cases to set it to 1 or similar. This is
> probably not preferable because it is not clear to users how to set
> it, but the idea is that no one will have to set it except special use
> cases such as Erhard's (who will want to set it to 1 in this case).
>
> (b) Make the number of zpools scale linearly with the number of CPUs.
> Maybe something like nr_cpus/4 or nr_cpus/8. The problem with this
> approach is that with a large number of CPUs, too many zpools will
> start having diminishing returns. Fragmentation will keep increasing,
> while the scalability/concurrency gains will diminish.
>
> (c) Make the number of zpools scale logarithmically with the number of
> CPUs. Maybe something like 4log2(nr_cpus). This will keep the number
> of zpools from increasing too much and close to the status quo. The
> problem is that at a small number of CPUs (e.g. 2), 4log2(nr_cpus)
> will actually give a nr_zpools > nr_cpus. So we will need to come up
> with a more fancy magic equation (e.g. 4log2(nr_cpus/4)).
>
> (d) Make the number of zpools scale linearly with memory. This makes
> more sense than scaling with CPUs because increasing the number of
> zpools increases fragmentation, so it makes sense to limit it by the
> available memory. This is also more consistent with other magic
> numbers we have (e.g. SWAP_ADDRESS_SPACE_SHIFT).
>
> The problem is that unlike zswap trees, the zswap pool is not
> connected to the swapfile size, so we don't have an indication for how
> much memory will be in the zswap pool. We can scale the number of
> zpools with the entire memory on the machine during boot, but this
> seems like it would be difficult to figure out, and will not take into
> consideration memory hotplugging and the zswap global limit changing.
>
> (e) A creative mix of the above.
>
> (f) Something else (probably simpler).
>
> I am personally leaning toward (c), but I want to hear the opinions of
> other people here. Yu, Vlastimil, Johannes, Nhat? Anyone else?

I double checked that commit and didn't find anything wrong. If we are
all in the mood of getting to the bottom, can we try using only 1
zpool while there are 2 available? I.e.,

static struct zpool *zswap_find_zpool(struct zswap_entry *entry)
{
 - return entry->pool->zpools[hash_ptr(entry, ilog2(ZSWAP_NR_ZPOOLS))];
 + return entry->pool->zpools[0];
}

> In the long-term, I think we may want to address the lock contention
> in zsmalloc itself instead of zswap spawning multiple zpools.
>
> >
> > The patch did not apply cleanly on v6.9.3 so I applied it on v6.10-rc2. 
> > dmesg of the current v6.10-rc2 run attached.
> >
> > Regards,
> > Erhard


Re: kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0 (Kernel v6.5.9, 32bit ppc)

2024-06-04 Thread Yu Zhao
On Tue, Jun 4, 2024 at 11:34 AM Yosry Ahmed  wrote:
>
> On Tue, Jun 4, 2024 at 10:19 AM Yu Zhao  wrote:
> >
> > On Tue, Jun 4, 2024 at 10:12 AM Yosry Ahmed  wrote:
> > >
> > > On Tue, Jun 4, 2024 at 4:45 AM Erhard Furtner  
> > > wrote:
> > > >
> > > > On Mon, 3 Jun 2024 16:24:02 -0700
> > > > Yosry Ahmed  wrote:
> > > >
> > > > > Thanks for bisecting. Taking a look at the thread, it seems like you
> > > > > have a very limited area of memory to allocate kernel memory from. One
> > > > > possible reason why that commit can cause an issue is because we will
> > > > > have multiple instances of the zsmalloc slab caches 'zspage' and
> > > > > 'zs_handle', which may contribute to fragmentation in slab memory.
> > > > >
> > > > > Do you have /proc/slabinfo from a good and a bad run by any chance?
> > > > >
> > > > > Also, could you check if the attached patch helps? It makes sure that
> > > > > even when we use multiple zsmalloc zpools, we will use a single slab
> > > > > cache of each type.
> > > >
> > > > Thanks for looking into this! I got you 'cat /proc/slabinfo' from a 
> > > > good HEAD, from a bad HEAD and from the bad HEAD + your patch applied.
> > > >
> > > > Good was 6be3601517d90b728095d70c14f3a04b9adcb166, bad was 
> > > > b8cf32dc6e8c75b712cbf638e0fd210101c22f17 which I got both from my 
> > > > bisect.log. I got the slabinfo shortly after boot and a 2nd time 
> > > > shortly before the OOM or the kswapd0: page allocation failure happens. 
> > > > I terminated the workload (stress-ng --vm 2 --vm-bytes 1930M --verify 
> > > > -v) manually shortly before the 2 GiB RAM exhausted and got the 
> > > > slabinfo then.
> > > >
> > > > The patch applied to git b8cf32dc6e8c75b712cbf638e0fd210101c22f17 
> > > > unfortunately didn't make a difference, I got the kswapd0: page 
> > > > allocation failure nevertheless.
> > >
> > > Thanks for trying this out. The patch reduces the amount of wasted
> > > memory due to the 'zs_handle' and 'zspage' caches by an order of
> > > magnitude, but it was a small number to begin with (~250K).
> > >
> > > I cannot think of other reasons why having multiple zsmalloc pools
> > > will end up using more memory in the 0.25GB zone that the kernel
> > > allocations can be made from.
> > >
> > > The number of zpools can be made configurable or determined at runtime
> > > by the size of the machine, but I don't want to do this without
> > > understanding the problem here first. Adding other zswap and zsmalloc
> > > folks in case they have any ideas.
> >
> > Hi Erhard,
> >
> > If it's not too much trouble, could you "grep nr_zspages /proc/vmstat"
> > on kernels before and after the bad commit? It'd be great if you could
> > run the grep command right before the OOM kills.
> >
> > The overall internal fragmentation of multiple zsmalloc pools might be
> > higher than a single one. I suspect this might be the cause.
>
> I thought about the internal fragmentation of pools, but zsmalloc
> should have access to highmem, and if I understand correctly the
> problem here is that we are running out of space in the DMA zone when
> making kernel allocations.
>
> Do you suspect zsmalloc is allocating memory from the DMA zone
> initially, even though it has access to highmem?

There was a lot of user memory in the DMA zone. So at a point the
highmem zone was full and allocation fallback happened.

The problem with zone fallback is that recent allocations go into
lower zones, meaning they are further back on the LRU list. This
applies to both user memory and zsmalloc memory -- the latter has a
writeback LRU. On top of this, neither the zswap shrinker nor the
zsmalloc shrinker (compaction) is zone aware. So page reclaim might
have trouble hitting the right target zone.

We can't really tell how zspages are distributed across zones, but the
overall number might be helpful. It'd be great if someone could make
nr_zspages per zone :)


Re: kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0 (Kernel v6.5.9, 32bit ppc)

2024-06-04 Thread Yu Zhao
On Tue, Jun 4, 2024 at 10:12 AM Yosry Ahmed  wrote:
>
> On Tue, Jun 4, 2024 at 4:45 AM Erhard Furtner  wrote:
> >
> > On Mon, 3 Jun 2024 16:24:02 -0700
> > Yosry Ahmed  wrote:
> >
> > > Thanks for bisecting. Taking a look at the thread, it seems like you
> > > have a very limited area of memory to allocate kernel memory from. One
> > > possible reason why that commit can cause an issue is because we will
> > > have multiple instances of the zsmalloc slab caches 'zspage' and
> > > 'zs_handle', which may contribute to fragmentation in slab memory.
> > >
> > > Do you have /proc/slabinfo from a good and a bad run by any chance?
> > >
> > > Also, could you check if the attached patch helps? It makes sure that
> > > even when we use multiple zsmalloc zpools, we will use a single slab
> > > cache of each type.
> >
> > Thanks for looking into this! I got you 'cat /proc/slabinfo' from a good 
> > HEAD, from a bad HEAD and from the bad HEAD + your patch applied.
> >
> > Good was 6be3601517d90b728095d70c14f3a04b9adcb166, bad was 
> > b8cf32dc6e8c75b712cbf638e0fd210101c22f17 which I got both from my 
> > bisect.log. I got the slabinfo shortly after boot and a 2nd time shortly 
> > before the OOM or the kswapd0: page allocation failure happens. I 
> > terminated the workload (stress-ng --vm 2 --vm-bytes 1930M --verify -v) 
> > manually shortly before the 2 GiB RAM exhausted and got the slabinfo then.
> >
> > The patch applied to git b8cf32dc6e8c75b712cbf638e0fd210101c22f17 
> > unfortunately didn't make a difference, I got the kswapd0: page allocation 
> > failure nevertheless.
>
> Thanks for trying this out. The patch reduces the amount of wasted
> memory due to the 'zs_handle' and 'zspage' caches by an order of
> magnitude, but it was a small number to begin with (~250K).
>
> I cannot think of other reasons why having multiple zsmalloc pools
> will end up using more memory in the 0.25GB zone that the kernel
> allocations can be made from.
>
> The number of zpools can be made configurable or determined at runtime
> by the size of the machine, but I don't want to do this without
> understanding the problem here first. Adding other zswap and zsmalloc
> folks in case they have any ideas.

Hi Erhard,

If it's not too much trouble, could you "grep nr_zspages /proc/vmstat"
on kernels before and after the bad commit? It'd be great if you could
run the grep command right before the OOM kills.

The overall internal fragmentation of multiple zsmalloc pools might be
higher than a single one. I suspect this might be the cause.

Thank you.


Re: kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0 (Kernel v6.5.9, 32bit ppc)

2024-06-02 Thread Yu Zhao
On Sun, Jun 2, 2024 at 12:03 PM Erhard Furtner  wrote:
>
> On Sat, 1 Jun 2024 00:01:48 -0600
> Yu Zhao  wrote:
>
> > Hi Erhard,
> >
> > The OOM kills on both kernel versions seem to be reasonable to me.
> >
> > Your system has 2GB memory and it uses zswap with zsmalloc (which is
> > good since it can allocate from the highmem zone) and zstd/lzo (which
> > doesn't matter much). Somehow -- I couldn't figure out why -- it
> > splits the 2GB into a 0.25GB DMA zone and a 1.75GB highmem zone:
> >
> > [0.00] Zone ranges:
> > [0.00]   DMA  [mem 0x-0x2fff]
> > [0.00]   Normal   empty
> > [0.00]   HighMem  [mem 0x3000-0x7fff]
> >
> > The kernel can't allocate from the highmem zone -- only userspace and
> > zsmalloc can. OOM kills were due to the low memory conditions in the
> > DMA zone where the kernel itself failed to allocate from.
> >
> > Do you know a kernel version that doesn't have OOM kills while running
> > the same workload? If so, could you send that .config to me? If not,
> > could you try disabling CONFIG_HIGHMEM? (It might not help but I'm out
> > of ideas at the moment.)
> >
> > Thanks!
>
> Hi Yu!
>
> Thanks for looking into this.
>
> The reason for this 0.25GB DMA / 1.75GB highmem split is beyond my knowledge. 
> I can only tell this much that it's like this at least since kernel v4.14.x 
> (dmesg of an old bugreport of mine at 
> https://bugzilla.kernel.org/show_bug.cgi?id=201723), I guess earlier kernel 
> versions too.
>
> Without CONFIG_HIGHMEM the memory layout looks like this:
>
> Total memory = 768MB; using 2048kB for hash table
> [...]
> Top of RAM: 0x3000, Total RAM: 0x3000
> Memory hole size: 0MB
> Zone ranges:
>   DMA  [mem 0x-0x2fff]
>   Normal   empty
> Movable zone start for each node
> Early memory node ranges
>   node   0: [mem 0x-0x2fff]
> Initmem setup node 0 [mem 0x-0x2fff]
> percpu: Embedded 29 pages/cpu s28448 r8192 d82144 u118784
> pcpu-alloc: s28448 r8192 d82144 u118784 alloc=29*4096
> pcpu-alloc: [0] 0 [0] 1
> Kernel command line: ro root=/dev/sda5 slub_debug=FZP page_poison=1 
> netconsole=@192.168.2.8/eth0,@192.168.2.3/A8:A1:59:16:4F:EA debug
> Dentry cache hash table entries: 131072 (order: 7, 524288 bytes, linear)
> Inode-cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
> Built 1 zonelists, mobility grouping on.  Total pages: 194880
> mem auto-init: stack:all(pattern), heap alloc:off, heap free:off
> Kernel virtual memory layout:
>   * 0xffbdf000..0xf000  : fixmap
>   * 0xff8f4000..0xffbdf000  : early ioremap
>   * 0xf100..0xff8f4000  : vmalloc & ioremap
>   * 0xb000..0xc000  : modules
> Memory: 761868K/786432K available (7760K kernel code, 524K rwdata, 4528K 
> rodata, 1100K init, 253K bss, 24564K reserved, 0K cma-reserved)
> [...]
>
> With only 768 MB RAM and 2048K hashtable I get pretty much the same "kswapd0: 
> page allocation failure: order:0, 
> mode:0xcc0(GFP_KERNEL),nodemask=(null),cpuset=/,mems_allowed=0" as with the 
> HIGHMEM enabled kernel at
> running "stress-ng --vm 2 --vm-bytes 1930M --verify -v".
>
> I tried the workload on v6.6.32 LTS where the issue shows up too. But v6.1.92 
> LTS seems ok! Triple checked v6.1.92 to be sure.
>
> Attached please find kernel v6.9.3 dmesg (without HIGHMEM) and kernel v6.1.92 
> .config.

Thanks.

I compared the .config between v6.8.9 (you attached previously) and
v6.1.92 -- I didn't see any major differences (both have ZONE_DMA,
HIGHMEM, MGLRU and zswap/zsmalloc). Either there is something broken
between v6.1.92 and v6.6.32 (as you mentioned above), or it's just a
kernel allocation bloat which puts the DMA zone (0.25GB) under too
heavy pressure. The latter isn't uncommon when upgrading to a newer
version of the kernel.

Could you please attach the dmesg from v6.1.92? I want to compare the
dmegs between the two kernel versions as well -- that might provide
some hints.


Re: kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0 (Kernel v6.5.9, 32bit ppc)

2024-06-01 Thread Yu Zhao
On Wed, May 15, 2024 at 4:06 PM Yu Zhao  wrote:
>
> On Wed, May 15, 2024 at 2:45 PM Erhard Furtner  wrote:
> >
> > On Wed, 8 May 2024 20:21:11 +0200
> > Erhard Furtner  wrote:
> >
> > > Greetings!
> > >
> > > Got that on my dual CPU PowerMac G4 DP shortly after boot. This does not 
> > > happen every time at bootup though:
> > >
> > > [...]
> > > kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), 
> > > nodemask=(null),cpuset=/,mems_allowed=0
> > > CPU: 1 PID: 40 Comm: kswapd0 Not tainted 6.8.9-gentoo-PMacG4 #1
> > > Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
> >
> > Very similar page allocation failure on the same machine on kernel 6.9.0 
> > too. Seems it can easily be provoked by running a memory stressor, e.g. 
> > "stress-ng --vm 2 --vm-bytes 1930M --verify -v":
> >
> > [...]
> > kswapd0: page allocation failure: order:0, mode:0xcc0(GFP_KERNEL), 
> > nodemask=(null),cpuset=/,mems_allowed=0
> > CPU: 0 PID: 41 Comm: kswapd0 Not tainted 6.9.0-gentoo-PMacG4 #1
> > Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
> > Call Trace:
> > [c1c65940] [c07926d4] dump_stack_lvl+0x80/0xac (unreliable)
> > [c1c65960] [c01b6234] warn_alloc+0x100/0x178
> > [c1c659c0] [c01b661c] __alloc_pages+0x370/0x8d0
> > [c1c65a80] [c01c4854] __read_swap_cache_async+0xc0/0x1cc
> > [c1c65ad0] [c01cb580] zswap_writeback_entry+0x50/0x154
> > [c1c65be0] [c01cb6f4] shrink_memcg_cb+0x70/0xec
> > [c1c65c10] [c019518c] __list_lru_walk_one+0xa0/0x154
> > [c1c65c70] [c01952a4] list_lru_walk_one+0x64/0x7c
> > [c1c65ca0] [c01cad00] zswap_shrinker_scan+0xac/0xc4
> > [c1c65cd0] [c018052c] do_shrink_slab+0x18c/0x304
> > [c1c65d20] [c0180a40] shrink_slab+0x174/0x260
> > [c1c65da0] [c017cb0c] shrink_one+0xbc/0x134
> > [c1c65dd0] [c017e3e4] shrink_node+0x238/0x84c
> > [c1c65e50] [c017ed38] balance_pgdat+0x340/0x650
> > [c1c65f50] [c017f270] kswapd+0x228/0x25c
> > [c1c65fc0] [c006bbac] kthread+0xe4/0xe8
> > [c1c65ff0] [c0015304] start_kernel_thread+0x10/0x14
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> >   cache: skbuff_small_head, object size: 480, buffer size: 544, default 
> > order: 1, min order: 0
> >   node 0: slabs: 15, objs: 225, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> >   cache: skbuff_small_head, object size: 480, buffer size: 544, default 
> > order: 1, min order: 0
> >   node 0: slabs: 15, objs: 225, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> >   cache: skbuff_small_head, object size: 480, buffer size: 544, default 
> > order: 1, min order: 0
> >   node 0: slabs: 15, objs: 225, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> >   cache: skbuff_small_head, object size: 480, buffer size: 544, default 
> > order: 1, min order: 0
> >   node 0: slabs: 15, objs: 225, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> >   cache: skbuff_small_head, object size: 480, buffer size: 544, default 
> > order: 1, min order: 0
> >   node 0: slabs: 15, objs: 225, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> >   cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default 
> > order: 3, min order: 1
> >   kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to 
> > disable.
> >   node 0: slabs: 33, objs: 165, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> >   cache: skbuff_small_head, object size: 480, buffer size: 544, default 
> > order: 1, min order: 0
> >   node 0: slabs: 15, objs: 225, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> >   cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default 
> > order: 3, min order: 1
> >   kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to 
> > disable.
> >   node 0: slabs: 33, objs: 165, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> >   cache: skbuff_small_head, object size: 480, buffer size: 544, default 
> > order: 1, min order: 0
> >   node 0: slabs: 15, objs: 225, free: 0
> > SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
> >   cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default 
> > order: 3, min order: 1
> >   kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to 
> > disable.
> >   node 0: slabs: 33, objs: 165,

Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

2024-05-31 Thread Yu Zhao
On Fri, May 31, 2024 at 1:24 AM Oliver Upton  wrote:
>
> On Wed, May 29, 2024 at 03:03:21PM -0600, Yu Zhao wrote:
> > On Wed, May 29, 2024 at 12:05 PM James Houghton  
> > wrote:
> > >
> > > Secondary MMUs are currently consulted for access/age information at
> > > eviction time, but before then, we don't get accurate age information.
> > > That is, pages that are mostly accessed through a secondary MMU (like
> > > guest memory, used by KVM) will always just proceed down to the oldest
> > > generation, and then at eviction time, if KVM reports the page to be
> > > young, the page will be activated/promoted back to the youngest
> > > generation.
> >
> > Correct, and as I explained offline, this is the only reasonable
> > behavior if we can't locklessly walk secondary MMUs.
> >
> > Just for the record, the (crude) analogy I used was:
> > Imagine a large room with many bills ($1, $5, $10, ...) on the floor,
> > but you are only allowed to pick up 10 of them (and put them in your
> > pocket). A smart move would be to survey the room *first and then*
> > pick up the largest ones. But if you are carrying a 500 lbs backpack,
> > you would just want to pick up whichever that's in front of you rather
> > than walk the entire room.
> >
> > MGLRU should only scan (or lookaround) secondary MMUs if it can be
> > done lockless. Otherwise, it should just fall back to the existing
> > approach, which existed in previous versions but is removed in this
> > version.
>
> Grabbing the MMU lock for write to scan sucks, no argument there. But
> can you please be specific about the impact of read lock v. RCU in the
> case of arm64? I had asked about this before and you never replied.
>
> My concern remains that adding support for software table walkers
> outside of the MMU lock entirely requires more work than just deferring
> the deallocation to an RCU callback. Walkers that previously assumed
> 'exclusive' access while holding the MMU lock for write must now cope
> with volatile PTEs.
>
> Yes, this problem already exists when hardware sets the AF, but the
> lock-free walker implementation needs to be generic so it can be applied
> for other PTE bits.

Direct reclaim is multi-threaded and each reclaimer can take the mmu
lock for read (testing the A-bit) or write (unmapping before paging
out) on arm64. The fundamental problem of using the readers-writer
lock in this case is priority inversion: the readers have lower
priority than the writers, so ideally, we don't want the readers to
block the writers at all.

Using my previous (crude) analogy: puting the bill right in front of
you (the writers) profits immediately whereas searching for the
largest bill (the readers) can be futile.

As I said earlier, I prefer we drop the arm64 support for now, but I
will not object to taking the mmu lock for read when clearing the
A-bit, as long as we fully understand the problem here and document it
clearly.


Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

2024-05-31 Thread Yu Zhao
On Fri, May 31, 2024 at 1:03 AM Oliver Upton  wrote:
>
> On Fri, May 31, 2024 at 12:05:48AM -0600, Yu Zhao wrote:

Let me add back what I said earlier:

  I'm not convinced, but it doesn't mean your point of view is
  invalid. If you fully understand the implications of your design
  choice and document them, I will not object.

> > All optimizations in v2 were measured step by step. Even that bitmap,
> > which might be considered overengineered, brought a readily
> > measuarable 4% improvement in memcached throughput on Altra Max
> > swapping to Optane:
>
> That's great, but taking an iterative approach to the problem allows
> the reviewers and maintainers to come to their own conclusions about
> each optimization independently. Squashing all of that together and
> posting the result doesn't allow for this.

That's your methodology, which I respect: as I said I won't stand in your way.

But mine is backed by data, please do respect that as well, by doing
what I asked: document your justifications.

> Even if we were to take the series as-is, the door is wide open to
> subsequent improvements.
>
> > What I don't think is acceptable is simplifying those optimizations
> > out without documenting your justifications (I would even call it a
> > design change, rather than simplification, from v3 to v4).
>
> No, sorry, there's nothing wrong with James' approach here.

Sorry, are you saying "without documenting your justifications" is
nothing wrong? If so, please elaborate.

> The discussion that led to the design of v4 happened on list; you were
> on CC. The general consensus on the KVM side was that the bitmap was
> complicated and lacked independent justification. There was ample
> opportunity to voice your concerns before he spent the time on v4.

Please re-read my previous emails -- I never object to the removal of
the bitmap or James' approach.

And please stop making assumptions -- I did voice my concerns with
James privately.

> You seriously cannot fault a contributor for respinning their work based
> on the provided feedback.

Are you saying I faulted James for taking others' feedback? If so,
where? And I'll make sure I don't give such an impression in the
future.

Also what do you think about the technical flaws and inaccurate
understandings I pointed out? You seem to have a strong opinion on
your iterate approach, but I hope you didn't choose to overlook the
real meat of this discussion.


Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

2024-05-29 Thread Yu Zhao
On Wed, May 29, 2024 at 3:59 PM Sean Christopherson  wrote:
>
> On Wed, May 29, 2024, Yu Zhao wrote:
> > On Wed, May 29, 2024 at 12:05 PM James Houghton  
> > wrote:
> > >
> > > Secondary MMUs are currently consulted for access/age information at
> > > eviction time, but before then, we don't get accurate age information.
> > > That is, pages that are mostly accessed through a secondary MMU (like
> > > guest memory, used by KVM) will always just proceed down to the oldest
> > > generation, and then at eviction time, if KVM reports the page to be
> > > young, the page will be activated/promoted back to the youngest
> > > generation.
> >
> > Correct, and as I explained offline, this is the only reasonable
> > behavior if we can't locklessly walk secondary MMUs.
> >
> > Just for the record, the (crude) analogy I used was:
> > Imagine a large room with many bills ($1, $5, $10, ...) on the floor,
> > but you are only allowed to pick up 10 of them (and put them in your
> > pocket). A smart move would be to survey the room *first and then*
> > pick up the largest ones. But if you are carrying a 500 lbs backpack,
> > you would just want to pick up whichever that's in front of you rather
> > than walk the entire room.
> >
> > MGLRU should only scan (or lookaround) secondary MMUs if it can be
> > done lockless. Otherwise, it should just fall back to the existing
> > approach, which existed in previous versions but is removed in this
> > version.
>
> IIUC, by "existing approach" you mean completely ignore secondary MMUs that 
> don't
> implement a lockless walk?

No, the existing approach only checks secondary MMUs for LRU folios,
i.e., those at the end of the LRU list. It might not find the best
candidates (the coldest ones) on the entire list, but it doesn't pay
as much for the locking. MGLRU can *optionally* scan MMUs (secondary
included) to find the best candidates, but it can only be a win if the
scanning incurs a relatively low overhead, e.g., done locklessly for
the secondary MMU. IOW, this is a balance between the cost of
reclaiming not-so-cold (warm) folios and that of finding the coldest
folios.

Scanning host MMUs is likely to be a win because 1) there is usually
access locality 2) there is no coarsed locking. If neither holds,
scanning secondary MMUs would likely be a loss. And 1) is generally
weaker for secondary MMUs, since it's about (guest) physical address
space.


Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

2024-05-29 Thread Yu Zhao
On Wed, May 29, 2024 at 12:05 PM James Houghton  wrote:
>
> Secondary MMUs are currently consulted for access/age information at
> eviction time, but before then, we don't get accurate age information.
> That is, pages that are mostly accessed through a secondary MMU (like
> guest memory, used by KVM) will always just proceed down to the oldest
> generation, and then at eviction time, if KVM reports the page to be
> young, the page will be activated/promoted back to the youngest
> generation.

Correct, and as I explained offline, this is the only reasonable
behavior if we can't locklessly walk secondary MMUs.

Just for the record, the (crude) analogy I used was:
Imagine a large room with many bills ($1, $5, $10, ...) on the floor,
but you are only allowed to pick up 10 of them (and put them in your
pocket). A smart move would be to survey the room *first and then*
pick up the largest ones. But if you are carrying a 500 lbs backpack,
you would just want to pick up whichever that's in front of you rather
than walk the entire room.

MGLRU should only scan (or lookaround) secondary MMUs if it can be
done lockless. Otherwise, it should just fall back to the existing
approach, which existed in previous versions but is removed in this
version.

> Do not do look around if there is a secondary MMU we have to interact
> with.
>
> The added feature bit (0x8), if disabled, will make MGLRU behave as if
> there are no secondary MMUs subscribed to MMU notifiers except at
> eviction time.
>
> Suggested-by: Yu Zhao 
> Signed-off-by: James Houghton 

This is not what I suggested, and it would have been done in the first
place if it hadn't regressed the non-lockless case.

NAK.


Re: kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0 (Kernel v6.5.9, 32bit ppc)

2024-05-15 Thread Yu Zhao
On Wed, May 15, 2024 at 2:45 PM Erhard Furtner  wrote:
>
> On Wed, 8 May 2024 20:21:11 +0200
> Erhard Furtner  wrote:
>
> > Greetings!
> >
> > Got that on my dual CPU PowerMac G4 DP shortly after boot. This does not 
> > happen every time at bootup though:
> >
> > [...]
> > kswapd0: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), 
> > nodemask=(null),cpuset=/,mems_allowed=0
> > CPU: 1 PID: 40 Comm: kswapd0 Not tainted 6.8.9-gentoo-PMacG4 #1
> > Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
>
> Very similar page allocation failure on the same machine on kernel 6.9.0 too. 
> Seems it can easily be provoked by running a memory stressor, e.g. "stress-ng 
> --vm 2 --vm-bytes 1930M --verify -v":
>
> [...]
> kswapd0: page allocation failure: order:0, mode:0xcc0(GFP_KERNEL), 
> nodemask=(null),cpuset=/,mems_allowed=0
> CPU: 0 PID: 41 Comm: kswapd0 Not tainted 6.9.0-gentoo-PMacG4 #1
> Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
> Call Trace:
> [c1c65940] [c07926d4] dump_stack_lvl+0x80/0xac (unreliable)
> [c1c65960] [c01b6234] warn_alloc+0x100/0x178
> [c1c659c0] [c01b661c] __alloc_pages+0x370/0x8d0
> [c1c65a80] [c01c4854] __read_swap_cache_async+0xc0/0x1cc
> [c1c65ad0] [c01cb580] zswap_writeback_entry+0x50/0x154
> [c1c65be0] [c01cb6f4] shrink_memcg_cb+0x70/0xec
> [c1c65c10] [c019518c] __list_lru_walk_one+0xa0/0x154
> [c1c65c70] [c01952a4] list_lru_walk_one+0x64/0x7c
> [c1c65ca0] [c01cad00] zswap_shrinker_scan+0xac/0xc4
> [c1c65cd0] [c018052c] do_shrink_slab+0x18c/0x304
> [c1c65d20] [c0180a40] shrink_slab+0x174/0x260
> [c1c65da0] [c017cb0c] shrink_one+0xbc/0x134
> [c1c65dd0] [c017e3e4] shrink_node+0x238/0x84c
> [c1c65e50] [c017ed38] balance_pgdat+0x340/0x650
> [c1c65f50] [c017f270] kswapd+0x228/0x25c
> [c1c65fc0] [c006bbac] kthread+0xe4/0xe8
> [c1c65ff0] [c0015304] start_kernel_thread+0x10/0x14
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>   cache: skbuff_small_head, object size: 480, buffer size: 544, default 
> order: 1, min order: 0
>   node 0: slabs: 15, objs: 225, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>   cache: skbuff_small_head, object size: 480, buffer size: 544, default 
> order: 1, min order: 0
>   node 0: slabs: 15, objs: 225, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>   cache: skbuff_small_head, object size: 480, buffer size: 544, default 
> order: 1, min order: 0
>   node 0: slabs: 15, objs: 225, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>   cache: skbuff_small_head, object size: 480, buffer size: 544, default 
> order: 1, min order: 0
>   node 0: slabs: 15, objs: 225, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>   cache: skbuff_small_head, object size: 480, buffer size: 544, default 
> order: 1, min order: 0
>   node 0: slabs: 15, objs: 225, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>   cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default 
> order: 3, min order: 1
>   kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to 
> disable.
>   node 0: slabs: 33, objs: 165, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>   cache: skbuff_small_head, object size: 480, buffer size: 544, default 
> order: 1, min order: 0
>   node 0: slabs: 15, objs: 225, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>   cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default 
> order: 3, min order: 1
>   kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to 
> disable.
>   node 0: slabs: 33, objs: 165, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>   cache: skbuff_small_head, object size: 480, buffer size: 544, default 
> order: 1, min order: 0
>   node 0: slabs: 15, objs: 225, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>   cache: kmalloc-rnd-15-2k, object size: 2048, buffer size: 6144, default 
> order: 3, min order: 1
>   kmalloc-rnd-15-2k debugging increased min order, use slab_debug=O to 
> disable.
>   node 0: slabs: 33, objs: 165, free: 0
> Mem-Info:
> active_anon:340071 inactive_anon:139179 isolated_anon:0
>  active_file:8297 inactive_file:2506 isolated_file:0
>  unevictable:4 dirty:1 writeback:18
>  slab_reclaimable:1377 slab_unreclaimable:7426
>  mapped:6804 shmem:112 pagetables:946
>  sec_pagetables:0 bounce:0
>  kernel_misc_reclaimable:0
>  free:1141 free_pcp:7 free_cma:0
> Node 0 active_anon:1360284kB inactive_anon:556716kB active_file:33188kB 
> inactive_file:10024kB unevictable:16kB isolated(anon):0kB isolated(file):0kB 
> mapped:27216kB dirty:4kB writeback:72kB shmem:448kB writeback_tmp:0kB 
> kernel_stack:1560kB pagetables:3784kB sec_pagetables:0kB all_unreclaimable? no
> DMA free:56kB boost:7756kB min:11208kB low:12068kB high:12928kB 
> reserved_highatomic:0KB active_anon:635128kB inactive_anon:58260kB 
> 

Re: [PATCH 3/3] mm/lru_gen: Don't build multi-gen LRU page table walk code on architecture not supported

2023-06-27 Thread Yu Zhao
On Tue, Jun 27, 2023 at 5:48 AM Aneesh Kumar K V
 wrote:
>
> On 6/26/23 10:34 PM, Yu Zhao wrote:
> > On Mon, Jun 26, 2023 at 4:52 AM Aneesh Kumar K V
> >  wrote:
> >>
> >> On 6/26/23 1:04 AM, Yu Zhao wrote:
> >>> On Sat, Jun 24, 2023 at 8:54 AM Aneesh Kumar K.V
> >>>  wrote:
> >>>>
> >>>> Hi Yu Zhao,
> >>>>
> >>>> "Aneesh Kumar K.V"  writes:
> >>>>
> >>>>> Not all architecture supports hardware atomic updates of access bits. On
> >>>>> such an arch, we don't use page table walk to classify pages into
> >>>>> generations. Add a kernel config option and remove adding all the page
> >>>>> table walk code on such architecture.
> >>>>>
> >>>>> No preformance change observed with mongodb ycsb test:
> >>>>>
> >>>>> Patch detailsThroughput(Ops/sec)
> >>>>> without patch 93278
> >>>>> With patch93400
> >>>>>
> >>>>> Without patch:
> >>>>> $ size mm/vmscan.o
> >>>>>textdata bss dec hex filename
> >>>>>  112102   42721  40  154863   25cef mm/vmscan.o
> >>>>>
> >>>>> With patch
> >>>>>
> >>>>> $ size  mm/vmscan.o
> >>>>>textdata bss dec hex filename
> >>>>>  105430   41333  24  146787   23d63 mm/vmscan.o
> >>>>>
> >>>>
> >>>> Any feedback on this patch? Can we look at merging this change?
> >>>
> >>> Just want to make sure I fully understand the motivation: are there
> >>> any other end goals besides reducing the footprint mentioned above?
> >>> E.g., preparing for HCA, etc. (My current understanding is that HCA
> >>> shouldn't care about it, since it's already runtime disabled if HCA
> >>> doesn't want to use it.)
> >>>
> >>
> >> My goal with this change was to remove all those dead code from getting 
> >> complied
> >> in for ppc64.
> >
> > I see. But the first thing (lru_gen_add_folio()) you moved has nothing
> > to do with this goal, because it's still compiled after the entire
> > series.
> >
>
> Sure. will drop that change.
>
> >>> Also as explained offline, solely relying on folio_activate() in
> >>> lru_gen_look_around() can cause a measure regression on powerpc,
> >>> because
> >>> 1. PAGEVEC_SIZE is 15 whereas pglist_data->mm_walk.batched is
> >>> virtually unlimited.
> >>> 2. Once folio_activate() reaches that limit, it takes the LRU lock on
> >>> top of the PTL, which can be shared by multiple page tables on
> >>> powerpc.
> >>>
> >>> In fact, I think we try the opposite direction first, before arriving
> >>> at any conclusions, i.e.,
> >>> #define arch_has_hw_pte_young() radix_enabled()
> >>
> >> The reason it is disabled on powerpc was that a reference bit update takes 
> >> a pagefault
> >> on powerpc irrespective of the translation mode.
> >
> > This is not true.
> >
> > From "IBM POWER9 Processor User Manual":
> > https://openpowerfoundation.org/resources/ibmpower9usermanual/
> >
> >   4.10.14 Reference and Change Bits
> >   ...
> >   When performing HPT translation, the hardware performs the R and C
> > bit updates nonatomically.
> >   ...
> >
> > The radix case is more complex, and I'll leave it to you to interpret
> > what it means:
> >
> > From "Power ISA Version 3.0 B":
> > https://openpowerfoundation.org/specifications/isa/
> >
> >   5.7.12 Reference and Change Recording
> >   ...
> >   For Radix Tree translation, the Reference and Change bits are set 
> > atomically.
> >   ...
> >
>
> it is atomic in that software use ldarx/stdcx to update these bits. 
> Hardware/core won't
> update this directly even though Nest can update this directly without taking 
> a fault. So
> for all purpose we can assume that on radix R/C bit is updated by page fault 
> handler.

Thanks. To me, it sounds like stating a function provided by h/w, not
a requirement for s/w. (IMO, the latter would be something like
"software must/should set the bits atomically.) But I'll take your
word for it.


Re: [PATCH mm-unstable v2 07/10] kvm/powerpc: add kvm_arch_test_clear_young()

2023-06-20 Thread Yu Zhao
On Tue, Jun 20, 2023 at 1:48 AM Nicholas Piggin  wrote:
>
> On Sat May 27, 2023 at 9:44 AM AEST, Yu Zhao wrote:
> > Implement kvm_arch_test_clear_young() to support the fast path in
> > mmu_notifier_ops->test_clear_young().
> >
> > It focuses on a simple case, i.e., radix MMU sets the accessed bit in
> > KVM PTEs and VMs are not nested, where it can rely on RCU and
> > pte_xchg() to safely clear the accessed bit without taking
> > kvm->mmu_lock. Complex cases fall back to the existing slow path
> > where kvm->mmu_lock is then taken.
> >
> > Signed-off-by: Yu Zhao 
> > ---
> >  arch/powerpc/include/asm/kvm_host.h|  8 
> >  arch/powerpc/include/asm/kvm_ppc.h |  1 +
> >  arch/powerpc/kvm/book3s.c  |  6 +++
> >  arch/powerpc/kvm/book3s.h  |  1 +
> >  arch/powerpc/kvm/book3s_64_mmu_radix.c | 59 ++
> >  arch/powerpc/kvm/book3s_hv.c   |  5 +++
> >  6 files changed, 80 insertions(+)
> >
> > diff --git a/arch/powerpc/include/asm/kvm_host.h 
> > b/arch/powerpc/include/asm/kvm_host.h
> > index 14ee0dece853..75c260ea8a9e 100644
> > --- a/arch/powerpc/include/asm/kvm_host.h
> > +++ b/arch/powerpc/include/asm/kvm_host.h
> > @@ -883,4 +883,12 @@ static inline void kvm_arch_sched_in(struct kvm_vcpu 
> > *vcpu, int cpu) {}
> >  static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
> >  static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
> >
> > +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> > +static inline bool kvm_arch_has_test_clear_young(void)
> > +{
> > + return IS_ENABLED(CONFIG_KVM_BOOK3S_HV_POSSIBLE) &&
> > +cpu_has_feature(CPU_FTR_HVMODE) && 
> > cpu_has_feature(CPU_FTR_ARCH_300) &&
> > +radix_enabled();
>
> This could probably be radix_enabled() && !kvmhv_on_pseries().

Will do. (I used !kvmhv_on_pseries() in v1 but had second thoughts on
moving kvmhv_on_pseries() into this file.)

> Although unclear why not nested hypervisor... I'd have to think about it a bit
> more. Do you have any idea what might go wrong, or just didn't have the
> time to consider it? (Not just powerpc nested but any others).

Yes, this series excludes nested KVM support on all architures. The
common reason for such a decision on powerpc and x86 (aarch64 doesn't
support nested yet) is that it's quite challenging to make the rmap, a
complex data structure that maps one PFN to multiple GFNs, lockless.
(See kvmhv_insert_nest_rmap().) It might be possible, however, the
potential ROI would be in question.

> > +}
> > +
> >  #endif /* __POWERPC_KVM_HOST_H__ */
> > diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
> > b/arch/powerpc/include/asm/kvm_ppc.h
> > index 79a9c0bb8bba..ff1af6a7b44f 100644
> > --- a/arch/powerpc/include/asm/kvm_ppc.h
> > +++ b/arch/powerpc/include/asm/kvm_ppc.h
> > @@ -287,6 +287,7 @@ struct kvmppc_ops {
> >   bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
> >   bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> >   bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> > + bool (*test_clear_young)(struct kvm *kvm, struct kvm_gfn_range 
> > *range);
> >   bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> >   void (*free_memslot)(struct kvm_memory_slot *slot);
> >   int (*init_vm)(struct kvm *kvm);
> > diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> > index 686d8d9eda3e..37bf40b0c4ff 100644
> > --- a/arch/powerpc/kvm/book3s.c
> > +++ b/arch/powerpc/kvm/book3s.c
> > @@ -899,6 +899,12 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct 
> > kvm_gfn_range *range)
> >   return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
> >  }
> >
> > +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range 
> > *range)
> > +{
> > + return !kvm->arch.kvm_ops->test_clear_young ||
> > +kvm->arch.kvm_ops->test_clear_young(kvm, range);
> > +}
> > +
> >  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> >  {
> >   return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
> > diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
> > index 58391b4b32ed..fa2659e21ccc 100644
> > --- a/arch/powerpc/kvm/book3s.h
> > +++ b/arch/powerpc/kvm/book3s.h
> > @@ -12,6 +12,7 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
> >  extern bool kvm_unmap_gfn_range_hv(struct kvm *

Re: [6.4.0-rc7-next-20230620] Boot failure on IBM Power LPAR

2023-06-20 Thread Yu Zhao
On Tue, Jun 20, 2023 at 05:41:57PM +0530, Sachin Sant wrote:
> 6.4.0-rc7-next-20230620 fails to boot on IBM Power LPAR with following

Sorry for hijacking this thread -- I've been seeing another crash on
NV since -rc1 but I haven't had the time to bisect. Just FYI.

[0.814500] BUG: Unable to handle kernel data access on read at 
0xc00a03f9
[0.814814] Faulting instruction address: 0xc0c77324
[0.814988] Oops: Kernel access of bad area, sig: 11 [#1]
[0.815185] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
[0.815487] Modules linked in:
[0.815653] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 6.4.0-rc7 #1
[0.815980] Hardware name: ZAIUS_FX_10  POWER9 (raw) 0x4e1202 
opal:custom PowerNV
[0.816293] NIP:  c0c77324 LR: c0c7c2c8 CTR: c0c77270
[0.816525] REGS: c0002a416de0 TRAP: 0300   Not tainted  (6.4.0-rc7)
[0.816778] MSR:  90009033   CR: 24008280  
XER: 2004
[0.817113] CFAR: c0c772c8 DAR: c00a03f9 DSISR: 4000 
IRQMASK: 1
[0.817113] GPR00: c0c7c2c8 c0002a417080 c1deea00 
03f9
[0.817113] GPR04: 0001  c27f3b28 
c16e0610
[0.817113] GPR08:  c2b9db10 c2b903e0 
84008282
[0.817113] GPR12:  c03dba00 c00128c8 

[0.817113] GPR16:    
0007
[0.817113] GPR20: c27903f0 c20034a4 c29db000 

[0.817113] GPR24: c29dad70 c29dad50  
0001
[0.817113] GPR28: c2d57378  c2d47378 
c00a03f9
[0.820185] NIP [c0c77324] io_serial_in+0xb4/0x130
[0.820383] LR [c0c7c2c8] serial8250_config_port+0x4b8/0x1680
[0.820610] Call Trace:
[0.820733] [c0002a417080] [c0c768c8] 
serial8250_request_std_resource+0x88/0x200 (unreliable)
[0.821221] [c0002a4170c0] [c0c7be50] 
serial8250_config_port+0x40/0x1680
[0.821623] [c0002a417190] [c0c733ec] 
univ8250_config_port+0x11c/0x1e0
[0.821956] [c0002a4171f0] [c0c71824] 
uart_add_one_port+0x244/0x750
[0.822244] [c0002a417310] [c0c73958] 
serial8250_register_8250_port+0x3b8/0x780
[0.822504] [c0002a4173c0] [c0c7411c] 
serial8250_probe+0x14c/0x1e0
[0.822833] [c0002a417760] [c0cd87e8] platform_probe+0x98/0x1b0
[0.823157] [c0002a4177e0] [c0cd2a50] really_probe+0x130/0x5b0
[0.823517] [c0002a417870] [c0cd2f94] 
__driver_probe_device+0xc4/0x240
[0.823827] [c0002a4178f0] [c0cd3164] 
driver_probe_device+0x54/0x180
[0.824096] [c0002a417930] [c0cd3618] __driver_attach+0x168/0x300
[0.824330] [c0002a4179b0] [c0ccf468] bus_for_each_dev+0xa8/0x130
[0.824650] [c0002a417a10] [c0cd1ef4] driver_attach+0x34/0x50
[0.825094] [c0002a417a30] [c0cd112c] bus_add_driver+0x16c/0x310
[0.825445] [c0002a417ac0] [c0cd55d4] driver_register+0xa4/0x1b0
[0.825787] [c0002a417b30] [c0cd7548] 
__platform_driver_register+0x38/0x50
[0.826037] [c0002a417b50] [c206be38] serial8250_init+0x1f8/0x270
[0.826267] [c0002a417c00] [c0012260] do_one_initcall+0x60/0x300
[0.826529] [c0002a417ce0] [c20052c4] 
kernel_init_freeable+0x3c0/0x484
[0.826883] [c0002a417de0] [c00128f4] kernel_init+0x34/0x1e0
[0.827187] [c0002a417e50] [c000d014] 
ret_from_kernel_user_thread+0x14/0x1c
[0.827595] --- interrupt: 0 at 0x0
[0.827722] NIP:   LR:  CTR: 
[0.827951] REGS: c0002a417e80 TRAP:    Not tainted  (6.4.0-rc7)
[0.828162] MSR:   <>  CR:   XER: 
[0.828394] CFAR:  IRQMASK: 0
[0.828394] GPR00:    

[0.828394] GPR04:    

[0.828394] GPR08:    

[0.828394] GPR12:    

[0.828394] GPR16:    

[0.828394] GPR20:    

[0.828394] GPR24:    

[0.828394] GPR28:    

[0.831329] NIP [] 0x0
[0.831445] LR [] 0x0
[0.831554] --- interrupt: 0
[0.831664] Code: 7bc30020 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 6000 
6042 3d2200db 3929f110 

Re: [PATCH mm-unstable v2 06/10] kvm/powerpc: make radix page tables RCU safe

2023-06-20 Thread Yu Zhao
On Tue, Jun 20, 2023 at 12:33 AM Nicholas Piggin  wrote:
>
> On Sat May 27, 2023 at 9:44 AM AEST, Yu Zhao wrote:
> > KVM page tables are currently not RCU safe against remapping, i.e.,
> > kvmppc_unmap_free_pmd_entry_table() et al. The previous
>
> Minor nit but the "page table" is not RCU-safe against something. It
> is RCU-freed, and therefore some algorithm that accesses it can have
> the existence guarantee provided by RCU (usually there still needs
> to be more to it).
>
> > mmu_notifier_ops members rely on kvm->mmu_lock to synchronize with
> > that operation.
> >
> > However, the new mmu_notifier_ops member test_clear_young() provides
> > a fast path that does not take kvm->mmu_lock. To implement
> > kvm_arch_test_clear_young() for that path, orphan page tables need to
> > be freed by RCU.
>
> Short version: clear the referenced bit using RCU instead of MMU lock
> to protect against page table freeing, and there is no problem with
> clearing the bit in a table that has been freed.
>
> Seems reasonable.

Thanks. All above points taken.

> > Unmapping, specifically kvm_unmap_radix(), does not free page tables,
> > hence not a concern.
>
> Not sure if you really need to make the distinction about why the page
> table is freed, we might free them via unmapping. The point is just
> anything that frees them while there can be concurrent access, right?

Correct.

> > Signed-off-by: Yu Zhao 
> > ---
> >  arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
> > b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > index 461307b89c3a..3b65b3b11041 100644
> > --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > @@ -1469,13 +1469,15 @@ int kvmppc_radix_init(void)
> >  {
> >   unsigned long size = sizeof(void *) << RADIX_PTE_INDEX_SIZE;
> >
> > - kvm_pte_cache = kmem_cache_create("kvm-pte", size, size, 0, pte_ctor);
> > + kvm_pte_cache = kmem_cache_create("kvm-pte", size, size,
> > +   SLAB_TYPESAFE_BY_RCU, pte_ctor);
> >   if (!kvm_pte_cache)
> >   return -ENOMEM;
> >
> >   size = sizeof(void *) << RADIX_PMD_INDEX_SIZE;
> >
> > - kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size, 0, pmd_ctor);
> > + kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size,
> > +   SLAB_TYPESAFE_BY_RCU, pmd_ctor);
> >   if (!kvm_pmd_cache) {
> >   kmem_cache_destroy(kvm_pte_cache);
> >   return -ENOMEM;
>
> KVM PPC HV radix PUD level page tables use the arch/powerpc allocators
> (for some reason), which are not RCU freed. I think you need them too?

We don't. The use of the arch/powerpc allocator for PUD tables seems
appropriate to me because, unlike PMD/PTE tables, we never free PUD
tables during the lifetime of a VM:
* We don't free PUD/PMD/PTE tables when they become empty, i.e., not
mapping any pages but still attached. (We could in theory, as
x86/aarch64 do.)
* We have to free PMD/PTE tables when we replace them with 1GB/2MB
pages. (Otherwise we'd lose track of detached tables.) And we
currently don't support huge pages at P4D level, so we never detach
and free PUD tables.


Re: [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit

2023-06-19 Thread Yu Zhao
On Fri, Jun 9, 2023 at 3:08 AM Paolo Bonzini  wrote:
>
> On 5/27/23 01:44, Yu Zhao wrote:
> > TLDR
> > 
> > This patchset adds a fast path to clear the accessed bit without
> > taking kvm->mmu_lock. It can significantly improve the performance of
> > guests when the host is under heavy memory pressure.
> >
> > ChromeOS has been using a similar approach [1] since mid 2021 and it
> > was proven successful on tens of millions devices.
> >
> > This v2 addressed previous requests [2] on refactoring code, removing
> > inaccurate/redundant texts, etc.
> >
> > [1]https://crrev.com/c/2987928
> > [2]https://lore.kernel.org/r/20230217041230.2417228-1-yuz...@google.com/
>
>  From the KVM point of view the patches look good (though I wouldn't
> mind if Nicholas took a look at the ppc part).  Jason's comment on the
> MMU notifier side are promising as well.  Can you send v3 with Oliver's
> comments addressed?

Thanks. I'll address all the comments in v3 and post it asap.

Meanwhile, some updates on the recent progress from my side:
1. I've asked some downstream kernels to pick up v2 for testing, the
Archlinux Zen kernel did. I don't really expect its enthusiastic
testers to find this series relevant to their use cases. But who
knows.
2. I've also asked openbenchmarking.org to run their popular highmem
benchmark suites with v2. Hopefully they'll have some independent
results soon.
3. I've backported v2 to v5.15 and v6.1 and started an A/B experiment
involving ~1 million devices, as I mentioned in another email in this
thread. I should have some results to share when posting v3.


Re: [PATCH v2 07/23 replacement] mips: add pte_unmap() to balance pte_offset_map()

2023-06-18 Thread Yu Zhao
On Fri, Jun 16, 2023 at 9:54 PM Yu Zhao  wrote:
>
> On Thu, Jun 15, 2023 at 04:02:43PM -0700, Hugh Dickins wrote:
> > To keep balance in future, __update_tlb() remember to pte_unmap() after
> > pte_offset_map().  This is an odd case, since the caller has already done
> > pte_offset_map_lock(), then mips forgets the address and recalculates it;
> > but my two naive attempts to clean that up did more harm than good.
> >
> > Tested-by: Nathan Chancellor 
> > Signed-off-by: Hugh Dickins 
>
> FWIW: Tested-by: Yu Zhao 
>
> There is another problem, likely caused by khugepaged, happened multiple 
> times. But I don't think it's related to your series, just FYI.
>
>   Got mcheck at 81134ef0
>   CPU: 3 PID: 36 Comm: khugepaged Not tainted 
> 6.4.0-rc6-00049-g62d8779610bb-dirty #1

...

>   Kernel panic - not syncing: Caught Machine Check exception - caused by 
> multiple matching entries in the TLB.

In case anyone plans to try to fix this - the problem goes back to at
least 5.15 stable. My (educated) guess is that nobody complained about
it because all the testing is done in QEMU, which does NOT detect
conflicting TLBs. This means the verification of the fix would need to
be on a real piece of h/w or an updated QEMU.

In target/mips/tcg/sysemu/tlb_helper.c:

static void r4k_fill_tlb(CPUMIPSState *env, int idx)
{
r4k_tlb_t *tlb;
uint64_t mask = env->CP0_PageMask >> (TARGET_PAGE_BITS + 1);

/* XXX: detect conflicting TLBs and raise a MCHECK exception when needed */
...


Re: kvm/arm64: Spark benchmark

2023-06-18 Thread Yu Zhao
On Fri, Jun 9, 2023 at 7:04 AM Marc Zyngier  wrote:
>
> On Fri, 09 Jun 2023 01:59:35 +0100,
> Yu Zhao  wrote:
> >
> > TLDR
> > 
> > Apache Spark spent 12% less time sorting four billion random integers 
> > twenty times (in ~4 hours) after this patchset [1].
>
> Why are the 3 architectures you have considered being evaluated with 3
> different benchmarks?

I was hoping people having special interests in different archs might
try to reproduce the benchmarks that I didn't report (but did cover)
and see what happens.

> I am not suspecting you to have cherry-picked
> the best results

I'm generally very conservative when reporting *synthetic* results.
For example, the same memcached benchmark used on powerpc yielded >50%
improvement on aarch64, because the default Ubuntu Kconfig uses 64KB
base page size for powerpc but 4KB for aarch64. (Before the series,
the reclaim (swap) path takes kvm->mmu_lock for *write* on O(nr of all
pages to consider); after the series, it becomes O(actual nr of pages
to swap), which is <10% given how the benchmark was set up.)

  Ops/sec  Avg. Latency  p50 Latency  p99 Latency  p99.9 Latency

Before  639511.40   0.09940  0.04700  0.27100   22.52700
After   974184.60   0.06471  0.04700  0.159003.75900

> but I'd really like to see a variety of benchmarks
> that exercise this stuff differently.

I'd be happy to try other synthetic workloads that people think that
are relatively representative. Also, I've backported the series and
started an A/B experiment involving ~1 million devices (real-world
workloads). We should have the preliminary results by the time I post
the next version.


Re: kvm/x86: multichase benchmark

2023-06-18 Thread Yu Zhao
On Thu, Jun 8, 2023 at 6:59 PM Yu Zhao  wrote:
>
> TLDR
> 
> Multichase in 64 microVMs achieved 6% more total samples (in ~4 hours) after 
> this patchset [1].
>
> Hardware
> 
> HOST $ lscpu
> Architecture:x86_64
>   CPU op-mode(s):32-bit, 64-bit
>   Address sizes: 43 bits physical, 48 bits virtual
>   Byte Order:Little Endian
> CPU(s):  128
>   On-line CPU(s) list:   0-127
> Vendor ID:   AuthenticAMD
>   Model name:AMD Ryzen Threadripper PRO 3995WX 64-Cores
> CPU family:  23
> Model:   49
> Thread(s) per core:  2
> Core(s) per socket:  64
> Socket(s):   1
> Stepping:0
> Frequency boost: disabled
> CPU max MHz: 4308.3979
> CPU min MHz: 2200.
> BogoMIPS:5390.20
> Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
> mca cmov pat pse36 clflush mmx fxsr sse sse2
>  ...
> Virtualization features:
>   Virtualization:AMD-V
> Caches (sum of all):
>   L1d:   2 MiB (64 instances)
>   L1i:   2 MiB (64 instances)
>   L2:32 MiB (64 instances)
>   L3:256 MiB (16 instances)
> NUMA:
>   NUMA node(s):  1
>   NUMA node0 CPU(s): 0-127
> Vulnerabilities:
>   Itlb multihit: Not affected
>   L1tf:  Not affected
>   Mds:   Not affected
>   Meltdown:  Not affected
>   Mmio stale data:   Not affected
>   Retbleed:  Mitigation; untrained return thunk; SMT enabled with 
> STIBP protection
>   Spec store bypass: Mitigation; Speculative Store Bypass disabled via 
> prctl
>   Spectre v1:Mitigation; usercopy/swapgs barriers and __user 
> pointer sanitization
>   Spectre v2:Mitigation; Retpolines, IBPB conditional, STIBP 
> always-on, RSB filling, PBRSB-eIBRS Not affected
>   Srbds: Not affected
>   Tsx async abort:   Not affected
>
> HOST $ numactl -H
> available: 1 nodes (0)
> node 0 cpus: 0-127
> node 0 size: 257542 MB
> node 0 free: 224855 MB
> node distances:
> node   0
>   0:  10
>
> HOST $ cat /sys/class/nvme/nvme0/model
> INTEL SSDPF21Q800GB
>
> HOST $ cat /sys/class/nvme/nvme0/numa_node
> 0
>
> Software
> 
> HOST $ cat /etc/lsb-release
> DISTRIB_ID=Ubuntu
> DISTRIB_RELEASE=22.04
> DISTRIB_CODENAME=jammy
> DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
>
> HOST $ uname -a
> Linux x86 6.4.0-rc5+ #1 SMP PREEMPT_DYNAMIC Wed Jun  7 22:17:47 UTC 2023 
> x86_64 x86_64 x86_64 GNU/Linux
>
> HOST $ cat /proc/swaps
> Filename  Type Size UsedPriority
> /dev/nvme0n1p2partition4668383560   -2
>
> HOST $ cat /sys/kernel/mm/lru_gen/enabled
> 0x000f
>
> HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
>
> HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
> always defer defer+madvise madvise [never]
>
> Procedure
> =
> HOST $ git clone https://github.com/google/multichase
>
> HOST $ 
> HOST $ 
>
> HOST $ cp multichase/multichase ./initrd/bin/
> HOST $ sed -i \
> "/^maybe_break top$/i multichase -t 2 -m 4g -n 28800; poweroff" \

I was reminded that I missed one parameter above, i.e.,

"/^maybe_break top$/i multichase -N -t 2 -m 4g -n 28800; poweroff" \
 ^^

> ./initrd/init
>
> HOST $ 
>
> HOST $ cat run_microvms.sh
> memcgs=64
>
> run() {
> path=/sys/fs/cgroup/memcg$1
>
> mkdir $path
> echo $BASHPID >$path/cgroup.procs

And one line here:

echo 4000m >$path/memory.min # or the largest size that doesn't cause OOM kills

> qemu-system-x86_64 -M microvm,accel=kvm -cpu host -smp 2 -m 6g \
> -nographic -kernel /boot/vmlinuz -initrd ./initrd.img \
> -append "console=ttyS0 loglevel=0"
> }
>
> for ((memcg = 0; memcg < $memcgs; memcg++)); do
> run $memcg &
> done
>
> wait
>
> Results
> ===
>  Before [1]AfterChange
> --
> Total samples6824  7237 +6%
>
> Notes
> =
> [1] "mm: rmap: Don't flush TLB after checking PTE young for page
> reference" was included so that the comparison is apples to
> Apples.
> https://lore.kernel.org/r/20220706112041.3831-1-21cn...@gmail.com/


Re: [PATCH v2 07/23 replacement] mips: add pte_unmap() to balance pte_offset_map()

2023-06-16 Thread Yu Zhao
On Thu, Jun 15, 2023 at 04:02:43PM -0700, Hugh Dickins wrote:
> To keep balance in future, __update_tlb() remember to pte_unmap() after
> pte_offset_map().  This is an odd case, since the caller has already done
> pte_offset_map_lock(), then mips forgets the address and recalculates it;
> but my two naive attempts to clean that up did more harm than good.
> 
> Tested-by: Nathan Chancellor 
> Signed-off-by: Hugh Dickins 

FWIW: Tested-by: Yu Zhao 

There is another problem, likely caused by khugepaged, happened multiple times. 
But I don't think it's related to your series, just FYI.

  Got mcheck at 81134ef0
  CPU: 3 PID: 36 Comm: khugepaged Not tainted 
6.4.0-rc6-00049-g62d8779610bb-dirty #1
  $ 0   :  0014 411ac004 4000
  $ 4   : c000 0045 00011a80045b 00011a80045b
  $ 8   : 800080188000 81b526c0 0200 
  $12   : 0028 81910cb4  0207
  $16   : 00aaab80 837ee990 81b50200 85066ae0
  $20   : 0001 8000 81c1 00aaab80
  $24   : 0002 812b75f8
  $28   : 8231 82313b00 81b5 81134d88
  Hi: 017a
  Lo: 
  epc   : 81134ef0 __update_tlb+0x260/0x2a0
  ra: 81134d88 __update_tlb+0xf8/0x2a0
  Status: 14309ce2  KX SX UX KERNEL EXL
  Cause : 00800060 (ExcCode 18)
  PrId  : 000d9602 (Cavium Octeon III)
  CPU: 3 PID: 36 Comm: khugepaged Not tainted 
6.4.0-rc6-00049-g62d8779610bb-dirty #1
  Stack : 0001  0008 82313768
  82313768 823138f8  
  a6c8cd76e1667e00 81db4f28 0001 30302d3663722d30
  643236672d393430 0010 81910cc0 
  81d96bcc   81a68ed0
  81b5 0001 8000 81c1
  00aaab80 0002 815b78c0 a184e710
  00c0 8231 82313760 81b5
  8111c9cc   
    8111c9ec 
  ...
  Call Trace:
  [] show_stack+0x64/0x158
  [] dump_stack_lvl+0x5c/0x7c
  [] do_mcheck+0x2c/0x98
  [] handle_mcheck_int+0x38/0x50
  
  Index: 8000
  PageMask : 1fe000
  EntryHi  : 00aaab8000bd
  EntryLo0 : 411a8004
  EntryLo1 : 411ac004
  Wired: 0
  PageGrain: e800
  
  Index:  2 pgmask=4kb va=c0fe4000 asid=b9
[ri=0 xi=0 pa=22a7000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=22af000 
c=0 d=1 v=1 g=1]
  Index:  3 pgmask=4kb va=c0fe8000 asid=b9
[ri=0 xi=0 pa=238 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=2381000 
c=0 d=1 v=1 g=1]
  Index:  4 pgmask=4kb va=c0fea000 asid=b9
[ri=0 xi=0 pa=23e9000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=23ea000 
c=0 d=1 v=1 g=1]
  Index:  5 pgmask=4kb va=c0fee000 asid=b9
[ri=0 xi=0 pa=2881000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=2882000 
c=0 d=1 v=1 g=1]
  Index:  6 pgmask=4kb va=c0fefffb asid=b9
[ri=0 xi=0 pa=2cc2000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=2cc3000 
c=0 d=1 v=1 g=1]
  Index:  7 pgmask=4kb va=c0fec000 asid=b9
[ri=0 xi=0 pa=23eb000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=288 
c=0 d=1 v=1 g=1]
  Index:  8 pgmask=4kb va=c0fe6000 asid=b9
[ri=0 xi=0 pa=237e000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=237f000 
c=0 d=1 v=1 g=1]
  Index: 14 pgmask=4kb va=c0fefff62000 asid=8e
[ri=0 xi=0 pa=7477000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=745e000 
c=0 d=1 v=1 g=1]
  Index: 15 pgmask=4kb va=c0fefff52000 asid=8e
[ri=0 xi=0 pa=744c000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=616d000 
c=0 d=1 v=1 g=1]
  Index: 16 pgmask=4kb va=c0fefff42000 asid=8e
[ri=0 xi=0 pa=6334000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=616b000 
c=0 d=1 v=1 g=1]
  Index: 19 pgmask=4kb va=c0fefffb6000 asid=8e
[ri=0 xi=0 pa=505 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=5051000 
c=0 d=1 v=1 g=1]
  Index: 20 pgmask=4kb va=c0fefff72000 asid=b9
[ri=0 xi=0 pa=7504000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=7503000 
c=0 d=1 v=1 g=1]
  Index: 58 pgmask=4kb va=c0fefffaa000 asid=8e
[ri=0 xi=0 pa=5126000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=5127000 
c=0 d=1 v=1 g=1]
  Index: 59 pgmask=4kb va=c0fefffba000 asid=8e
[ri=0 xi=0 pa=5129000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=512a000 
c=0 d=1 v=1 g=1]
  Index: 79 pgmask=4kb va=c006 asid=8e
[ri=0 xi=0 pa=534b000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=62f9000 
c=0 d=1 v=1 g=1]
  Index: 80 pgmask=4kb va=c00

Re: [PATCH v2 07/23] mips: update_mmu_cache() can replace __update_tlb()

2023-06-15 Thread Yu Zhao
On Wed, Jun 14, 2023 at 04:17:58PM -0700, Nathan Chancellor wrote:
> Hi Hugh,
> 
> On Thu, Jun 08, 2023 at 12:17:24PM -0700, Hugh Dickins wrote:
> > Don't make update_mmu_cache() a wrapper around __update_tlb(): call it
> > directly, and use the ptep (or pmdp) provided by the caller, instead of
> > re-calling pte_offset_map() - which would raise a question of whether a
> > pte_unmap() is needed to balance it.
> > 
> > Check whether the "ptep" provided by the caller is actually the pmdp,
> > instead of testing pmd_huge(): or test pmd_huge() too and warn if it
> > disagrees?  This is "hazardous" territory: needs review and testing.
> > 
> > Signed-off-by: Hugh Dickins 
> > ---
> >  arch/mips/include/asm/pgtable.h | 15 +++
> >  arch/mips/mm/tlb-r3k.c  |  5 +++--
> >  arch/mips/mm/tlb-r4k.c  |  9 +++--
> >  3 files changed, 9 insertions(+), 20 deletions(-)
> > 
> 
> I just bisected a crash while powering down a MIPS machine in QEMU to
> this change as commit 8044511d3893 ("mips: update_mmu_cache() can
> replace __update_tlb()") in linux-next. Unfortunately, I can still
> reproduce it with the existing fix you have for this change on the
> mailing list, which is present in next-20230614.
> 
> I can reproduce it with the GCC 13.1.0 on kernel.org [1].
> 
>   $ make -skj"$(nproc)" ARCH=mips CROSS_COMPILE=mips-linux- mrproper 
> malta_defconfig vmlinux
> 
>   $ qemu-system-mipsel \
>   -display none \
>   -nodefaults \
>   -cpu 24Kf \
>   -machine malta \
>   -kernel vmlinux \
>   -initrd rootfs.cpio \
>   -m 512m \
>   -serial mon:stdio
>   ...
>   Linux version 6.4.0-rc6-next-20230614 (nathan@dev-arch.thelio-3990X) 
> (mips-linux-gcc (GCC) 13.1.0, GNU ld (GNU Binutils) 2.40) #1 SMP Wed Jun 14 
> 16:13:02 MST 2023
>   ...
>   Run /init as init process
>   process '/bin/busybox' started with executable stack
>   do_page_fault(): sending SIGSEGV to init for invalid read access from 
> 003c
>   epc = 77b893dc in ld-uClibc-1.0.39.so[77b84000+8000]
>   ra  = 77b8930c in ld-uClibc-1.0.39.so[77b84000+8000]
>   Kernel panic - not syncing: Attempted to kill init! exitcode=0x000b
>   ---[ end Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x000b ]---
> 
> The rootfs is available at [2] if it is needed. I am more than happy to
> provide additional information or test patches if necessary.
> 
> [1]: https://mirrors.edge.kernel.org/pub/tools/crosstool/
> [2]: 
> https://github.com/ClangBuiltLinux/boot-utils/releases/download/20230609-194440/mipsel-rootfs.cpio.zst

Seeing this on real h/w as well (just to confirm).

Linux version 6.4.0-rc4-00437-g4bab5c42a698 (r...@yuzhao.bld.corp.google.com) 
(mips64el-linux-gnuabi64-gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils 
for Debian) 2.40) #3 SMP PREEMPT Thu Jun 15 01:05:20 MDT 2023
Skipping L2 locking due to reduced L2 cache size
CVMSEG size: 2 cache lines (256 bytes)
printk: bootconsole [early0] enabled
CPU0 revision is: 000d9602 (Cavium Octeon III)
FPU revision is: 00739600
Kernel sections are not in the memory maps
Wasting 243712 bytes for tracking 4352 unused pages
Initrd not found or empty - disabling initrd
Using passed Device Tree.
software IO TLB: SWIOTLB bounce buffer size adjusted to 0MB
software IO TLB: area num 1.
software IO TLB: mapped [mem 0x0370d000-0x0374d000] (0MB)
Primary instruction cache 78kB, virtually tagged, 39 way, 16 sets, linesize 128 
bytes.
Primary data cache 32kB, 32-way, 8 sets, linesize 128 bytes.
Zone ranges:
  DMA32[mem 0x0110-0xefff]
  Normal   empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x0110-0x03646fff]
  node   0: [mem 0x0370-0x0faf]
  node   0: [mem 0x2000-0x4ebf]
Initmem setup node 0 [mem 0x0110-0x4ebf]
On node 0, zone DMA32: 4352 pages in unavailable ranges
On node 0, zone DMA32: 185 pages in unavailable ranges
On node 0, zone DMA32: 1280 pages in unavailable ranges
On node 0, zone DMA32: 5120 pages in unavailable ranges
percpu: Embedded 15 pages/cpu s24368 r8192 d28880 u61440
pcpu-alloc: s24368 r8192 d28880 u61440 alloc=15*4096
pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3 
Kernel command line:  loglevel=8 console=ttyS0,115200
printk: log_buf_len individual max cpu contribution: 4096 bytes
printk: log_buf_len total cpu_extra contributions: 12288 bytes
printk: log_buf_len min size: 16384 bytes
printk: log_buf_len: 32768 bytes
printk: early log buf free: 14184(86%)
Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)
Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)
Built 1 zonelists, mobility grouping on.  Total pages: 247772
mem auto-init: stack:all(zero), heap alloc:off, heap free:off
Memory: 950032K/1004828K available (8058K kernel code, 575K rwdata, 1880K 
rodata, 27488K init, 158K bss, 54796K reserved, 0K cma-reserved)
rcu: Preemptible 

Re: [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()

2023-06-08 Thread Yu Zhao
On Tue, Jun 6, 2023 at 2:34 AM Tzung-Bi Shih  wrote:
>
> On Fri, May 26, 2023 at 05:44:26PM -0600, Yu Zhao wrote:
> > +/*
> > + * Architectures that implement kvm_arch_test_clear_young() should override
> > + * kvm_arch_has_test_clear_young().
> > + *
> > + * kvm_arch_has_test_clear_young() is allowed to return false positive, 
> > i.e., it
> > + * can return true if kvm_arch_test_clear_young() is supported but 
> > disabled due
> > + * to some runtime constraint. In this case, kvm_arch_test_clear_young() 
> > should
>
> Is it a typo here?  
> s/kvm_arch_test_clear_young/kvm_arch_has_test_clear_young/.

Not a typo.


kvm/x86: multichase benchmark

2023-06-08 Thread Yu Zhao
TLDR

Multichase in 64 microVMs achieved 6% more total samples (in ~4 hours) after 
this patchset [1].

Hardware

HOST $ lscpu
Architecture:x86_64
  CPU op-mode(s):32-bit, 64-bit
  Address sizes: 43 bits physical, 48 bits virtual
  Byte Order:Little Endian
CPU(s):  128
  On-line CPU(s) list:   0-127
Vendor ID:   AuthenticAMD
  Model name:AMD Ryzen Threadripper PRO 3995WX 64-Cores
CPU family:  23
Model:   49
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):   1
Stepping:0
Frequency boost: disabled
CPU max MHz: 4308.3979
CPU min MHz: 2200.
BogoMIPS:5390.20
Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush mmx fxsr sse sse2
 ...
Virtualization features:
  Virtualization:AMD-V
Caches (sum of all):
  L1d:   2 MiB (64 instances)
  L1i:   2 MiB (64 instances)
  L2:32 MiB (64 instances)
  L3:256 MiB (16 instances)
NUMA:
  NUMA node(s):  1
  NUMA node0 CPU(s): 0-127
Vulnerabilities:
  Itlb multihit: Not affected
  L1tf:  Not affected
  Mds:   Not affected
  Meltdown:  Not affected
  Mmio stale data:   Not affected
  Retbleed:  Mitigation; untrained return thunk; SMT enabled with 
STIBP protection
  Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:Mitigation; usercopy/swapgs barriers and __user 
pointer sanitization
  Spectre v2:Mitigation; Retpolines, IBPB conditional, STIBP 
always-on, RSB filling, PBRSB-eIBRS Not affected
  Srbds: Not affected
  Tsx async abort:   Not affected

HOST $ numactl -H
available: 1 nodes (0)
node 0 cpus: 0-127
node 0 size: 257542 MB
node 0 free: 224855 MB
node distances:
node   0
  0:  10

HOST $ cat /sys/class/nvme/nvme0/model
INTEL SSDPF21Q800GB

HOST $ cat /sys/class/nvme/nvme0/numa_node
0

Software

HOST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"

HOST $ uname -a
Linux x86 6.4.0-rc5+ #1 SMP PREEMPT_DYNAMIC Wed Jun  7 22:17:47 UTC 2023 x86_64 
x86_64 x86_64 GNU/Linux

HOST $ cat /proc/swaps
Filename  Type Size UsedPriority
/dev/nvme0n1p2partition4668383560   -2

HOST $ cat /sys/kernel/mm/lru_gen/enabled
0x000f

HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise madvise [never]

Procedure
=
HOST $ git clone https://github.com/google/multichase

HOST $ 
HOST $ 

HOST $ cp multichase/multichase ./initrd/bin/
HOST $ sed -i \
"/^maybe_break top$/i multichase -t 2 -m 4g -n 28800; poweroff" \
./initrd/init

HOST $ 

HOST $ cat run_microvms.sh
memcgs=64

run() {
path=/sys/fs/cgroup/memcg$1

mkdir $path
echo $BASHPID >$path/cgroup.procs

qemu-system-x86_64 -M microvm,accel=kvm -cpu host -smp 2 -m 6g \
-nographic -kernel /boot/vmlinuz -initrd ./initrd.img \
-append "console=ttyS0 loglevel=0"
}

for ((memcg = 0; memcg < $memcgs; memcg++)); do
run $memcg &
done

wait

Results
===
 Before [1]AfterChange
--
Total samples6824  7237 +6%

Notes
=
[1] "mm: rmap: Don't flush TLB after checking PTE young for page
reference" was included so that the comparison is apples to
Apples.
https://lore.kernel.org/r/20220706112041.3831-1-21cn...@gmail.com/


kvm/arm64: Spark benchmark

2023-06-08 Thread Yu Zhao
TLDR

Apache Spark spent 12% less time sorting four billion random integers twenty 
times (in ~4 hours) after this patchset [1].

Hardware

HOST $ lscpu
Architecture:   aarch64
  CPU op-mode(s):   32-bit, 64-bit
  Byte Order:   Little Endian
CPU(s): 128
  On-line CPU(s) list:  0-127
Vendor ID:  ARM
  Model name:   Neoverse-N1
Model:  1
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s):  2
Stepping:   r3p1
Frequency boost:disabled
CPU max MHz:2800.
CPU min MHz:1000.
BogoMIPS:   50.00
Flags:  fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp 
asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
Caches (sum of all):
  L1d:  8 MiB (128 instances)
  L1i:  8 MiB (128 instances)
  L2:   128 MiB (128 instances)
NUMA:
  NUMA node(s): 2
  NUMA node0 CPU(s):0-63
  NUMA node1 CPU(s):64-127
Vulnerabilities:
  Itlb multihit:Not affected
  L1tf: Not affected
  Mds:  Not affected
  Meltdown: Not affected
  Mmio stale data:  Not affected
  Retbleed: Not affected
  Spec store bypass:Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:   Mitigation; __user pointer sanitization
  Spectre v2:   Mitigation; CSV2, BHB
  Srbds:Not affected
  Tsx async abort:  Not affected

HOST $ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0-63
node 0 size: 257730 MB
node 0 free: 1447 MB
node 1 cpus: 64-127
node 1 size: 256877 MB
node 1 free: 256093 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

HOST $ cat /sys/class/nvme/nvme0/model
INTEL SSDPF21Q800GB

HOST $ cat /sys/class/nvme/nvme0/numa_node
0

Software

HOST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"

HOST $ uname -a
Linux arm 6.4.0-rc4 #1 SMP Sat Jun  3 05:30:06 UTC 2023 aarch64 aarch64 aarch64 
GNU/Linux

HOST $ cat /proc/swaps
FilenameTypeSizeUsed
Priority
/dev/nvme0n1p2  partition   466838356   
116922112   -2

HOST $ cat /sys/kernel/mm/lru_gen/enabled
0x000b

HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise madvise [never]

HOST $ qemu-system-aarch64 --version
QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.6)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers

GUEST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.2 LTS"

GUEST $ java --version
openjdk 17.0.7 2023-04-18
OpenJDK Runtime Environment (build 17.0.7+7-Ubuntu-0ubuntu122.04.2)
OpenJDK 64-Bit Server VM (build 17.0.7+7-Ubuntu-0ubuntu122.04.2, mixed mode, 
sharing)

GUEST $ spark-shell --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.0
  /_/

Using Scala version 2.12.17, OpenJDK 64-Bit Server VM, 17.0.7
Branch HEAD
Compiled by user xinrong.meng on 2023-04-07T02:18:01Z
Revision 87a5442f7ed96b11051d8a9333476d080054e5a0
Url https://github.com/apache/spark
Type --help for more information.

Procedure
=
HOST $ sudo numactl -N 0 -m 0 qemu-system-aarch64 \
-M virt,accel=kvm -cpu host -smp 64 -m 300g -nographic -nic user \
-bios /usr/share/qemu-efi-aarch64/QEMU_EFI.fd \
-drive if=virtio,format=raw,file=/dev/nvme0n1p1

GUEST $ cat gen.scala
import java.io._
import scala.collection.mutable.ArrayBuffer

object GenData {
def main(args: Array[String]): Unit = {
val file = new File("/dev/shm/dataset.txt")
val writer = new BufferedWriter(new FileWriter(file))
val buf = ArrayBuffer(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)
for(_ <- 0 until 4) {
for (i <- 0 until 10) {
buf.update(i, scala.util.Random.nextLong())
}
writer.write(s"${buf.mkString(",")}\n")
}
writer.close()
}
}
GenData.main(Array())

GUEST $ cat sort.scala
import java.time.temporal.ChronoUnit
import org.apache.spark.sql.SparkSession

object SparkSort {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().getOrCreate()
val file = sc.textFile("/dev/shm/dataset.txt", 64)
val results = file.flatMap(_.split(",")).map(x => (x, 
1)).sortByKey().takeOrdered(10)
results.foreach(println)
spark.stop()
}
}
SparkSort.main(Array())

GUEST $ cat run_spark.sh
export SPARK_LOCAL_DIRS=/dev/shm/

spark-shell https://lore.kernel.org/r/20220706112041.3831-1-21cn...@gmail.com/


kvm/powerpc: memcached benchmark

2023-06-08 Thread Yu Zhao
TLDR

Memcached achieved 10% more operations per second (in ~4 hours) after this 
patchset [1].

Hardware

HOST $ lscpu
Architecture:  ppc64le
  Byte Order:  Little Endian
CPU(s):184
  On-line CPU(s) list: 0-183
Model name:POWER9 (raw), altivec supported
  Model:   2.2 (pvr 004e 1202)
  Thread(s) per core:  4
  Core(s) per socket:  23
  Socket(s):   2
  CPU max MHz: 3000.
  CPU min MHz: 2300.
Caches (sum of all):
  L1d: 1.4 MiB (46 instances)
  L1i: 1.4 MiB (46 instances)
  L2:  12 MiB (24 instances)
  L3:  240 MiB (24 instances)
NUMA:
  NUMA node(s):2
  NUMA node0 CPU(s):   0-91
  NUMA node1 CPU(s):   92-183
Vulnerabilities:
  Itlb multihit:   Not affected
  L1tf:Mitigation; RFI Flush, L1D private per thread
  Mds: Not affected
  Meltdown:Mitigation; RFI Flush, L1D private per thread
  Mmio stale data: Not affected
  Retbleed:Not affected
  Spec store bypass:   Mitigation; Kernel entry/exit barrier (eieio)
  Spectre v1:  Mitigation; __user pointer sanitization, ori31 
speculation barrier enabled
  Spectre v2:  Mitigation; Indirect branch serialisation (kernel only), 
Indirect branch cache disabled, Software link stack flush
  Srbds:   Not affected
  Tsx async abort: Not affected

HOST $ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0-91
node 0 size: 261659 MB
node 0 free: 259152 MB
node 1 cpus: 92-183
node 1 size: 261713 MB
node 1 free: 261076 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10

HOST $ cat /sys/class/nvme/nvme0/model
INTEL SSDPF21Q800GB

HOST $ cat /sys/class/nvme/nvme0/numa_node
0

Software

HOST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04 LTS"

HOST $ uname -a
Linux ppc 6.3.0 #1 SMP Sun Jun  4 18:26:37 UTC 2023 ppc64le ppc64le ppc64le 
GNU/Linux

HOST $ cat /proc/swaps
Filename  Type Size UsedPriority
/dev/nvme0n1p2partition 4668382720   -2

HOST $ cat /sys/kernel/mm/lru_gen/enabled
0x0009

HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise madvise [never]

HOST $ qemu-system-ppc64 --version
QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.6)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers

GUEST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"

GUEST $ cat /etc/memcached.conf
...
-t 92
-m 262144
-B binary
-s /var/run/memcached/memcached.sock
-a 0766

GUEST $ memtier_benchmark -v
memtier_benchmark 1.4.0
Copyright (C) 2011-2022 Redis Ltd.
This is free software.  You may redistribute copies of it under the terms of
the GNU General Public License .
There is NO WARRANTY, to the extent permitted by law.

Procedure
=
HOST $ sudo numactl -N 0 -m 0 qemu-system-ppc64 \
-M pseries,accel=kvm,kvm-type=HV -cpu host -smp 92 -m 270g
-nographic -nic user \
-drive if=virtio,format=raw,file=/dev/nvme0n1p1

GUEST $ memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -c 1 -t 92 --pipeline 1 --ratio 1:0 \
--key-minimum=1 --key-maximum=12000 --key-pattern=P:P \
-n allkeys -d 2000

GUEST $ memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -c 1 -t 92 --pipeline 1 --ratio 0:1 \
--key-minimum=1 --key-maximum=12000 --key-pattern=R:R \
-n allkeys --randomize --distinct-client-seed

Results
===
Before [1]AfterChange
-
Ops/sec 721586.10 800210.12+10%
Avg. Latency0.12546   0.11260  -10%
p50 Latency 0.08700   0.08700  N/C
p99 Latency 0.28700   0.24700  -13%

Notes
=
[1] "mm: rmap: Don't flush TLB after checking PTE young for page
reference" was included so that the comparison is apples to
Apples.
https://lore.kernel.org/r/20220706112041.3831-1-21cn...@gmail.com/


Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

2023-05-31 Thread Yu Zhao
On Wed, May 31, 2023 at 5:23 PM Oliver Upton  wrote:
>
> On Wed, May 31, 2023 at 05:10:52PM -0600, Yu Zhao wrote:
> > On Wed, May 31, 2023 at 1:28 PM Oliver Upton  wrote:
> > > On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote:
> > > > On Tue, May 30, 2023 at 1:37 PM Oliver Upton  
> > > > wrote:
> > > > > As it is currently implemented, yes. But, there's potential to 
> > > > > fast-path
> > > > > the implementation by checking page_count() before starting the walk.
> > > >
> > > > Do you mind posting another patch? I'd be happy to ack it, as well as
> > > > the one you suggested above.
> > >
> > > I'd rather not take such a patch independent of the test_clear_young
> > > series if you're OK with that. Do you mind implementing something
> > > similar to the above patch w/ the proposed optimization if you need it?
> >
> > No worries. I can take the above together with the following, which
> > would form a new series with its own merits, since apparently you
> > think the !AF case is important.
>
> Sorry if my suggestion was unclear.
>
> I thought we were talking about ->free_removed_table() being called from
> the stage-2 unmap path

Yes, we were, or in general, about how to make KVM PTs RCU safe for ARM.

So I'm thinking about taking 1) your patch above, 2) what I just
suggested and 3) what you suggested below to form a mini series, which
could land indepently and would make my job here easier.

> in which case we wind up unnecessarily visiting
> PTEs on a table known to be empty. You could fast-path that by only
> initiating a walk if  page_count() > 1:

Yes, this is what I meant.

> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 95dae02ccc2e..766563dc465c 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1331,7 +1331,8 @@ void kvm_pgtable_stage2_free_removed(struct 
> kvm_pgtable_mm_ops *mm_ops, void *pg
> .end= kvm_granule_size(level),
> };
>
> -   WARN_ON(__kvm_pgtable_walk(, mm_ops, ptep, level + 1));
> +   if (mm_ops->page_count(pgtable) > 1)
> +   WARN_ON(__kvm_pgtable_walk(, mm_ops, ptep, level + 1));
>
> WARN_ON(mm_ops->page_count(pgtable) != 1);
> mm_ops->put_page(pgtable);
>
>
> A lock-free access fault walker is interesting, but in my testing it hasn't
> led to any significant improvements over acquiring the MMU lock for
> read. Because of that I hadn't bothered with posting the series upstream.

It's hard to measure but we have perf benchmarks on ChromeOS which should help.


Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

2023-05-31 Thread Yu Zhao
On Wed, May 31, 2023 at 1:28 PM Oliver Upton  wrote:
>
> On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote:
> > On Tue, May 30, 2023 at 1:37 PM Oliver Upton  wrote:
> > >
> > > Hi Yu,
> > >
> > > On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote:
> > > > On Sat, May 27, 2023 at 12:08 PM Oliver Upton  
> > > > wrote:
> > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c 
> > > > > b/arch/arm64/kvm/hyp/pgtable.c
> > > > > index 3d61bd3e591d..bfbebdcb4ef0 100644
> > > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct 
> > > > > kvm_pgtable_visit_ctx *ctx,
> > > > >
> > > > > kvm_granule_size(ctx->level));
> > > > >
> > > > > if (childp)
> > > > > -   mm_ops->put_page(childp);
> > > > > +   mm_ops->free_removed_table(childp, ctx->level);
> > > >
> > > > Thanks, Oliver.
> > > >
> > > > A couple of things I haven't had the chance to verify -- I'm hoping
> > > > you could help clarify:
> > > > 1. For unmapping, with free_removed_table(), wouldn't we have to look
> > > > into the table we know it's empty unnecessarily?
> > >
> > > As it is currently implemented, yes. But, there's potential to fast-path
> > > the implementation by checking page_count() before starting the walk.
> >
> > Do you mind posting another patch? I'd be happy to ack it, as well as
> > the one you suggested above.
>
> I'd rather not take such a patch independent of the test_clear_young
> series if you're OK with that. Do you mind implementing something
> similar to the above patch w/ the proposed optimization if you need it?

No worries. I can take the above together with the following, which
would form a new series with its own merits, since apparently you
think the !AF case is important.

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 26a8d955b49c..6ce73ce9f146 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1453,10 +1453,10 @@ static void handle_access_fault(struct
kvm_vcpu *vcpu, phys_addr_t fault_ipa)

trace_kvm_access_fault(fault_ipa);

-   read_lock(>kvm->mmu_lock);
+   rcu_read_lock();
mmu = vcpu->arch.hw_mmu;
pte = kvm_pgtable_stage2_mkyoung(mmu->pgt, fault_ipa);
-   read_unlock(>kvm->mmu_lock);
+   rcu_read_unlock();

if (kvm_pte_valid(pte))
kvm_set_pfn_accessed(kvm_pte_to_pfn(pte));


Re: [PATCH mm-unstable v2 05/10] kvm/arm64: add kvm_arch_test_clear_young()

2023-05-31 Thread Yu Zhao
On Wed, May 31, 2023 at 1:56 PM Oliver Upton  wrote:
>
> Hi Yu,
>
> On Fri, May 26, 2023 at 05:44:30PM -0600, Yu Zhao wrote:
> > Implement kvm_arch_test_clear_young() to support the fast path in
> > mmu_notifier_ops->test_clear_young().
> >
> > It focuses on a simple case, i.e., hardware sets the accessed bit in
> > KVM PTEs and VMs are not protected, where it can rely on RCU and
> > cmpxchg to safely clear the accessed bit without taking
> > kvm->mmu_lock. Complex cases fall back to the existing slow path
> > where kvm->mmu_lock is then taken.
> >
> > Signed-off-by: Yu Zhao 
> > ---
> >  arch/arm64/include/asm/kvm_host.h |  6 ++
> >  arch/arm64/kvm/mmu.c  | 36 +++
> >  2 files changed, 42 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h 
> > b/arch/arm64/include/asm/kvm_host.h
> > index 7e7e19ef6993..da32b0890716 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -1113,4 +1113,10 @@ static inline void kvm_hyp_reserve(void) { }
> >  void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
> >  bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
> >
> > +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> > +static inline bool kvm_arch_has_test_clear_young(void)
> > +{
> > + return cpu_has_hw_af() && !is_protected_kvm_enabled();
> > +}
>
> I would *strongly* suggest you consider supporting test_clear_young on
> systems that do software Access Flag management. FEAT_HAFDBS is an
> *optional* extension to the architecture, so we're going to support
> software AF management for a very long time in KVM. It is also a valid
> fallback option in the case of hardware errata which render HAFDBS
> broken.

Hi Oliver,

It's not about willingness but resources. Ideally we want to make
everything perfect, but in reality, we can only move forward one step
a time.

If I looked at your request from ARM's POV, I would agree with you.
But my goal is to lay the foundation for all architectures that could
benefit, so I may not be able to cover a lot for each architecture.
Specifically, I don't have the bandwidth to test the !FEAT_HAFDBS case
for ARM.

So here are some options I could offer, ordered by my preferences:
1. We proceed as it is for now. I *will* find someone from my team (or
yours) to follow up -- this way we can make sure !FEAT_HAFDBS is well
tested.
2. I drop the cpu_has_hw_af() check above. Not that I think there is
much risk, I'm just trying to be cautious.
3. I drop the entire ARM support (and include the RISC-V support which
I previously deprioritized). We revisit after the test is done.

Sounds reasonable?

> So, we should expect (and support) systems of all shapes and sizes that
> do software AF. I'm sure we'll hear about more in the not-too-distant
> future...
>
> For future reference (even though I'm suggesting you support software
> AF), decisions such of these need an extremely verbose comment
> describing the rationale behind the decision.
>
> > +
> >  #endif /* __ARM64_KVM_HOST_H__ */
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index c3b3e2afe26f..26a8d955b49c 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
>
> Please do not implement page table walkers outside of hyp/pgtable.c
>
> > @@ -1678,6 +1678,42 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct 
> > kvm_gfn_range *range)
> >  range->start << PAGE_SHIFT);
> >  }
> >
> > +static int stage2_test_clear_young(const struct kvm_pgtable_visit_ctx *ctx,
> > +enum kvm_pgtable_walk_flags flags)
> > +{
> > + kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
> > +
> > + VM_WARN_ON_ONCE(!page_count(virt_to_page(ctx->ptep)));
>
> This sort of sanity checking is a bit excessive. Isn't there a risk of
> false negatives here too? IOW, if we tragically mess up RCU in the page
> table code, what's stopping a prematurely freed page from being
> allocated to another user?

Yes, but from my aforementioned POV (the breadth I'm focusing on),
this is a good practice. I can live without this assertion if you feel
strongly about it.

> > + if (!kvm_pte_valid(new))
> > + return 0;
> > +
> > + if (new == ctx->old)
> > + return 0;
> > +
> > + if (kvm_should_clear_young(ctx->arg, ctx->addr / PAGE_SIZE))
> > + stage2_try_set_pte(ctx, new);
> > +
> > + return 0;
> > +}
> > +
> > +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range 
> > *range)
> > +{
> > + u64 start = range->start * PAGE_SIZE;
> > + u64 end = range->end * PAGE_SIZE;
> > + struct kvm_pgtable_walker walker = {
> > + .cb = stage2_test_clear_young,
> > + .arg= range,
> > + .flags  = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_SHARED,
> > + };
> > +
> > + BUILD_BUG_ON(is_hyp_code());
>
> Delete this assertion.

Will do.


Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

2023-05-30 Thread Yu Zhao
On Tue, May 30, 2023 at 1:37 PM Oliver Upton  wrote:
>
> Hi Yu,
>
> On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote:
> > On Sat, May 27, 2023 at 12:08 PM Oliver Upton  
> > wrote:
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index 3d61bd3e591d..bfbebdcb4ef0 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct 
> > > kvm_pgtable_visit_ctx *ctx,
> > >
> > > kvm_granule_size(ctx->level));
> > >
> > > if (childp)
> > > -   mm_ops->put_page(childp);
> > > +   mm_ops->free_removed_table(childp, ctx->level);
> >
> > Thanks, Oliver.
> >
> > A couple of things I haven't had the chance to verify -- I'm hoping
> > you could help clarify:
> > 1. For unmapping, with free_removed_table(), wouldn't we have to look
> > into the table we know it's empty unnecessarily?
>
> As it is currently implemented, yes. But, there's potential to fast-path
> the implementation by checking page_count() before starting the walk.

Do you mind posting another patch? I'd be happy to ack it, as well as
the one you suggested above.

> > 2. For remapping and unmapping, how does free_removed_table() put the
> > final refcnt on the table passed in? (Previously we had
> > put_page(childp) in stage2_map_walk_table_post(). So I'm assuming we'd
> > have to do something equivalent with free_removed_table().)
>
> Heh, that's a bug, and an embarrassing one at that!
>
> Sent out a fix for that, since it would appear we leak memory on
> table->block transitions. PTAL if you have a chance.
>
> https://lore.kernel.org/all/20230530193213.1663411-1-oliver.up...@linux.dev/

Awesome.


Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

2023-05-27 Thread Yu Zhao
On Sat, May 27, 2023 at 12:08 PM Oliver Upton  wrote:
>
> Yu,
>
> On Fri, May 26, 2023 at 05:44:29PM -0600, Yu Zhao wrote:
> > Stage2 page tables are currently not RCU safe against unmapping or VM
> > destruction. The previous mmu_notifier_ops members rely on
> > kvm->mmu_lock to synchronize with those operations.
> >
> > However, the new mmu_notifier_ops member test_clear_young() provides
> > a fast path that does not take kvm->mmu_lock. To implement
> > kvm_arch_test_clear_young() for that path, unmapped page tables need
> > to be freed by RCU and kvm_free_stage2_pgd() needs to be after
> > mmu_notifier_unregister().
> >
> > Remapping, specifically stage2_free_removed_table(), is already RCU
> > safe.
> >
> > Signed-off-by: Yu Zhao 
> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h |  2 ++
> >  arch/arm64/kvm/arm.c |  1 +
> >  arch/arm64/kvm/hyp/pgtable.c |  8 ++--
> >  arch/arm64/kvm/mmu.c | 17 -
> >  4 files changed, 25 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
> > b/arch/arm64/include/asm/kvm_pgtable.h
> > index ff520598b62c..5cab52e3a35f 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -153,6 +153,7 @@ static inline bool kvm_level_supports_block_mapping(u32 
> > level)
> >   * @put_page:Decrement the refcount on a page. 
> > When the
> >   *   refcount reaches 0 the page is automatically
> >   *   freed.
> > + * @put_page_rcu:RCU variant of the above.
>
> You don't need to add yet another hook to implement this. I was working
> on lock-free walks in a separate context and arrived at the following:
>
> commit f82d264a37745e07ee28e116c336f139f681fd7f
> Author: Oliver Upton 
> Date:   Mon May 1 08:53:37 2023 +
>
> KVM: arm64: Consistently use free_removed_table() for stage-2
>
> free_removed_table() is essential to the RCU-protected parallel walking
> scheme, as behind the scenes the cleanup is deferred until an RCU grace
> period. Nonetheless, the stage-2 unmap path calls put_page() directly,
> which leads to table memory being freed inline with the table walk.
>
> This is safe for the time being, as the stage-2 unmap walker is called
> while holding the write lock. A future change to KVM will further relax
> the locking mechanics around the stage-2 page tables to allow lock-free
> walkers protected only by RCU. As such, switch to the RCU-safe mechanism
> for freeing table memory.
>
> Signed-off-by: Oliver Upton 
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 3d61bd3e591d..bfbebdcb4ef0 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct 
> kvm_pgtable_visit_ctx *ctx,
>kvm_granule_size(ctx->level));
>
> if (childp)
> -   mm_ops->put_page(childp);
> +   mm_ops->free_removed_table(childp, ctx->level);

Thanks, Oliver.

A couple of things I haven't had the chance to verify -- I'm hoping
you could help clarify:
1. For unmapping, with free_removed_table(), wouldn't we have to look
into the table we know it's empty unnecessarily?
2. For remapping and unmapping, how does free_removed_table() put the
final refcnt on the table passed in? (Previously we had
put_page(childp) in stage2_map_walk_table_post(). So I'm assuming we'd
have to do something equivalent with free_removed_table().)


[PATCH mm-unstable v2 10/10] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

2023-05-26 Thread Yu Zhao
Use mmu_notifier_test_clear_young() to handle KVM PTEs in batches
when the fast path is supported. This reduces the contention on
kvm->mmu_lock when the host is under heavy memory pressure.

An existing selftest can quickly demonstrate the effectiveness of
this patch. On a generic workstation equipped with 128 CPUs and 256GB
DRAM:

  $ sudo max_guest_memory_test -c 64 -m 250 -s 250

  MGLRU run2
  --
  Before [1]~64s
  After ~51s

  kswapd (MGLRU before)
100.00%  balance_pgdat
  100.00%  shrink_node
100.00%  shrink_one
  99.99%  try_to_shrink_lruvec
99.71%  evict_folios
  97.29%  shrink_folio_list
  ==>>  13.05%  folio_referenced
  12.83%  rmap_walk_file
12.31%  folio_referenced_one
  7.90%  __mmu_notifier_clear_young
7.72%  kvm_mmu_notifier_clear_young
  7.34%  _raw_write_lock

  kswapd (MGLRU after)
100.00%  balance_pgdat
  100.00%  shrink_node
100.00%  shrink_one
  99.99%  try_to_shrink_lruvec
99.59%  evict_folios
  80.37%  shrink_folio_list
  ==>>  3.74%  folio_referenced
  3.59%  rmap_walk_file
3.19%  folio_referenced_one
  2.53%  lru_gen_look_around
1.06%  __mmu_notifier_test_clear_young

[1] "mm: rmap: Don't flush TLB after checking PTE young for page
reference" was included so that the comparison is apples to
apples.
https://lore.kernel.org/r/20220706112041.3831-1-21cn...@gmail.com/

Signed-off-by: Yu Zhao 
---
 Documentation/admin-guide/mm/multigen_lru.rst |   6 +-
 include/linux/mmzone.h|   6 +-
 mm/rmap.c |   8 +-
 mm/vmscan.c   | 139 --
 4 files changed, 138 insertions(+), 21 deletions(-)

diff --git a/Documentation/admin-guide/mm/multigen_lru.rst 
b/Documentation/admin-guide/mm/multigen_lru.rst
index 33e068830497..0ae2a6d4d94c 100644
--- a/Documentation/admin-guide/mm/multigen_lru.rst
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -48,6 +48,10 @@ Values Components
verified on x86 varieties other than Intel and AMD. If it is
disabled, the multi-gen LRU will suffer a negligible
performance degradation.
+0x0008 Clearing the accessed bit in KVM page table entries in large
+   batches, when KVM MMU sets it (e.g., on x86_64). This can
+   improve the performance of guests when the host is under memory
+   pressure.
 [yYnN] Apply to all the components above.
 == ===
 
@@ -56,7 +60,7 @@ E.g.,
 
 echo y >/sys/kernel/mm/lru_gen/enabled
 cat /sys/kernel/mm/lru_gen/enabled
-0x0007
+0x000f
 echo 5 >/sys/kernel/mm/lru_gen/enabled
 cat /sys/kernel/mm/lru_gen/enabled
 0x0005
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5a7ada0413da..1b148f39fabc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -369,6 +369,7 @@ enum {
LRU_GEN_CORE,
LRU_GEN_MM_WALK,
LRU_GEN_NONLEAF_YOUNG,
+   LRU_GEN_KVM_MMU_WALK,
NR_LRU_GEN_CAPS
 };
 
@@ -471,7 +472,7 @@ struct lru_gen_mm_walk {
 };
 
 void lru_gen_init_lruvec(struct lruvec *lruvec);
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
 
 #ifdef CONFIG_MEMCG
 
@@ -559,8 +560,9 @@ static inline void lru_gen_init_lruvec(struct lruvec 
*lruvec)
 {
 }
 
-static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 {
+   return false;
 }
 
 #ifdef CONFIG_MEMCG
diff --git a/mm/rmap.c b/mm/rmap.c
index ae127f60a4fb..3a2c4e6a0887 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -825,12 +825,10 @@ static bool folio_referenced_one(struct folio *folio,
return false; /* To break the loop */
}
 
-   if (pvmw.pte) {
-   if (lru_gen_enabled() && pte_young(*pvmw.pte)) {
-   lru_gen_look_around();
+   if (lru_gen_enabled() && pvmw.pte) {
+   if (lru_gen_look_around())
referenced++;
-   }
-
+   } else if (pvmw.pte) {
if (ptep_clear_flush_young_notify(vma, address,
pvmw.pte))
referenced++;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ef687f9be13c..3f734588b600 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -58,6 +58,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -3244,6 +3245,20 @@ static bool should_clear_pmd_young(voi

[PATCH mm-unstable v2 09/10] kvm/x86: add kvm_arch_test_clear_young()

2023-05-26 Thread Yu Zhao
Implement kvm_arch_test_clear_young() to support the fast path in
mmu_notifier_ops->test_clear_young().

It focuses on a simple case, i.e., TDP MMU sets the accessed bit in
KVM PTEs and VMs are not nested, where it can rely on RCU and
clear_bit() to safely clear the accessed bit without taking
kvm->mmu_lock. Complex cases fall back to the existing slow path
where kvm->mmu_lock is then taken.

Signed-off-by: Yu Zhao 
---
 arch/x86/include/asm/kvm_host.h |  7 +++
 arch/x86/kvm/mmu/tdp_mmu.c  | 34 +
 2 files changed, 41 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 753c67072c47..d6dfdebe3d94 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2223,4 +2223,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, 
unsigned long npages);
  */
 #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
 
+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+   return IS_ENABLED(CONFIG_X86_64) &&
+  (!IS_REACHABLE(CONFIG_KVM) || (tdp_mmu_enabled && 
shadow_accessed_mask));
+}
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 08340219c35a..6875a819e007 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1232,6 +1232,40 @@ bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct 
kvm_gfn_range *range)
return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
 }
 
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+   struct kvm_mmu_page *root;
+   int offset = ffs(shadow_accessed_mask) - 1;
+
+   if (kvm_shadow_root_allocated(kvm))
+   return true;
+
+   rcu_read_lock();
+
+   list_for_each_entry_rcu(root, >arch.tdp_mmu_roots, link) {
+   struct tdp_iter iter;
+
+   if (kvm_mmu_page_as_id(root) != range->slot->as_id)
+   continue;
+
+   tdp_root_for_each_leaf_pte(iter, root, range->start, 
range->end) {
+   u64 *sptep = rcu_dereference(iter.sptep);
+
+   VM_WARN_ON_ONCE(!page_count(virt_to_page(sptep)));
+
+   if (!(iter.old_spte & shadow_accessed_mask))
+   continue;
+
+   if (kvm_should_clear_young(range, iter.gfn))
+   clear_bit(offset, (unsigned long *)sptep);
+   }
+   }
+
+   rcu_read_unlock();
+
+   return false;
+}
+
 static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
 struct kvm_gfn_range *range)
 {
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH mm-unstable v2 08/10] kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask

2023-05-26 Thread Yu Zhao
tdp_mmu_enabled and shadow_accessed_mask are needed to implement
kvm_arch_has_test_clear_young().

Signed-off-by: Yu Zhao 
---
 arch/x86/include/asm/kvm_host.h | 6 ++
 arch/x86/kvm/mmu.h  | 6 --
 arch/x86/kvm/mmu/spte.h | 1 -
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fb9d1f2d6136..753c67072c47 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1772,6 +1772,7 @@ struct kvm_arch_async_pf {
 
 extern u32 __read_mostly kvm_nr_uret_msrs;
 extern u64 __read_mostly host_efer;
+extern u64 __read_mostly shadow_accessed_mask;
 extern bool __read_mostly allow_smaller_maxphyaddr;
 extern bool __read_mostly enable_apicv;
 extern struct kvm_x86_ops kvm_x86_ops;
@@ -1855,6 +1856,11 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned 
irqchip, unsigned pin,
 bool mask);
 
 extern bool tdp_enabled;
+#ifdef CONFIG_X86_64
+extern bool tdp_mmu_enabled;
+#else
+#define tdp_mmu_enabled false
+#endif
 
 u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 92d5a1924fc1..84aedb2671ef 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -253,12 +253,6 @@ static inline bool kvm_shadow_root_allocated(struct kvm 
*kvm)
return smp_load_acquire(>arch.shadow_root_allocated);
 }
 
-#ifdef CONFIG_X86_64
-extern bool tdp_mmu_enabled;
-#else
-#define tdp_mmu_enabled false
-#endif
-
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1279db2eab44..a82c4fa1c47b 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -153,7 +153,6 @@ extern u64 __read_mostly shadow_mmu_writable_mask;
 extern u64 __read_mostly shadow_nx_mask;
 extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
 extern u64 __read_mostly shadow_user_mask;
-extern u64 __read_mostly shadow_accessed_mask;
 extern u64 __read_mostly shadow_dirty_mask;
 extern u64 __read_mostly shadow_mmio_value;
 extern u64 __read_mostly shadow_mmio_mask;
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH mm-unstable v2 07/10] kvm/powerpc: add kvm_arch_test_clear_young()

2023-05-26 Thread Yu Zhao
Implement kvm_arch_test_clear_young() to support the fast path in
mmu_notifier_ops->test_clear_young().

It focuses on a simple case, i.e., radix MMU sets the accessed bit in
KVM PTEs and VMs are not nested, where it can rely on RCU and
pte_xchg() to safely clear the accessed bit without taking
kvm->mmu_lock. Complex cases fall back to the existing slow path
where kvm->mmu_lock is then taken.

Signed-off-by: Yu Zhao 
---
 arch/powerpc/include/asm/kvm_host.h|  8 
 arch/powerpc/include/asm/kvm_ppc.h |  1 +
 arch/powerpc/kvm/book3s.c  |  6 +++
 arch/powerpc/kvm/book3s.h  |  1 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 59 ++
 arch/powerpc/kvm/book3s_hv.c   |  5 +++
 6 files changed, 80 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 14ee0dece853..75c260ea8a9e 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -883,4 +883,12 @@ static inline void kvm_arch_sched_in(struct kvm_vcpu 
*vcpu, int cpu) {}
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 
+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+   return IS_ENABLED(CONFIG_KVM_BOOK3S_HV_POSSIBLE) &&
+  cpu_has_feature(CPU_FTR_HVMODE) && 
cpu_has_feature(CPU_FTR_ARCH_300) &&
+  radix_enabled();
+}
+
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 79a9c0bb8bba..ff1af6a7b44f 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -287,6 +287,7 @@ struct kvmppc_ops {
bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
+   bool (*test_clear_young)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
void (*free_memslot)(struct kvm_memory_slot *slot);
int (*init_vm)(struct kvm *kvm);
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 686d8d9eda3e..37bf40b0c4ff 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -899,6 +899,12 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct 
kvm_gfn_range *range)
return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
 }
 
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+   return !kvm->arch.kvm_ops->test_clear_young ||
+  kvm->arch.kvm_ops->test_clear_young(kvm, range);
+}
+
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
index 58391b4b32ed..fa2659e21ccc 100644
--- a/arch/powerpc/kvm/book3s.h
+++ b/arch/powerpc/kvm/book3s.h
@@ -12,6 +12,7 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
 extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range 
*range);
 extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
 extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
+extern bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range 
*range);
 extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
 
 extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 3b65b3b11041..0a392e9a100a 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1088,6 +1088,65 @@ bool kvm_test_age_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
return ref;
 }
 
+bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+   bool err;
+   gfn_t gfn = range->start;
+
+   rcu_read_lock();
+
+   err = !kvm_is_radix(kvm);
+   if (err)
+   goto unlock;
+
+   /*
+* Case 1:  This function  kvmppc_switch_mmu_to_hpt()
+*
+*  rcu_read_lock()
+*  Test kvm_is_radix()kvm->arch.radix = 0
+*  Use kvm->arch.pgtable  synchronize_rcu()
+*  rcu_read_unlock()
+* kvmppc_free_radix()
+*
+*
+* Case 2:  This function  kvmppc_switch_mmu_to_radix()
+*
+* kvmppc_init_vm_radix()
+* smp_wmb()
+*  Test kvm_is_radix()kvm->arch.radix = 1
+ 

[PATCH mm-unstable v2 06/10] kvm/powerpc: make radix page tables RCU safe

2023-05-26 Thread Yu Zhao
KVM page tables are currently not RCU safe against remapping, i.e.,
kvmppc_unmap_free_pmd_entry_table() et al. The previous
mmu_notifier_ops members rely on kvm->mmu_lock to synchronize with
that operation.

However, the new mmu_notifier_ops member test_clear_young() provides
a fast path that does not take kvm->mmu_lock. To implement
kvm_arch_test_clear_young() for that path, orphan page tables need to
be freed by RCU.

Unmapping, specifically kvm_unmap_radix(), does not free page tables,
hence not a concern.

Signed-off-by: Yu Zhao 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 461307b89c3a..3b65b3b11041 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1469,13 +1469,15 @@ int kvmppc_radix_init(void)
 {
unsigned long size = sizeof(void *) << RADIX_PTE_INDEX_SIZE;
 
-   kvm_pte_cache = kmem_cache_create("kvm-pte", size, size, 0, pte_ctor);
+   kvm_pte_cache = kmem_cache_create("kvm-pte", size, size,
+ SLAB_TYPESAFE_BY_RCU, pte_ctor);
if (!kvm_pte_cache)
return -ENOMEM;
 
size = sizeof(void *) << RADIX_PMD_INDEX_SIZE;
 
-   kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size, 0, pmd_ctor);
+   kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size,
+ SLAB_TYPESAFE_BY_RCU, pmd_ctor);
if (!kvm_pmd_cache) {
kmem_cache_destroy(kvm_pte_cache);
return -ENOMEM;
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH mm-unstable v2 05/10] kvm/arm64: add kvm_arch_test_clear_young()

2023-05-26 Thread Yu Zhao
Implement kvm_arch_test_clear_young() to support the fast path in
mmu_notifier_ops->test_clear_young().

It focuses on a simple case, i.e., hardware sets the accessed bit in
KVM PTEs and VMs are not protected, where it can rely on RCU and
cmpxchg to safely clear the accessed bit without taking
kvm->mmu_lock. Complex cases fall back to the existing slow path
where kvm->mmu_lock is then taken.

Signed-off-by: Yu Zhao 
---
 arch/arm64/include/asm/kvm_host.h |  6 ++
 arch/arm64/kvm/mmu.c  | 36 +++
 2 files changed, 42 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 7e7e19ef6993..da32b0890716 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1113,4 +1113,10 @@ static inline void kvm_hyp_reserve(void) { }
 void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
 bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
 
+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+   return cpu_has_hw_af() && !is_protected_kvm_enabled();
+}
+
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c3b3e2afe26f..26a8d955b49c 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1678,6 +1678,42 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct 
kvm_gfn_range *range)
   range->start << PAGE_SHIFT);
 }
 
+static int stage2_test_clear_young(const struct kvm_pgtable_visit_ctx *ctx,
+  enum kvm_pgtable_walk_flags flags)
+{
+   kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
+
+   VM_WARN_ON_ONCE(!page_count(virt_to_page(ctx->ptep)));
+
+   if (!kvm_pte_valid(new))
+   return 0;
+
+   if (new == ctx->old)
+   return 0;
+
+   if (kvm_should_clear_young(ctx->arg, ctx->addr / PAGE_SIZE))
+   stage2_try_set_pte(ctx, new);
+
+   return 0;
+}
+
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+   u64 start = range->start * PAGE_SIZE;
+   u64 end = range->end * PAGE_SIZE;
+   struct kvm_pgtable_walker walker = {
+   .cb = stage2_test_clear_young,
+   .arg= range,
+   .flags  = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_SHARED,
+   };
+
+   BUILD_BUG_ON(is_hyp_code());
+
+   kvm_pgtable_walk(kvm->arch.mmu.pgt, start, end - start, );
+
+   return false;
+}
+
 phys_addr_t kvm_mmu_get_httbr(void)
 {
return __pa(hyp_pgtable->pgd);
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit

2023-05-26 Thread Yu Zhao
TLDR

This patchset adds a fast path to clear the accessed bit without
taking kvm->mmu_lock. It can significantly improve the performance of
guests when the host is under heavy memory pressure.

ChromeOS has been using a similar approach [1] since mid 2021 and it
was proven successful on tens of millions devices.

This v2 addressed previous requests [2] on refactoring code, removing
inaccurate/redundant texts, etc.

[1] https://crrev.com/c/2987928
[2] https://lore.kernel.org/r/20230217041230.2417228-1-yuz...@google.com/

Overview

The goal of this patchset is to optimize the performance of guests
when the host memory is overcommitted. It focuses on a simple yet
common case where hardware sets the accessed bit in KVM PTEs and VMs
are not nested. Complex cases fall back to the existing slow path
where kvm->mmu_lock is then taken.

The fast path relies on two techniques to safely clear the accessed
bit: RCU and CAS. The former protects KVM page tables from being
freed while the latter clears the accessed bit atomically against
both the hardware and other software page table walkers.

A new mmu_notifier_ops member, test_clear_young(), supersedes the
existing clear_young() and test_young(). This extended callback can
operate on a range of KVM PTEs individually according to a bitmap, if
the caller provides it.

Evaluation
==
An existing selftest can quickly demonstrate the effectiveness of
this patchset. On a generic workstation equipped with 128 CPUs and
256GB DRAM:

  $ sudo max_guest_memory_test -c 64 -m 250 -s 250
  
  MGLRU run2
  --
  Before [1]~64s
  After ~51s
  
  kswapd (MGLRU before)
100.00%  balance_pgdat
  100.00%  shrink_node
100.00%  shrink_one
  99.99%  try_to_shrink_lruvec
99.71%  evict_folios
  97.29%  shrink_folio_list
  ==>>  13.05%  folio_referenced
  12.83%  rmap_walk_file
12.31%  folio_referenced_one
  7.90%  __mmu_notifier_clear_young
7.72%  kvm_mmu_notifier_clear_young
  7.34%  _raw_write_lock
  
  kswapd (MGLRU after)
100.00%  balance_pgdat
  100.00%  shrink_node
100.00%  shrink_one
  99.99%  try_to_shrink_lruvec
99.59%  evict_folios
  80.37%  shrink_folio_list
  ==>>  3.74%  folio_referenced
  3.59%  rmap_walk_file
3.19%  folio_referenced_one
  2.53%  lru_gen_look_around
1.06%  __mmu_notifier_test_clear_young

Comprehensive benchmarks are coming soon.

[1] "mm: rmap: Don't flush TLB after checking PTE young for page
 reference" was included so that the comparison is apples to
 apples.
https://lore.kernel.org/r/20220706112041.3831-1-21cn...@gmail.com/

Yu Zhao (10):
  mm/kvm: add mmu_notifier_ops->test_clear_young()
  mm/kvm: use mmu_notifier_ops->test_clear_young()
  kvm/arm64: export stage2_try_set_pte() and macros
  kvm/arm64: make stage2 page tables RCU safe
  kvm/arm64: add kvm_arch_test_clear_young()
  kvm/powerpc: make radix page tables RCU safe
  kvm/powerpc: add kvm_arch_test_clear_young()
  kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask
  kvm/x86: add kvm_arch_test_clear_young()
  mm: multi-gen LRU: use mmu_notifier_test_clear_young()

 Documentation/admin-guide/mm/multigen_lru.rst |   6 +-
 arch/arm64/include/asm/kvm_host.h |   6 +
 arch/arm64/include/asm/kvm_pgtable.h  |  55 +++
 arch/arm64/kvm/arm.c  |   1 +
 arch/arm64/kvm/hyp/pgtable.c  |  61 +---
 arch/arm64/kvm/mmu.c  |  53 ++-
 arch/powerpc/include/asm/kvm_host.h   |   8 +
 arch/powerpc/include/asm/kvm_ppc.h|   1 +
 arch/powerpc/kvm/book3s.c |   6 +
 arch/powerpc/kvm/book3s.h |   1 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c|  65 +++-
 arch/powerpc/kvm/book3s_hv.c  |   5 +
 arch/x86/include/asm/kvm_host.h   |  13 ++
 arch/x86/kvm/mmu.h|   6 -
 arch/x86/kvm/mmu/spte.h   |   1 -
 arch/x86/kvm/mmu/tdp_mmu.c|  34 +
 include/linux/kvm_host.h  |  22 +++
 include/linux/mmu_notifier.h  |  79 ++
 include/linux/mmzone.h|   6 +-
 include/trace/events/kvm.h|  15 --
 mm/mmu_notifier.c |  48 ++
 mm/rmap.c |   8 +-
 mm/vmscan.c   | 139 --
 virt/kvm/kvm_main.c   | 114 --
 24 files changed, 546 insertions(+), 207 deletions(-)

-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

2023-05-26 Thread Yu Zhao
Stage2 page tables are currently not RCU safe against unmapping or VM
destruction. The previous mmu_notifier_ops members rely on
kvm->mmu_lock to synchronize with those operations.

However, the new mmu_notifier_ops member test_clear_young() provides
a fast path that does not take kvm->mmu_lock. To implement
kvm_arch_test_clear_young() for that path, unmapped page tables need
to be freed by RCU and kvm_free_stage2_pgd() needs to be after
mmu_notifier_unregister().

Remapping, specifically stage2_free_removed_table(), is already RCU
safe.

Signed-off-by: Yu Zhao 
---
 arch/arm64/include/asm/kvm_pgtable.h |  2 ++
 arch/arm64/kvm/arm.c |  1 +
 arch/arm64/kvm/hyp/pgtable.c |  8 ++--
 arch/arm64/kvm/mmu.c | 17 -
 4 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
b/arch/arm64/include/asm/kvm_pgtable.h
index ff520598b62c..5cab52e3a35f 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -153,6 +153,7 @@ static inline bool kvm_level_supports_block_mapping(u32 
level)
  * @put_page:  Decrement the refcount on a page. When the
  * refcount reaches 0 the page is automatically
  * freed.
+ * @put_page_rcu:  RCU variant of the above.
  * @page_count:Return the refcount of a page.
  * @phys_to_virt:  Convert a physical address into a virtual
  * address mapped in the current context.
@@ -170,6 +171,7 @@ struct kvm_pgtable_mm_ops {
void(*free_removed_table)(void *addr, u32 level);
void(*get_page)(void *addr);
void(*put_page)(void *addr);
+   void(*put_page_rcu)(void *addr);
int (*page_count)(void *addr);
void*   (*phys_to_virt)(phys_addr_t phys);
phys_addr_t (*virt_to_phys)(void *addr);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 14391826241c..ee93271035d9 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -191,6 +191,7 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, 
struct vm_fault *vmf)
  */
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
+   kvm_free_stage2_pgd(>arch.mmu);
bitmap_free(kvm->arch.pmu_filter);
free_cpumask_var(kvm->arch.supported_cpus);
 
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 24678ccba76a..dbace4c6a841 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -988,8 +988,12 @@ static int stage2_unmap_walker(const struct 
kvm_pgtable_visit_ctx *ctx,
mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
   kvm_granule_size(ctx->level));
 
-   if (childp)
-   mm_ops->put_page(childp);
+   if (childp) {
+   if (mm_ops->put_page_rcu)
+   mm_ops->put_page_rcu(childp);
+   else
+   mm_ops->put_page(childp);
+   }
 
return 0;
 }
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 3b9d4d24c361..c3b3e2afe26f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -172,6 +172,21 @@ static int kvm_host_page_count(void *addr)
return page_count(virt_to_page(addr));
 }
 
+static void kvm_s2_rcu_put_page(struct rcu_head *head)
+{
+   put_page(container_of(head, struct page, rcu_head));
+}
+
+static void kvm_s2_put_page_rcu(void *addr)
+{
+   struct page *page = virt_to_page(addr);
+
+   if (kvm_host_page_count(addr) == 1)
+   kvm_account_pgtable_pages(addr, -1);
+
+   call_rcu(>rcu_head, kvm_s2_rcu_put_page);
+}
+
 static phys_addr_t kvm_host_pa(void *addr)
 {
return __pa(addr);
@@ -704,6 +719,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
.free_removed_table = stage2_free_removed_table,
.get_page   = kvm_host_get_page,
.put_page   = kvm_s2_put_page,
+   .put_page_rcu   = kvm_s2_put_page_rcu,
.page_count = kvm_host_page_count,
.phys_to_virt   = kvm_host_va,
.virt_to_phys   = kvm_host_pa,
@@ -1877,7 +1893,6 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
-   kvm_free_stage2_pgd(>arch.mmu);
 }
 
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH mm-unstable v2 02/10] mm/kvm: use mmu_notifier_ops->test_clear_young()

2023-05-26 Thread Yu Zhao
Replace test_young() and clear_young() with test_clear_young().

Signed-off-by: Yu Zhao 
---
 include/linux/mmu_notifier.h | 29 ++-
 include/trace/events/kvm.h   | 15 --
 mm/mmu_notifier.c| 42 
 virt/kvm/kvm_main.c  | 54 
 4 files changed, 2 insertions(+), 138 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index dfdbb370682d..c8f35fc08703 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -104,26 +104,6 @@ struct mmu_notifier_ops {
 unsigned long start,
 unsigned long end);
 
-   /*
-* clear_young is a lightweight version of clear_flush_young. Like the
-* latter, it is supposed to test-and-clear the young/accessed bitflag
-* in the secondary pte, but it may omit flushing the secondary tlb.
-*/
-   int (*clear_young)(struct mmu_notifier *subscription,
-  struct mm_struct *mm,
-  unsigned long start,
-  unsigned long end);
-
-   /*
-* test_young is called to check the young/accessed bitflag in
-* the secondary pte. This is used to know if the page is
-* frequently used without actually clearing the flag or tearing
-* down the secondary mapping on the page.
-*/
-   int (*test_young)(struct mmu_notifier *subscription,
- struct mm_struct *mm,
- unsigned long address);
-
int (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm,
unsigned long start, unsigned long end,
bool clear, unsigned long *bitmap);
@@ -393,11 +373,6 @@ extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
  unsigned long start,
  unsigned long end);
-extern int __mmu_notifier_clear_young(struct mm_struct *mm,
- unsigned long start,
- unsigned long end);
-extern int __mmu_notifier_test_young(struct mm_struct *mm,
-unsigned long address);
 extern int __mmu_notifier_test_clear_young(struct mm_struct *mm,
   unsigned long start, unsigned long 
end,
   bool clear, unsigned long *bitmap);
@@ -437,7 +412,7 @@ static inline int mmu_notifier_clear_young(struct mm_struct 
*mm,
   unsigned long end)
 {
if (mm_has_notifiers(mm))
-   return __mmu_notifier_clear_young(mm, start, end);
+   return __mmu_notifier_test_clear_young(mm, start, end, true, 
NULL);
return 0;
 }
 
@@ -445,7 +420,7 @@ static inline int mmu_notifier_test_young(struct mm_struct 
*mm,
  unsigned long address)
 {
if (mm_has_notifiers(mm))
-   return __mmu_notifier_test_young(mm, address);
+   return __mmu_notifier_test_clear_young(mm, address, address + 
1, false, NULL);
return 0;
 }
 
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 3bd31ea23fee..46c347e56e60 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -489,21 +489,6 @@ TRACE_EVENT(kvm_age_hva,
  __entry->start, __entry->end)
 );
 
-TRACE_EVENT(kvm_test_age_hva,
-   TP_PROTO(unsigned long hva),
-   TP_ARGS(hva),
-
-   TP_STRUCT__entry(
-   __field(unsigned long,  hva )
-   ),
-
-   TP_fast_assign(
-   __entry->hva= hva;
-   ),
-
-   TP_printk("mmu notifier test age hva: %#016lx", __entry->hva)
-);
-
 #endif /* _TRACE_KVM_MAIN_H */
 
 /* This part must be outside protection */
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 7e6aba4bddcb..c7e9747c9920 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -382,48 +382,6 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
return young;
 }
 
-int __mmu_notifier_clear_young(struct mm_struct *mm,
-  unsigned long start,
-  unsigned long end)
-{
-   struct mmu_notifier *subscription;
-   int young = 0, id;
-
-   id = srcu_read_lock();
-   hlist_for_each_entry_rcu(subscription,
->notifier_subscriptions->list, hlist,
-srcu_read_lock_held()) {
-   if (subscription->ops->clear_young)
-   young |= subscription->ops->clear_young(subscription,
- 

[PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()

2023-05-26 Thread Yu Zhao
Add mmu_notifier_ops->test_clear_young() to supersede test_young()
and clear_young().

test_clear_young() has a fast path, which if supported, allows its
callers to safely clear the accessed bit without taking
kvm->mmu_lock.

The fast path requires arch-specific code that generally relies on
RCU and CAS: the former protects KVM page tables from being freed
while the latter clears the accessed bit atomically against both the
hardware and other software page table walkers. If the fast path is
unsupported, test_clear_young() falls back to the existing slow path
where kvm->mmu_lock is then taken.

test_clear_young() can also operate on a range of KVM PTEs
individually according to a bitmap, if the caller provides it.

Signed-off-by: Yu Zhao 
---
 include/linux/kvm_host.h | 22 +++
 include/linux/mmu_notifier.h | 52 
 mm/mmu_notifier.c| 24 
 virt/kvm/kvm_main.c  | 76 +++-
 4 files changed, 173 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0e571e973bc2..374262545f96 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -258,6 +258,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
 struct kvm_gfn_range {
struct kvm_memory_slot *slot;
+   void *args;
gfn_t start;
gfn_t end;
pte_t pte;
@@ -267,6 +268,27 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 
kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_should_clear_young(struct kvm_gfn_range *range, gfn_t gfn);
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range);
+#endif
+
+/*
+ * Architectures that implement kvm_arch_test_clear_young() should override
+ * kvm_arch_has_test_clear_young().
+ *
+ * kvm_arch_has_test_clear_young() is allowed to return false positive, i.e., 
it
+ * can return true if kvm_arch_test_clear_young() is supported but disabled due
+ * to some runtime constraint. In this case, kvm_arch_test_clear_young() should
+ * return true; otherwise, it should return false.
+ *
+ * For each young KVM PTE, kvm_arch_test_clear_young() should call
+ * kvm_should_clear_young() to decide whether to clear the accessed bit.
+ */
+#ifndef kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+   return false;
+}
 #endif
 
 enum {
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 64a3e051c3c4..dfdbb370682d 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -60,6 +60,8 @@ enum mmu_notifier_event {
 };
 
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
+#define MMU_NOTIFIER_RANGE_LOCKLESS(1 << 1)
+#define MMU_NOTIFIER_RANGE_YOUNG   (1 << 2)
 
 struct mmu_notifier_ops {
/*
@@ -122,6 +124,10 @@ struct mmu_notifier_ops {
  struct mm_struct *mm,
  unsigned long address);
 
+   int (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm,
+   unsigned long start, unsigned long end,
+   bool clear, unsigned long *bitmap);
+
/*
 * change_pte is called in cases that pte mapping to page is changed:
 * for example, when ksm remaps pte to point to a new shared page.
@@ -392,6 +398,9 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm,
  unsigned long end);
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 unsigned long address);
+extern int __mmu_notifier_test_clear_young(struct mm_struct *mm,
+  unsigned long start, unsigned long 
end,
+  bool clear, unsigned long *bitmap);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
  unsigned long address, pte_t pte);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
@@ -440,6 +449,35 @@ static inline int mmu_notifier_test_young(struct mm_struct 
*mm,
return 0;
 }
 
+/*
+ * mmu_notifier_test_clear_young() returns nonzero if any of the KVM PTEs 
within
+ * a given range was young. Specifically, it returns 
MMU_NOTIFIER_RANGE_LOCKLESS
+ * if the fast path was successful, MMU_NOTIFIER_RANGE_YOUNG otherwise.
+ *
+ * The last parameter to the function is a bitmap and only the fast path
+ * supports it: if it is NULL, the function falls back to the slow path if the
+ * fast path was unsuccessful; otherwise, the function bails out.
+ *
+ * The bitmap has the following specifications:
+ * 1. The number of bits should be at least (end-start)/PAGE_SIZE.
+ 

[PATCH mm-unstable v2 03/10] kvm/arm64: export stage2_try_set_pte() and macros

2023-05-26 Thread Yu Zhao
stage2_try_set_pte() and KVM_PTE_LEAF_ATTR_LO_S2_AF are needed to
implement kvm_arch_test_clear_young().

Signed-off-by: Yu Zhao 
---
 arch/arm64/include/asm/kvm_pgtable.h | 53 
 arch/arm64/kvm/hyp/pgtable.c | 53 
 2 files changed, 53 insertions(+), 53 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
b/arch/arm64/include/asm/kvm_pgtable.h
index dc3c072e862f..ff520598b62c 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -44,6 +44,49 @@ typedef u64 kvm_pte_t;
 
 #define KVM_PHYS_INVALID   (-1ULL)
 
+#define KVM_PTE_TYPE   BIT(1)
+#define KVM_PTE_TYPE_BLOCK 0
+#define KVM_PTE_TYPE_PAGE  1
+#define KVM_PTE_TYPE_TABLE 1
+
+#define KVM_PTE_LEAF_ATTR_LO   GENMASK(11, 2)
+
+#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDXGENMASK(4, 2)
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6)
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO  3
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW  1
+#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8)
+#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS  3
+#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10)
+
+#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTRGENMASK(5, 2)
+#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6)
+#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7)
+#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8)
+#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS  3
+#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10)
+
+#define KVM_PTE_LEAF_ATTR_HI   GENMASK(63, 51)
+
+#define KVM_PTE_LEAF_ATTR_HI_SWGENMASK(58, 55)
+
+#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
+
+#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
+
+#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
+KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
+KVM_PTE_LEAF_ATTR_HI_S2_XN)
+
+#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2)
+#define KVM_MAX_OWNER_ID   1
+
+/*
+ * Used to indicate a pte for which a 'break-before-make' sequence is in
+ * progress.
+ */
+#define KVM_INVALID_PTE_LOCKED BIT(10)
+
 static inline bool kvm_pte_valid(kvm_pte_t pte)
 {
return pte & KVM_PTE_VALID;
@@ -224,6 +267,16 @@ static inline bool kvm_pgtable_walk_shared(const struct 
kvm_pgtable_visit_ctx *c
return ctx->flags & KVM_PGTABLE_WALK_SHARED;
 }
 
+static inline bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, 
kvm_pte_t new)
+{
+   if (!kvm_pgtable_walk_shared(ctx)) {
+   WRITE_ONCE(*ctx->ptep, new);
+   return true;
+   }
+
+   return cmpxchg(ctx->ptep, ctx->old, new) == ctx->old;
+}
+
 /**
  * struct kvm_pgtable_walker - Hook into a page-table walk.
  * @cb:Callback function to invoke during the walk.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 5282cb9ca4cf..24678ccba76a 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -12,49 +12,6 @@
 #include 
 
 
-#define KVM_PTE_TYPE   BIT(1)
-#define KVM_PTE_TYPE_BLOCK 0
-#define KVM_PTE_TYPE_PAGE  1
-#define KVM_PTE_TYPE_TABLE 1
-
-#define KVM_PTE_LEAF_ATTR_LO   GENMASK(11, 2)
-
-#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDXGENMASK(4, 2)
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6)
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO  3
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW  1
-#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8)
-#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS  3
-#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10)
-
-#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTRGENMASK(5, 2)
-#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6)
-#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7)
-#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8)
-#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS  3
-#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10)
-
-#define KVM_PTE_LEAF_ATTR_HI   GENMASK(63, 51)
-
-#define KVM_PTE_LEAF_ATTR_HI_SWGENMASK(58, 55)
-
-#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
-
-#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
-
-#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
-KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
-KVM_PTE_LEAF_ATTR_HI_S2_XN)
-
-#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2)
-#define KVM_MAX_OWNER_ID   1
-
-/*
- * Used to indicate a pte for which a 'break-before-make' sequence is in
- * progress.
- */
-#define KVM_INVALID_PTE_LOCKED BIT(10)
-
 struct kvm_pgtable_walk_data {
struct kvm_pgtable_walker   *walker;
 
@@ -702,16 +659,6 @@ static bool stage2_pte_is_locked(kvm_pte_t pte)
return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED);
 }
 
-static bool stage2_try_set_pte(const struct kvm_pgt

Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

2023-02-23 Thread Yu Zhao
On Thu, Feb 23, 2023 at 1:29 PM Sean Christopherson  wrote:
>
> On Thu, Feb 23, 2023, Yu Zhao wrote:
> > On Thu, Feb 23, 2023 at 12:58 PM Sean Christopherson  
> > wrote:
> > >
> > > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > > On Thu, Feb 23, 2023 at 12:11 PM Sean Christopherson 
> > > >  wrote:
> > > > >
> > > > > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > > > > > As alluded to in patch 1, unless batching the walks even if KVM 
> > > > > > > does _not_ support
> > > > > > > a lockless walk is somehow _worse_ than using the existing 
> > > > > > > mmu_notifier_clear_flush_young(),
> > > > > > > I think batching the calls should be conditional only on 
> > > > > > > LRU_GEN_SPTE_WALK.  Or
> > > > > > > if we want to avoid batching when there are no mmu_notifier 
> > > > > > > listeners, probe
> > > > > > > mmu_notifiers.  But don't call into KVM directly.
> > > > > >
> > > > > > I'm not sure I fully understand. Let's present the problem on the MM
> > > > > > side: assuming KVM supports lockless walks, batching can still be
> > > > > > worse (very unlikely), because GFNs can exhibit no memory locality 
> > > > > > at
> > > > > > all. So this option allows userspace to disable batching.
> > > > >
> > > > > I'm asking the opposite.  Is there a scenario where batching+lock is 
> > > > > worse than
> > > > > !batching+lock?  If not, then don't make batching depend on lockless 
> > > > > walks.
> > > >
> > > > Yes, absolutely. batching+lock means we take/release mmu_lock for
> > > > every single PTE in the entire VA space -- each small batch contains
> > > > 64 PTEs but the entire batch is the whole KVM.
> > >
> > > Who is "we"?
> >
> > Oops -- shouldn't have used "we".
> >
> > > I don't see anything in the kernel that triggers walking the whole
> > > VMA, e.g. lru_gen_look_around() limits the walk to a single PMD.  I feel 
> > > like I'm
> > > missing something...
> >
> > walk_mm() -> walk_pud_range() -> walk_pmd_range() -> walk_pte_range()
> > -> test_spte_young() -> mmu_notifier_test_clear_young().
> >
> > MGLRU takes two passes: during the first pass, it sweeps entire VA
> > space on each MM (per MM/KVM); during the second pass, it uses the rmap on 
> > each
> > folio (per folio).
>
> Ah.  IIUC, userspace can use LRU_GEN_SPTE_WALK to control whether or not to 
> walk
> secondary MMUs, and the kernel further restricts LRU_GEN_SPTE_WALK to 
> secondary
> MMUs that implement a lockless walk.  And if the answer is "no", secondary 
> MMUs
> are simply not consulted.

Sorry for the bad naming -- probably LRU_GEN_SPTE_BATCH_WALK would be
less confusing.

MGLRU always consults the secondary MMU for each page it's going to
reclaim (during the second pass), i.e., it checks the A-bit in the
SPTE mapping a page (by the rmap) it plans to reclaim so that it won't
take a hot page away from KVM.

If the lockless walk is supported, MGLRU doesn't need to work at page
granularity: (physical) pages on the LRU list may have nothing in
common (e.g., from different processes), checking their PTEs/SPTEs one
by one is expensive. Instead, it sweeps the entire KVM spaces in the
first pass and checks the *adjacent SPTEs* of a page it plans to
reclaim in the second pass. Both rely on the *assumption* there would
be some spatial locality to exploit. This assumption can be wrong, and
LRU_GEN_SPTE_WALK disables it.

> If that's correct, then the proper way to handle this is by extending 
> mmu_notifier_ops
> to query (a) if there's at least one register listeners that implements
> test_clear_young() and (b) if all registered listeners that implement 
> test_clear_young()
> support lockless walks.  That avoids direct dependencies on KVM, and avoids 
> making
> assumptions that may not always hold true, e.g. that KVM is the only 
> mmu_notifier
> user that supports the young APIs.
>
> P.S. all of this info absolutely belongs in documentation and/or changelogs.

Will do.


Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

2023-02-23 Thread Yu Zhao
On Thu, Feb 23, 2023 at 12:58 PM Sean Christopherson  wrote:
>
> On Thu, Feb 23, 2023, Yu Zhao wrote:
> > On Thu, Feb 23, 2023 at 12:11 PM Sean Christopherson  
> > wrote:
> > >
> > > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > > > As alluded to in patch 1, unless batching the walks even if KVM does 
> > > > > _not_ support
> > > > > a lockless walk is somehow _worse_ than using the existing 
> > > > > mmu_notifier_clear_flush_young(),
> > > > > I think batching the calls should be conditional only on 
> > > > > LRU_GEN_SPTE_WALK.  Or
> > > > > if we want to avoid batching when there are no mmu_notifier 
> > > > > listeners, probe
> > > > > mmu_notifiers.  But don't call into KVM directly.
> > > >
> > > > I'm not sure I fully understand. Let's present the problem on the MM
> > > > side: assuming KVM supports lockless walks, batching can still be
> > > > worse (very unlikely), because GFNs can exhibit no memory locality at
> > > > all. So this option allows userspace to disable batching.
> > >
> > > I'm asking the opposite.  Is there a scenario where batching+lock is 
> > > worse than
> > > !batching+lock?  If not, then don't make batching depend on lockless 
> > > walks.
> >
> > Yes, absolutely. batching+lock means we take/release mmu_lock for
> > every single PTE in the entire VA space -- each small batch contains
> > 64 PTEs but the entire batch is the whole KVM.
>
> Who is "we"?

Oops -- shouldn't have used "we".

> I don't see anything in the kernel that triggers walking the whole
> VMA, e.g. lru_gen_look_around() limits the walk to a single PMD.  I feel like 
> I'm
> missing something...

walk_mm() -> walk_pud_range() -> walk_pmd_range() -> walk_pte_range()
-> test_spte_young() -> mmu_notifier_test_clear_young().

MGLRU takes two passes: during the first pass, it sweeps entire VA
space on each MM (per MM/KVM); during the second pass, it uses the rmap on each
folio (per folio). The look around exploits the (spatial) locality in
the second pass, to get the best out of the expensive per folio rmap
walk.

(The first pass can't handle shared mappings; the second pass can.)


Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

2023-02-23 Thread Yu Zhao
On Thu, Feb 23, 2023 at 12:11 PM Sean Christopherson  wrote:
>
> On Thu, Feb 23, 2023, Yu Zhao wrote:
> > On Thu, Feb 23, 2023 at 10:43 AM Sean Christopherson  
> > wrote:
> > >
> > > On Thu, Feb 16, 2023, Yu Zhao wrote:
> > > >   kswapd (MGLRU before)
> > > > 100.00%  balance_pgdat
> > > >   100.00%  shrink_node
> > > > 100.00%  shrink_one
> > > >   99.97%  try_to_shrink_lruvec
> > > > 99.06%  evict_folios
> > > >   97.41%  shrink_folio_list
> > > > 31.33%  folio_referenced
> > > >   31.06%  rmap_walk_file
> > > > 30.89%  folio_referenced_one
> > > >   20.83%  __mmu_notifier_clear_flush_young
> > > > 20.54%  kvm_mmu_notifier_clear_flush_young
> > > >   =>  19.34%  _raw_write_lock
> > > >
> > > >   kswapd (MGLRU after)
> > > > 100.00%  balance_pgdat
> > > >   100.00%  shrink_node
> > > > 100.00%  shrink_one
> > > >   99.97%  try_to_shrink_lruvec
> > > > 99.51%  evict_folios
> > > >   71.70%  shrink_folio_list
> > > > 7.08%  folio_referenced
> > > >   6.78%  rmap_walk_file
> > > > 6.72%  folio_referenced_one
> > > >   5.60%  lru_gen_look_around
> > > >   =>1.53%  __mmu_notifier_test_clear_young
> > >
> > > Do you happen to know how much of the improvement is due to batching, and 
> > > how
> > > much is due to using a walkless walk?
> >
> > No. I have three benchmarks running at the moment:
> > 1. Windows SQL server guest on x86 host,
> > 2. Apache Spark guest on arm64 host, and
> > 3. Memcached guest on ppc64 host.
> >
> > If you are really interested in that, I can reprioritize -- I need to
> > stop 1) and use that machine to get the number for you.
>
> After looking at the "MGLRU before" stack again, it's definitely worth getting
> those numbers.  The "before" isn't just taking mmu_lock, it's taking mmu_lock 
> for
> write _and_ flushing remote TLBs on _every_ PTE.

Correct.

> I suspect the batching is a
> tiny percentage of the overall win (might be larger with RETPOLINE and 
> friends),

Same here.

> and that the bulk of the improvement comes from avoiding the insanity of
> kvm_mmu_notifier_clear_flush_young().
>
> Speaking of which, what would it take to drop mmu_notifier_clear_flush_young()
> entirely?

That's not my call :)

Adding Johannes.

> I.e. why can MGLRU tolerate stale information but !MGLRU cannot?

Good question. The native clear API doesn't flush:

  int ptep_clear_flush_young(struct vm_area_struct *vma,
 unsigned long address, pte_t *ptep)
  {
  /*
   * On x86 CPUs, clearing the accessed bit without a TLB flush
   * doesn't cause data corruption. [ It could cause incorrect
   * page aging and the (mistaken) reclaim of hot pages, but the
   * chance of that should be relatively low. ]
   *
   * So as a performance optimization don't flush the TLB when
   * clearing the accessed bit, it will eventually be flushed by
   * a context switch or a VM operation anyway. [ In the rare
   * event of it not getting flushed for a long time the delay
   * shouldn't really matter because there's no real memory
   * pressure for swapout to react to. ]
   */
  return ptep_test_and_clear_young(vma, address, ptep);
  }

> If
> we simply deleted mmu_notifier_clear_flush_young() and used 
> mmu_notifier_clear_young()
> instead, would anyone notice, let alone care?

I tend to agree.

> > > > @@ -5699,6 +5797,9 @@ static ssize_t show_enabled(struct kobject *kobj, 
> > > > struct kobj_attribute *attr, c
> > > >   if (arch_has_hw_nonleaf_pmd_young() && 
> > > > get_cap(LRU_GEN_NONLEAF_YOUNG))
> > > >   caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
> > > >
> > > > + if (kvm_arch_has_test_clear_young() && get_cap(LRU_GEN_SPTE_WALK))
> > > > + caps |= BIT(LRU_GEN_SPTE_WALK);
> > >
> > > As alluded to in patch 1, unless batching the walks even if KVM does 
> > > _not_ support
> > > a lockless walk is somehow _worse_ than using the existing 
> > > mmu_notifier_clear_flush_young(),
> > > I

Re: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

2023-02-23 Thread Yu Zhao
On Thu, Feb 23, 2023 at 12:21 PM Sean Christopherson  wrote:
>
> On Thu, Feb 23, 2023, Yu Zhao wrote:
> > On Thu, Feb 23, 2023 at 11:47 AM Sean Christopherson  
> > wrote:
> > >
> > > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > > On Thu, Feb 23, 2023 at 11:24 AM Sean Christopherson 
> > > >  wrote:
> > > > >
> > > > > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > > > > On Thu, Feb 23, 2023 at 10:09 AM Sean Christopherson 
> > > > > >  wrote:
> > > > > > > > I'll take a look at that series. clear_bit() probably won't 
> > > > > > > > cause any
> > > > > > > > practical damage but is technically wrong because, for example, 
> > > > > > > > it can
> > > > > > > > end up clearing the A-bit in a non-leaf PMD. (cmpxchg will just 
> > > > > > > > fail
> > > > > > > > in this case, obviously.)
> > > > > > >
> > > > > > > Eh, not really.  By that argument, clearing an A-bit in a huge 
> > > > > > > PTE is also technically
> > > > > > > wrong because the target gfn may or may not have been accessed.
> > > > > >
> > > > > > Sorry, I don't understand. You mean clear_bit() on a huge PTE is
> > > > > > technically wrong? Yes, that's what I mean. (cmpxchg() on a huge PTE
> > > > > > is not.)
> > > > > >
> > > > > > > The only way for
> > > > > > > KVM to clear a A-bit in a non-leaf entry is if the entry _was_ a 
> > > > > > > huge PTE, but was
> > > > > > > replaced between the "is leaf" and the clear_bit().
> > > > > >
> > > > > > I think there is a misunderstanding here. Let me be more specific:
> > > > > > 1. Clearing the A-bit in a non-leaf entry is technically wrong 
> > > > > > because
> > > > > > that's not our intention.
> > > > > > 2. When we try to clear_bit() on a leaf PMD, it can at the same time
> > > > > > become a non-leaf PMD, which causes 1) above, and therefore is
> > > > > > technically wrong.
> > > > > > 3. I don't think 2) could do any real harm, so no practically no 
> > > > > > problem.
> > > > > > 4. cmpxchg() can avoid 2).
> > > > > >
> > > > > > Does this make sense?
> > > > >
> > > > > I understand what you're saying, but clearing an A-bit on a non-leaf 
> > > > > PMD that
> > > > > _just_ got converted from a leaf PMD is "wrong" if and only if the 
> > > > > intented
> > > > > behavior is nonsensical.
> > > >
> > > > Sorry, let me rephrase:
> > > > 1. Clearing the A-bit in a non-leaf entry is technically wrong because
> > > > we didn't make sure there is the A-bit there --  the bit we are
> > > > clearing can be something else. (Yes, we know it's not, but we didn't
> > > > define this behavior, e.g., a macro to designate that bit for non-leaf
> > > > entries.
> > >
> > > Heh, by that definition, anything and everything is "technically wrong".
> >
> > I really don't see how what I said, in our context,
> >
> >   "Clearing the A-bit in a non-leaf entry is technically wrong because
> > we didn't make sure there is the A-bit there"
> >
> > can infer
> >
> >   "anything and everything is "technically wrong"."
> >
> > And how what I said can be an analogy to
> >
> >   "An Intel CPU might support SVM, even though we know no such CPUs
> > exist, so requiring AMD or Hygon to enable SVM is technically wrong."
> >
> > BTW, here is a bug caused by clearing the A-bit in non-leaf entries in
> > a different scenario:
> > https://lore.kernel.org/linux-mm/20221123064510.16225-1-jgr...@suse.com/
> >
> > Let's just agree to disagree.
>
> No, because I don't want anyone to leave with the impression that relying on 
> the
> Accessed bit to uniformly exist (or not) at all levels in the TDP MMU is 
> somehow
> technically wrong.  The link you posted is about running as a Xen guest, and 
> is
> in arch-agnostic code.  That is wildly different than what we are talking 
> about
> here, where the targets are strictly limited to x86-64 TDP, and the existence 
> of
> the Accessed bit is architecturally defined.

Yes, I see now.

Sorry to say this, but this is all I needed to hear: "the existence of
the Accessed bit is architecturally defined". (I didn't know and see
this is what you meant.)

> In this code, there are exactly two flavors of paging that can be in use, and
> using clear_bit() to clear shadow_accessed_mask is safe for both, full stop.


Re: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

2023-02-23 Thread Yu Zhao
On Thu, Feb 23, 2023 at 11:47 AM Sean Christopherson  wrote:
>
> On Thu, Feb 23, 2023, Yu Zhao wrote:
> > On Thu, Feb 23, 2023 at 11:24 AM Sean Christopherson  
> > wrote:
> > >
> > > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > > On Thu, Feb 23, 2023 at 10:09 AM Sean Christopherson 
> > > >  wrote:
> > > > > > I'll take a look at that series. clear_bit() probably won't cause 
> > > > > > any
> > > > > > practical damage but is technically wrong because, for example, it 
> > > > > > can
> > > > > > end up clearing the A-bit in a non-leaf PMD. (cmpxchg will just fail
> > > > > > in this case, obviously.)
> > > > >
> > > > > Eh, not really.  By that argument, clearing an A-bit in a huge PTE is 
> > > > > also technically
> > > > > wrong because the target gfn may or may not have been accessed.
> > > >
> > > > Sorry, I don't understand. You mean clear_bit() on a huge PTE is
> > > > technically wrong? Yes, that's what I mean. (cmpxchg() on a huge PTE
> > > > is not.)
> > > >
> > > > > The only way for
> > > > > KVM to clear a A-bit in a non-leaf entry is if the entry _was_ a huge 
> > > > > PTE, but was
> > > > > replaced between the "is leaf" and the clear_bit().
> > > >
> > > > I think there is a misunderstanding here. Let me be more specific:
> > > > 1. Clearing the A-bit in a non-leaf entry is technically wrong because
> > > > that's not our intention.
> > > > 2. When we try to clear_bit() on a leaf PMD, it can at the same time
> > > > become a non-leaf PMD, which causes 1) above, and therefore is
> > > > technically wrong.
> > > > 3. I don't think 2) could do any real harm, so no practically no 
> > > > problem.
> > > > 4. cmpxchg() can avoid 2).
> > > >
> > > > Does this make sense?
> > >
> > > I understand what you're saying, but clearing an A-bit on a non-leaf PMD 
> > > that
> > > _just_ got converted from a leaf PMD is "wrong" if and only if the 
> > > intented
> > > behavior is nonsensical.
> >
> > Sorry, let me rephrase:
> > 1. Clearing the A-bit in a non-leaf entry is technically wrong because
> > we didn't make sure there is the A-bit there --  the bit we are
> > clearing can be something else. (Yes, we know it's not, but we didn't
> > define this behavior, e.g., a macro to designate that bit for non-leaf
> > entries.
>
> Heh, by that definition, anything and everything is "technically wrong".

I really don't see how what I said, in our context,

  "Clearing the A-bit in a non-leaf entry is technically wrong because
we didn't make sure there is the A-bit there"

can infer

  "anything and everything is "technically wrong"."

And how what I said can be an analogy to

  "An Intel CPU might support SVM, even though we know no such CPUs
exist, so requiring AMD or Hygon to enable SVM is technically wrong."

BTW, here is a bug caused by clearing the A-bit in non-leaf entries in
a different scenario:
https://lore.kernel.org/linux-mm/20221123064510.16225-1-jgr...@suse.com/

Let's just agree to disagree.


Re: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

2023-02-23 Thread Yu Zhao
On Thu, Feb 23, 2023 at 11:24 AM Sean Christopherson  wrote:
>
> On Thu, Feb 23, 2023, Yu Zhao wrote:
> > On Thu, Feb 23, 2023 at 10:09 AM Sean Christopherson  
> > wrote:
> > > > I'll take a look at that series. clear_bit() probably won't cause any
> > > > practical damage but is technically wrong because, for example, it can
> > > > end up clearing the A-bit in a non-leaf PMD. (cmpxchg will just fail
> > > > in this case, obviously.)
> > >
> > > Eh, not really.  By that argument, clearing an A-bit in a huge PTE is 
> > > also technically
> > > wrong because the target gfn may or may not have been accessed.
> >
> > Sorry, I don't understand. You mean clear_bit() on a huge PTE is
> > technically wrong? Yes, that's what I mean. (cmpxchg() on a huge PTE
> > is not.)
> >
> > > The only way for
> > > KVM to clear a A-bit in a non-leaf entry is if the entry _was_ a huge 
> > > PTE, but was
> > > replaced between the "is leaf" and the clear_bit().
> >
> > I think there is a misunderstanding here. Let me be more specific:
> > 1. Clearing the A-bit in a non-leaf entry is technically wrong because
> > that's not our intention.
> > 2. When we try to clear_bit() on a leaf PMD, it can at the same time
> > become a non-leaf PMD, which causes 1) above, and therefore is
> > technically wrong.
> > 3. I don't think 2) could do any real harm, so no practically no problem.
> > 4. cmpxchg() can avoid 2).
> >
> > Does this make sense?
>
> I understand what you're saying, but clearing an A-bit on a non-leaf PMD that
> _just_ got converted from a leaf PMD is "wrong" if and only if the intented
> behavior is nonsensical.

Sorry, let me rephrase:
1. Clearing the A-bit in a non-leaf entry is technically wrong because
we didn't make sure there is the A-bit there --  the bit we are
clearing can be something else. (Yes, we know it's not, but we didn't
define this behavior, e.g., a macro to designate that bit for non-leaf
entries. Also I didn't check the spec -- does EPT actually support the
A-bit in non-leaf entries? My guess is that NPT does.)


Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

2023-02-23 Thread Yu Zhao
On Thu, Feb 23, 2023 at 10:43 AM Sean Christopherson  wrote:
>
> On Thu, Feb 16, 2023, Yu Zhao wrote:
> > An existing selftest can quickly demonstrate the effectiveness of this
> > patch. On a generic workstation equipped with 128 CPUs and 256GB DRAM:
>
> Not my area of maintenance, but a non-existent changelog (for all intents and
> purposes) for a change of this size and complexity is not acceptable.

Will fix.

> >   $ sudo max_guest_memory_test -c 64 -m 250 -s 250
> >
> >   MGLRU  run2
> >   ---
> >   Before~600s
> >   After  ~50s
> >   Off   ~250s
> >
> >   kswapd (MGLRU before)
> > 100.00%  balance_pgdat
> >   100.00%  shrink_node
> > 100.00%  shrink_one
> >   99.97%  try_to_shrink_lruvec
> > 99.06%  evict_folios
> >   97.41%  shrink_folio_list
> > 31.33%  folio_referenced
> >   31.06%  rmap_walk_file
> > 30.89%  folio_referenced_one
> >   20.83%  __mmu_notifier_clear_flush_young
> > 20.54%  kvm_mmu_notifier_clear_flush_young
> >   =>  19.34%  _raw_write_lock
> >
> >   kswapd (MGLRU after)
> > 100.00%  balance_pgdat
> >   100.00%  shrink_node
> > 100.00%  shrink_one
> >   99.97%  try_to_shrink_lruvec
> > 99.51%  evict_folios
> >   71.70%  shrink_folio_list
> > 7.08%  folio_referenced
> >   6.78%  rmap_walk_file
> > 6.72%  folio_referenced_one
> >   5.60%  lru_gen_look_around
> >   =>1.53%  __mmu_notifier_test_clear_young
>
> Do you happen to know how much of the improvement is due to batching, and how
> much is due to using a walkless walk?

No. I have three benchmarks running at the moment:
1. Windows SQL server guest on x86 host,
2. Apache Spark guest on arm64 host, and
3. Memcached guest on ppc64 host.

If you are really interested in that, I can reprioritize -- I need to
stop 1) and use that machine to get the number for you.

> > @@ -5699,6 +5797,9 @@ static ssize_t show_enabled(struct kobject *kobj, 
> > struct kobj_attribute *attr, c
> >   if (arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG))
> >   caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
> >
> > + if (kvm_arch_has_test_clear_young() && get_cap(LRU_GEN_SPTE_WALK))
> > + caps |= BIT(LRU_GEN_SPTE_WALK);
>
> As alluded to in patch 1, unless batching the walks even if KVM does _not_ 
> support
> a lockless walk is somehow _worse_ than using the existing 
> mmu_notifier_clear_flush_young(),
> I think batching the calls should be conditional only on LRU_GEN_SPTE_WALK.  
> Or
> if we want to avoid batching when there are no mmu_notifier listeners, probe
> mmu_notifiers.  But don't call into KVM directly.

I'm not sure I fully understand. Let's present the problem on the MM
side: assuming KVM supports lockless walks, batching can still be
worse (very unlikely), because GFNs can exhibit no memory locality at
all. So this option allows userspace to disable batching.

I fully understand why you don't want MM to call into KVM directly. No
acceptable ways to set up a clear interface between MM and KVM other
than the MMU notifier?


Re: [PATCH mm-unstable v1 1/5] mm/kvm: add mmu_notifier_test_clear_young()

2023-02-23 Thread Yu Zhao
On Thu, Feb 23, 2023 at 10:14 AM Sean Christopherson  wrote:
>
> On Thu, Feb 16, 2023, Yu Zhao wrote:
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 9c60384b5ae0..1b465df4a93d 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -875,6 +875,63 @@ static int kvm_mmu_notifier_clear_young(struct 
> > mmu_notifier *mn,
> >   return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
> >  }
> >
> > +static bool kvm_test_clear_young(struct kvm *kvm, unsigned long start,
> > +  unsigned long end, unsigned long *bitmap)
> > +{
> > + int i;
> > + int key;
> > + bool success = true;
> > +
> > + trace_kvm_age_hva(start, end);
> > +
> > + key = srcu_read_lock(>srcu);
> > +
> > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > + struct interval_tree_node *node;
> > + struct kvm_memslots *slots = __kvm_memslots(kvm, i);
> > +
> > + kvm_for_each_memslot_in_hva_range(node, slots, start, end - 
> > 1) {
> > + gfn_t lsb_gfn;
> > + unsigned long hva_start, hva_end;
> > + struct kvm_gfn_range range = {
> > + .slot = container_of(node, struct 
> > kvm_memory_slot,
> > +  
> > hva_node[slots->node_idx]),
> > + };
> > +
> > + hva_start = max(start, range.slot->userspace_addr);
> > + hva_end = min(end - 1, range.slot->userspace_addr +
> > +range.slot->npages * PAGE_SIZE 
> > - 1);
> > +
> > + range.start = hva_to_gfn_memslot(hva_start, 
> > range.slot);
> > + range.end = hva_to_gfn_memslot(hva_end, range.slot) + 
> > 1;
> > +
> > + if (WARN_ON_ONCE(range.end <= range.start))
> > + continue;
>
> Extend __kvm_handle_hva_range() instead of copy-pasting.  At a very quick 
> glance,
> I believe all that is needed is (minus sanity checks):

Yes, will do.

I do need to add one more parameter to kvm_gfn_range, because that's
what the current kvm_arch_test_clear_young() needs, assuming that
function is acceptable.

Also, just a side note, from MM's POV, the following in
__kvm_handle_hva_range() seems to forget to handle end == 0, if that's
possible?

  hva_end = min(range->end, slot->userspace_addr + (slot->npages <<
PAGE_SHIFT));

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d255964ec331..3296ae2cf6fa 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -544,6 +544,7 @@ struct kvm_hva_range {
> hva_handler_t handler;
> on_lock_fn_t on_lock;
> on_unlock_fn_t on_unlock;
> +   bool lockless;
> bool flush_on_ret;
> bool may_block;
>  };
> @@ -616,7 +617,7 @@ static __always_inline int __kvm_handle_hva_range(struct 
> kvm *kvm,
> gfn_range.end = hva_to_gfn_memslot(hva_end + 
> PAGE_SIZE - 1, slot);
> gfn_range.slot = slot;
>
> -   if (!locked) {
> +   if (!range->lockless && !locked) {
> locked = true;
> KVM_MMU_LOCK(kvm);
> if (!IS_KVM_NULL_FN(range->on_lock))
>
> > +
> > + /* see the comments on the generic 
> > kvm_arch_has_test_clear_young() */
> > + lsb_gfn = hva_to_gfn_memslot(end - 1, range.slot);
> > +
> > + success = kvm_arch_test_clear_young(kvm, , 
> > lsb_gfn, bitmap);
> > + if (!success)
> > + break;
> > + }
> > + }
> > +
> > + srcu_read_unlock(>srcu, key);
> > +
> > + return success;
> > +}


Re: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

2023-02-23 Thread Yu Zhao
On Thu, Feb 23, 2023 at 10:09 AM Sean Christopherson  wrote:
>
> On Wed, Feb 22, 2023, Yu Zhao wrote:
> > On Fri, Feb 17, 2023 at 9:27 AM Sean Christopherson  
> > wrote:
> > >
> > > On Thu, Feb 16, 2023, Yu Zhao wrote:
> > > > diff --git a/arch/x86/include/asm/kvm_host.h 
> > > > b/arch/x86/include/asm/kvm_host.h
> > > > index 6aaae18f1854..d2995c9e8f07 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -1367,6 +1367,12 @@ struct kvm_arch {
> > > >*  the MMU lock in read mode + the tdp_mmu_pages_lock or
> > > >*  the MMU lock in write mode
> > > >*
> > > > +  * kvm_arch_test_clear_young() is a special case. It relies on two
> > >
> > > No, it's not.  The TDP MMU already employs on RCU and CMPXCHG.
> >
> > It is -- you read it out of context :)
>
> Ah, the special case is that it's fully lockless.  That's still not all that
> special, e.g. see kvm_tdp_mmu_walk_lockless_{begin,end}().
>
> >  * For reads, this list is protected by:
> >  *  the MMU lock in read mode + RCU or
> >  *  the MMU lock in write mode
> >  *
> >  * For writes, this list is protected by:
> >  *  the MMU lock in read mode + the tdp_mmu_pages_lock or
> >  *  the MMU lock in write mode
> >  *
> >  * kvm_arch_test_clear_young() is a special case.
> >  ...
> >
> > struct list_head tdp_mmu_roots;
> >
> > > Just drop the
> > > entire comment.
> >
> > Let me move it into kvm_arch_test_clear_young().
>
> No, I do not want kvm_arch_test_clear_young(), or any other one-off function, 
> to
> be "special".  I love the idea of a lockless walk, but I want it to be a 
> formal,
> documented way to walk TDP MMU roots.  I.e. add macro to go with 
> for_each_tdp_mmu_root()
> and the yield-safe variants.

I see what you mean now. will do.

> /* blah blah blah */
> #define for_each_tdp_mmu_root_lockless(_kvm, _root, _as_id) \
> list_for_each_entry_rcu(_root, >arch.tdp_mmu_roots, link)  \
> if (refcount_read(>tdp_mmu_root_count) && \
> kvm_mmu_page_as_id(_root) != _as_id) {  \
> } else
>
> > Also I want to be clear:
> > 1. We can't just focus on here and now; we need to consider the distant 
> > future.
>
> I 100% agree, but those words need to be backed up by actions.  This series is
> littered with code that is not maintainable long term, e.g. open coding stuff
> that belongs in helpers and/or for which KVM already provides helpers, 
> copy-pasting
> __kvm_handle_hva_range() instead of extending it to have a lockless option, 
> poking
> directly into KVM from mm/ code, etc.
>
> I apologize for being so blunt.  My intent isn't to be rude/snarky, it's to 
> set
> very clear expectations for getting any of these changes merges.

No worries at all. I appreciate you directly telling me how you prefer
it to be done, and that makes the job easier for both of us. Please do
bear with me though, because I'm very unfamiliar with the KVM side of
expectations.

> I asbolutely do
> want to land improvments to KVM's test+clear young flows, but it needs to be 
> done
> in a way that is maintainable and doesn't saddle KVM with more tech debt.

Agreed.

> > 2. From my POV, "see the comments on ..." is like the index of a book.
>
> And my _very_ strong preference is to provide the "index" via code, not 
> comments.

Will do.

> > > Clearing a single bit doesn't need a CMPXCHG.  Please weigh in on a 
> > > relevant series
> > > that is modifying the aging flows[*], I want to have exactly one helper 
> > > for aging
> > > TDP MMU SPTEs.
> > >
> > > [*] 
> > > https://lore.kernel.org/all/20230211014626.3659152-5-vipi...@google.com
> >
> > I'll take a look at that series. clear_bit() probably won't cause any
> > practical damage but is technically wrong because, for example, it can
> > end up clearing the A-bit in a non-leaf PMD. (cmpxchg will just fail
> > in this case, obviously.)
>
> Eh, not really.  By that argument, clearing an A-bit in a huge PTE is also 
> technically
> wrong because the target gfn may or may not have been accessed.

Sorry, I don't understand. You mean clear_bit() on a huge PTE is
technically wrong? Yes, that's what I mean. (cmpxchg() on a huge PTE
is not.)

> The only way for
> KVM to clear a A-bit in a non-leaf entry is if the entry _was_ a huge PTE, 
> but was
> replaced between the "is leaf" and the clear_bit().

I think there is a misunderstanding here. Let me be more specific:
1. Clearing the A-bit in a non-leaf entry is technically wrong because
that's not our intention.
2. When we try to clear_bit() on a leaf PMD, it can at the same time
become a non-leaf PMD, which causes 1) above, and therefore is
technically wrong.
3. I don't think 2) could do any real harm, so no practically no problem.
4. cmpxchg() can avoid 2).

Does this make sense?

Thanks.


Re: [PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young()

2023-02-23 Thread Yu Zhao
On Thu, Feb 23, 2023 at 2:03 AM Marc Zyngier  wrote:
>
> On Thu, 23 Feb 2023 03:58:47 +0000,
> Yu Zhao  wrote:
> >
> > On Fri, Feb 17, 2023 at 2:00 AM Marc Zyngier  wrote:
> > >
> > > On Fri, 17 Feb 2023 04:21:28 +,
> > > Yu Zhao  wrote:
> > > >
> > > > On Thu, Feb 16, 2023 at 9:12 PM Yu Zhao  wrote:
> > > > >
> > > > > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > > > > VMs that are not pKVM and run on hardware that sets the accessed bit
> > > > > in KVM page tables.
> > >
> > > I'm really interested in how you can back this statement. 90% of the
> > > HW I have access to is not FEAT_HWAFDB capable, either because it
> > > predates the feature or because the feature is too buggy to be useful.
> >
> > This is my expericen too -- most devices are pre v8.2.
>
> And yet you have no issue writing the above. Puzzling.

That's best to my knowledge. Mind enlightening me?

> > > Do you have numbers?
> >
> > Let's do a quick market survey by segment. The following only applies
> > to ARM CPUs:
> >
> > 1. Phones: none of the major Android phone vendors sell phones running
> > VMs; no other major Linux phone vendors.
>
> Maybe you should have a reality check and look at what your own
> employer is shipping.

Which model? I'll look it up and see how/how I missed it.

> > 2. Laptops: only a very limited number of Chromebooks run VMs, namely
> > ACRVM. No other major Linux laptop vendors.
>
> Again, your employer disagree.

What do you mean? Sorry, I'm a little surprised here... I do know *a
lot* about Chromebooks.

> > 3. Desktops: no major Linux desktop vendors.
>
> My desktop disagree (I send this from my arm64 desktop VM ).

A model number please?

> > 4. Embedded/IoT/Router: no major Linux vendors run VMs (Android Auto
> > can be a VM guest on QNX host).
>
> This email is brought to you via a router VM on an arm64 box.

More details?

> > 5. Cloud: this is where the vast majority VMs come from. Among the
> > vendors available to the general public, Ampere is the biggest player.
> > Here [1] is a list of its customers. The A-bit works well even on its
> > EVT products (Neoverse cores).
>
> Just the phone stuff dwarfs the number of cloud hosts.

Please point me to something that I can work on so that I wouldn't
sound so ignorant next time.


Re: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

2023-02-22 Thread Yu Zhao
On Fri, Feb 17, 2023 at 9:27 AM Sean Christopherson  wrote:
>
> On Thu, Feb 16, 2023, Yu Zhao wrote:
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index 6aaae18f1854..d2995c9e8f07 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1367,6 +1367,12 @@ struct kvm_arch {
> >*  the MMU lock in read mode + the tdp_mmu_pages_lock or
> >*  the MMU lock in write mode
> >*
> > +  * kvm_arch_test_clear_young() is a special case. It relies on two
>
> No, it's not.  The TDP MMU already employs on RCU and CMPXCHG.

It is -- you read it out of context :)

 * For reads, this list is protected by:
 *  the MMU lock in read mode + RCU or
 *  the MMU lock in write mode
 *
 * For writes, this list is protected by:
 *  the MMU lock in read mode + the tdp_mmu_pages_lock or
 *  the MMU lock in write mode
 *
 * kvm_arch_test_clear_young() is a special case.
 ...

struct list_head tdp_mmu_roots;

> Just drop the
> entire comment.

Let me move it into kvm_arch_test_clear_young().

Also I want to be clear:
1. We can't just focus on here and now; we need to consider the distant future.
2. From my POV, "see the comments on ..." is like the index of a book.

> > +  * techniques, RCU and cmpxchg, to safely test and clear the accessed
> > +  * bit without taking the MMU lock. The former protects KVM page 
> > tables
> > +  * from being freed while the latter clears the accessed bit 
> > atomically
> > +  * against both the hardware and other software page table walkers.
> > +  *
> >* Roots will remain in the list until their tdp_mmu_root_count
> >* drops to zero, at which point the thread that decremented the
> >* count to zero should removed the root from the list and clean
> > @@ -2171,4 +2177,25 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, 
> > unsigned long npages);
> >KVM_X86_QUIRK_FIX_HYPERCALL_INSN | \
> >KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS)
> >
> > +extern u64 __read_mostly shadow_accessed_mask;
> > +
> > +/*
> > + * Returns true if A/D bits are supported in hardware and are enabled by 
> > KVM.
> > + * When enabled, KVM uses A/D bits for all non-nested MMUs.  Because L1 can
> > + * disable A/D bits in EPTP12, SP and SPTE variants are needed to handle 
> > the
> > + * scenario where KVM is using A/D bits for L1, but not L2.
> > + */
> > +static inline bool kvm_ad_enabled(void)
> > +{
> > + return shadow_accessed_mask;
> > +}
>
> Absolutely not.  This information is not getting directly exposed outside of 
> KVM.

Will do.

> > +
> > +/* see the comments on the generic kvm_arch_has_test_clear_young() */
> > +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> > +static inline bool kvm_arch_has_test_clear_young(void)
> > +{
> > + return IS_ENABLED(CONFIG_KVM) && IS_ENABLED(CONFIG_X86_64) &&
> > +(!IS_REACHABLE(CONFIG_KVM) || (kvm_ad_enabled() && 
> > tdp_enabled));
> > +}
>
> Pending the justification for why this is KVM-only

Nothing else has *_young()... IOW, KVM is the only user of *_young().

> I would strongly prefer we
> find a way to have the mmu_notifier framework track whether or not any 
> listeners
> have a test_clear_young().  E.g. have KVM nullify its hook during module load.

It's already done that way. This function is just for the caller to
avoid unnecessary trips into the MMU notifier on archs that don't
support this capability. (The caller would find it unsupported.)

> > +
> >  #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index 6f54dc9409c9..0dc7fed1f3fd 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -153,7 +153,6 @@ extern u64 __read_mostly shadow_mmu_writable_mask;
> >  extern u64 __read_mostly shadow_nx_mask;
> >  extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
> >  extern u64 __read_mostly shadow_user_mask;
> > -extern u64 __read_mostly shadow_accessed_mask;
> >  extern u64 __read_mostly shadow_dirty_mask;
> >  extern u64 __read_mostly shadow_mmio_value;
> >  extern u64 __read_mostly shadow_mmio_mask;
> > @@ -247,17 +246,6 @@ static inline bool is_shadow_present_pte(u64 pte)
> >   return !!(pte & SPTE_MMU_PRESENT_MASK);
> >  }
> >
> > -/*
> > - * Return

Re: [PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young()

2023-02-22 Thread Yu Zhao
On Fri, Feb 17, 2023 at 9:00 AM Sean Christopherson  wrote:
>
> On Fri, Feb 17, 2023, Oliver Upton wrote:
> > Hi Yu,
> >
> > scripts/get_maintainers.pl is your friend for getting the right set of
> > emails for a series :) Don't know about others, but generally I would
> > prefer to be Cc'ed on an entire series (to gather context) than just an
> > individual patch.
>
> +1
>
> >
> > On Thu, Feb 16, 2023 at 09:12:28PM -0700, Yu Zhao wrote:
> > > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > > VMs that are not pKVM and run on hardware that sets the accessed bit
> > > in KVM page tables.
>
> At least for the x86 changes, please read 
> Documentation/process/maintainer-tip.rst
> and rewrite the changelogs.

I see -- will remove "this patch".

> > > It relies on two techniques, RCU and cmpxchg, to safely test and clear
> > > the accessed bit without taking the MMU lock. The former protects KVM
> > > page tables from being freed while the latter clears the accessed bit
> > > atomically against both the hardware and other software page table
> > > walkers.
> > >
> > > Signed-off-by: Yu Zhao 
> > > ---
> > >  arch/arm64/include/asm/kvm_host.h   |  7 +++
> > >  arch/arm64/include/asm/kvm_pgtable.h|  8 +++
> > >  arch/arm64/include/asm/stage2_pgtable.h | 43 ++
> > >  arch/arm64/kvm/arm.c|  1 +
> > >  arch/arm64/kvm/hyp/pgtable.c| 51 ++--
> > >  arch/arm64/kvm/mmu.c| 77 -
> > >  6 files changed, 141 insertions(+), 46 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_host.h 
> > > b/arch/arm64/include/asm/kvm_host.h
> > > index 35a159d131b5..572bcd321586 100644
> > > --- a/arch/arm64/include/asm/kvm_host.h
> > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > @@ -1031,4 +1031,11 @@ static inline void kvm_hyp_reserve(void) { }
> > >  void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
> > >  bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
> > >
> > > +/* see the comments on the generic kvm_arch_has_test_clear_young() */
>
> Please eliminate all of these "see the comments on blah", in every case they 
> do
> nothing more than redirect the reader to something they're likely already 
> aware of.
>
> > > +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> > > +static inline bool kvm_arch_has_test_clear_young(void)
> > > +{
> > > +   return IS_ENABLED(CONFIG_KVM) && cpu_has_hw_af() && 
> > > !is_protected_kvm_enabled();
> > > +}
>
> ...
>
> > Also, I'm at a loss for why we'd need to test if CONFIG_KVM is enabled.
> > My expectation is that we should provide an implementation that returns
> > false if !CONFIG_KVM, avoiding the need to repeat that bit in every
> > single implementation of the function.
>
> mm/vmscan.c uses kvm_arch_has_test_clear_young().  I have opinions on that, 
> but
> I'll hold off on expressing them until there's actual justification presented
> somewhere.
>
> Yu, this series and each patch needs a big pile of "why".  I get that the goal
> is to optimize memory oversubscribe, but there needs to be justification for
> why this is KVM only, why nested VMs and !A/D hardware are out of scope, why 
> yet
> another mmu_notifier hook is being added, etc.

I totally agree.

This is an optimization, not a bug fix. It can't be justified without
performance numbers from some common use cases. That burden of proof
clearly rests on me -- I will follow up on that.

For now, I want to make sure the methodical part is clear:
1. We only have limited resources and we need to prioritize major use cases.
2. We can only improve one thing at a time and we can't cover
everything at the same time.
3. We need to focus on the return on investment and the future.

I hope everyone by now agrees with my "the vast majority of VMs ..."
assertion. If not, I'm happy to revisit that [1]. If so, the next step
would be whether we want to focus on the vast majority first. I think
this naturally answers why the nested VMs and !AD h/w are out of
scope, at the moment (I didn't spell this out; probably I should in
v2). After we have taken the first step, we probably can decide
whether there is enough resource and demand to cover the low return on
investment part (but complexity but less common use cases).

[1] https://lore.kernel.org/linux-mm/20230217041230.2417228-1-yuz...@google.com/


Re: [PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young()

2023-02-22 Thread Yu Zhao
On Fri, Feb 17, 2023 at 2:09 AM Oliver Upton  wrote:
>
> Hi Yu,
>
> scripts/get_maintainers.pl is your friend for getting the right set of
> emails for a series :) Don't know about others, but generally I would
> prefer to be Cc'ed on an entire series (to gather context) than just an
> individual patch.

Will do. Thank you.

> On Thu, Feb 16, 2023 at 09:12:28PM -0700, Yu Zhao wrote:
> > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > VMs that are not pKVM and run on hardware that sets the accessed bit
> > in KVM page tables.
> >
> > It relies on two techniques, RCU and cmpxchg, to safely test and clear
> > the accessed bit without taking the MMU lock. The former protects KVM
> > page tables from being freed while the latter clears the accessed bit
> > atomically against both the hardware and other software page table
> > walkers.
> >
> > Signed-off-by: Yu Zhao 
> > ---
> >  arch/arm64/include/asm/kvm_host.h   |  7 +++
> >  arch/arm64/include/asm/kvm_pgtable.h|  8 +++
> >  arch/arm64/include/asm/stage2_pgtable.h | 43 ++
> >  arch/arm64/kvm/arm.c|  1 +
> >  arch/arm64/kvm/hyp/pgtable.c| 51 ++--
> >  arch/arm64/kvm/mmu.c| 77 -
> >  6 files changed, 141 insertions(+), 46 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h 
> > b/arch/arm64/include/asm/kvm_host.h
> > index 35a159d131b5..572bcd321586 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -1031,4 +1031,11 @@ static inline void kvm_hyp_reserve(void) { }
> >  void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
> >  bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
> >
> > +/* see the comments on the generic kvm_arch_has_test_clear_young() */
> > +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> > +static inline bool kvm_arch_has_test_clear_young(void)
> > +{
> > + return IS_ENABLED(CONFIG_KVM) && cpu_has_hw_af() && 
> > !is_protected_kvm_enabled();
> > +}
>
> Why does the lack of FEAT_HAFDBS preclude the use of the test-and-clear
> notifier?

This all comes down to the return on investment. We could
theoretically make it work but the complexity and the poor performance
would outweigh the benefits -- VM memory overcommit mostly happens in
Cloud and none of the major Cloud vendors use pre v8.2 [1].

[1] 
https://lore.kernel.org/linux-mm/caouhufbbs2gg+dpvsow_n_kx7fwdzvpdjuvlzko-bdq8vfd...@mail.gmail.com/

> On implementations without FEAT_HAFDBS, hardware will generate a data
> abort for software to set the access flag. Regardless of whether
> software or hardware is responsible for updating the PTE that
> information is available in the page tables.

Agreed, the s/w emulation of the A-bit has poor performance. With the
forward looking in mind, businesses who wish to overcommit host memory
will eventually all move onto v8.2 and later. This is another reason
not to worry about pre v8.2 (or 32-bit for that matter).

> Also, I'm at a loss for why we'd need to test if CONFIG_KVM is enabled.
> My expectation is that we should provide an implementation that returns
> false if !CONFIG_KVM, avoiding the need to repeat that bit in every
> single implementation of the function.

We do have that default implementation. But we still need to disable
this implementation when !CONFIG_KVM (it isn't protected by #ifdef
CONFIG_KVM).

> > +
> >  #endif /* __ARM64_KVM_HOST_H__ */
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
> > b/arch/arm64/include/asm/kvm_pgtable.h
> > index 63f81b27a4e3..8c9a04388c88 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -105,6 +105,7 @@ static inline bool kvm_level_supports_block_mapping(u32 
> > level)
> >   * @put_page:Decrement the refcount on a page. 
> > When the
> >   *   refcount reaches 0 the page is automatically
> >   *   freed.
> > + * @put_page_rcu:RCU variant of put_page().
> >   * @page_count:  Return the refcount of a page.
> >   * @phys_to_virt:Convert a physical address into a virtual
> >   *   address mapped in the current context.
> > @@ -122,6 +123,7 @@ struct kvm_pgtable_mm_ops {
> >   void(*free_removed_table)(void *addr, u32 level);
> >   void(*get_page)(void *addr);
> >   void(*put_page)(void *addr);
> > + void  

Re: [PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young()

2023-02-22 Thread Yu Zhao
On Fri, Feb 17, 2023 at 2:00 AM Marc Zyngier  wrote:
>
> On Fri, 17 Feb 2023 04:21:28 +0000,
> Yu Zhao  wrote:
> >
> > On Thu, Feb 16, 2023 at 9:12 PM Yu Zhao  wrote:
> > >
> > > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > > VMs that are not pKVM and run on hardware that sets the accessed bit
> > > in KVM page tables.
>
> I'm really interested in how you can back this statement. 90% of the
> HW I have access to is not FEAT_HWAFDB capable, either because it
> predates the feature or because the feature is too buggy to be useful.

This is my expericen too -- most devices are pre v8.2.

> Do you have numbers?

Let's do a quick market survey by segment. The following only applies
to ARM CPUs:

1. Phones: none of the major Android phone vendors sell phones running
VMs; no other major Linux phone vendors.
2. Laptops: only a very limited number of Chromebooks run VMs, namely
ACRVM. No other major Linux laptop vendors.
3. Desktops: no major Linux desktop vendors.
4. Embedded/IoT/Router: no major Linux vendors run VMs (Android Auto
can be a VM guest on QNX host).
5. Cloud: this is where the vast majority VMs come from. Among the
vendors available to the general public, Ampere is the biggest player.
Here [1] is a list of its customers. The A-bit works well even on its
EVT products (Neoverse cores).

[1] https://en.wikipedia.org/wiki/Ampere_Computing

> > > It relies on two techniques, RCU and cmpxchg, to safely test and clear
> > > the accessed bit without taking the MMU lock. The former protects KVM
> > > page tables from being freed while the latter clears the accessed bit
> > > atomically against both the hardware and other software page table
> > > walkers.
> > >
> > > Signed-off-by: Yu Zhao 
> > > ---
> > >  arch/arm64/include/asm/kvm_host.h   |  7 +++
> > >  arch/arm64/include/asm/kvm_pgtable.h|  8 +++
> > >  arch/arm64/include/asm/stage2_pgtable.h | 43 ++
> > >  arch/arm64/kvm/arm.c|  1 +
> > >  arch/arm64/kvm/hyp/pgtable.c| 51 ++--
> > >  arch/arm64/kvm/mmu.c| 77 -
> > >  6 files changed, 141 insertions(+), 46 deletions(-)
> >
> > Adding Marc and Will.
> >
> > Can you please add other interested parties that I've missed?
>
> The MAINTAINERS file has it all:
>
> KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64)
> M:  Marc Zyngier 
> M:  Oliver Upton 
> R:  James Morse 
> R:  Suzuki K Poulose 
> R:  Zenghui Yu 
> L:  kvm...@lists.linux.dev
>
> May I suggest that you repost your patch and Cc the interested
> parties yourself? I guess most folks will want to see this in context,
> and not as a random, isolated change with no rationale.

This clarified it. Thanks. (I was hesitant to spam people with the
entire series containing changes to other architectures.)


Re: [PATCH mm-unstable v1 4/5] kvm/powerpc: add kvm_arch_test_clear_young()

2023-02-16 Thread Yu Zhao
On Thu, Feb 16, 2023 at 9:12 PM Yu Zhao  wrote:
>
> This patch adds kvm_arch_test_clear_young() for the vast majority of
> VMs that are not nested and run on hardware with Radix MMU enabled.
>
> It relies on two techniques, RCU and cmpxchg, to safely test and clear
> the accessed bit without taking the MMU lock. The former protects KVM
> page tables from being freed while the latter clears the accessed bit
> atomically against both the hardware and other software page table
> walkers.
>
> Signed-off-by: Yu Zhao 
> ---
>  arch/powerpc/include/asm/kvm_host.h| 18 ++
>  arch/powerpc/include/asm/kvm_ppc.h | 14 +
>  arch/powerpc/kvm/book3s.c  |  7 +++
>  arch/powerpc/kvm/book3s.h  |  2 +
>  arch/powerpc/kvm/book3s_64_mmu_radix.c | 78 +-
>  arch/powerpc/kvm/book3s_hv.c   | 10 ++--
>  6 files changed, 110 insertions(+), 19 deletions(-)

Adding Michael, Nicholas and Christophe.

I'm not sure who I should add for this patch. Can you please add any
interested parties that I've missed?

Thank you.


Re: [PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young()

2023-02-16 Thread Yu Zhao
On Thu, Feb 16, 2023 at 9:12 PM Yu Zhao  wrote:
>
> This patch adds kvm_arch_test_clear_young() for the vast majority of
> VMs that are not pKVM and run on hardware that sets the accessed bit
> in KVM page tables.
>
> It relies on two techniques, RCU and cmpxchg, to safely test and clear
> the accessed bit without taking the MMU lock. The former protects KVM
> page tables from being freed while the latter clears the accessed bit
> atomically against both the hardware and other software page table
> walkers.
>
> Signed-off-by: Yu Zhao 
> ---
>  arch/arm64/include/asm/kvm_host.h   |  7 +++
>  arch/arm64/include/asm/kvm_pgtable.h|  8 +++
>  arch/arm64/include/asm/stage2_pgtable.h | 43 ++
>  arch/arm64/kvm/arm.c|  1 +
>  arch/arm64/kvm/hyp/pgtable.c| 51 ++--
>  arch/arm64/kvm/mmu.c| 77 -
>  6 files changed, 141 insertions(+), 46 deletions(-)

Adding Marc and Will.

Can you please add other interested parties that I've missed?

Thanks.


Re: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

2023-02-16 Thread Yu Zhao
On Thu, Feb 16, 2023 at 9:12 PM Yu Zhao  wrote:
>
> This patch adds kvm_arch_test_clear_young() for the vast majority of
> VMs that are not nested and run on hardware that sets the accessed bit
> in TDP MMU page tables.
>
> It relies on two techniques, RCU and cmpxchg, to safely test and clear
> the accessed bit without taking the MMU lock. The former protects KVM
> page tables from being freed while the latter clears the accessed bit
> atomically against both the hardware and other software page table
> walkers.
>
> Signed-off-by: Yu Zhao 
> ---
>  arch/x86/include/asm/kvm_host.h | 27 ++
>  arch/x86/kvm/mmu/spte.h | 12 --
>  arch/x86/kvm/mmu/tdp_mmu.c  | 41 +
>  3 files changed, 68 insertions(+), 12 deletions(-)

Adding Sean and David.

Can you please add other interested parties that I've missed?

Thanks.


[PATCH mm-unstable v1 4/5] kvm/powerpc: add kvm_arch_test_clear_young()

2023-02-16 Thread Yu Zhao
This patch adds kvm_arch_test_clear_young() for the vast majority of
VMs that are not nested and run on hardware with Radix MMU enabled.

It relies on two techniques, RCU and cmpxchg, to safely test and clear
the accessed bit without taking the MMU lock. The former protects KVM
page tables from being freed while the latter clears the accessed bit
atomically against both the hardware and other software page table
walkers.

Signed-off-by: Yu Zhao 
---
 arch/powerpc/include/asm/kvm_host.h| 18 ++
 arch/powerpc/include/asm/kvm_ppc.h | 14 +
 arch/powerpc/kvm/book3s.c  |  7 +++
 arch/powerpc/kvm/book3s.h  |  2 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 78 +-
 arch/powerpc/kvm/book3s_hv.c   | 10 ++--
 6 files changed, 110 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index caea15dcb91d..996850029ce0 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -886,4 +886,22 @@ static inline void kvm_arch_exit(void) {}
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 
+static inline int kvmppc_radix_possible(void)
+{
+   return cpu_has_feature(CPU_FTR_ARCH_300) && radix_enabled();
+}
+
+static inline bool kvmhv_on_pseries(void)
+{
+   return IS_ENABLED(CONFIG_PPC_PSERIES) && 
!cpu_has_feature(CPU_FTR_HVMODE);
+}
+
+/* see the comments on the generic kvm_arch_has_test_clear_young() */
+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+   return IS_ENABLED(CONFIG_KVM) && 
IS_ENABLED(CONFIG_KVM_BOOK3S_HV_POSSIBLE) &&
+  kvmppc_radix_possible() && !kvmhv_on_pseries();
+}
+
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index eae9619b6190..0bb772fc12b1 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -277,6 +277,8 @@ struct kvmppc_ops {
bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
+   bool (*test_clear_young)(struct kvm *kvm, struct kvm_gfn_range *range,
+gfn_t lsb_gfn, unsigned long *bitmap);
bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
void (*free_memslot)(struct kvm_memory_slot *slot);
int (*init_vm)(struct kvm *kvm);
@@ -580,18 +582,6 @@ static inline bool kvm_hv_mode_active(void)
{ return false; }
 
 #endif
 
-#ifdef CONFIG_PPC_PSERIES
-static inline bool kvmhv_on_pseries(void)
-{
-   return !cpu_has_feature(CPU_FTR_HVMODE);
-}
-#else
-static inline bool kvmhv_on_pseries(void)
-{
-   return false;
-}
-#endif
-
 #ifdef CONFIG_KVM_XICS
 static inline int kvmppc_xics_enabled(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 6d525285dbe8..f4cf330e3e81 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -877,6 +877,13 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct 
kvm_gfn_range *range)
return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
 }
 
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range,
+  gfn_t lsb_gfn, unsigned long *bitmap)
+{
+   return kvm->arch.kvm_ops->test_clear_young &&
+  kvm->arch.kvm_ops->test_clear_young(kvm, range, lsb_gfn, bitmap);
+}
+
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
index 58391b4b32ed..fe9cac423817 100644
--- a/arch/powerpc/kvm/book3s.h
+++ b/arch/powerpc/kvm/book3s.h
@@ -12,6 +12,8 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
 extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range 
*range);
 extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
 extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
+extern bool kvmhv_test_clear_young(struct kvm *kvm, struct kvm_gfn_range 
*range,
+  gfn_t lsb_gfn, unsigned long *bitmap);
 extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
 
 extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 9d3743ca16d5..8476646c554c 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1083,6 +1083,78 @@ bool kvm_test_age_radix(struct kv

[PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

2023-02-16 Thread Yu Zhao
An existing selftest can quickly demonstrate the effectiveness of this
patch. On a generic workstation equipped with 128 CPUs and 256GB DRAM:

  $ sudo max_guest_memory_test -c 64 -m 250 -s 250

  MGLRU  run2
  ---
  Before~600s
  After  ~50s
  Off   ~250s

  kswapd (MGLRU before)
100.00%  balance_pgdat
  100.00%  shrink_node
100.00%  shrink_one
  99.97%  try_to_shrink_lruvec
99.06%  evict_folios
  97.41%  shrink_folio_list
31.33%  folio_referenced
  31.06%  rmap_walk_file
30.89%  folio_referenced_one
  20.83%  __mmu_notifier_clear_flush_young
20.54%  kvm_mmu_notifier_clear_flush_young
  =>  19.34%  _raw_write_lock

  kswapd (MGLRU after)
100.00%  balance_pgdat
  100.00%  shrink_node
100.00%  shrink_one
  99.97%  try_to_shrink_lruvec
99.51%  evict_folios
  71.70%  shrink_folio_list
7.08%  folio_referenced
  6.78%  rmap_walk_file
6.72%  folio_referenced_one
  5.60%  lru_gen_look_around
  =>1.53%  __mmu_notifier_test_clear_young

  kswapd (MGLRU off)
100.00%  balance_pgdat
  100.00%  shrink_node
99.92%  shrink_lruvec
  69.95%  shrink_folio_list
19.35%  folio_referenced
  18.37%  rmap_walk_file
17.88%  folio_referenced_one
  13.20%  __mmu_notifier_clear_flush_young
11.64%  kvm_mmu_notifier_clear_flush_young
  =>  9.93%  _raw_write_lock
  26.23%  shrink_active_list
25.50%  folio_referenced
  25.35%  rmap_walk_file
25.28%  folio_referenced_one
  23.87%  __mmu_notifier_clear_flush_young
23.69%  kvm_mmu_notifier_clear_flush_young
  =>  18.98%  _raw_write_lock

Signed-off-by: Yu Zhao 
---
 include/linux/mmzone.h |   6 +-
 mm/rmap.c  |   8 +--
 mm/vmscan.c| 127 -
 3 files changed, 121 insertions(+), 20 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9fb1b03b83b2..ce34b7ea8e4c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -379,6 +379,7 @@ enum {
LRU_GEN_CORE,
LRU_GEN_MM_WALK,
LRU_GEN_NONLEAF_YOUNG,
+   LRU_GEN_SPTE_WALK,
NR_LRU_GEN_CAPS
 };
 
@@ -485,7 +486,7 @@ struct lru_gen_mm_walk {
 };
 
 void lru_gen_init_lruvec(struct lruvec *lruvec);
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
 
 #ifdef CONFIG_MEMCG
 
@@ -573,8 +574,9 @@ static inline void lru_gen_init_lruvec(struct lruvec 
*lruvec)
 {
 }
 
-static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 {
+   return false;
 }
 
 #ifdef CONFIG_MEMCG
diff --git a/mm/rmap.c b/mm/rmap.c
index 15ae24585fc4..eb0089f8f112 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -823,12 +823,10 @@ static bool folio_referenced_one(struct folio *folio,
return false; /* To break the loop */
}
 
-   if (pvmw.pte) {
-   if (lru_gen_enabled() && pte_young(*pvmw.pte)) {
-   lru_gen_look_around();
+   if (lru_gen_enabled() && pvmw.pte) {
+   if (lru_gen_look_around())
referenced++;
-   }
-
+   } else if (pvmw.pte) {
if (ptep_clear_flush_young_notify(vma, address,
pvmw.pte))
referenced++;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c1c5e8b24b8..d6d69f0baabf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -57,6 +57,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -3923,6 +3925,55 @@ static struct folio *get_pfn_folio(unsigned long pfn, 
struct mem_cgroup *memcg,
return folio;
 }
 
+static bool test_spte_young(struct mm_struct *mm, unsigned long addr, unsigned 
long end,
+   unsigned long *bitmap, unsigned long *last)
+{
+   if (!kvm_arch_has_test_clear_young() || !get_cap(LRU_GEN_SPTE_WALK))
+   return false;
+
+   if (*last > addr)
+   goto done;
+
+   *last = end - addr > MIN_LRU_BATCH * PAGE_SIZE ?
+   addr + MIN_LRU_BATCH * PAGE_SIZE - 1 : end - 1;
+   bitmap_zero(bitmap, MIN_LRU_BATCH);
+
+   mmu_notifier_test_clear_young(mm, addr, *last + 1, false, bitmap);
+done:
+   return test_bit((*last - addr) / PAGE_SIZE, bitmap);
+}
+
+static void clear_spte_young(struct

[PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young()

2023-02-16 Thread Yu Zhao
This patch adds kvm_arch_test_clear_young() for the vast majority of
VMs that are not pKVM and run on hardware that sets the accessed bit
in KVM page tables.

It relies on two techniques, RCU and cmpxchg, to safely test and clear
the accessed bit without taking the MMU lock. The former protects KVM
page tables from being freed while the latter clears the accessed bit
atomically against both the hardware and other software page table
walkers.

Signed-off-by: Yu Zhao 
---
 arch/arm64/include/asm/kvm_host.h   |  7 +++
 arch/arm64/include/asm/kvm_pgtable.h|  8 +++
 arch/arm64/include/asm/stage2_pgtable.h | 43 ++
 arch/arm64/kvm/arm.c|  1 +
 arch/arm64/kvm/hyp/pgtable.c| 51 ++--
 arch/arm64/kvm/mmu.c| 77 -
 6 files changed, 141 insertions(+), 46 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 35a159d131b5..572bcd321586 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1031,4 +1031,11 @@ static inline void kvm_hyp_reserve(void) { }
 void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
 bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
 
+/* see the comments on the generic kvm_arch_has_test_clear_young() */
+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+   return IS_ENABLED(CONFIG_KVM) && cpu_has_hw_af() && 
!is_protected_kvm_enabled();
+}
+
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
b/arch/arm64/include/asm/kvm_pgtable.h
index 63f81b27a4e3..8c9a04388c88 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -105,6 +105,7 @@ static inline bool kvm_level_supports_block_mapping(u32 
level)
  * @put_page:  Decrement the refcount on a page. When the
  * refcount reaches 0 the page is automatically
  * freed.
+ * @put_page_rcu:  RCU variant of put_page().
  * @page_count:Return the refcount of a page.
  * @phys_to_virt:  Convert a physical address into a virtual
  * address mapped in the current context.
@@ -122,6 +123,7 @@ struct kvm_pgtable_mm_ops {
void(*free_removed_table)(void *addr, u32 level);
void(*get_page)(void *addr);
void(*put_page)(void *addr);
+   void(*put_page_rcu)(void *addr);
int (*page_count)(void *addr);
void*   (*phys_to_virt)(phys_addr_t phys);
phys_addr_t (*virt_to_phys)(void *addr);
@@ -188,6 +190,12 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 
end,
  * children.
  * @KVM_PGTABLE_WALK_SHARED:   Indicates the page-tables may be shared
  * with other software walkers.
+ *
+ * kvm_arch_test_clear_young() is a special case. It relies on two
+ * techniques, RCU and cmpxchg, to safely test and clear the accessed
+ * bit without taking the MMU lock. The former protects KVM page tables
+ * from being freed while the latter clears the accessed bit atomically
+ * against both the hardware and other software page table walkers.
  */
 enum kvm_pgtable_walk_flags {
KVM_PGTABLE_WALK_LEAF   = BIT(0),
diff --git a/arch/arm64/include/asm/stage2_pgtable.h 
b/arch/arm64/include/asm/stage2_pgtable.h
index c8dca8ae359c..350437661d4b 100644
--- a/arch/arm64/include/asm/stage2_pgtable.h
+++ b/arch/arm64/include/asm/stage2_pgtable.h
@@ -30,4 +30,47 @@
  */
 #define kvm_mmu_cache_min_pages(kvm)   (kvm_stage2_levels(kvm) - 1)
 
+#define KVM_PTE_TYPE   BIT(1)
+#define KVM_PTE_TYPE_BLOCK 0
+#define KVM_PTE_TYPE_PAGE  1
+#define KVM_PTE_TYPE_TABLE 1
+
+#define KVM_PTE_LEAF_ATTR_LO   GENMASK(11, 2)
+
+#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDXGENMASK(4, 2)
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6)
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO  3
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW  1
+#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8)
+#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS  3
+#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10)
+
+#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTRGENMASK(5, 2)
+#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6)
+#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7)
+#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8)
+#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS  3
+#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10)
+
+#define KVM_PTE_LEAF_ATTR_HI   GENMASK(63, 51)
+
+#define KVM_PTE_LEAF_ATTR_HI_SWGENMASK(58, 55)
+
+#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
+
+#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
+
+#define KVM_PTE_LE

[PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

2023-02-16 Thread Yu Zhao
This patch adds kvm_arch_test_clear_young() for the vast majority of
VMs that are not nested and run on hardware that sets the accessed bit
in TDP MMU page tables.

It relies on two techniques, RCU and cmpxchg, to safely test and clear
the accessed bit without taking the MMU lock. The former protects KVM
page tables from being freed while the latter clears the accessed bit
atomically against both the hardware and other software page table
walkers.

Signed-off-by: Yu Zhao 
---
 arch/x86/include/asm/kvm_host.h | 27 ++
 arch/x86/kvm/mmu/spte.h | 12 --
 arch/x86/kvm/mmu/tdp_mmu.c  | 41 +
 3 files changed, 68 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6aaae18f1854..d2995c9e8f07 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1367,6 +1367,12 @@ struct kvm_arch {
 *  the MMU lock in read mode + the tdp_mmu_pages_lock or
 *  the MMU lock in write mode
 *
+* kvm_arch_test_clear_young() is a special case. It relies on two
+* techniques, RCU and cmpxchg, to safely test and clear the accessed
+* bit without taking the MMU lock. The former protects KVM page tables
+* from being freed while the latter clears the accessed bit atomically
+* against both the hardware and other software page table walkers.
+*
 * Roots will remain in the list until their tdp_mmu_root_count
 * drops to zero, at which point the thread that decremented the
 * count to zero should removed the root from the list and clean
@@ -2171,4 +2177,25 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, 
unsigned long npages);
 KVM_X86_QUIRK_FIX_HYPERCALL_INSN | \
 KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS)
 
+extern u64 __read_mostly shadow_accessed_mask;
+
+/*
+ * Returns true if A/D bits are supported in hardware and are enabled by KVM.
+ * When enabled, KVM uses A/D bits for all non-nested MMUs.  Because L1 can
+ * disable A/D bits in EPTP12, SP and SPTE variants are needed to handle the
+ * scenario where KVM is using A/D bits for L1, but not L2.
+ */
+static inline bool kvm_ad_enabled(void)
+{
+   return shadow_accessed_mask;
+}
+
+/* see the comments on the generic kvm_arch_has_test_clear_young() */
+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+   return IS_ENABLED(CONFIG_KVM) && IS_ENABLED(CONFIG_X86_64) &&
+  (!IS_REACHABLE(CONFIG_KVM) || (kvm_ad_enabled() && tdp_enabled));
+}
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 6f54dc9409c9..0dc7fed1f3fd 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -153,7 +153,6 @@ extern u64 __read_mostly shadow_mmu_writable_mask;
 extern u64 __read_mostly shadow_nx_mask;
 extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
 extern u64 __read_mostly shadow_user_mask;
-extern u64 __read_mostly shadow_accessed_mask;
 extern u64 __read_mostly shadow_dirty_mask;
 extern u64 __read_mostly shadow_mmio_value;
 extern u64 __read_mostly shadow_mmio_mask;
@@ -247,17 +246,6 @@ static inline bool is_shadow_present_pte(u64 pte)
return !!(pte & SPTE_MMU_PRESENT_MASK);
 }
 
-/*
- * Returns true if A/D bits are supported in hardware and are enabled by KVM.
- * When enabled, KVM uses A/D bits for all non-nested MMUs.  Because L1 can
- * disable A/D bits in EPTP12, SP and SPTE variants are needed to handle the
- * scenario where KVM is using A/D bits for L1, but not L2.
- */
-static inline bool kvm_ad_enabled(void)
-{
-   return !!shadow_accessed_mask;
-}
-
 static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
 {
return sp->role.ad_disabled;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d6df38d371a0..9028e09f1aab 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1309,6 +1309,47 @@ bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct 
kvm_gfn_range *range)
return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range);
 }
 
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range,
+  gfn_t lsb_gfn, unsigned long *bitmap)
+{
+   struct kvm_mmu_page *root;
+
+   if (WARN_ON_ONCE(!kvm_arch_has_test_clear_young()))
+   return false;
+
+   if (kvm_memslots_have_rmaps(kvm))
+   return false;
+
+   /* see the comments on kvm_arch->tdp_mmu_roots */
+   rcu_read_lock();
+
+   list_for_each_entry_rcu(root, >arch.tdp_mmu_roots, link) {
+   struct tdp_iter iter;
+
+   if (kvm_mmu_page_as_id(root) != range->slot->as_id)
+   continue;
+
+   tdp_root_for_each_lea

[PATCH mm-unstable v1 1/5] mm/kvm: add mmu_notifier_test_clear_young()

2023-02-16 Thread Yu Zhao
mmu_notifier_test_clear_young() allows the caller to safely test and
clear the accessed bit in KVM PTEs without taking the MMU lock.

This patch adds the generic infrastructure to invoke the subsequent
arch-specific patches. The arch-specific implementations generally
rely on two techniques: RCU and cmpxchg. The former protects KVM page
tables from being freed while the latter clears the accessed bit
atomically against both the hardware and other software page table
walkers.

mmu_notifier_test_clear_young() follows two design patterns: fallback
and batching. For any unsupported cases, it can optionally fall back
to mmu_notifier_ops->clear_young(). For a range of KVM PTEs, it can
test or test and clear their accessed bits according to a bitmap
provided by the caller.

mmu_notifier_test_clear_young() always returns 0 if fallback is not
allowed. If fallback happens, its return value is similar to that of
mmu_notifier_clear_young().

The bitmap parameter has the following specifications:
1. The number of bits should be at least (end-start)/PAGE_SIZE.
2. The offset of each bit is relative to the end. E.g., the offset
   corresponding to addr is (end-addr)/PAGE_SIZE-1. This is to better
   suit batching while forward looping.
3. For each KVM PTE with the accessed bit set (young), arch-specific
   implementations flip the corresponding bit in the bitmap. It only
   clears the accessed bit if the old value is 1. A caller can test or
   test and clear the accessed bit by setting the corresponding bit in
   the bitmap to 0 or 1, and the new value will be 1 or 0 for a young
   KVM PTE.

Signed-off-by: Yu Zhao 
---
 include/linux/kvm_host.h | 29 ++
 include/linux/mmu_notifier.h | 40 +
 mm/mmu_notifier.c| 26 
 virt/kvm/kvm_main.c  | 58 
 4 files changed, 153 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4f26b244f6d0..df46fc815c8b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2281,4 +2281,33 @@ static inline void kvm_account_pgtable_pages(void *virt, 
int nr)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+/*
+ * Architectures that implement kvm_arch_test_clear_young() should override
+ * kvm_arch_has_test_clear_young().
+ *
+ * kvm_arch_has_test_clear_young() is allowed to return false positive. It can
+ * return true if kvm_arch_test_clear_young() is supported but disabled due to
+ * some runtime constraint. In this case, kvm_arch_test_clear_young() should
+ * return false.
+ *
+ * The last parameter to kvm_arch_test_clear_young() is a bitmap with the
+ * following specifications:
+ * 1. The offset of each bit is relative to the second to the last parameter
+ *lsb_gfn. E.g., the offset corresponding to gfn is lsb_gfn-gfn. This is to
+ *better suit batching while forward looping.
+ * 2. For each KVM PTE with the accessed bit set, the implementation should 
flip
+ *the corresponding bit in the bitmap. It should only clear the accessed 
bit
+ *if the old value is 1. This allows the caller to test or test and clear
+ *the accessed bit.
+ */
+#ifndef kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+   return false;
+}
+#endif
+
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range,
+  gfn_t lsb_gfn, unsigned long *bitmap);
+
 #endif
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 64a3e051c3c4..432b51cd6843 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -122,6 +122,11 @@ struct mmu_notifier_ops {
  struct mm_struct *mm,
  unsigned long address);
 
+   /* see the comments on mmu_notifier_test_clear_young() */
+   bool (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm,
+unsigned long start, unsigned long end,
+unsigned long *bitmap);
+
/*
 * change_pte is called in cases that pte mapping to page is changed:
 * for example, when ksm remaps pte to point to a new shared page.
@@ -390,6 +395,9 @@ extern int __mmu_notifier_clear_flush_young(struct 
mm_struct *mm,
 extern int __mmu_notifier_clear_young(struct mm_struct *mm,
  unsigned long start,
  unsigned long end);
+extern int __mmu_notifier_test_clear_young(struct mm_struct *mm,
+  unsigned long start, unsigned long 
end,
+  bool fallback, unsigned long 
*bitmap);
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
@@ -43

[PATCH mm-unstable v1 0/5] mm/kvm: lockless accessed bit harvest

2023-02-16 Thread Yu Zhao
TLDR

This patchset RCU-protects KVM page tables and compare-and-exchanges
KVM PTEs with the accessed bit set by hardware. It significantly
improves the performance of guests when the host is under heavy
memory pressure.

ChromeOS has been using a similar approach [1] since mid 2021 and it
was proven successful on tens of millions devices.

[1] https://crrev.com/c/2987928

Overview

The goal of this patchset is to optimize the performance of guests
when the host memory is overcommitted. It focuses on the vast
majority of VMs that are not nested and run on hardware that sets the
accessed bit in KVM page tables.

Note that nested VMs and hardware that does not support the accessed
bit are both out of scope.

This patchset relies on two techniques, RCU and cmpxchg, to safely
test and clear the accessed bit without taking kvm->mmu_lock. The
former protects KVM page tables from being freed while the latter
clears the accessed bit atomically against both hardware and other
software page table walkers.

A new MMU notifier API, mmu_notifier_test_clear_young(), is
introduced. It follows two design patterns: fallback and batching.
For any unsupported cases, it can optionally fall back to
mmu_notifier_ops->clear_young(). For a range of KVM PTEs, it can test
or test and clear their accessed bits according to a bitmap provided
by the caller.

This patchset only applies mmu_notifier_test_clear_young() to MGLRU.
A follow-up patchset will apply it to /proc/PID/pagemap and
/prod/PID/clear_refs.

Evaluation
==
An existing selftest can quickly demonstrate the effectiveness of
this patchset. On a generic workstation equipped with 64 CPUs and
256GB DRAM:

  $ sudo max_guest_memory_test -c 64 -m 256 -s 256

  MGLRU  run2
  ---
  Before~600s
  After  ~50s
  Off   ~250s

  kswapd (MGLRU before)
100.00%  balance_pgdat
  100.00%  shrink_node
100.00%  shrink_one
  99.97%  try_to_shrink_lruvec
99.06%  evict_folios
  97.41%  shrink_folio_list
31.33%  folio_referenced
  31.06%  rmap_walk_file
30.89%  folio_referenced_one
  20.83%  __mmu_notifier_clear_flush_young
20.54%  kvm_mmu_notifier_clear_flush_young
  =>  19.34%  _raw_write_lock

  kswapd (MGLRU after)
100.00%  balance_pgdat
  100.00%  shrink_node
100.00%  shrink_one
  99.97%  try_to_shrink_lruvec
99.51%  evict_folios
  71.70%  shrink_folio_list
7.08%  folio_referenced
  6.78%  rmap_walk_file
6.72%  folio_referenced_one
  5.60%  lru_gen_look_around
  =>1.53%  __mmu_notifier_test_clear_young

  kswapd (MGLRU off)
100.00%  balance_pgdat
  100.00%  shrink_node
99.92%  shrink_lruvec
  69.95%  shrink_folio_list
19.35%  folio_referenced
  18.37%  rmap_walk_file
17.88%  folio_referenced_one
  13.20%  __mmu_notifier_clear_flush_young
11.64%  kvm_mmu_notifier_clear_flush_young
  =>  9.93%  _raw_write_lock
  26.23%  shrink_active_list
25.50%  folio_referenced
  25.35%  rmap_walk_file
25.28%  folio_referenced_one
  23.87%  __mmu_notifier_clear_flush_young
23.69%  kvm_mmu_notifier_clear_flush_young
  =>  18.98%  _raw_write_lock

Comprehensive benchmarks are coming soon.

Yu Zhao (5):
  mm/kvm: add mmu_notifier_test_clear_young()
  kvm/x86: add kvm_arch_test_clear_young()
  kvm/arm64: add kvm_arch_test_clear_young()
  kvm/powerpc: add kvm_arch_test_clear_young()
  mm: multi-gen LRU: use mmu_notifier_test_clear_young()

 arch/arm64/include/asm/kvm_host.h   |   7 ++
 arch/arm64/include/asm/kvm_pgtable.h|   8 ++
 arch/arm64/include/asm/stage2_pgtable.h |  43 
 arch/arm64/kvm/arm.c|   1 +
 arch/arm64/kvm/hyp/pgtable.c|  51 ++
 arch/arm64/kvm/mmu.c|  77 +-
 arch/powerpc/include/asm/kvm_host.h |  18 
 arch/powerpc/include/asm/kvm_ppc.h  |  14 +--
 arch/powerpc/kvm/book3s.c   |   7 ++
 arch/powerpc/kvm/book3s.h   |   2 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c  |  78 ++-
 arch/powerpc/kvm/book3s_hv.c|  10 +-
 arch/x86/include/asm/kvm_host.h |  27 +
 arch/x86/kvm/mmu/spte.h |  12 ---
 arch/x86/kvm/mmu/tdp_mmu.c  |  41 
 include/linux/kvm_host.h|  29 ++
 include/linux/mmu_notifier.h|  40 
 include/linux/mmzone.h  |   6 +-
 mm/mmu_notifier.c   |  26 +
 mm/rmap.c   |   8 +-
 mm/vmscan.c