Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-09 Thread Huang, Ying
ose memory types that are not >> > initialized by device drivers. >> > Because late initialized memory and default DRAM memory need to be managed, >> > a default memory type is created for storing all memory types that are >> > not initialized by device drivers and as

Re: [PATCH v8 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-28 Thread Huang, Ying
ory types that are > not initialized by device drivers and as a fallback. > > Signed-off-by: Ho-Ren (Jack) Chuang > Signed-off-by: Hao Xiang > Reviewed-by: "Huang, Ying" > --- > mm/memory-tiers.c | 94 +++ > 1 file chan

Re: [PATCH v6 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-27 Thread Huang, Ying
e_memory_types[nid].memtype" will be !NULL. And it's possible (in theory) that some nodes becomes "node_state(nid, N_CPU) == true" between memory_tier_init() and memory_tier_late_init(). Otherwise, Looks good to me. Feel free to add Reviewed-by: "Huang, Ying" in the fu

Re: [PATCH v5 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-26 Thread Huang, Ying
pe = alloc_memory_type(MEMTIER_ADISTANCE_DRAM); > + default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM, > + > _memory_types); > if (IS_ERR(default_dram_type)) > panic("%s() failed to allocate default DRAM tier\n", __func__); > > @@ -868,6 +919,14 @@ static int __init memory_tier_init(void) >* types assigned. >*/ > for_each_node_state(node, N_MEMORY) { > + if (!node_state(node, N_CPU)) > + /* > + * Defer memory tier initialization on CPUless numa > nodes. > + * These will be initialized after firmware and devices > are > + * initialized. > + */ > + continue; > + > memtier = set_node_memory_tier(node); > if (IS_ERR(memtier)) > /* -- Best Regards, Huang, Ying

Re: [External] Re: [PATCH v4 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-25 Thread Huang, Ying
"Ho-Ren (Jack) Chuang" writes: > On Fri, Mar 22, 2024 at 1:41 AM Huang, Ying wrote: >> >> "Ho-Ren (Jack) Chuang" writes: >> >> > The current implementation treats emulated memory devices, such as >> > CXL1.1 type3 mem

Re: [PATCH v4 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-22 Thread Huang, Ying
t; _memory_types); > if (IS_ERR(default_dram_type)) > panic("%s() failed to allocate default DRAM tier\n", __func__); > > @@ -868,6 +913,14 @@ static int __init memory_tier_init(void) >* types assigned. >*/ > for_each_node_state(node, N_MEMORY) { > + if (!node_state(node, N_CPU)) > + /* > + * Defer memory tier initialization on CPUless numa > nodes. > + * These will be initialized after firmware and devices > are > + * initialized. > + */ > + continue; > + > memtier = set_node_memory_tier(node); > if (IS_ERR(memtier)) > /* -- Best Regards, Huang, Ying

Re: [PATCH v3 1/2] memory tier: dax/kmem: create CPUless memory tiers after obtaining HMAT info

2024-03-20 Thread Huang, Ying
> > return 0; > } > @@ -826,7 +897,8 @@ static int __init memory_tier_init(void) >* For now we can have 4 faster memory tiers with smaller adistance >* than default DRAM tier. >*/ > - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM); > + default_dram_type = mt_find_alloc_memory_type( > + MEMTIER_ADISTANCE_DRAM, > _memory_types); > if (IS_ERR(default_dram_type)) > panic("%s() failed to allocate default DRAM tier\n", __func__); > > @@ -836,6 +908,14 @@ static int __init memory_tier_init(void) >* types assigned. >*/ > for_each_node_state(node, N_MEMORY) { > + if (!node_state(node, N_CPU)) > + /* > + * Defer memory tier initialization on CPUless numa > nodes. > + * These will be initialized after firmware and devices > are > + * initialized. > + */ > + continue; > + > memtier = set_node_memory_tier(node); > if (IS_ERR(memtier)) > /* -- Best Regards, Huang, Ying

Re: [External] Re: [PATCH v2 1/1] memory tier: acpi/hmat: create CPUless memory tiers after obtaining HMAT info

2024-03-14 Thread Huang, Ying
"Ho-Ren (Jack) Chuang" writes: > On Tue, Mar 12, 2024 at 2:21 AM Huang, Ying wrote: >> >> "Ho-Ren (Jack) Chuang" writes: >> >> > The current implementation treats emulated memory devices, such as >> > CXL1.1 type3 mem

Re: [PATCH v2 1/1] memory tier: acpi/hmat: create CPUless memory tiers after obtaining HMAT info

2024-03-12 Thread Huang, Ying
tance(struct access_coordinate *perf, > int *adist) > (default_dram_perf.read_latency + > default_dram_perf.write_latency) * > (default_dram_perf.read_bandwidth + > default_dram_perf.write_bandwidth) / > (perf->read_bandwidth + perf->write_bandwidth); > - mutex_unlock(_tier_lock); > + mutex_unlock(_perf_lock); > > return 0; > } > @@ -836,6 +890,14 @@ static int __init memory_tier_init(void) >* types assigned. >*/ > for_each_node_state(node, N_MEMORY) { > + if (!node_state(node, N_CPU)) > + /* > + * Defer memory tier initialization on CPUless numa > nodes. > + * These will be initialized when HMAT information is HMAT is platform specific, we should avoid to mention it in general code if possible. > + * available. > + */ > + continue; > + > memtier = set_node_memory_tier(node); > if (IS_ERR(memtier)) > /* -- Best Regards, Huang, Ying

Re: [PATCH v6 4/4] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-14 Thread Huang, Ying
emmap_on_memory semantics, to > preserve legacy behavior. For dax devices via CXL, the default is on. > The sysfs control allows the administrator to override the above > defaults if needed. > > Cc: David Hildenbrand > Cc: Dan Williams > Cc: Dave Jiang > Cc: Dave Hansen &g

Re: [PATCH v5 4/4] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-14 Thread Huang, Ying
emmap_on_memory semantics, to > preserve legacy behavior. For dax devices via CXL, the default is on. > The sysfs control allows the administrator to override the above > defaults if needed. > > Cc: David Hildenbrand > Cc: Dan Williams > Cc: Dave Jiang > Cc: Dave Hansen &g

Re: [PATCH v4 3/3] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-12 Thread Huang, Ying
emmap_on_memory semantics, to > preserve legacy behavior. For dax devices via CXL, the default is on. > The sysfs control allows the administrator to override the above > defaults if needed. > > Cc: David Hildenbrand > Cc: Dan Williams > Cc: Dave Jiang > Cc: Dave Hansen &g

Re: [PATCH v3 2/2] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-11 Thread Huang, Ying
"Verma, Vishal L" writes: > On Tue, 2023-12-12 at 08:30 +0800, Huang, Ying wrote: >> Vishal Verma writes: >> >> > Add a sysfs knob for dax devices to control the memmap_on_memory setting >> > if the dax device were to be hotplugged as system mem

Re: [PATCH v3 2/2] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-11 Thread Huang, Ying
emmap_on_memory semantics, to > preserve legacy behavior. For dax devices via CXL, the default is on. > The sysfs control allows the administrator to override the above > defaults if needed. > > Cc: David Hildenbrand > Cc: Dan Williams > Cc: Dave Jiang > Cc: Dave Hansen &g

Re: [PATCH v2 2/2] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-07 Thread Huang, Ying
emmap_on_memory semantics, to > preserve legacy behavior. For dax devices via CXL, the default is on. > The sysfs control allows the administrator to override the above > defaults if needed. > > Cc: David Hildenbrand > Cc: Dan Williams > Cc: Dave Jiang > Cc: Dave Hansen &g

Re: [PATCH v9 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-11-02 Thread Huang, Ying
6753402-2de9-25b2-36e9-eacd49752...@redhat.com/ > > Cc: Andrew Morton > Cc: David Hildenbrand > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Dan Williams > Cc: Dave Jiang > Cc: Dave Hansen > Cc: Huang Ying > Suggested-by: David Hildenbrand > Reviewed-by

Re: [PATCH v8 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-11-01 Thread Huang, Ying
6753402-2de9-25b2-36e9-eacd49752...@redhat.com/ > > Cc: Andrew Morton > Cc: David Hildenbrand > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Dan Williams > Cc: Dave Jiang > Cc: Dave Hansen > Cc: Huang Ying > Suggested-by: David Hildenbr

Re: [PATCH v7 3/3] dax/kmem: allow kmem to add memory with memmap_on_memory

2023-10-29 Thread Huang, Ying
s via CXL. For non-CXL dax regions, retain the existing > default behavior of hot adding without memmap_on_memory semantics. > > Cc: Andrew Morton > Cc: David Hildenbrand > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Dan Williams > Cc: Dave Jiang > Cc: Dave Hans

Re: [PATCH v7 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-29 Thread Huang, Ying
6753402-2de9-25b2-36e9-eacd49752...@redhat.com/ > > Cc: Andrew Morton > Cc: David Hildenbrand > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Dan Williams > Cc: Dave Jiang > Cc: Dave Hansen > Cc: Huang Ying > Suggested-by: David Hildenbr

Re: [PATCH v6 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-17 Thread Huang, Ying
6753402-2de9-25b2-36e9-eacd49752...@redhat.com/ > > Cc: Andrew Morton > Cc: David Hildenbrand > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Dan Williams > Cc: Dave Jiang > Cc: Dave Hansen > Cc: Huang Ying > Suggested-by: David Hildenbr

Re: [PATCH v5 2/2] dax/kmem: allow kmem to add memory with memmap_on_memory

2023-10-16 Thread Huang, Ying
"Verma, Vishal L" writes: > On Tue, 2023-10-17 at 13:18 +0800, Huang, Ying wrote: >> "Verma, Vishal L" writes: >> >> > On Thu, 2023-10-05 at 14:16 -0700, Dan Williams wrote: >> > > Vishal Verma wrote: >> > &

Re: [PATCH v5 2/2] dax/kmem: allow kmem to add memory with memmap_on_memory

2023-10-16 Thread Huang, Ying
if (!dax_region->dev->driver) { >> >> Is the polarity backwards here? I.e. if the device is already attached to >> the kmem driver it is too late to modify memmap_on_memory policy. > > Hm this sounded logical until I tried it. After a reconfigure-device to > devdax (i

Re: [PATCH v5 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-07 Thread Huang, Ying
have been split up into memblock sized chunks, > and to loop through those as needed. > > Cc: Andrew Morton > Cc: David Hildenbrand > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Dan Williams > Cc: Dave Jiang > Cc: Dave Hansen > Cc: Huang Ying > Suggested-by: David

Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

2023-08-21 Thread Huang, Ying
Alistair Popple writes: > "Huang, Ying" writes: > >> Alistair Popple writes: >> >>> "Huang, Ying" writes: >>> >>>> Hi, Alistair, >>>> >>>> Sorry for late response. Just come back from vacation. >

Re: [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface

2023-08-21 Thread Huang, Ying
Alistair Popple writes: > "Huang, Ying" writes: > >> Alistair Popple writes: >> >>> Huang Ying writes: >>> >>>> Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is >>>> used for slow memory type i

Re: [PATCH RESEND 3/4] acpi, hmat: calculate abstract distance with HMAT

2023-08-21 Thread Huang, Ying
Alistair Popple writes: > "Huang, Ying" writes: > >> Alistair Popple writes: >> >>> Huang Ying writes: >>> >>>> A memory tiering abstract distance calculation algorithm based on ACPI >>>> HMAT is implemented. The ba

Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

2023-08-21 Thread Huang, Ying
Alistair Popple writes: > "Huang, Ying" writes: > >> Hi, Alistair, >> >> Sorry for late response. Just come back from vacation. > > Ditto for this response :-) > > I see Andrew has taken this into mm-unstable though, so my bad for not &g

Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-08-14 Thread Huang, Ying
"Verma, Vishal L" writes: > On Mon, 2023-07-24 at 13:54 +0800, Huang, Ying wrote: >> Vishal Verma writes: >> >> > >> > @@ -2035,12 +2056,38 @@ void try_offline_node(int nid) >> > } >> > EXPORT_SYMBOL(try_offline_node); >

Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-08-14 Thread Huang, Ying
"Verma, Vishal L" writes: > On Mon, 2023-07-24 at 11:16 +0800, Huang, Ying wrote: >> "Aneesh Kumar K.V" writes: >> > >> > > @@ -1339,27 +1367,20 @@ int __ref add_memory_resource(int nid, >> > > struct resource *res, mhp_t mhp_flags

Re: [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT

2023-08-11 Thread Huang, Ying
bers reported by HMAT here, but FWIW, this patchset > puts the CXL nodes on a lower tier than DRAM nodes. Thank you very much! Can I add your "Tested-by" for the series? -- Best Regards, Huang, Ying

Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

2023-08-10 Thread Huang, Ying
Hi, Alistair, Sorry for late response. Just come back from vacation. Alistair Popple writes: > "Huang, Ying" writes: > >> Alistair Popple writes: >> >>> "Huang, Ying" writes: >>> >>>> Alistair Popple writes: >>

Re: [PATCH RESEND 2/4] acpi, hmat: refactor hmat_register_target_initiators()

2023-08-10 Thread Huang, Ying
Hi, Jonathan, Thanks for review! Jonathan Cameron writes: > On Fri, 21 Jul 2023 09:29:30 +0800 > Huang Ying wrote: > >> Previously, in hmat_register_target_initiators(), the performance >> attributes are calculated and the corresponding sysfs links and files >&

Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

2023-07-26 Thread Huang, Ying
Alistair Popple writes: > "Huang, Ying" writes: > >> Alistair Popple writes: >> >>> "Huang, Ying" writes: >>> >>>>>> And, I don't think that we are forced to use the general notifier >>>>>>

Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

2023-07-26 Thread Huang, Ying
Alistair Popple writes: > "Huang, Ying" writes: > >>>> The other way (suggested by this series) is to make dax/kmem call a >>>> notifier chain, then CXL CDAT or ACPI HMAT can identify the type of >>>> device and calculate the distance

Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

2023-07-26 Thread Huang, Ying
Alistair Popple writes: > "Huang, Ying" writes: > >> Hi, Alistair, >> >> Thanks a lot for comments! >> >> Alistair Popple writes: >> >>> Huang Ying writes: >>> >>>> The abstract distance may be calculate

Re: [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface

2023-07-25 Thread Huang, Ying
Alistair Popple writes: > Huang Ying writes: > >> Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is >> used for slow memory type in kmem driver. This limits the usage of >> kmem driver, for example, it cannot be used for HBM (high bandwidth >

Re: [PATCH RESEND 3/4] acpi, hmat: calculate abstract distance with HMAT

2023-07-25 Thread Huang, Ying
Alistair Popple writes: > Huang Ying writes: > >> A memory tiering abstract distance calculation algorithm based on ACPI >> HMAT is implemented. The basic idea is as follows. >> >> The performance attributes of system default DRAM nodes are recorded >&g

Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

2023-07-24 Thread Huang, Ying
Hi, Alistair, Thanks a lot for comments! Alistair Popple writes: > Huang Ying writes: > >> The abstract distance may be calculated by various drivers, such as >> ACPI HMAT, CXL CDAT, etc. While it may be used by various code which >> hot-add memory node, such as da

Re: [PATCH v2 1/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory()

2023-07-24 Thread Huang, Ying
t; Cc: Oscar Salvador > Cc: Dan Williams > Cc: Dave Jiang > Cc: Dave Hansen > Cc: Huang Ying > Reviewed-by: David Hildenbrand > Signed-off-by: Vishal Verma > --- > include/linux/memory_hotplug.h | 5 + > mm/memory_hotplug.c| 1 + > 2 files chan

Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-07-23 Thread Huang, Ying
have been split up into memblock sized chunks, > and to loop through those as needed. > > Cc: Andrew Morton > Cc: David Hildenbrand > Cc: Oscar Salvador > Cc: Dan Williams > Cc: Dave Jiang > Cc: Dave Hansen > Cc: Huang Ying > Suggested-by: David Hil

Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-07-23 Thread Huang, Ying
gt; it are met,. Teach try_remove_memory() to also expect that a memory >> range being removed might have been split up into memblock sized chunks, >> and to loop through those as needed. >> >> Cc: Andrew Morton >> Cc: David Hildenbrand >> Cc: Oscar Salvador >

[PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface

2023-07-20 Thread Huang Ying
k. Signed-off-by: "Huang, Ying" Cc: Aneesh Kumar K.V Cc: Wei Xu Cc: Alistair Popple Cc: Dan Williams Cc: Dave Hansen Cc: Davidlohr Bueso Cc: Johannes Weiner Cc: Jonathan Cameron Cc: Michal Hocko Cc: Yang Shi Cc: Rafael J Wysocki --- drivers/dax/k

[PATCH RESEND 3/4] acpi, hmat: calculate abstract distance with HMAT

2023-07-20 Thread Huang Ying
distance of a memory node (target) to MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance attributes of the node to that of the default DRAM nodes. Signed-off-by: "Huang, Ying" Cc: Aneesh Kumar K.V Cc: Wei Xu Cc: Alistair Popple Cc: Dan Williams Cc: Dave Hansen Cc:

[PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

2023-07-20 Thread Huang Ying
be specified via priority (notifier_block.priority). Signed-off-by: "Huang, Ying" Cc: Aneesh Kumar K.V Cc: Wei Xu Cc: Alistair Popple Cc: Dan Williams Cc: Dave Hansen Cc: Davidlohr Bueso Cc: Johannes Weiner Cc: Jonathan Cameron Cc: Michal Hocko Cc: Yang Shi Cc: Rafael J Wysocki --

[PATCH RESEND 2/4] acpi, hmat: refactor hmat_register_target_initiators()

2023-07-20 Thread Huang Ying
to calculate the performance attributes for a memory target without creating sysfs links and files. To do that, hmat_register_target_initiators() is refactored to make it possible to calculate performance attributes separately. Signed-off-by: "Huang, Ying" Cc: Aneesh Kumar K.V Cc: Wei Xu Cc

[PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT

2023-07-20 Thread Huang Ying
Optane DCPMM. Changelog: V1 (from RFC): - Added some comments per Aneesh's comments, Thanks! Best Regards, Huang, Ying

Re: [PATCH] memory tier: rename destroy_memory_type() to put_memory_type()

2023-07-06 Thread Huang, Ying
Miaohe Lin writes: > It appears that destroy_memory_type() isn't a very good name because > we usually will not free the memory_type here. So rename it to a more > appropriate name i.e. put_memory_type(). > > Suggested-by: Huang, Ying > Signed-off-by: Miaohe Lin LGTM,

Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

2023-06-16 Thread Huang, Ying
Use the > mhp_flag to force the memmap_on_memory checks regardless of the > respective module parameter setting. > > Cc: "Rafael J. Wysocki" > Cc: Len Brown > Cc: Andrew Morton > Cc: David Hildenbrand > Cc: Oscar Salvador > Cc: Dan Williams > Cc: Dave Jiang

Re: [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param

2023-06-16 Thread Huang, Ying
t;Rafael J. Wysocki" > Cc: Len Brown > Cc: Andrew Morton > Cc: David Hildenbrand > Cc: Oscar Salvador > Cc: Dan Williams > Cc: Dave Jiang > Cc: Dave Hansen > Cc: Huang Ying > Signed-off-by: Vishal Verma > --- > include/linux/memory_hotplug.h | 2 +- &

Re: [PATCH v3 1/4] mm/swapfile: use percpu_ref to serialize against concurrent swapoff

2021-04-20 Thread Huang, Ying
tatic void _enable_swap_info(struct swap_info_struct *p) > { > - p->flags |= SWP_WRITEOK | SWP_VALID; > + p->flags |= SWP_WRITEOK; > atomic_long_add(p->pages, _swap_pages); > total_swap_pages += p->pages; > > @@ -2497,10 +2506,9 @@ static void

Re: [PATCH v3 4/4] mm/shmem: fix shmem_swapin() race with swapoff

2021-04-20 Thread Huang, Ying
}; > > + /* Prevent swapoff from happening to us. */ > + si = get_swap_device(swap); Better to put get/put_swap_device() in shmem_swapin_page(), that make it possible for us to remove get/put_swap_device() in lookup_swap_cache(). Best Regards, Huang, Ying > + if (unlikel

Re: [PATCH v3 3/4] mm/swap: remove confusing checking for non_swap_entry() in swap_ra_info()

2021-04-20 Thread Huang, Ying
race isn't important because it will not cause problem. Best Regards, Huang, Ying > But the swap_entry > isn't used in this function and we will have enough checking when we really > operate the PTE entries later. So checking for non_swap_entry() is not > really needed here and shou

Re: [PATCH v3 2/4] swap: fix do_swap_page() race with swapoff

2021-04-20 Thread Huang, Ying
y > done when system shutdown only. To reduce the performance overhead on the > hot-path as much as possible, it appears we can use the percpu_ref to close > this race window(as suggested by Huang, Ying). This needs to be revised too. Unless you squash 1/4 and 2/4. > Fixes: 0bcac06

Re: [PATCH v2 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-19 Thread Huang, Ying
Miaohe Lin writes: > On 2021/4/19 15:09, Huang, Ying wrote: >> Miaohe Lin writes: >> >>> On 2021/4/19 10:48, Huang, Ying wrote: >>>> Miaohe Lin writes: >>>> >>>>> We will use percpu-refcount to serialize against concurrent

Re: [PATCH v2 5/5] mm/shmem: fix shmem_swapin() race with swapoff

2021-04-19 Thread Huang, Ying
Miaohe Lin writes: > On 2021/4/19 15:04, Huang, Ying wrote: >> Miaohe Lin writes: >> >>> On 2021/4/19 10:15, Huang, Ying wrote: >>>> Miaohe Lin writes: >>>> >>>>> When I was investigating the swap code,

Re: [PATCH v2 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-19 Thread Huang, Ying
Miaohe Lin writes: > On 2021/4/19 10:48, Huang, Ying wrote: >> Miaohe Lin writes: >> >>> We will use percpu-refcount to serialize against concurrent swapoff. This >>> patch adds the percpu_ref support for swap. >>> >>> Signed-off-by:

Re: [PATCH v2 5/5] mm/shmem: fix shmem_swapin() race with swapoff

2021-04-19 Thread Huang, Ying
Miaohe Lin writes: > On 2021/4/19 10:15, Huang, Ying wrote: >> Miaohe Lin writes: >> >>> When I was investigating the swap code, I found the below possible race >>> window: >>> >>> CP

Re: [PATCH v2 2/5] mm/swapfile: use percpu_ref to serialize against concurrent swapoff

2021-04-18 Thread Huang, Ying
es, _swap_pages); > total_swap_pages += p->pages; > > @@ -2507,7 +2504,7 @@ static void enable_swap_info(struct swap_info_struct > *p, int prio, > spin_unlock(_lock); > /* >* Guarantee swap_map, cluster_info, etc. fields are valid >

Re: [PATCH v2 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-18 Thread Huang, Ying
lags) > { > struct swap_info_struct *p; > - struct filename *name; > + struct filename *name = NULL; > struct file *swap_file = NULL; > struct address_space *mapping; > int prio; > @@ -3163,6 +3179,15 @@ SYSCALL_DEFINE2(swapon, const char __user *

Re: [PATCH v2 3/5] swap: fix do_swap_page() race with swapoff

2021-04-18 Thread Huang, Ying
is usually > done when system shutdown only. To reduce the performance overhead on the > hot-path as much as possible, it appears we can use the percpu_ref to close > this race window(as suggested by Huang, Ying). I still suggest to squash PATCH 1-3, at least PATCH 1-2. That will change th

Re: [PATCH v2 5/5] mm/shmem: fix shmem_swapin() race with swapoff

2021-04-18 Thread Huang, Ying
node *inode = si->swap_file->f_mapping->host;[oops!] > > Close this race window by using get/put_swap_device() to guard against > concurrent swapoff. > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") No. This isn't the commit that introduces the race condition

Re: [PATCH v2 4/5] mm/swap: remove confusing checking for non_swap_entry() in swap_ra_info()

2021-04-18 Thread Huang, Ying
-blame to find out it. The patch itself looks good to me. Best Regards, Huang, Ying > Signed-off-by: Miaohe Lin > --- > mm/swap_state.c | 6 -- > 1 file changed, 6 deletions(-) > > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 272ea2108c9d..df5405384520 100644 > ---

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-16 Thread Huang, Ying
Miaohe Lin writes: > On 2021/4/15 22:31, Dennis Zhou wrote: >> On Thu, Apr 15, 2021 at 01:24:31PM +0800, Huang, Ying wrote: >>> Dennis Zhou writes: >>> >>>> On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote: >>>>> Dennis Zhou w

Re: [RFC PATCH] percpu_ref: Make percpu_ref_tryget*() ACQUIRE operations

2021-04-16 Thread Huang, Ying
Kent Overstreet writes: > On Thu, Apr 15, 2021 at 09:42:56PM -0700, Paul E. McKenney wrote: >> On Tue, Apr 13, 2021 at 10:47:03AM +0800, Huang Ying wrote: >> > One typical use case of percpu_ref_tryget() family functions is as >> > follows, >> >

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-15 Thread Huang, Ying
Dennis Zhou writes: > On Thu, Apr 15, 2021 at 01:24:31PM +0800, Huang, Ying wrote: >> Dennis Zhou writes: >> >> > On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote: >> >> Dennis Zhou writes: >> >> >> >> > On Wed, Apr 1

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-15 Thread Huang, Ying
ning and rmap scanning in the page reclaiming. For example, if the working-set is transitioned, we can take advantage of the fast page table scanning to identify the new working-set quickly. While we can fallback to the rmap scanning if the page table scanning doesn't help. Best Regards, Huang, Ying

Re: [v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count

2021-04-15 Thread Huang, Ying
"Zi Yan" writes: > On 13 Apr 2021, at 23:00, Huang, Ying wrote: > >> Yang Shi writes: >> >>> The generic migration path will check refcount, so no need check refcount >>> here. >>> But the old code actually prevents from migrating shared T

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-14 Thread Huang, Ying
Dennis Zhou writes: > On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote: >> Dennis Zhou writes: >> >> > On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote: >> >> Dennis Zhou writes: >> >> >> >> > Hello, >&g

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Huang, Ying
Yu Zhao writes: > On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying wrote: >> >> Yu Zhao writes: >> >> > On Tue, Apr 13, 2021 at 8:30 PM Rik van Riel wrote: >> >> >> >> On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote: >> >

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Huang, Ying
rmap, we need to > scan a lot of pages anyway. Why not just scan them all? This may be not the case. For rmap scanning, it's possible to scan only a small portion of memory. But with the page table scanning, you need to scan almost all (I understand you have some optimization as above). As Rik shown i

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-13 Thread Huang, Ying
Dennis Zhou writes: > On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote: >> Dennis Zhou writes: >> >> > Hello, >> > >> > On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote: >> >> Miaohe Lin writes: >> >>

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-13 Thread Huang, Ying
Dennis Zhou writes: > Hello, > > On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote: >> Miaohe Lin writes: >> >> > On 2021/4/14 9:17, Huang, Ying wrote: >> >> Miaohe Lin writes: >> >> >> >>>

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-13 Thread Huang, Ying
Miaohe Lin writes: > On 2021/4/13 9:27, Huang, Ying wrote: >> Miaohe Lin writes: >> >>> When I was investigating the swap code, I found the below possible race >>> window: >>> >>>

Re: [v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count

2021-04-13 Thread Huang, Ying
us from migrating shared THP? If no, why not just remove the old refcount checking? Best Regards, Huang, Ying > Signed-off-by: Yang Shi > --- > mm/migrate.c | 16 > 1 file changed, 4 insertions(+), 12 deletions(-) > > diff --git a/mm/migrate.c b/mm/migrate.

Re: [v2 PATCH 3/7] mm: thp: refactor NUMA fault handling

2021-04-13 Thread Huang, Ying
/mm/huge_memory.c > @@ -1418,93 +1418,21 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) > { > struct vm_area_struct *vma = vmf->vma; > pmd_t pmd = vmf->orig_pmd; > - struct anon_vma *anon_vma = NULL; > + pmd_t oldpmd; nit: the usage of oldpmd and

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-13 Thread Huang, Ying
Miaohe Lin writes: > On 2021/4/14 9:17, Huang, Ying wrote: >> Miaohe Lin writes: >> >>> On 2021/4/12 15:24, Huang, Ying wrote: >>>> "Huang, Ying" writes: >>>> >>>>> Miaohe Lin writes: >>>>> >>&g

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-13 Thread Huang, Ying
Miaohe Lin writes: > On 2021/4/12 15:24, Huang, Ying wrote: >> "Huang, Ying" writes: >> >>> Miaohe Lin writes: >>> >>>> We will use percpu-refcount to serialize against concurrent swapoff. This >>>> patch adds the perc

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-13 Thread Huang, Ying
Tim Chen writes: > On 4/12/21 6:27 PM, Huang, Ying wrote: > >> >> This isn't the commit that introduces the race. You can use `git blame` >> find out the correct commit. For this it's commit 0bcac06f27d7 "mm, >> swap: skip swapcache for swapin of synchr

Re: [RFC] mm: activate access-more-than-once page via NUMA balancing

2021-04-12 Thread Huang, Ying
Yu Zhao writes: > On Fri, Mar 26, 2021 at 12:21 AM Huang, Ying wrote: >> >> Mel Gorman writes: >> >> > On Thu, Mar 25, 2021 at 12:33:45PM +0800, Huang, Ying wrote: >> >> > I caution against this patch. >> >> > >> >&g

Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list

2021-04-12 Thread Huang, Ying
Yu Zhao writes: > On Wed, Mar 24, 2021 at 12:58 AM Huang, Ying wrote: >> >> Yu Zhao writes: >> >> > On Mon, Mar 22, 2021 at 11:13:19AM +0800, Huang, Ying wrote: >> >> Yu Zhao writes: >> >> >> >> > On Wed, Mar 17,

Re: [PATCH v1 00/14] Multigenerational LRU

2021-04-12 Thread Huang, Ying
gle-page VMAs, i.e., not returning to the PGD table for each > of such VMAs. Just a heads-up. > > The rmap, on the other hand, had to > 1) lock each (shmem) page it scans > 2) go through five levels of page tables for each page, even though > some of them have the same LCAs > during the test. The second part is worse given that I have 5 levels > of page tables configured. > > Any additional benchmarks you would suggest? Thanks. Hi, Yu, Thanks for your data. In addition to the data your measured above, is it possible for you to measure some raw data? For example, how many CPU cycles does it take to scan all pages in the system? For the page table scanning, the page tables of all processes will be scanned. For the rmap scanning, all pages in LRU will be scanned. And we can do that with difference parameters, for example, shared vs. non-shared, sparse vs. dense. Then we can get an idea about how fast the page table scanning can be. Best Regards, Huang, Ying

[RFC PATCH] percpu_ref: Make percpu_ref_tryget*() ACQUIRE operations

2021-04-12 Thread Huang Ying
rom the other fields may be invalid or inconsistent. To guarantee the correct memory ordering, percpu_ref_tryget*() needs to be the ACQUIRE operations. This function implements that via using smp_load_acquire() in __ref_is_percpu() to read the percpu pointer. Signed-off-by: "Huang, Ying&quo

Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

2021-04-12 Thread Huang, Ying
ier 0 memory used by the cgroup exceeds > this high > boundary, allocation of tier 0 memory by the cgroup will > be throttled. The tier 0 memory used by this cgroup > will also be subjected to heavy demotion. I think we

Re: [PATCH 5/5] mm/swap_state: fix swap_cluster_readahead() race with swapoff

2021-04-12 Thread Huang, Ying
p_page() has been fixed. We need to fix shmem_swapin(). Best Regards, Huang, Ying > Signed-off-by: Miaohe Lin > --- > mm/swap_state.c | 11 +-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 3bf0d0c297b

Re: [PATCH 3/5] mm/swap_state: fix get_shadow_from_swap_cache() race with swapoff

2021-04-12 Thread Huang, Ying
essary. The only caller has guaranteed the swap device from swapoff. Best Regards, Huang, Ying > --- > mm/swap_state.c | 9 ++--- > 1 file changed, 6 insertions(+), 3 deletions(-) > > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 272ea2108c9d..709c260d644a 100644 &

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-12 Thread Huang, Ying
e overhead on the > hot-path as much as possible, it appears we can use the percpu_ref to close > this race window(as suggested by Huang, Ying). > > Fixes: 235b62176712 ("mm/swap: add cluster lock") This isn't the commit that introduces the race. You can use `git blame` find

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-12 Thread Huang, Ying
"Huang, Ying" writes: > Miaohe Lin writes: > >> We will use percpu-refcount to serialize against concurrent swapoff. This >> patch adds the percpu_ref support for later fixup. >> >> Signed-off-by: Miaohe Lin >> --- >> includ

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-11 Thread Huang, Ying
pecialfile); > if (IS_ERR(name)) { > error = PTR_ERR(name); > @@ -3356,6 +3374,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, > specialfile, int, swap_flags) > bad_swap_unlock_inode: > inode_unlock(inode); > bad_swap: > + percpu_ref_exit(>users); Usually the resource freeing order matches their allocating order reversely. So, if there's no special reason, please follow that rule. Best Regards, Huang, Ying > free_percpu(p->percpu_cluster); > p->percpu_cluster = NULL; > free_percpu(p->cluster_next_cpu);

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-11 Thread Huang, Ying
.. >>p->swap_file >> = NULL; >> struct file *swap_file = sis->swap_file; >> struct address_space *mapping = swap_file->f_mapping;[oops!] >> ... >> ... >> > > Agree. This is also what I meant to illustrate. And you provide a better one. > Many thanks! For the pages that are swapped in through swap cache. That isn't an issue. Because the page is locked, the swap entry will be marked with SWAP_HAS_CACHE, so swapoff() cannot proceed until the page has been unlocked. So the race is for the fast path as follows, if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1) I found it in your original patch description. But please make it more explicit to reduce the potential confusing. Best Regards, Huang, Ying

Re: [PATCH 4/5] mm/swap_state: fix potential faulted in race in swap_ra_info()

2021-04-11 Thread Huang, Ying
Miaohe Lin writes: > On 2021/4/9 16:50, Huang, Ying wrote: >> Miaohe Lin writes: >> >>> While we released the pte lock, somebody else might faulted in this pte. >>> So we should check whether it's swap pte first to guard against such race >>> or swp

Re: [PATCH 4/5] mm/swap_state: fix potential faulted in race in swap_ra_info()

2021-04-09 Thread Huang, Ying
l issue. entry or swap_entry isn't used in this function. And we have enough checking when we really operate the PTE entries later. But I admit it's confusing. So I suggest to just remove the checking. We will check it when necessary. Best Regards, Huang, Ying

Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

2021-04-08 Thread Huang, Ying
secase which divides DRAM:PMEM ratio for different jobs or memcgs > when I was with Alibaba. > > In the first place I thought about per NUMA node limit, but it was > very hard to configure it correctly for users unless you know exactly > about your memory usage and hot/cold memory distribution. > > I'm wondering, just off the top of my head, if we could extend the > semantic of low and min limit. For example, just redefine low and min > to "the limit on top tier memory". Then we could have low priority > jobs have 0 low/min limit. Per my understanding, memory.low/min are for the memory protection instead of the memory limiting. memory.high is for the memory limiting. Best Regards, Huang, Ying

Re: [PATCH -V2] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault

2021-04-08 Thread Huang, Ying
Mel Gorman writes: > On Fri, Apr 02, 2021 at 04:27:17PM +0800, Huang Ying wrote: >> With NUMA balancing, in hint page fault handler, the faulting page >> will be migrated to the accessing node if necessary. During the >> migration, TLB will be shot down on all CPUs tha

[PATCH -V3] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault

2021-04-08 Thread Huang Ying
) with about 9.2e6 pages (35.8GB) migrated. From the perf profile, it can be found that the CPU cycles spent by try_to_unmap() and its callees reduces from 6.02% to 0.47%. That is, the CPU cycles spent by TLB shooting down decreases greatly. Signed-off-by: "Huang, Ying" Reviewed-by: Mel

[PATCH -V2] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault

2021-04-02 Thread Huang Ying
) with about 9.2e6 pages (35.8GB) migrated. From the perf profile, it can be found that the CPU cycles spent by try_to_unmap() and its callees reduces from 6.02% to 0.47%. That is, the CPU cycles spent by TLB shooting down decreases greatly. Signed-off-by: "Huang, Ying" Cc: Peter Zijlstr

Re: [RFC] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault

2021-03-31 Thread Huang, Ying
Mel Gorman writes: > On Wed, Mar 31, 2021 at 07:20:09PM +0800, Huang, Ying wrote: >> Mel Gorman writes: >> >> > On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote: >> >> For NUMA balancing, in hint page fault handler, the faulting page will >

Re: [RFC] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault

2021-03-31 Thread Huang, Ying
Mel Gorman writes: > On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote: >> For NUMA balancing, in hint page fault handler, the faulting page will >> be migrated to the accessing node if necessary. During the migration, >> TLB will be shot down on all CPUs tha

Re: [Question] Is there a race window between swapoff vs synchronous swap_readpage

2021-03-30 Thread Huang, Ying
Yu Zhao writes: > On Mon, Mar 29, 2021 at 9:44 PM Huang, Ying wrote: >> >> Miaohe Lin writes: >> >> > On 2021/3/30 9:57, Huang, Ying wrote: >> >> Hi, Miaohe, >> >> >> >> Miaohe Lin writes: >> >> >> >

Re: [Question] Is there a race window between swapoff vs synchronous swap_readpage

2021-03-29 Thread Huang, Ying
Miaohe Lin writes: > On 2021/3/30 9:57, Huang, Ying wrote: >> Hi, Miaohe, >> >> Miaohe Lin writes: >> >>> Hi all, >>> I am investigating the swap code, and I found the below possible race >>> window: >>> >&g

Re: [Question] Is there a race window between swapoff vs synchronous swap_readpage

2021-03-29 Thread Huang, Ying
would be really grateful. Thanks! :) This appears possible. Even for swapcache case, we can't guarantee the swap entry gotten from the page table is always valid too. The underlying swap device can be swapped off at the same time. So we use get/put_swap_device() for that. Maybe we need similar stuff here. Best Regards, Huang, Ying

  1   2   3   4   5   6   7   8   9   10   >