from:"Huang, Ying"

Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-09 Thread Huang, Ying

"Ho-Ren (Jack) Chuang"  writes:

> On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron
>  wrote:
>>
>> On Fri,  5 Apr 2024 00:07:06 +
>> "Ho-Ren (Jack) Chuang"  wrote:
>>
>> > The current implementation treats emulated memory devices, such as
>> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
>> > (E820_TYPE_RAM). However, these emulated devices have different
>> > characteristics than traditional DRAM, making it important to
>> > distinguish them. Thus, we modify the tiered memory initialization process
>> > to introduce a delay specifically for CPUless NUMA nodes. This delay
>> > ensures that the memory tier initialization for these nodes is deferred
>> > until HMAT information is obtained during the boot process. Finally,
>> > demotion tables are recalculated at the end.
>> >
>> > * late_initcall(memory_tier_late_init);
>> > Some device drivers may have initialized memory tiers between
>> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
>> > online memory nodes and configuring memory tiers. They should be excluded
>> > in the late init.
>> >
>> > * Handle cases where there is no HMAT when creating memory tiers
>> > There is a scenario where a CPUless node does not provide HMAT information.
>> > If no HMAT is specified, it falls back to using the default DRAM tier.
>> >
>> > * Introduce another new lock `default_dram_perf_lock` for adist calculation
>> > In the current implementation, iterating through CPUlist nodes requires
>> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
>> > trying to acquire the same lock, leading to a potential deadlock.
>> > Therefore, we propose introducing a standalone `default_dram_perf_lock` to
>> > protect `default_dram_perf_*`. This approach not only avoids deadlock
>> > but also prevents holding a large lock simultaneously.
>> >
>> > * Upgrade `set_node_memory_tier` to support additional cases, including
>> >   default DRAM, late CPUless, and hot-plugged initializations.
>> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
>> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
>> > handle cases where memtype is not initialized and where HMAT information is
>> > available.
>> >
>> > * Introduce `default_memory_types` for those memory types that are not
>> >   initialized by device drivers.
>> > Because late initialized memory and default DRAM memory need to be managed,
>> > a default memory type is created for storing all memory types that are
>> > not initialized by device drivers and as a fallback.
>> >
>> > Signed-off-by: Ho-Ren (Jack) Chuang 
>> > Signed-off-by: Hao Xiang 
>> > Reviewed-by: "Huang, Ying" 
>>
>> Hi - one remaining question. Why can't we delay init for all nodes
>> to either drivers or your fallback late_initcall code.
>> It would be nice to reduce possible code paths.
>
> I try not to change too much of the existing code structure in
> this patchset.
>
> To me, postponing/moving all memory tier registrations to
> late_initcall() is another possible action item for the next patchset.
>
> After tier_mem(), hmat_init() is called, which requires registering
> `default_dram_type` info. This is when `default_dram_type` is needed.
> However, it is indeed possible to postpone the latter part,
> set_node_memory_tier(), to `late_init(). So, memory_tier_init() can
> indeed be split into two parts, and the latter part can be moved to
> late_initcall() to be processed together.

I don't think that it's good to move all memory_tier initialization in
drivers to late_initcall().  It's natural to keep them in
device_initcall() level.

If so, we can allocate default_dram_type in memory_tier_init(), and call
set_node_memory_tier() only in memory_tier_lateinit().  We can call
memory_tier_lateinit() in device_initcall() level too.

--
Best Regards,
Huang, Ying

> Doing this all memory-type drivers have to call late_initcall() to
> register a memory tier. I’m not sure how many they are?
>
> What do you guys think?
>
>>
>> Jonathan
>>
>>
>> > ---
>> >  mm/memory-tiers.c | 94 +++
>> >  1 file changed, 70 insertions(+), 24 deletions(-)
>> >
>> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> > index 516b144fd45a..6632102bd5c9 100644
>> > --- a/mm/memory-tiers.c
>> > +++ b/mm/memory-tiers.c
&

Re: [PATCH v8 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-28 Thread Huang, Ying

"Ho-Ren (Jack) Chuang"  writes:

> The current implementation treats emulated memory devices, such as
> CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
> (E820_TYPE_RAM). However, these emulated devices have different
> characteristics than traditional DRAM, making it important to
> distinguish them. Thus, we modify the tiered memory initialization process
> to introduce a delay specifically for CPUless NUMA nodes. This delay
> ensures that the memory tier initialization for these nodes is deferred
> until HMAT information is obtained during the boot process. Finally,
> demotion tables are recalculated at the end.
>
> * late_initcall(memory_tier_late_init);
> Some device drivers may have initialized memory tiers between
> `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> online memory nodes and configuring memory tiers. They should be excluded
> in the late init.
>
> * Handle cases where there is no HMAT when creating memory tiers
> There is a scenario where a CPUless node does not provide HMAT information.
> If no HMAT is specified, it falls back to using the default DRAM tier.
>
> * Introduce another new lock `default_dram_perf_lock` for adist calculation
> In the current implementation, iterating through CPUlist nodes requires
> holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
> trying to acquire the same lock, leading to a potential deadlock.
> Therefore, we propose introducing a standalone `default_dram_perf_lock` to
> protect `default_dram_perf_*`. This approach not only avoids deadlock
> but also prevents holding a large lock simultaneously.
>
> * Upgrade `set_node_memory_tier` to support additional cases, including
>   default DRAM, late CPUless, and hot-plugged initializations.
> To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> handle cases where memtype is not initialized and where HMAT information is
> available.
>
> * Introduce `default_memory_types` for those memory types that are not
>   initialized by device drivers.
> Because late initialized memory and default DRAM memory need to be managed,
> a default memory type is created for storing all memory types that are
> not initialized by device drivers and as a fallback.
>
> Signed-off-by: Ho-Ren (Jack) Chuang 
> Signed-off-by: Hao Xiang 
> Reviewed-by: "Huang, Ying" 
> ---
>  mm/memory-tiers.c | 94 +++
>  1 file changed, 78 insertions(+), 16 deletions(-)
>
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 974af10cfdd8..e24fc3bebae4 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -36,6 +36,11 @@ struct node_memory_type_map {
>  
>  static DEFINE_MUTEX(memory_tier_lock);
>  static LIST_HEAD(memory_tiers);
> +/*
> + * The list is used to store all memory types that are not created
> + * by a device driver.
> + */
> +static LIST_HEAD(default_memory_types);
>  static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
>  struct memory_dev_type *default_dram_type;
>  
> @@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion __read_mostly;
>  
>  static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
>  
> +/* The lock is used to protect `default_dram_perf*` info and nid. */
> +static DEFINE_MUTEX(default_dram_perf_lock);
>  static bool default_dram_perf_error;
>  static struct access_coordinate default_dram_perf;
>  static int default_dram_perf_ref_nid = NUMA_NO_NODE;
> @@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, 
> struct memory_dev_type *mem
>  static struct memory_tier *set_node_memory_tier(int node)
>  {
>   struct memory_tier *memtier;
> - struct memory_dev_type *memtype;
> + struct memory_dev_type *mtype = default_dram_type;
> + int adist = MEMTIER_ADISTANCE_DRAM;
>   pg_data_t *pgdat = NODE_DATA(node);
>  
>  
> @@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int 
> node)
>   if (!node_state(node, N_MEMORY))
>   return ERR_PTR(-EINVAL);
>  
> - __init_node_memory_type(node, default_dram_type);
> + mt_calc_adistance(node, );
> + if (node_memory_types[node].memtype == NULL) {
> + mtype = mt_find_alloc_memory_type(adist, _memory_types);
> + if (IS_ERR(mtype)) {
> + mtype = default_dram_type;
> + pr_info("Failed to allocate a memory type. Fall 
> back.\n");
> + }
> + }
> +
> + __init_node_memory_type(node, mtype);
>  
> - memtype = node_memory_types[node].memtype;
>

Re: [PATCH v6 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-27 Thread Huang, Ying

"Ho-Ren (Jack) Chuang"  writes:

[snip]

> @@ -655,6 +672,34 @@ void mt_put_memory_types(struct list_head *memory_types)
>  }
>  EXPORT_SYMBOL_GPL(mt_put_memory_types);
>  
> +/*
> + * This is invoked via `late_initcall()` to initialize memory tiers for
> + * CPU-less memory nodes after driver initialization, which is
> + * expected to provide `adistance` algorithms.
> + */
> +static int __init memory_tier_late_init(void)
> +{
> + int nid;
> +
> + mutex_lock(_tier_lock);
> + for_each_node_state(nid, N_MEMORY)
> + if (!node_state(nid, N_CPU) &&
> + node_memory_types[nid].memtype == NULL)

Think about this again.  It seems that it is better to check
"node_memory_types[nid].memtype == NULL" only here.  Because for all
node with N_CPU in memory_tier_init(), "node_memory_types[nid].memtype"
will be !NULL.  And it's possible (in theory) that some nodes becomes
"node_state(nid, N_CPU) == true" between memory_tier_init() and
memory_tier_late_init().

Otherwise, Looks good to me.  Feel free to add

Reviewed-by: "Huang, Ying" 

in the future version.

> + /*
> +  * Some device drivers may have initialized memory tiers
> +  * between `memory_tier_init()` and 
> `memory_tier_late_init()`,
> +  * potentially bringing online memory nodes and
> +  * configuring memory tiers. Exclude them here.
> +  */
> + set_node_memory_tier(nid);
> +
> + establish_demotion_targets();
> + mutex_unlock(_tier_lock);
> +
> + return 0;
> +}
> +late_initcall(memory_tier_late_init);
> +

[snip]

--
Best Regards,
Huang, Ying

Re: [PATCH v5 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-26 Thread Huang, Ying

   
> _memory_types);
>   if (IS_ERR(default_dram_type))
>   panic("%s() failed to allocate default DRAM tier\n", __func__);
>  
> @@ -868,6 +919,14 @@ static int __init memory_tier_init(void)
>* types assigned.
>*/
>   for_each_node_state(node, N_MEMORY) {
> + if (!node_state(node, N_CPU))
> + /*
> +  * Defer memory tier initialization on CPUless numa 
> nodes.
> +  * These will be initialized after firmware and devices 
> are
> +  * initialized.
> +  */
> + continue;
> +
>   memtier = set_node_memory_tier(node);
>   if (IS_ERR(memtier))
>   /*

--
Best Regards,
Huang, Ying

Re: [External] Re: [PATCH v4 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-25 Thread Huang, Ying

"Ho-Ren (Jack) Chuang"  writes:

> On Fri, Mar 22, 2024 at 1:41 AM Huang, Ying  wrote:
>>
>> "Ho-Ren (Jack) Chuang"  writes:
>>
>> > The current implementation treats emulated memory devices, such as
>> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
>> > (E820_TYPE_RAM). However, these emulated devices have different
>> > characteristics than traditional DRAM, making it important to
>> > distinguish them. Thus, we modify the tiered memory initialization process
>> > to introduce a delay specifically for CPUless NUMA nodes. This delay
>> > ensures that the memory tier initialization for these nodes is deferred
>> > until HMAT information is obtained during the boot process. Finally,
>> > demotion tables are recalculated at the end.
>> >
>> > * late_initcall(memory_tier_late_init);
>> > Some device drivers may have initialized memory tiers between
>> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
>> > online memory nodes and configuring memory tiers. They should be excluded
>> > in the late init.
>> >
>> > * Handle cases where there is no HMAT when creating memory tiers
>> > There is a scenario where a CPUless node does not provide HMAT information.
>> > If no HMAT is specified, it falls back to using the default DRAM tier.
>> >
>> > * Introduce another new lock `default_dram_perf_lock` for adist calculation
>> > In the current implementation, iterating through CPUlist nodes requires
>> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
>> > trying to acquire the same lock, leading to a potential deadlock.
>> > Therefore, we propose introducing a standalone `default_dram_perf_lock` to
>> > protect `default_dram_perf_*`. This approach not only avoids deadlock
>> > but also prevents holding a large lock simultaneously.
>> >
>> > * Upgrade `set_node_memory_tier` to support additional cases, including
>> >   default DRAM, late CPUless, and hot-plugged initializations.
>> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
>> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
>> > handle cases where memtype is not initialized and where HMAT information is
>> > available.
>> >
>> > * Introduce `default_memory_types` for those memory types that are not
>> >   initialized by device drivers.
>> > Because late initialized memory and default DRAM memory need to be managed,
>> > a default memory type is created for storing all memory types that are
>> > not initialized by device drivers and as a fallback.
>> >
>> > Signed-off-by: Ho-Ren (Jack) Chuang 
>> > Signed-off-by: Hao Xiang 
>> > ---
>> >  mm/memory-tiers.c | 73 ---
>> >  1 file changed, 63 insertions(+), 10 deletions(-)
>> >
>> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> > index 974af10cfdd8..9396330fa162 100644
>> > --- a/mm/memory-tiers.c
>> > +++ b/mm/memory-tiers.c
>> > @@ -36,6 +36,11 @@ struct node_memory_type_map {
>> >
>> >  static DEFINE_MUTEX(memory_tier_lock);
>> >  static LIST_HEAD(memory_tiers);
>> > +/*
>> > + * The list is used to store all memory types that are not created
>> > + * by a device driver.
>> > + */
>> > +static LIST_HEAD(default_memory_types);
>> >  static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
>> >  struct memory_dev_type *default_dram_type;
>> >
>> > @@ -108,6 +113,7 @@ static struct demotion_nodes *node_demotion 
>> > __read_mostly;
>> >
>> >  static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
>> >
>> > +static DEFINE_MUTEX(default_dram_perf_lock);
>>
>> Better to add comments about what is protected by this lock.
>>
>
> Thank you. I will add a comment like this:
> + /* The lock is used to protect `default_dram_perf*` info and nid. */
> +static DEFINE_MUTEX(default_dram_perf_lock);
>
> I also found an error path was not handled and
> found the lock could be put closer to what it protects.
> I will have them fixed in V5.
>
>> >  static bool default_dram_perf_error;
>> >  static struct access_coordinate default_dram_perf;
>> >  static int default_dram_perf_ref_nid = NUMA_NO_NODE;
>> > @@ -505,7 +511,8 @@ static inline void __init_node_memory_type(int node, 
>> > struct memory_dev_type *mem

Re: [PATCH v4 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-22 Thread Huang, Ying

ate_memory_tier(memtype);
> + __init_node_memory_type(node, mtype);
> +
> + mtype = node_memory_types[node].memtype;
> + node_set(node, mtype->nodes);
> + memtier = find_create_memory_tier(mtype);
>   if (!IS_ERR(memtier))
>   rcu_assign_pointer(pgdat->memtier, memtier);
>   return memtier;
> @@ -655,6 +671,34 @@ void mt_put_memory_types(struct list_head *memory_types)
>  }
>  EXPORT_SYMBOL_GPL(mt_put_memory_types);
>  
> +/*
> + * This is invoked via `late_initcall()` to initialize memory tiers for
> + * CPU-less memory nodes after driver initialization, which is
> + * expected to provide `adistance` algorithms.
> + */
> +static int __init memory_tier_late_init(void)
> +{
> + int nid;
> +
> + mutex_lock(_tier_lock);
> + for_each_node_state(nid, N_MEMORY)
> + if (!node_state(nid, N_CPU) &&
> + node_memory_types[nid].memtype == NULL)
> + /*
> +  * Some device drivers may have initialized memory tiers
> +  * between `memory_tier_init()` and 
> `memory_tier_late_init()`,
> +  * potentially bringing online memory nodes and
> +  * configuring memory tiers. Exclude them here.
> +  */
> + set_node_memory_tier(nid);
> +
> + establish_demotion_targets();
> + mutex_unlock(_tier_lock);
> +
> + return 0;
> +}
> +late_initcall(memory_tier_late_init);
> +
>  static void dump_hmem_attrs(struct access_coordinate *coord, const char 
> *prefix)
>  {
>   pr_info(
> @@ -668,7 +712,7 @@ int mt_set_default_dram_perf(int nid, struct 
> access_coordinate *perf,
>  {
>   int rc = 0;
>  
> - mutex_lock(_tier_lock);
> + mutex_lock(_dram_perf_lock);
>   if (default_dram_perf_error) {
>   rc = -EIO;
>   goto out;
> @@ -716,7 +760,7 @@ int mt_set_default_dram_perf(int nid, struct 
> access_coordinate *perf,
>   }
>  
>  out:
> - mutex_unlock(_tier_lock);
> + mutex_unlock(_dram_perf_lock);
>   return rc;
>  }
>  
> @@ -732,7 +776,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, 
> int *adist)
>   perf->read_bandwidth + perf->write_bandwidth == 0)
>   return -EINVAL;
>  
> - mutex_lock(_tier_lock);
> + mutex_lock(_dram_perf_lock);
>   /*
>* The abstract distance of a memory node is in direct proportion to
>* its memory latency (read + write) and inversely proportional to its
> @@ -745,7 +789,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, 
> int *adist)
>   (default_dram_perf.read_latency + 
> default_dram_perf.write_latency) *
>   (default_dram_perf.read_bandwidth + 
> default_dram_perf.write_bandwidth) /
>   (perf->read_bandwidth + perf->write_bandwidth);
> - mutex_unlock(_tier_lock);
> + mutex_unlock(_dram_perf_lock);
>  
>   return 0;
>  }
> @@ -858,7 +902,8 @@ static int __init memory_tier_init(void)
>* For now we can have 4 faster memory tiers with smaller adistance
>* than default DRAM tier.
>*/
> - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
> + default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM,
> + 
> _memory_types);
>   if (IS_ERR(default_dram_type))
>   panic("%s() failed to allocate default DRAM tier\n", __func__);
>  
> @@ -868,6 +913,14 @@ static int __init memory_tier_init(void)
>* types assigned.
>*/
>   for_each_node_state(node, N_MEMORY) {
> + if (!node_state(node, N_CPU))
> + /*
> +  * Defer memory tier initialization on CPUless numa 
> nodes.
> +  * These will be initialized after firmware and devices 
> are
> +  * initialized.
> +  */
> + continue;
> +
>   memtier = set_node_memory_tier(node);
>   if (IS_ERR(memtier))
>   /*

--
Best Regards,
Huang, Ying

Re: [PATCH v3 1/2] memory tier: dax/kmem: create CPUless memory tiers after obtaining HMAT info

2024-03-20 Thread Huang, Ying

 tiers
> +  * between `memory_tier_init()` and 
> `memory_tier_late_init()`,
> +  * potentially bringing online memory nodes and
> +  * configuring memory tiers. Exclude them here.
> +  */
> + set_node_memory_tier(nid);
> +
> + establish_demotion_targets();
> + mutex_unlock(_tier_lock);
> +
> + return 0;
> +}
> +late_initcall(memory_tier_late_init);
> +
>  static void dump_hmem_attrs(struct access_coordinate *coord, const char 
> *prefix)
>  {
>   pr_info(
> @@ -631,12 +698,16 @@ static void dump_hmem_attrs(struct access_coordinate 
> *coord, const char *prefix)
>   coord->read_bandwidth, coord->write_bandwidth);
>  }
>  
> +/*
> + * The lock is used to protect the default_dram_perf.
> + */
> +static DEFINE_MUTEX(mt_perf_lock);

Miscommunication here too.  Should be moved to near the
"default_dram_perf" definition.  And it protects not only
default_dram_perf.

>  int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
>const char *source)
>  {
>   int rc = 0;
>  
> - mutex_lock(_tier_lock);
> + mutex_lock(_perf_lock);
>   if (default_dram_perf_error) {
>   rc = -EIO;
>   goto out;
> @@ -684,7 +755,7 @@ int mt_set_default_dram_perf(int nid, struct 
> access_coordinate *perf,
>   }
>  
>  out:
> - mutex_unlock(_tier_lock);
> + mutex_unlock(_perf_lock);
>   return rc;
>  }
>  
> @@ -700,7 +771,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, 
> int *adist)
>   perf->read_bandwidth + perf->write_bandwidth == 0)
>   return -EINVAL;
>  
> - mutex_lock(_tier_lock);
> + mutex_lock(_perf_lock);
>   /*
>* The abstract distance of a memory node is in direct proportion to
>* its memory latency (read + write) and inversely proportional to its
> @@ -713,7 +784,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, 
> int *adist)
>   (default_dram_perf.read_latency + 
> default_dram_perf.write_latency) *
>   (default_dram_perf.read_bandwidth + 
> default_dram_perf.write_bandwidth) /
>   (perf->read_bandwidth + perf->write_bandwidth);
> - mutex_unlock(_tier_lock);
> + mutex_unlock(_perf_lock);
>  
>   return 0;
>  }
> @@ -826,7 +897,8 @@ static int __init memory_tier_init(void)
>* For now we can have 4 faster memory tiers with smaller adistance
>* than default DRAM tier.
>*/
> - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
> + default_dram_type = mt_find_alloc_memory_type(
> + MEMTIER_ADISTANCE_DRAM, 
> _memory_types);
>   if (IS_ERR(default_dram_type))
>   panic("%s() failed to allocate default DRAM tier\n", __func__);
>  
> @@ -836,6 +908,14 @@ static int __init memory_tier_init(void)
>* types assigned.
>*/
>   for_each_node_state(node, N_MEMORY) {
> + if (!node_state(node, N_CPU))
> + /*
> +  * Defer memory tier initialization on CPUless numa 
> nodes.
> +  * These will be initialized after firmware and devices 
> are
> +  * initialized.
> +  */
> + continue;
> +
>   memtier = set_node_memory_tier(node);
>   if (IS_ERR(memtier))
>   /*

--
Best Regards,
Huang, Ying

Re: [External] Re: [PATCH v2 1/1] memory tier: acpi/hmat: create CPUless memory tiers after obtaining HMAT info

2024-03-14 Thread Huang, Ying

"Ho-Ren (Jack) Chuang"  writes:

> On Tue, Mar 12, 2024 at 2:21 AM Huang, Ying  wrote:
>>
>> "Ho-Ren (Jack) Chuang"  writes:
>>
>> > The current implementation treats emulated memory devices, such as
>> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
>> > (E820_TYPE_RAM). However, these emulated devices have different
>> > characteristics than traditional DRAM, making it important to
>> > distinguish them. Thus, we modify the tiered memory initialization process
>> > to introduce a delay specifically for CPUless NUMA nodes. This delay
>> > ensures that the memory tier initialization for these nodes is deferred
>> > until HMAT information is obtained during the boot process. Finally,
>> > demotion tables are recalculated at the end.
>> >
>> > * Abstract common functions into `find_alloc_memory_type()`
>>
>> We should move kmem_put_memory_types() (renamed to
>> mt_put_memory_types()?) too.  This can be put in a separate patch.
>>
>
> Will do! Thanks,
>
>
>>
>> > Since different memory devices require finding or allocating a memory type,
>> > these common steps are abstracted into a single function,
>> > `find_alloc_memory_type()`, enhancing code scalability and conciseness.
>> >
>> > * Handle cases where there is no HMAT when creating memory tiers
>> > There is a scenario where a CPUless node does not provide HMAT information.
>> > If no HMAT is specified, it falls back to using the default DRAM tier.
>> >
>> > * Change adist calculation code to use another new lock, mt_perf_lock.
>> > In the current implementation, iterating through CPUlist nodes requires
>> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
>> > trying to acquire the same lock, leading to a potential deadlock.
>> > Therefore, we propose introducing a standalone `mt_perf_lock` to protect
>> > `default_dram_perf`. This approach not only avoids deadlock but also
>> > prevents holding a large lock simultaneously.
>> >
>> > Signed-off-by: Ho-Ren (Jack) Chuang 
>> > Signed-off-by: Hao Xiang 
>> > ---
>> >  drivers/acpi/numa/hmat.c | 11 ++
>> >  drivers/dax/kmem.c   | 13 +--
>> >  include/linux/acpi.h |  6 
>> >  include/linux/memory-tiers.h |  8 +
>> >  mm/memory-tiers.c| 70 +---
>> >  5 files changed, 92 insertions(+), 16 deletions(-)
>> >
>> > diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
>> > index d6b85f0f6082..28812ec2c793 100644
>> > --- a/drivers/acpi/numa/hmat.c
>> > +++ b/drivers/acpi/numa/hmat.c
>> > @@ -38,6 +38,8 @@ static LIST_HEAD(targets);
>> >  static LIST_HEAD(initiators);
>> >  static LIST_HEAD(localities);
>> >
>> > +static LIST_HEAD(hmat_memory_types);
>> > +
>>
>> HMAT isn't a device driver for some memory devices.  So I don't think we
>> should manage memory types in HMAT.
>
> I can put it back in memory-tier.c. How about the list? Do we still
> need to keep a separate list for storing late_inited memory nodes?
> And how about the list name if we need to remove the prefix "hmat_"?

I don't think we need a separate list for memory-less nodes.  Just
iterate all memory-less nodes.

>
>> Instead, if the memory_type of a
>> node isn't set by the driver, we should manage it in memory-tier.c as
>> fallback.
>>
>
> Do you mean some device drivers may init memory tiers between
> memory_tier_init() and late_initcall(memory_tier_late_init);?
> And this is the reason why you mention to exclude
> "node_memory_types[nid].memtype != NULL" in memory_tier_late_init().
> Is my understanding correct?

Yes.

>> >  static DEFINE_MUTEX(target_lock);
>> >
>> >  /*
>> > @@ -149,6 +151,12 @@ int acpi_get_genport_coordinates(u32 uid,
>> >  }
>> >  EXPORT_SYMBOL_NS_GPL(acpi_get_genport_coordinates, CXL);
>> >
>> > +struct memory_dev_type *hmat_find_alloc_memory_type(int adist)
>> > +{
>> > + return find_alloc_memory_type(adist, _memory_types);
>> > +}
>> > +EXPORT_SYMBOL_GPL(hmat_find_alloc_memory_type);
>> > +
>> >  static __init void alloc_memory_initiator(unsigned int cpu_pxm)
>> >  {
>> >   struct memory_initiator *initiator;
>> > @@ -1038,6 +1046,9 @@ static __init int hmat_init(void)
>> >   if (!hmat_set_default_dram_perf())

Re: [PATCH v2 1/1] memory tier: acpi/hmat: create CPUless memory tiers after obtaining HMAT info

2024-03-12 Thread Huang, Ying

>  {
>   pr_info(
> @@ -636,7 +690,7 @@ int mt_set_default_dram_perf(int nid, struct 
> access_coordinate *perf,
>  {
>   int rc = 0;
>  
> - mutex_lock(_tier_lock);
> + mutex_lock(_perf_lock);
>   if (default_dram_perf_error) {
>   rc = -EIO;
>   goto out;
> @@ -684,7 +738,7 @@ int mt_set_default_dram_perf(int nid, struct 
> access_coordinate *perf,
>   }
>  
>  out:
> - mutex_unlock(_tier_lock);
> + mutex_unlock(_perf_lock);
>   return rc;
>  }
>  
> @@ -700,7 +754,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, 
> int *adist)
>   perf->read_bandwidth + perf->write_bandwidth == 0)
>   return -EINVAL;
>  
> - mutex_lock(_tier_lock);
> + mutex_lock(_perf_lock);
>   /*
>* The abstract distance of a memory node is in direct proportion to
>* its memory latency (read + write) and inversely proportional to its
> @@ -713,7 +767,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, 
> int *adist)
>   (default_dram_perf.read_latency + 
> default_dram_perf.write_latency) *
>   (default_dram_perf.read_bandwidth + 
> default_dram_perf.write_bandwidth) /
>   (perf->read_bandwidth + perf->write_bandwidth);
> - mutex_unlock(_tier_lock);
> + mutex_unlock(_perf_lock);
>  
>   return 0;
>  }
> @@ -836,6 +890,14 @@ static int __init memory_tier_init(void)
>* types assigned.
>*/
>   for_each_node_state(node, N_MEMORY) {
> + if (!node_state(node, N_CPU))
> + /*
> +  * Defer memory tier initialization on CPUless numa 
> nodes.
> +  * These will be initialized when HMAT information is

HMAT is platform specific, we should avoid to mention it in general code
if possible.

> +  * available.
> +  */
> + continue;
> +
>   memtier = set_node_memory_tier(node);
>   if (IS_ERR(memtier))
>   /*

--
Best Regards,
Huang, Ying

Re: [PATCH v6 4/4] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-14 Thread Huang, Ying

Vishal Verma  writes:

> Add a sysfs knob for dax devices to control the memmap_on_memory setting
> if the dax device were to be hotplugged as system memory.
>
> The default memmap_on_memory setting for dax devices originating via
> pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
> preserve legacy behavior. For dax devices via CXL, the default is on.
> The sysfs control allows the administrator to override the above
> defaults if needed.
>
> Cc: David Hildenbrand 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Tested-by: Li Zhijian 
> Reviewed-by: Jonathan Cameron 
> Reviewed-by: David Hildenbrand 
> Signed-off-by: Vishal Verma 

Looks good to me!  Thanks!

Reviewed-by: "Huang, Ying" 

> ---
>  drivers/dax/bus.c   | 36 
> +
>  Documentation/ABI/testing/sysfs-bus-dax | 17 
>  2 files changed, 53 insertions(+)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 6226de131d17..3622b3d1c0de 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1245,6 +1245,41 @@ static ssize_t numa_node_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(numa_node);
>  
> +static ssize_t memmap_on_memory_show(struct device *dev,
> +  struct device_attribute *attr, char *buf)
> +{
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> +
> + return sysfs_emit(buf, "%d\n", dev_dax->memmap_on_memory);
> +}
> +
> +static ssize_t memmap_on_memory_store(struct device *dev,
> +   struct device_attribute *attr,
> +   const char *buf, size_t len)
> +{
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> + ssize_t rc;
> + bool val;
> +
> + rc = kstrtobool(buf, );
> + if (rc)
> + return rc;
> +
> + if (val == true && !mhp_supports_memmap_on_memory()) {
> + dev_dbg(dev, "memmap_on_memory is not available\n");
> + return -EOPNOTSUPP;
> + }
> +
> + guard(device)(dev);
> + if (dev_dax->memmap_on_memory != val && dev->driver &&
> + to_dax_drv(dev->driver)->type == DAXDRV_KMEM_TYPE)
> + return -EBUSY;
> + dev_dax->memmap_on_memory = val;
> +
> + return len;
> +}
> +static DEVICE_ATTR_RW(memmap_on_memory);
> +
>  static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, 
> int n)
>  {
>   struct device *dev = container_of(kobj, struct device, kobj);
> @@ -1271,6 +1306,7 @@ static struct attribute *dev_dax_attributes[] = {
>   _attr_align.attr,
>   _attr_resource.attr,
>   _attr_numa_node.attr,
> + _attr_memmap_on_memory.attr,
>   NULL,
>  };
>  
> diff --git a/Documentation/ABI/testing/sysfs-bus-dax 
> b/Documentation/ABI/testing/sysfs-bus-dax
> index 6359f7bc9bf4..b34266bfae49 100644
> --- a/Documentation/ABI/testing/sysfs-bus-dax
> +++ b/Documentation/ABI/testing/sysfs-bus-dax
> @@ -134,3 +134,20 @@ KernelVersion:   v5.1
>  Contact: nvd...@lists.linux.dev
>  Description:
>   (RO) The id attribute indicates the region id of a dax region.
> +
> +What:/sys/bus/dax/devices/daxX.Y/memmap_on_memory
> +Date:January, 2024
> +KernelVersion:   v6.8
> +Contact: nvd...@lists.linux.dev
> +Description:
> + (RW) Control the memmap_on_memory setting if the dax device
> + were to be hotplugged as system memory. This determines whether
> + the 'altmap' for the hotplugged memory will be placed on the
> + device being hotplugged (memmap_on_memory=1) or if it will be
> + placed on regular memory (memmap_on_memory=0). This attribute
> + must be set before the device is handed over to the 'kmem'
> + driver (i.e.  hotplugged into system-ram). Additionally, this
> + depends on CONFIG_MHP_MEMMAP_ON_MEMORY, and a globally enabled
> + memmap_on_memory parameter for memory_hotplug. This is
> + typically set on the kernel command line -
> + memory_hotplug.memmap_on_memory set to 'true' or 'force'."

Re: [PATCH v5 4/4] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-14 Thread Huang, Ying

Vishal Verma  writes:

> Add a sysfs knob for dax devices to control the memmap_on_memory setting
> if the dax device were to be hotplugged as system memory.
>
> The default memmap_on_memory setting for dax devices originating via
> pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
> preserve legacy behavior. For dax devices via CXL, the default is on.
> The sysfs control allows the administrator to override the above
> defaults if needed.
>
> Cc: David Hildenbrand 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Tested-by: Li Zhijian 
> Reviewed-by: Jonathan Cameron 
> Reviewed-by: David Hildenbrand 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.c   | 38 
> +
>  Documentation/ABI/testing/sysfs-bus-dax | 17 +++
>  2 files changed, 55 insertions(+)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 6226de131d17..f4d3beec507c 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1245,6 +1245,43 @@ static ssize_t numa_node_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(numa_node);
>  
> +static ssize_t memmap_on_memory_show(struct device *dev,
> +  struct device_attribute *attr, char *buf)
> +{
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> +
> + return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
> +}
> +
> +static ssize_t memmap_on_memory_store(struct device *dev,
> +   struct device_attribute *attr,
> +   const char *buf, size_t len)
> +{
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> + struct dax_device_driver *dax_drv;
> + ssize_t rc;
> + bool val;
> +
> + rc = kstrtobool(buf, );
> + if (rc)
> + return rc;
> +
> + if (val == true && !mhp_supports_memmap_on_memory()) {
> + dev_dbg(dev, "memmap_on_memory is not available\n");
> + return -EOPNOTSUPP;
> + }
> +
> + guard(device)(dev);
> + dax_drv = to_dax_drv(dev->driver);

Although "struct driver" is the first member of "struct
dax_device_driver", I feel the code is fragile to depends on that.  Can
we check dev->driver directly instead?

--
Best Regards,
Huang, Ying

> + if (dax_drv && dev_dax->memmap_on_memory != val &&
> + dax_drv->type == DAXDRV_KMEM_TYPE)
> + return -EBUSY;
> + dev_dax->memmap_on_memory = val;
> +
> + return len;
> +}
> +static DEVICE_ATTR_RW(memmap_on_memory);
> +
>  static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, 
> int n)
>  {
>   struct device *dev = container_of(kobj, struct device, kobj);
> @@ -1271,6 +1308,7 @@ static struct attribute *dev_dax_attributes[] = {
>   _attr_align.attr,
>   _attr_resource.attr,
>   _attr_numa_node.attr,
> + _attr_memmap_on_memory.attr,
>   NULL,
>  };
>  
> diff --git a/Documentation/ABI/testing/sysfs-bus-dax 
> b/Documentation/ABI/testing/sysfs-bus-dax
> index 6359f7bc9bf4..40d9965733b2 100644
> --- a/Documentation/ABI/testing/sysfs-bus-dax
> +++ b/Documentation/ABI/testing/sysfs-bus-dax
> @@ -134,3 +134,20 @@ KernelVersion:   v5.1
>  Contact: nvd...@lists.linux.dev
>  Description:
>   (RO) The id attribute indicates the region id of a dax region.
> +
> +What:/sys/bus/dax/devices/daxX.Y/memmap_on_memory
> +Date:October, 2023
> +KernelVersion:   v6.8
> +Contact: nvd...@lists.linux.dev
> +Description:
> + (RW) Control the memmap_on_memory setting if the dax device
> + were to be hotplugged as system memory. This determines whether
> + the 'altmap' for the hotplugged memory will be placed on the
> + device being hotplugged (memmap_on_memory=1) or if it will be
> + placed on regular memory (memmap_on_memory=0). This attribute
> + must be set before the device is handed over to the 'kmem'
> + driver (i.e.  hotplugged into system-ram). Additionally, this
> + depends on CONFIG_MHP_MEMMAP_ON_MEMORY, and a globally enabled
> + memmap_on_memory parameter for memory_hotplug. This is
> + typically set on the kernel command line -
> + memory_hotplug.memmap_on_memory set to 'true' or 'force'."

Re: [PATCH v4 3/3] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-12 Thread Huang, Ying

Vishal Verma  writes:

> Add a sysfs knob for dax devices to control the memmap_on_memory setting
> if the dax device were to be hotplugged as system memory.
>
> The default memmap_on_memory setting for dax devices originating via
> pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
> preserve legacy behavior. For dax devices via CXL, the default is on.
> The sysfs control allows the administrator to override the above
> defaults if needed.
>
> Cc: David Hildenbrand 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Tested-by: Li Zhijian 
> Reviewed-by: Jonathan Cameron 
> Reviewed-by: David Hildenbrand 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.c   | 32 
>  Documentation/ABI/testing/sysfs-bus-dax | 17 +
>  2 files changed, 49 insertions(+)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index ce1356ac6dc2..423adee6f802 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1245,6 +1245,37 @@ static ssize_t numa_node_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(numa_node);
>  
> +static ssize_t memmap_on_memory_show(struct device *dev,
> +  struct device_attribute *attr, char *buf)
> +{
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> +
> + return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
> +}
> +
> +static ssize_t memmap_on_memory_store(struct device *dev,
> +   struct device_attribute *attr,
> +   const char *buf, size_t len)
> +{
> + struct dax_device_driver *dax_drv = to_dax_drv(dev->driver);
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> + ssize_t rc;
> + bool val;
> +
> + rc = kstrtobool(buf, );
> + if (rc)
> + return rc;
> +
> + guard(device)(dev);
> + if (dev_dax->memmap_on_memory != val &&
> +     dax_drv->type == DAXDRV_KMEM_TYPE)

Should we check "dev->driver != NULL" here, and should we move

dax_drv = to_dax_drv(dev->driver);

here with device lock held?

--
Best Regards,
Huang, Ying

> + return -EBUSY;
> + dev_dax->memmap_on_memory = val;
> +
> + return len;
> +}
> +static DEVICE_ATTR_RW(memmap_on_memory);
> +
>  static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, 
> int n)
>  {
>   struct device *dev = container_of(kobj, struct device, kobj);
> @@ -1271,6 +1302,7 @@ static struct attribute *dev_dax_attributes[] = {
>   _attr_align.attr,
>   _attr_resource.attr,
>   _attr_numa_node.attr,
> + _attr_memmap_on_memory.attr,
>   NULL,
>  };
>  
> diff --git a/Documentation/ABI/testing/sysfs-bus-dax 
> b/Documentation/ABI/testing/sysfs-bus-dax
> index a61a7b186017..b1fd8bf8a7de 100644
> --- a/Documentation/ABI/testing/sysfs-bus-dax
> +++ b/Documentation/ABI/testing/sysfs-bus-dax
> @@ -149,3 +149,20 @@ KernelVersion:   v5.1
>  Contact: nvd...@lists.linux.dev
>  Description:
>   (RO) The id attribute indicates the region id of a dax region.
> +
> +What:/sys/bus/dax/devices/daxX.Y/memmap_on_memory
> +Date:October, 2023
> +KernelVersion:   v6.8
> +Contact: nvd...@lists.linux.dev
> +Description:
> + (RW) Control the memmap_on_memory setting if the dax device
> + were to be hotplugged as system memory. This determines whether
> + the 'altmap' for the hotplugged memory will be placed on the
> + device being hotplugged (memmap_on_memory=1) or if it will be
> + placed on regular memory (memmap_on_memory=0). This attribute
> + must be set before the device is handed over to the 'kmem'
> + driver (i.e.  hotplugged into system-ram). Additionally, this
> + depends on CONFIG_MHP_MEMMAP_ON_MEMORY, and a globally enabled
> + memmap_on_memory parameter for memory_hotplug. This is
> + typically set on the kernel command line -
> + memory_hotplug.memmap_on_memory set to 'true' or 'force'."

Re: [PATCH v3 2/2] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-11 Thread Huang, Ying

"Verma, Vishal L"  writes:

> On Tue, 2023-12-12 at 08:30 +0800, Huang, Ying wrote:
>> Vishal Verma  writes:
>>
>> > Add a sysfs knob for dax devices to control the memmap_on_memory setting
>> > if the dax device were to be hotplugged as system memory.
>> >
>> > The default memmap_on_memory setting for dax devices originating via
>> > pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
>> > preserve legacy behavior. For dax devices via CXL, the default is on.
>> > The sysfs control allows the administrator to override the above
>> > defaults if needed.
>> >
>> > Cc: David Hildenbrand 
>> > Cc: Dan Williams 
>> > Cc: Dave Jiang 
>> > Cc: Dave Hansen 
>> > Cc: Huang Ying 
>> > Tested-by: Li Zhijian 
>> > Reviewed-by: Jonathan Cameron 
>> > Reviewed-by: David Hildenbrand 
>> > Signed-off-by: Vishal Verma 
>> > ---
>> >  drivers/dax/bus.c   | 47 
>> > +
>> >  Documentation/ABI/testing/sysfs-bus-dax | 17 
>> >  2 files changed, 64 insertions(+)
>> >
>> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
>> > index 1ff1ab5fa105..2871e5188f0d 100644
>> > --- a/drivers/dax/bus.c
>> > +++ b/drivers/dax/bus.c
>> > @@ -1270,6 +1270,52 @@ static ssize_t numa_node_show(struct device *dev,
>> >  }
>> >  static DEVICE_ATTR_RO(numa_node);
>> >
>> > +static ssize_t memmap_on_memory_show(struct device *dev,
>> > +struct device_attribute *attr, char 
>> > *buf)
>> > +{
>> > +   struct dev_dax *dev_dax = to_dev_dax(dev);
>> > +
>> > +   return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
>> > +}
>> > +
>> > +static ssize_t memmap_on_memory_store(struct device *dev,
>> > + struct device_attribute *attr,
>> > + const char *buf, size_t len)
>> > +{
>> > +   struct device_driver *drv = dev->driver;
>> > +   struct dev_dax *dev_dax = to_dev_dax(dev);
>> > +   struct dax_region *dax_region = dev_dax->region;
>> > +   struct dax_device_driver *dax_drv = to_dax_drv(drv);
>> > +   ssize_t rc;
>> > +   bool val;
>> > +
>> > +   rc = kstrtobool(buf, );
>> > +   if (rc)
>> > +   return rc;
>> > +
>> > +   if (dev_dax->memmap_on_memory == val)
>> > +   return len;
>> > +
>> > +   device_lock(dax_region->dev);
>> > +   if (!dax_region->dev->driver) {
>> > +   device_unlock(dax_region->dev);
>> > +   return -ENXIO;
>> > +   }
>>
>> I think that it should be OK to write to "memmap_on_memory" if no driver
>> is bound to the device.  We just need to avoid to write to it when kmem
>> driver is bound.
>
> Oh this is just a check on the region driver, not for a dax driver
> being bound to the device. It's the same as what things like
> align_store(), size_store() etc. do for dax device reconfiguration.

Sorry, I misunderstood it.

> That said, it might be okay to remove this check, as this operation
> doesn't change any attributes of the dax region (the other interfaces I
> mentioned above can affect regions, so we want to lock the region
> device). If removing the check, we'd drop the region lock acquisition
> as well.

This sounds good to me.

And is it necessary to check driver type with device_lock()?  Can driver
be changed between checking and lock?

--
Best Regards,
Huang, Ying

Re: [PATCH v3 2/2] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-11 Thread Huang, Ying

Vishal Verma  writes:

> Add a sysfs knob for dax devices to control the memmap_on_memory setting
> if the dax device were to be hotplugged as system memory.
>
> The default memmap_on_memory setting for dax devices originating via
> pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
> preserve legacy behavior. For dax devices via CXL, the default is on.
> The sysfs control allows the administrator to override the above
> defaults if needed.
>
> Cc: David Hildenbrand 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Tested-by: Li Zhijian 
> Reviewed-by: Jonathan Cameron 
> Reviewed-by: David Hildenbrand 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.c   | 47 
> +
>  Documentation/ABI/testing/sysfs-bus-dax | 17 
>  2 files changed, 64 insertions(+)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 1ff1ab5fa105..2871e5188f0d 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1270,6 +1270,52 @@ static ssize_t numa_node_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(numa_node);
>  
> +static ssize_t memmap_on_memory_show(struct device *dev,
> +  struct device_attribute *attr, char *buf)
> +{
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> +
> + return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
> +}
> +
> +static ssize_t memmap_on_memory_store(struct device *dev,
> +   struct device_attribute *attr,
> +   const char *buf, size_t len)
> +{
> + struct device_driver *drv = dev->driver;
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> + struct dax_region *dax_region = dev_dax->region;
> + struct dax_device_driver *dax_drv = to_dax_drv(drv);
> + ssize_t rc;
> + bool val;
> +
> + rc = kstrtobool(buf, );
> + if (rc)
> + return rc;
> +
> + if (dev_dax->memmap_on_memory == val)
> + return len;
> +
> + device_lock(dax_region->dev);
> + if (!dax_region->dev->driver) {
> + device_unlock(dax_region->dev);
> +     return -ENXIO;
> + }

I think that it should be OK to write to "memmap_on_memory" if no driver
is bound to the device.  We just need to avoid to write to it when kmem
driver is bound.

--
Best Regards,
Huang, Ying

> +
> + if (dax_drv->type == DAXDRV_KMEM_TYPE) {
> + device_unlock(dax_region->dev);
> + return -EBUSY;
> + }
> +
> + device_lock(dev);
> + dev_dax->memmap_on_memory = val;
> + device_unlock(dev);
> +
> + device_unlock(dax_region->dev);
> + return len;
> +}
> +static DEVICE_ATTR_RW(memmap_on_memory);
> +
>  static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, 
> int n)
>  {
>   struct device *dev = container_of(kobj, struct device, kobj);
> @@ -1296,6 +1342,7 @@ static struct attribute *dev_dax_attributes[] = {
>   _attr_align.attr,
>   _attr_resource.attr,
>   _attr_numa_node.attr,
> + _attr_memmap_on_memory.attr,
>   NULL,
>  };
>  
> diff --git a/Documentation/ABI/testing/sysfs-bus-dax 
> b/Documentation/ABI/testing/sysfs-bus-dax
> index a61a7b186017..b1fd8bf8a7de 100644
> --- a/Documentation/ABI/testing/sysfs-bus-dax
> +++ b/Documentation/ABI/testing/sysfs-bus-dax
> @@ -149,3 +149,20 @@ KernelVersion:   v5.1
>  Contact: nvd...@lists.linux.dev
>  Description:
>   (RO) The id attribute indicates the region id of a dax region.
> +
> +What:/sys/bus/dax/devices/daxX.Y/memmap_on_memory
> +Date:October, 2023
> +KernelVersion:   v6.8
> +Contact: nvd...@lists.linux.dev
> +Description:
> + (RW) Control the memmap_on_memory setting if the dax device
> + were to be hotplugged as system memory. This determines whether
> + the 'altmap' for the hotplugged memory will be placed on the
> + device being hotplugged (memmap_on_memory=1) or if it will be
> + placed on regular memory (memmap_on_memory=0). This attribute
> + must be set before the device is handed over to the 'kmem'
> + driver (i.e.  hotplugged into system-ram). Additionally, this
> + depends on CONFIG_MHP_MEMMAP_ON_MEMORY, and a globally enabled
> + memmap_on_memory parameter for memory_hotplug. This is
> + typically set on the kernel command line -
> + memory_hotplug.memmap_on_memory set to 'true' or 'force'."

Re: [PATCH v2 2/2] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-07 Thread Huang, Ying

Vishal Verma  writes:

> Add a sysfs knob for dax devices to control the memmap_on_memory setting
> if the dax device were to be hotplugged as system memory.
>
> The default memmap_on_memory setting for dax devices originating via
> pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
> preserve legacy behavior. For dax devices via CXL, the default is on.
> The sysfs control allows the administrator to override the above
> defaults if needed.
>
> Cc: David Hildenbrand 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Reviewed-by: Jonathan Cameron 
> Reviewed-by: David Hildenbrand 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.c   | 40 
> +
>  Documentation/ABI/testing/sysfs-bus-dax | 13 +++
>  2 files changed, 53 insertions(+)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 1ff1ab5fa105..11abb57cc031 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1270,6 +1270,45 @@ static ssize_t numa_node_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(numa_node);
>  
> +static ssize_t memmap_on_memory_show(struct device *dev,
> +  struct device_attribute *attr, char *buf)
> +{
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> +
> + return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
> +}
> +
> +static ssize_t memmap_on_memory_store(struct device *dev,
> +   struct device_attribute *attr,
> +   const char *buf, size_t len)
> +{
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> + struct dax_region *dax_region = dev_dax->region;
> + ssize_t rc;
> + bool val;
> +
> + rc = kstrtobool(buf, );
> + if (rc)
> + return rc;
> +
> + if (dev_dax->memmap_on_memory == val)
> + return len;
> +
> + device_lock(dax_region->dev);
> + if (!dax_region->dev->driver) {

This still doesn't look right.  Can we check whether the current driver
is kmem?  And only allow change if it's not kmem?

--
Best Regards,
Huang, Ying

> + device_unlock(dax_region->dev);
> + return -ENXIO;
> + }
> +
> + device_lock(dev);
> + dev_dax->memmap_on_memory = val;
> + device_unlock(dev);
> +
> + device_unlock(dax_region->dev);
> + return rc == 0 ? len : rc;
> +}
> +static DEVICE_ATTR_RW(memmap_on_memory);
> +
>  static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, 
> int n)
>  {
>   struct device *dev = container_of(kobj, struct device, kobj);
> @@ -1296,6 +1335,7 @@ static struct attribute *dev_dax_attributes[] = {
>   _attr_align.attr,
>   _attr_resource.attr,
>   _attr_numa_node.attr,
> + _attr_memmap_on_memory.attr,
>   NULL,
>  };
>  
> diff --git a/Documentation/ABI/testing/sysfs-bus-dax 
> b/Documentation/ABI/testing/sysfs-bus-dax
> index a61a7b186017..bb063a004e41 100644
> --- a/Documentation/ABI/testing/sysfs-bus-dax
> +++ b/Documentation/ABI/testing/sysfs-bus-dax
> @@ -149,3 +149,16 @@ KernelVersion:   v5.1
>  Contact: nvd...@lists.linux.dev
>  Description:
>   (RO) The id attribute indicates the region id of a dax region.
> +
> +What:/sys/bus/dax/devices/daxX.Y/memmap_on_memory
> +Date:October, 2023
> +KernelVersion:   v6.8
> +Contact: nvd...@lists.linux.dev
> +Description:
> + (RW) Control the memmap_on_memory setting if the dax device
> + were to be hotplugged as system memory. This determines whether
> + the 'altmap' for the hotplugged memory will be placed on the
> + device being hotplugged (memmap_on+memory=1) or if it will be
> + placed on regular memory (memmap_on_memory=0). This attribute
> + must be set before the device is handed over to the 'kmem'
> + driver (i.e.  hotplugged into system-ram).

Re: [PATCH v9 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-11-02 Thread Huang, Ying

Vishal Verma  writes:

> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
> 'memblock_size' chunks of memory being added. Adding a larger span of
> memory precludes memmap_on_memory semantics.
>
> For users of hotplug such as kmem, large amounts of memory might get
> added from the CXL subsystem. In some cases, this amount may exceed the
> available 'main memory' to store the memmap for the memory being added.
> In this case, it is useful to have a way to place the memmap on the
> memory being added, even if it means splitting the addition into
> memblock-sized chunks.
>
> Change add_memory_resource() to loop over memblock-sized chunks of
> memory if caller requested memmap_on_memory, and if other conditions for
> it are met. Teach try_remove_memory() to also expect that a memory
> range being removed might have been split up into memblock sized chunks,
> and to loop through those as needed.
>
> This does preclude being able to use PUD mappings in the direct map; a
> proposal to how this could be optimized in the future is laid out
> here[1].
>
> [1]: 
> https://lore.kernel.org/linux-mm/b6753402-2de9-25b2-36e9-eacd49752...@redhat.com/
>
> Cc: Andrew Morton 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Suggested-by: David Hildenbrand 
> Reviewed-by: Dan Williams 
> Acked-by: David Hildenbrand 
> Signed-off-by: Vishal Verma 

LGTM, Thanks!

Reviewed-by: "Huang, Ying" 

> ---
>  mm/memory_hotplug.c | 210 
> ++--
>  1 file changed, 136 insertions(+), 74 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 6be7de9efa55..b380675ab932 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1380,6 +1380,85 @@ static bool mhp_supports_memmap_on_memory(unsigned 
> long size)
>   return arch_supports_memmap_on_memory(vmemmap_size);
>  }
>  
> +static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size)
> +{
> + unsigned long memblock_size = memory_block_size_bytes();
> + u64 cur_start;
> +
> + /*
> +  * For memmap_on_memory, the altmaps were added on a per-memblock
> +  * basis; we have to process each individual memory block.
> +  */
> + for (cur_start = start; cur_start < start + size;
> +  cur_start += memblock_size) {
> + struct vmem_altmap *altmap = NULL;
> + struct memory_block *mem;
> +
> + mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(cur_start)));
> + if (WARN_ON_ONCE(!mem))
> + continue;
> +
> + altmap = mem->altmap;
> + mem->altmap = NULL;
> +
> + remove_memory_block_devices(cur_start, memblock_size);
> +
> + arch_remove_memory(cur_start, memblock_size, altmap);
> +
> + /* Verify that all vmemmap pages have actually been freed. */
> + WARN(altmap->alloc, "Altmap not fully unmapped");
> + kfree(altmap);
> + }
> +}
> +
> +static int create_altmaps_and_memory_blocks(int nid, struct memory_group 
> *group,
> + u64 start, u64 size)
> +{
> + unsigned long memblock_size = memory_block_size_bytes();
> + u64 cur_start;
> + int ret;
> +
> + for (cur_start = start; cur_start < start + size;
> +  cur_start += memblock_size) {
> + struct mhp_params params = { .pgprot =
> +  pgprot_mhp(PAGE_KERNEL) };
> + struct vmem_altmap mhp_altmap = {
> + .base_pfn = PHYS_PFN(cur_start),
> + .end_pfn = PHYS_PFN(cur_start + memblock_size - 1),
> + };
> +
> + mhp_altmap.free = memory_block_memmap_on_memory_pages();
> + params.altmap = kmemdup(_altmap, sizeof(struct vmem_altmap),
> + GFP_KERNEL);
> + if (!params.altmap) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + /* call arch's memory hotadd */
> + ret = arch_add_memory(nid, cur_start, memblock_size, );
> + if (ret < 0) {
> + kfree(params.altmap);
> + goto out;
> + }
> +
> + /* create memory block devices after memory was added */
> + ret = create_memory_block_devices(cur_start, memblock_size,
> +   params.altm

Re: [PATCH v8 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-11-01 Thread Huang, Ying

Vishal Verma  writes:

> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
> 'memblock_size' chunks of memory being added. Adding a larger span of
> memory precludes memmap_on_memory semantics.
>
> For users of hotplug such as kmem, large amounts of memory might get
> added from the CXL subsystem. In some cases, this amount may exceed the
> available 'main memory' to store the memmap for the memory being added.
> In this case, it is useful to have a way to place the memmap on the
> memory being added, even if it means splitting the addition into
> memblock-sized chunks.
>
> Change add_memory_resource() to loop over memblock-sized chunks of
> memory if caller requested memmap_on_memory, and if other conditions for
> it are met. Teach try_remove_memory() to also expect that a memory
> range being removed might have been split up into memblock sized chunks,
> and to loop through those as needed.
>
> This does preclude being able to use PUD mappings in the direct map; a
> proposal to how this could be optimized in the future is laid out
> here[1].
>
> [1]: 
> https://lore.kernel.org/linux-mm/b6753402-2de9-25b2-36e9-eacd49752...@redhat.com/
>
> Cc: Andrew Morton 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Suggested-by: David Hildenbrand 
> Reviewed-by: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  mm/memory_hotplug.c | 213 
> ++--
>  1 file changed, 138 insertions(+), 75 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 6be7de9efa55..d242e49d7f7b 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1380,6 +1380,84 @@ static bool mhp_supports_memmap_on_memory(unsigned 
> long size)
>   return arch_supports_memmap_on_memory(vmemmap_size);
>  }
>  
> +static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size)
> +{
> + unsigned long memblock_size = memory_block_size_bytes();
> + u64 cur_start;
> +
> + /*
> +  * For memmap_on_memory, the altmaps were added on a per-memblock
> +  * basis; we have to process each individual memory block.
> +  */
> + for (cur_start = start; cur_start < start + size;
> +  cur_start += memblock_size) {
> + struct vmem_altmap *altmap = NULL;
> + struct memory_block *mem;
> +
> + mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(cur_start)));
> + WARN_ON_ONCE(!mem);
> + if (!mem)
> + continue;
> +
> + altmap = mem->altmap;
> + mem->altmap = NULL;
> +
> + remove_memory_block_devices(cur_start, memblock_size);
> +
> + arch_remove_memory(cur_start, memblock_size, altmap);
> +
> + /* Verify that all vmemmap pages have actually been freed. */
> + WARN(altmap->alloc, "Altmap not fully unmapped");
> + kfree(altmap);
> + }
> +}
> +
> +static int create_altmaps_and_memory_blocks(int nid, struct memory_group 
> *group,
> + u64 start, u64 size)
> +{
> + unsigned long memblock_size = memory_block_size_bytes();
> + u64 cur_start;
> + int ret;
> +
> + for (cur_start = start; cur_start < start + size;
> +  cur_start += memblock_size) {
> + struct mhp_params params = { .pgprot =
> +  pgprot_mhp(PAGE_KERNEL) };
> + struct vmem_altmap mhp_altmap = {
> + .base_pfn = PHYS_PFN(cur_start),
> + .end_pfn = PHYS_PFN(cur_start + memblock_size - 1),
> + };
> +
> + mhp_altmap.free = memory_block_memmap_on_memory_pages();
> + params.altmap = kmemdup(_altmap, sizeof(struct vmem_altmap),
> + GFP_KERNEL);
> + if (!params.altmap)
> + return -ENOMEM;

Use "goto out" here too?

> +
> + /* call arch's memory hotadd */
> + ret = arch_add_memory(nid, cur_start, memblock_size, );
> + if (ret < 0) {
> + kfree(params.altmap);
> + goto out;
> + }
> +
> + /* create memory block devices after memory was added */
> + ret = create_memory_block_devices(cur_start, memblock_size,
> +   params.altmap, group);
> + if (ret) {
> + arch_remove_memory(cur_start, memblo

Re: [PATCH v7 3/3] dax/kmem: allow kmem to add memory with memmap_on_memory

2023-10-29 Thread Huang, Ying

Vishal Verma  writes:

> Large amounts of memory managed by the kmem driver may come in via CXL,
> and it is often desirable to have the memmap for this memory on the new
> memory itself.
>
> Enroll kmem-managed memory for memmap_on_memory semantics if the dax
> region originates via CXL. For non-CXL dax regions, retain the existing
> default behavior of hot adding without memmap_on_memory semantics.
>
> Cc: Andrew Morton 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Reviewed-by: Jonathan Cameron 
> Reviewed-by: David Hildenbrand 
> Signed-off-by: Vishal Verma 

LGTM, Thanks!

Reviewed-by: "Huang, Ying" 

> ---
>  drivers/dax/bus.h | 1 +
>  drivers/dax/dax-private.h | 1 +
>  drivers/dax/bus.c | 3 +++
>  drivers/dax/cxl.c | 1 +
>  drivers/dax/hmem/hmem.c   | 1 +
>  drivers/dax/kmem.c| 8 +++-
>  drivers/dax/pmem.c| 1 +
>  7 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index 1ccd23360124..cbbf64443098 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -23,6 +23,7 @@ struct dev_dax_data {
>   struct dev_pagemap *pgmap;
>   resource_size_t size;
>   int id;
> + bool memmap_on_memory;
>  };
>  
>  struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index 27cf2d79..446617b73aea 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -70,6 +70,7 @@ struct dev_dax {
>   struct ida ida;
>   struct device dev;
>   struct dev_pagemap *pgmap;
> + bool memmap_on_memory;
>   int nr_range;
>   struct dev_dax_range {
>   unsigned long pgoff;
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 0ee96e6fc426..ad9f821b8c78 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -367,6 +367,7 @@ static ssize_t create_store(struct device *dev, struct 
> device_attribute *attr,
>   .dax_region = dax_region,
>   .size = 0,
>   .id = -1,
> + .memmap_on_memory = false,
>   };
>   struct dev_dax *dev_dax = devm_create_dev_dax();
>  
> @@ -1400,6 +1401,8 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data 
> *data)
>   dev_dax->align = dax_region->align;
>   ida_init(_dax->ida);
>  
> + dev_dax->memmap_on_memory = data->memmap_on_memory;
> +
>   inode = dax_inode(dax_dev);
>   dev->devt = inode->i_rdev;
>   dev->bus = _bus_type;
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 8bc9d04034d6..c696837ab23c 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -26,6 +26,7 @@ static int cxl_dax_region_probe(struct device *dev)
>   .dax_region = dax_region,
>   .id = -1,
>   .size = range_len(_dax->hpa_range),
> + .memmap_on_memory = true,
>   };
>  
>   return PTR_ERR_OR_ZERO(devm_create_dev_dax());
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index 5d2ddef0f8f5..b9da69f92697 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -36,6 +36,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
>   .dax_region = dax_region,
>   .id = -1,
>   .size = region_idle ? 0 : range_len(>range),
> + .memmap_on_memory = false,
>   };
>  
>   return PTR_ERR_OR_ZERO(devm_create_dev_dax());
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index c57acb73e3db..0aa6c45a4e5a 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "dax-private.h"
>  #include "bus.h"
>  
> @@ -56,6 +57,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>   unsigned long total_len = 0;
>   struct dax_kmem_data *data;
>   int i, rc, mapped = 0;
> + mhp_t mhp_flags;
>   int numa_node;
>  
>   /*
> @@ -136,12 +138,16 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>*/
>   res->flags = IORESOURCE_SYSTEM_RAM;
>  
> + mhp_flags = MHP_NID_IS_MGID;
> + if (dev_dax->memmap_on_memory)
> + mhp_flags |= MHP_MEMMAP_ON_MEMORY;
> +
>   /*
>* Ensure that future kexec'd kernels will not treat
>

Re: [PATCH v7 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-29 Thread Huang, Ying

Vishal Verma  writes:

> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
> 'memblock_size' chunks of memory being added. Adding a larger span of
> memory precludes memmap_on_memory semantics.
>
> For users of hotplug such as kmem, large amounts of memory might get
> added from the CXL subsystem. In some cases, this amount may exceed the
> available 'main memory' to store the memmap for the memory being added.
> In this case, it is useful to have a way to place the memmap on the
> memory being added, even if it means splitting the addition into
> memblock-sized chunks.
>
> Change add_memory_resource() to loop over memblock-sized chunks of
> memory if caller requested memmap_on_memory, and if other conditions for
> it are met. Teach try_remove_memory() to also expect that a memory
> range being removed might have been split up into memblock sized chunks,
> and to loop through those as needed.
>
> This does preclude being able to use PUD mappings in the direct map; a
> proposal to how this could be optimized in the future is laid out
> here[1].
>
> [1]: 
> https://lore.kernel.org/linux-mm/b6753402-2de9-25b2-36e9-eacd49752...@redhat.com/
>
> Cc: Andrew Morton 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Suggested-by: David Hildenbrand 
> Reviewed-by: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  mm/memory_hotplug.c | 209 
> 
>  1 file changed, 144 insertions(+), 65 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 6be7de9efa55..b97035193090 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1380,6 +1380,48 @@ static bool mhp_supports_memmap_on_memory(unsigned 
> long size)
>   return arch_supports_memmap_on_memory(vmemmap_size);
>  }
>  
> +static int create_altmaps_and_memory_blocks(int nid, struct memory_group 
> *group,
> + u64 start, u64 size)
> +{
> + unsigned long memblock_size = memory_block_size_bytes();
> + u64 cur_start;
> + int ret;
> +
> + for (cur_start = start; cur_start < start + size;
> +  cur_start += memblock_size) {
> + struct mhp_params params = { .pgprot =
> +  pgprot_mhp(PAGE_KERNEL) };
> + struct vmem_altmap mhp_altmap = {
> + .base_pfn = PHYS_PFN(cur_start),
> + .end_pfn = PHYS_PFN(cur_start + memblock_size - 1),
> + };
> +
> + mhp_altmap.free = memory_block_memmap_on_memory_pages();
> + params.altmap = kmemdup(_altmap, sizeof(struct vmem_altmap),
> + GFP_KERNEL);
> + if (!params.altmap)
> + return -ENOMEM;
> +
> + /* call arch's memory hotadd */
> + ret = arch_add_memory(nid, cur_start, memblock_size, );
> + if (ret < 0) {
> + kfree(params.altmap);

Should we call

remove_memory_blocks_and_altmaps(start, cur_start - start);

here to clean up resources?

--
Best Regards,
Huang, Ying

> + return ret;
> + }
> +
> + /* create memory block devices after memory was added */
> + ret = create_memory_block_devices(cur_start, memblock_size,
> +   params.altmap, group);
> + if (ret) {
> + arch_remove_memory(cur_start, memblock_size, NULL);
> + kfree(params.altmap);
> + return ret;
> + }
> + }
> +
> + return 0;
> +}
> +
>  /*
>   * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
>   * and online/offline operations (triggered e.g. by sysfs).
> @@ -1390,10 +1432,6 @@ int __ref add_memory_resource(int nid, struct resource 
> *res, mhp_t mhp_flags)
>  {
>   struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>   enum memblock_flags memblock_flags = MEMBLOCK_NONE;
> - struct vmem_altmap mhp_altmap = {
> - .base_pfn =  PHYS_PFN(res->start),
> - .end_pfn  =  PHYS_PFN(res->end),
> - };
>   struct memory_group *group = NULL;
>   u64 start, size;
>   bool new_node = false;
> @@ -1436,28 +1474,22 @@ int __ref add_memory_resource(int nid, struct 
> resource *res, mhp_t mhp_flags)
>   /*
>* Self hosted memmap array
>*/
> - if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> - if

Re: [PATCH v6 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-17 Thread Huang, Ying

Vishal Verma  writes:

> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
> 'memblock_size' chunks of memory being added. Adding a larger span of
> memory precludes memmap_on_memory semantics.
>
> For users of hotplug such as kmem, large amounts of memory might get
> added from the CXL subsystem. In some cases, this amount may exceed the
> available 'main memory' to store the memmap for the memory being added.
> In this case, it is useful to have a way to place the memmap on the
> memory being added, even if it means splitting the addition into
> memblock-sized chunks.
>
> Change add_memory_resource() to loop over memblock-sized chunks of
> memory if caller requested memmap_on_memory, and if other conditions for
> it are met. Teach try_remove_memory() to also expect that a memory
> range being removed might have been split up into memblock sized chunks,
> and to loop through those as needed.
>
> This does preclude being able to use PUD mappings in the direct map; a
> proposal to how this could be optimized in the future is laid out
> here[1].
>
> [1]: 
> https://lore.kernel.org/linux-mm/b6753402-2de9-25b2-36e9-eacd49752...@redhat.com/
>
> Cc: Andrew Morton 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Suggested-by: David Hildenbrand 
> Reviewed-by: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  mm/memory_hotplug.c | 214 
> 
>  1 file changed, 148 insertions(+), 66 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 6be7de9efa55..83e5ec377aad 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1380,6 +1380,43 @@ static bool mhp_supports_memmap_on_memory(unsigned 
> long size)
>   return arch_supports_memmap_on_memory(vmemmap_size);
>  }
>  
> +static int add_memory_create_devices(int nid, struct memory_group *group,
> +  u64 start, u64 size, mhp_t mhp_flags)
> +{
> + struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> + struct vmem_altmap mhp_altmap = {
> + .base_pfn =  PHYS_PFN(start),
> + .end_pfn  =  PHYS_PFN(start + size - 1),
> + };
> + int ret;
> +
> + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
> + mhp_altmap.free = memory_block_memmap_on_memory_pages();
> + params.altmap = kmemdup(_altmap, sizeof(struct vmem_altmap),
> + GFP_KERNEL);
> + if (!params.altmap)
> + return -ENOMEM;
> + }
> +
> + /* call arch's memory hotadd */
> + ret = arch_add_memory(nid, start, size, );
> + if (ret < 0)
> + goto error;
> +
> + /* create memory block devices after memory was added */
> + ret = create_memory_block_devices(start, size, params.altmap, group);
> + if (ret)
> + goto err_bdev;
> +
> + return 0;
> +
> +err_bdev:
> + arch_remove_memory(start, size, NULL);
> +error:
> + kfree(params.altmap);
> + return ret;
> +}
> +
>  /*
>   * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
>   * and online/offline operations (triggered e.g. by sysfs).
> @@ -1388,14 +1425,10 @@ static bool mhp_supports_memmap_on_memory(unsigned 
> long size)
>   */
>  int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  {
> - struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> + unsigned long memblock_size = memory_block_size_bytes();
>   enum memblock_flags memblock_flags = MEMBLOCK_NONE;
> - struct vmem_altmap mhp_altmap = {
> - .base_pfn =  PHYS_PFN(res->start),
> - .end_pfn  =  PHYS_PFN(res->end),
> - };
>   struct memory_group *group = NULL;
> - u64 start, size;
> + u64 start, size, cur_start;
>   bool new_node = false;
>   int ret;
>  
> @@ -1436,28 +1469,21 @@ int __ref add_memory_resource(int nid, struct 
> resource *res, mhp_t mhp_flags)
>   /*
>* Self hosted memmap array
>*/
> - if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> - if (mhp_supports_memmap_on_memory(size)) {
> - mhp_altmap.free = memory_block_memmap_on_memory_pages();
> - params.altmap = kmemdup(_altmap,
> - sizeof(struct vmem_altmap),
> - GFP_KERNEL);
> - if (!params.altmap)
&

Re: [PATCH v5 2/2] dax/kmem: allow kmem to add memory with memmap_on_memory

2023-10-16 Thread Huang, Ying

"Verma, Vishal L"  writes:

> On Tue, 2023-10-17 at 13:18 +0800, Huang, Ying wrote:
>> "Verma, Vishal L"  writes:
>>
>> > On Thu, 2023-10-05 at 14:16 -0700, Dan Williams wrote:
>> > > Vishal Verma wrote:
>> > > >
>> > <..>
>> >
>> > > > +
>> > > > +   rc = kstrtobool(buf, );
>> > > > +   if (rc)
>> > > > +   return rc;
>> > >
>> > > Perhaps:
>> > >
>> > > if (dev_dax->memmap_on_memory == val)
>> > > return len;
>> > >
>> > > ...and skip the check below when it is going to be a nop
>> > >
>> > > > +
>> > > > +   device_lock(dax_region->dev);
>> > > > +   if (!dax_region->dev->driver) {
>> > >
>> > > Is the polarity backwards here? I.e. if the device is already
>> > > attached to
>> > > the kmem driver it is too late to modify memmap_on_memory policy.
>> >
>> > Hm this sounded logical until I tried it. After a reconfigure-
>> > device to
>> > devdax (i.e. detach kmem), I get the -EBUSY if I invert this check.
>>
>> Can you try to unbind the device via sysfs by hand and retry?
>>
> I think what is happening maybe is while kmem gets detached, the device
> goes back to another dax driver (hmem in my tests). So either way, the
> check for if (driver) or if (!driver) won't distinguish between kmem
> vs. something else.
>
> Maybe we just remove this check? Or add an explicit kmem check somehow?

I think it's good to check kmem explicitly here.

--
Best Regards,
Huang, Ying

Re: [PATCH v5 2/2] dax/kmem: allow kmem to add memory with memmap_on_memory

2023-10-16 Thread Huang, Ying

"Verma, Vishal L"  writes:

> On Thu, 2023-10-05 at 14:16 -0700, Dan Williams wrote:
>> Vishal Verma wrote:
>> >
> <..>
>
>> > +
>> > +   rc = kstrtobool(buf, );
>> > +   if (rc)
>> > +   return rc;
>>
>> Perhaps:
>>
>> if (dev_dax->memmap_on_memory == val)
>> return len;
>>
>> ...and skip the check below when it is going to be a nop
>>
>> > +
>> > +   device_lock(dax_region->dev);
>> > +   if (!dax_region->dev->driver) {
>>
>> Is the polarity backwards here? I.e. if the device is already attached to
>> the kmem driver it is too late to modify memmap_on_memory policy.
>
> Hm this sounded logical until I tried it. After a reconfigure-device to
> devdax (i.e. detach kmem), I get the -EBUSY if I invert this check.

Can you try to unbind the device via sysfs by hand and retry?

--
Best Regards,
Huang, Ying

>>
>> > +   device_unlock(dax_region->dev);
>> > +   return -ENXIO;
>>

[snip]

Re: [PATCH v5 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-07 Thread Huang, Ying

Vishal Verma  writes:

> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
> 'memblock_size' chunks of memory being added. Adding a larger span of
> memory precludes memmap_on_memory semantics.
>
> For users of hotplug such as kmem, large amounts of memory might get
> added from the CXL subsystem. In some cases, this amount may exceed the
> available 'main memory' to store the memmap for the memory being added.
> In this case, it is useful to have a way to place the memmap on the
> memory being added, even if it means splitting the addition into
> memblock-sized chunks.
>
> Change add_memory_resource() to loop over memblock-sized chunks of
> memory if caller requested memmap_on_memory, and if other conditions for
> it are met. Teach try_remove_memory() to also expect that a memory
> range being removed might have been split up into memblock sized chunks,
> and to loop through those as needed.
>
> Cc: Andrew Morton 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Suggested-by: David Hildenbrand 
> Signed-off-by: Vishal Verma 
> ---
>  mm/memory_hotplug.c | 162 
> 
>  1 file changed, 99 insertions(+), 63 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f8d3e7427e32..77ec6f15f943 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1380,6 +1380,44 @@ static bool mhp_supports_memmap_on_memory(unsigned 
> long size)
>   return arch_supports_memmap_on_memory(vmemmap_size);
>  }
>  
> +static int add_memory_create_devices(int nid, struct memory_group *group,
> +  u64 start, u64 size, mhp_t mhp_flags)
> +{
> + struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> + struct vmem_altmap mhp_altmap = {
> + .base_pfn =  PHYS_PFN(start),
> + .end_pfn  =  PHYS_PFN(start + size - 1),
> + };
> + int ret;
> +
> + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
> + mhp_altmap.free = memory_block_memmap_on_memory_pages();
> + params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL);
> + if (!params.altmap)
> + return -ENOMEM;
> +
> + memcpy(params.altmap, _altmap, sizeof(mhp_altmap));
> + }
> +
> + /* call arch's memory hotadd */
> + ret = arch_add_memory(nid, start, size, );
> + if (ret < 0)
> + goto error;
> +
> + /* create memory block devices after memory was added */
> + ret = create_memory_block_devices(start, size, params.altmap, group);
> + if (ret)
> + goto err_bdev;
> +
> + return 0;
> +
> +err_bdev:
> + arch_remove_memory(start, size, NULL);
> +error:
> + kfree(params.altmap);
> + return ret;
> +}
> +
>  /*
>   * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
>   * and online/offline operations (triggered e.g. by sysfs).
> @@ -1388,14 +1426,10 @@ static bool mhp_supports_memmap_on_memory(unsigned 
> long size)
>   */
>  int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  {
> - struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> + unsigned long memblock_size = memory_block_size_bytes();
>   enum memblock_flags memblock_flags = MEMBLOCK_NONE;
> - struct vmem_altmap mhp_altmap = {
> - .base_pfn =  PHYS_PFN(res->start),
> - .end_pfn  =  PHYS_PFN(res->end),
> - };
>   struct memory_group *group = NULL;
> - u64 start, size;
> + u64 start, size, cur_start;
>   bool new_node = false;
>   int ret;
>  
> @@ -1436,28 +1470,21 @@ int __ref add_memory_resource(int nid, struct 
> resource *res, mhp_t mhp_flags)
>   /*
>* Self hosted memmap array
>*/
> - if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> - if (mhp_supports_memmap_on_memory(size)) {
> - mhp_altmap.free = memory_block_memmap_on_memory_pages();
> - params.altmap = kmalloc(sizeof(struct vmem_altmap), 
> GFP_KERNEL);
> - if (!params.altmap)
> + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
> + mhp_supports_memmap_on_memory(memblock_size)) {
> + for (cur_start = start; cur_start < start + size;
> +  cur_start += memblock_size) {
> + ret = add_memory_create_devices(nid, group, cur_start,
> + memblock_size,
> +

Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

2023-08-21 Thread Huang, Ying

Alistair Popple  writes:

> "Huang, Ying"  writes:
>
>> Alistair Popple  writes:
>>
>>> "Huang, Ying"  writes:
>>>
>>>> Hi, Alistair,
>>>>
>>>> Sorry for late response.  Just come back from vacation.
>>>
>>> Ditto for this response :-)
>>>
>>> I see Andrew has taken this into mm-unstable though, so my bad for not
>>> getting around to following all this up sooner.
>>>
>>>> Alistair Popple  writes:
>>>>
>>>>> "Huang, Ying"  writes:
>>>>>
>>>>>> Alistair Popple  writes:
>>>>>>
>>>>>>> "Huang, Ying"  writes:
>>>>>>>
>>>>>>>> Alistair Popple  writes:
>>>>>>>>
>>>>>>>>>>>> While other memory device drivers can use the general notifier 
>>>>>>>>>>>> chain
>>>>>>>>>>>> interface at the same time.
>>>>>>>>>
>>>>>>>>> How would that work in practice though? The abstract distance as far 
>>>>>>>>> as
>>>>>>>>> I can tell doesn't have any meaning other than establishing 
>>>>>>>>> preferences
>>>>>>>>> for memory demotion order. Therefore all calculations are relative to
>>>>>>>>> the rest of the calculations on the system. So if a driver does it's 
>>>>>>>>> own
>>>>>>>>> thing how does it choose a sensible distance? IHMO the value here is 
>>>>>>>>> in
>>>>>>>>> coordinating all that through a standard interface, whether that is 
>>>>>>>>> HMAT
>>>>>>>>> or something else.
>>>>>>>>
>>>>>>>> Only if different algorithms follow the same basic principle.  For
>>>>>>>> example, the abstract distance of default DRAM nodes are fixed
>>>>>>>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device 
>>>>>>>> is
>>>>>>>> in linear direct proportion to the memory latency and inversely
>>>>>>>> proportional to the memory bandwidth.  Use the memory latency and
>>>>>>>> bandwidth of default DRAM nodes as base.
>>>>>>>>
>>>>>>>> HMAT and CDAT report the raw memory latency and bandwidth.  If there 
>>>>>>>> are
>>>>>>>> some other methods to report the raw memory latency and bandwidth, we
>>>>>>>> can use them too.
>>>>>>>
>>>>>>> Argh! So we could address my concerns by having drivers feed
>>>>>>> latency/bandwidth numbers into a standard calculation algorithm right?
>>>>>>> Ie. Rather than having drivers calculate abstract distance themselves we
>>>>>>> have the notifier chains return the raw performance data from which the
>>>>>>> abstract distance is derived.
>>>>>>
>>>>>> Now, memory device drivers only need a general interface to get the
>>>>>> abstract distance from the NUMA node ID.  In the future, if they need
>>>>>> more interfaces, we can add them.  For example, the interface you
>>>>>> suggested above.
>>>>>
>>>>> Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract
>>>>> distance, it's a meaningless number. The only reason they care about it
>>>>> is so they can pass it to alloc_memory_type():
>>>>>
>>>>> struct memory_dev_type *alloc_memory_type(int adistance)
>>>>>
>>>>> Instead alloc_memory_type() should be taking bandwidth/latency numbers
>>>>> and the calculation of abstract distance should be done there. That
>>>>> resovles the issues about how drivers are supposed to devine adistance
>>>>> and also means that when CDAT is added we don't have to duplicate the
>>>>> calculation code.
>>>>
>>>> In the current design, the abstract distance is the key concept of
>>>> memory types and memory tiers.  And it is used as interface to allocate
>>>> memory types.  This provides more flexibility than some other interfaces
>>>> (e.g. read/write ban

Re: [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface

2023-08-21 Thread Huang, Ying

Alistair Popple  writes:

> "Huang, Ying"  writes:
>
>> Alistair Popple  writes:
>>
>>> Huang Ying  writes:
>>>
>>>> Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
>>>> used for slow memory type in kmem driver.  This limits the usage of
>>>> kmem driver, for example, it cannot be used for HBM (high bandwidth
>>>> memory).
>>>>
>>>> So, we use the general abstract distance calculation mechanism in kmem
>>>> drivers to get more accurate abstract distance on systems with proper
>>>> support.  The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as
>>>> fallback only.
>>>>
>>>> Now, multiple memory types may be managed by kmem.  These memory types
>>>> are put into the "kmem_memory_types" list and protected by
>>>> kmem_memory_type_lock.
>>>
>>> See below but I wonder if kmem_memory_types could be a common helper
>>> rather than kdax specific?
>>>
>>>> Signed-off-by: "Huang, Ying" 
>>>> Cc: Aneesh Kumar K.V 
>>>> Cc: Wei Xu 
>>>> Cc: Alistair Popple 
>>>> Cc: Dan Williams 
>>>> Cc: Dave Hansen 
>>>> Cc: Davidlohr Bueso 
>>>> Cc: Johannes Weiner 
>>>> Cc: Jonathan Cameron 
>>>> Cc: Michal Hocko 
>>>> Cc: Yang Shi 
>>>> Cc: Rafael J Wysocki 
>>>> ---
>>>>  drivers/dax/kmem.c   | 54 +++-
>>>>  include/linux/memory-tiers.h |  2 ++
>>>>  mm/memory-tiers.c|  2 +-
>>>>  3 files changed, 44 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>>> index 898ca9505754..837165037231 100644
>>>> --- a/drivers/dax/kmem.c
>>>> +++ b/drivers/dax/kmem.c
>>>> @@ -49,14 +49,40 @@ struct dax_kmem_data {
>>>>struct resource *res[];
>>>>  };
>>>>  
>>>> -static struct memory_dev_type *dax_slowmem_type;
>>>> +static DEFINE_MUTEX(kmem_memory_type_lock);
>>>> +static LIST_HEAD(kmem_memory_types);
>>>> +
>>>> +static struct memory_dev_type *kmem_find_alloc_memorty_type(int adist)
>>>> +{
>>>> +  bool found = false;
>>>> +  struct memory_dev_type *mtype;
>>>> +
>>>> +  mutex_lock(_memory_type_lock);
>>>> +  list_for_each_entry(mtype, _memory_types, list) {
>>>> +  if (mtype->adistance == adist) {
>>>> +  found = true;
>>>> +  break;
>>>> +  }
>>>> +  }
>>>> +  if (!found) {
>>>> +  mtype = alloc_memory_type(adist);
>>>> +  if (!IS_ERR(mtype))
>>>> +  list_add(>list, _memory_types);
>>>> +  }
>>>> +  mutex_unlock(_memory_type_lock);
>>>> +
>>>> +  return mtype;
>>>> +}
>>>> +
>>>>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>  {
>>>>struct device *dev = _dax->dev;
>>>>unsigned long total_len = 0;
>>>>struct dax_kmem_data *data;
>>>> +  struct memory_dev_type *mtype;
>>>>int i, rc, mapped = 0;
>>>>int numa_node;
>>>> +  int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
>>>>  
>>>>/*
>>>> * Ensure good NUMA information for the persistent memory.
>>>> @@ -71,6 +97,11 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>return -EINVAL;
>>>>}
>>>>  
>>>> +  mt_calc_adistance(numa_node, );
>>>> +  mtype = kmem_find_alloc_memorty_type(adist);
>>>> +  if (IS_ERR(mtype))
>>>> +  return PTR_ERR(mtype);
>>>> +
>>>
>>> I wrote my own quick and dirty module to test this and wrote basically
>>> the same code sequence.
>>>
>>> I notice your using a list of memory types here though. I think it would
>>> be nice to have a common helper that other users could call to do the
>>> mt_calc_adistance() / kmem_find_alloc_memory_type() /
>>> init_node_memory_type() sequence and cleanup as my naive approach would
>>> result in a new memory_dev_type per device even though adist might be
>>> the same. A common helper would make it easy to de-dup those.
>>
>> If it's useful, we can move kmem_find_alloc_memory_type() to
>> memory-tier.c after some revision.  But I tend to move it after we have
>> the second user.  What do you think about that?
>
> Usually I would agree, but this series already introduces a general
> interface for calculating adist even though there's only one user and
> implementation. So if we're going to add a general interface I think it
> would be better to make it more usable now rather than after variations
> of it have been cut and pasted into other drivers.

In general, I would like to introduce complexity when necessary.  So, we
can discuss the necessity of the general interface firstly.  We can do
that in [1/4] of the series.

--
Best Regards,
Huang, Ying

Re: [PATCH RESEND 3/4] acpi, hmat: calculate abstract distance with HMAT

2023-08-21 Thread Huang, Ying

Alistair Popple  writes:

> "Huang, Ying"  writes:
>
>> Alistair Popple  writes:
>>
>>> Huang Ying  writes:
>>>
>>>> A memory tiering abstract distance calculation algorithm based on ACPI
>>>> HMAT is implemented.  The basic idea is as follows.
>>>>
>>>> The performance attributes of system default DRAM nodes are recorded
>>>> as the base line.  Whose abstract distance is MEMTIER_ADISTANCE_DRAM.
>>>> Then, the ratio of the abstract distance of a memory node (target) to
>>>> MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance
>>>> attributes of the node to that of the default DRAM nodes.
>>>
>>> The problem I encountered here with the calculations is that HBM memory
>>> ended up in a lower-tiered node which isn't what I wanted (at least when
>>> that HBM is attached to a GPU say).
>>
>> I have tested the series on a server machine with HBM (pure HBM, not
>> attached to a GPU).  Where, HBM is placed in a higher tier than DRAM.
>
> Good to know.
>
>>> I suspect this is because the calculations are based on the CPU
>>> point-of-view (access1) which still sees lower bandwidth to remote HBM
>>> than local DRAM, even though the remote GPU has higher bandwidth access
>>> to that memory. Perhaps we need to be considering access0 as well?
>>> Ie. HBM directly attached to a generic initiator should be in a higher
>>> tier regardless of CPU access characteristics?
>>
>> What's your requirements for memory tiers on the machine?  I guess you
>> want to put GPU attache HBM in a higher tier and put DRAM in a lower
>> tier.  So, cold HBM pages can be demoted to DRAM when there are memory
>> pressure on HBM?  This sounds reasonable from GPU point of view.
>
> Yes, that is what I would like to implement.
>
>> The above requirements may be satisfied via calculating abstract
>> distance based on access0 (or combined with access1).  But I suspect
>> this will be a general solution.  I guess that any memory devices that
>> are used mainly by the memory initiators other than CPUs want to put
>> themselves in a higher memory tier than DRAM, regardless of its
>> access0.
>
> Right. I'm still figuring out how ACPI HMAT fits together but that
> sounds reasonable.
>
>> One solution is to put GPU HBM in the highest memory tier (with smallest
>> abstract distance) always in GPU device driver regardless its HMAT
>> performance attributes.  Is it possible?
>
> It's certainly possible and easy enough to do, although I think it would
> be good to provide upper and lower bounds for HMAT derived adistances to
> make that easier. It does make me wonder what the point of HMAT is if we
> have to ignore it in some scenarios though. But perhaps I need to dig
> deeper into the GPU values to figure out how it can be applied correctly
> there.

In the original design (page 11 of [1]),

[1] 
https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf

the default memory tier hierarchy is based on the performance from CPU
point of view.  Then the abstract distance of a memory type (e.g., GPU
HBM) can be adjusted via a sysfs knob
(/abstract_distance_offset) based on the requirements of
GPU.

That's another possible solution.

--
Best Regards,
Huang, Ying

Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

2023-08-21 Thread Huang, Ying

Alistair Popple  writes:

> "Huang, Ying"  writes:
>
>> Hi, Alistair,
>>
>> Sorry for late response.  Just come back from vacation.
>
> Ditto for this response :-)
>
> I see Andrew has taken this into mm-unstable though, so my bad for not
> getting around to following all this up sooner.
>
>> Alistair Popple  writes:
>>
>>> "Huang, Ying"  writes:
>>>
>>>> Alistair Popple  writes:
>>>>
>>>>> "Huang, Ying"  writes:
>>>>>
>>>>>> Alistair Popple  writes:
>>>>>>
>>>>>>>>>> While other memory device drivers can use the general notifier chain
>>>>>>>>>> interface at the same time.
>>>>>>>
>>>>>>> How would that work in practice though? The abstract distance as far as
>>>>>>> I can tell doesn't have any meaning other than establishing preferences
>>>>>>> for memory demotion order. Therefore all calculations are relative to
>>>>>>> the rest of the calculations on the system. So if a driver does it's own
>>>>>>> thing how does it choose a sensible distance? IHMO the value here is in
>>>>>>> coordinating all that through a standard interface, whether that is HMAT
>>>>>>> or something else.
>>>>>>
>>>>>> Only if different algorithms follow the same basic principle.  For
>>>>>> example, the abstract distance of default DRAM nodes are fixed
>>>>>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
>>>>>> in linear direct proportion to the memory latency and inversely
>>>>>> proportional to the memory bandwidth.  Use the memory latency and
>>>>>> bandwidth of default DRAM nodes as base.
>>>>>>
>>>>>> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
>>>>>> some other methods to report the raw memory latency and bandwidth, we
>>>>>> can use them too.
>>>>>
>>>>> Argh! So we could address my concerns by having drivers feed
>>>>> latency/bandwidth numbers into a standard calculation algorithm right?
>>>>> Ie. Rather than having drivers calculate abstract distance themselves we
>>>>> have the notifier chains return the raw performance data from which the
>>>>> abstract distance is derived.
>>>>
>>>> Now, memory device drivers only need a general interface to get the
>>>> abstract distance from the NUMA node ID.  In the future, if they need
>>>> more interfaces, we can add them.  For example, the interface you
>>>> suggested above.
>>>
>>> Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract
>>> distance, it's a meaningless number. The only reason they care about it
>>> is so they can pass it to alloc_memory_type():
>>>
>>> struct memory_dev_type *alloc_memory_type(int adistance)
>>>
>>> Instead alloc_memory_type() should be taking bandwidth/latency numbers
>>> and the calculation of abstract distance should be done there. That
>>> resovles the issues about how drivers are supposed to devine adistance
>>> and also means that when CDAT is added we don't have to duplicate the
>>> calculation code.
>>
>> In the current design, the abstract distance is the key concept of
>> memory types and memory tiers.  And it is used as interface to allocate
>> memory types.  This provides more flexibility than some other interfaces
>> (e.g. read/write bandwidth/latency).  For example, in current
>> dax/kmem.c, if HMAT isn't available in the system, the default abstract
>> distance: MEMTIER_DEFAULT_DAX_ADISTANCE is used.  This is still useful
>> to support some systems now.  On a system without HMAT/CDAT, it's
>> possible to calculate abstract distance from ACPI SLIT, although this is
>> quite limited.  I'm not sure whether all systems will provide read/write
>> bandwith/latency data for all memory devices.
>>
>> HMAT and CDAT or some other mechanisms may provide the read/write
>> bandwidth/latency data to be used to calculate abstract distance.  For
>> them, we can provide a shared implementation in mm/memory-tiers.c to map
>> from read/write bandwith/latency to the abstract distance.  Can this
>> solve your concerns about the consistency among algorithms?  If so, we
>> can do that when we add the second algorithm that needs that.
>
> I guess it would address my concerns if we did that now. I don't see why
> we need to wait for a second implementation for that though - the whole
> series seems to be built around adding a framework for supporting
> multiple algorithms even though only one exists. So I think we should
> support that fully, or simplfy the whole thing and just assume the only
> thing that exists is HMAT and get rid of the general interface until a
> second algorithm comes along.

We will need a general interface even for one algorithm implementation.
Because it's not good to make a dax subsystem driver (dax/kmem) to
depend on a ACPI subsystem driver (acpi/hmat).  We need some general
interface at subsystem level (memory tier here) between them.

Best Regards,
Huang, Ying

Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-08-14 Thread Huang, Ying

"Verma, Vishal L"  writes:

> On Mon, 2023-07-24 at 13:54 +0800, Huang, Ying wrote:
>> Vishal Verma  writes:
>>
>> >
>> > @@ -2035,12 +2056,38 @@ void try_offline_node(int nid)
>> >  }
>> >  EXPORT_SYMBOL(try_offline_node);
>> >
>> > -static int __ref try_remove_memory(u64 start, u64 size)
>> > +static void __ref __try_remove_memory(int nid, u64 start, u64 size,
>> > +struct vmem_altmap *altmap)
>> >  {
>> > -   struct vmem_altmap mhp_altmap = {};
>> > -   struct vmem_altmap *altmap = NULL;
>> > -   unsigned long nr_vmemmap_pages;
>> > -   int rc = 0, nid = NUMA_NO_NODE;
>> > +   /* remove memmap entry */
>> > +   firmware_map_remove(start, start + size, "System RAM");
>>
>> If mhp_supports_memmap_on_memory(), we will call
>> firmware_map_add_hotplug() for whole range.  But here we may call
>> firmware_map_remove() for part of range.  Is it OK?
>>
>
> Good point, this is a discrepancy in the add vs remove path. Can the
> firmware memmap entries be moved up a bit in the add path, and is it
> okay to create these for each memblock? Or should these be for the
> whole range? I'm not familiar with the implications. (I've left it as
> is for v3 for now, but depending on the direction I can update in a
> future rev).

Cced more firmware map developers and maintainers.

Per my understanding, we should create one firmware memmap entry for
each memblock.

--
Best Regards,
Huang, Ying

Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-08-14 Thread Huang, Ying

"Verma, Vishal L"  writes:

> On Mon, 2023-07-24 at 11:16 +0800, Huang, Ying wrote:
>> "Aneesh Kumar K.V"  writes:
>> >
>> > > @@ -1339,27 +1367,20 @@ int __ref add_memory_resource(int nid,
>> > > struct resource *res, mhp_t mhp_flags)
>> > > /*
>> > >  * Self hosted memmap array
>> > >  */
>> > > -   if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
>> > > -   if (!mhp_supports_memmap_on_memory(size)) {
>> > > -   ret = -EINVAL;
>> > > +   if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
>> > > +   mhp_supports_memmap_on_memory(memblock_size)) {
>> > > +   for (cur_start = start; cur_start < start + size;
>> > > +cur_start += memblock_size) {
>> > > +   ret = add_memory_create_devices(nid,
>> > > group, cur_start,
>> > > +   memblock_
>> > > size,
>> > > +   mhp_flags
>> > > );
>> > > +   if (ret)
>> > > +   goto error;
>> > > +   }
>> >
>> > We should handle the below error details here.
>> >
>> > 1) If we hit an error after some blocks got added, should we
>> > iterate over rest of the dev_dax->nr_range.
>> > 2) With some blocks added if we return a failure here, we remove
>> > the
>> > resource in dax_kmem. Is that ok?
>> >
>> > IMHO error handling with partial creation of memory blocks in a
>> > resource range should be
>> > documented with this change.
>>
>> Or, should we remove all added memory blocks upon error?
>>
> I didn't address these in v3 - I wasn't sure how we'd proceed here.
> Something obviously went very wrong and I'd imagine it is okay if this
> memory is unusable as a result.
>
> What woyuld removing the blocks we added look like? Just call
> try_remove_memory() from the error path in add_memory_resource()? (for
> a range of [start, cur_start) ?

I guess that we can just keep the original behavior.  Originally, if
something goes wrong, arch_remove_memory() and remove_memory_block() (in
create_memory_block_devices()) will be called for all added memory
range.  So, we should do that too?

--
Best Regards,
Huang, Ying

Re: [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT

2023-08-11 Thread Huang, Ying

Hi, Rao,

Bharata B Rao  writes:

> On 24-Jul-23 11:28 PM, Andrew Morton wrote:
>> On Fri, 21 Jul 2023 14:15:31 +1000 Alistair Popple  
>> wrote:
>> 
>>> Thanks for this Huang, I had been hoping to take a look at it this week
>>> but have run out of time. I'm keen to do some testing with it as well.
>> 
>> Thanks.  I'll queue this in mm-unstable for some testing.  Detailed
>> review and testing would be appreciated.
>
> I gave this series a try on a 2P system with 2 CXL cards. I don't trust the
> bandwidth and latency numbers reported by HMAT here, but FWIW, this patchset
> puts the CXL nodes on a lower tier than DRAM nodes.

Thank you very much!

Can I add your "Tested-by" for the series?

--
Best Regards,
Huang, Ying

Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

2023-08-10 Thread Huang, Ying

Hi, Alistair,

Sorry for late response.  Just come back from vacation.

Alistair Popple  writes:

> "Huang, Ying"  writes:
>
>> Alistair Popple  writes:
>>
>>> "Huang, Ying"  writes:
>>>
>>>> Alistair Popple  writes:
>>>>
>>>>>>>> While other memory device drivers can use the general notifier chain
>>>>>>>> interface at the same time.
>>>>>
>>>>> How would that work in practice though? The abstract distance as far as
>>>>> I can tell doesn't have any meaning other than establishing preferences
>>>>> for memory demotion order. Therefore all calculations are relative to
>>>>> the rest of the calculations on the system. So if a driver does it's own
>>>>> thing how does it choose a sensible distance? IHMO the value here is in
>>>>> coordinating all that through a standard interface, whether that is HMAT
>>>>> or something else.
>>>>
>>>> Only if different algorithms follow the same basic principle.  For
>>>> example, the abstract distance of default DRAM nodes are fixed
>>>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
>>>> in linear direct proportion to the memory latency and inversely
>>>> proportional to the memory bandwidth.  Use the memory latency and
>>>> bandwidth of default DRAM nodes as base.
>>>>
>>>> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
>>>> some other methods to report the raw memory latency and bandwidth, we
>>>> can use them too.
>>>
>>> Argh! So we could address my concerns by having drivers feed
>>> latency/bandwidth numbers into a standard calculation algorithm right?
>>> Ie. Rather than having drivers calculate abstract distance themselves we
>>> have the notifier chains return the raw performance data from which the
>>> abstract distance is derived.
>>
>> Now, memory device drivers only need a general interface to get the
>> abstract distance from the NUMA node ID.  In the future, if they need
>> more interfaces, we can add them.  For example, the interface you
>> suggested above.
>
> Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract
> distance, it's a meaningless number. The only reason they care about it
> is so they can pass it to alloc_memory_type():
>
> struct memory_dev_type *alloc_memory_type(int adistance)
>
> Instead alloc_memory_type() should be taking bandwidth/latency numbers
> and the calculation of abstract distance should be done there. That
> resovles the issues about how drivers are supposed to devine adistance
> and also means that when CDAT is added we don't have to duplicate the
> calculation code.

In the current design, the abstract distance is the key concept of
memory types and memory tiers.  And it is used as interface to allocate
memory types.  This provides more flexibility than some other interfaces
(e.g. read/write bandwidth/latency).  For example, in current
dax/kmem.c, if HMAT isn't available in the system, the default abstract
distance: MEMTIER_DEFAULT_DAX_ADISTANCE is used.  This is still useful
to support some systems now.  On a system without HMAT/CDAT, it's
possible to calculate abstract distance from ACPI SLIT, although this is
quite limited.  I'm not sure whether all systems will provide read/write
bandwith/latency data for all memory devices.

HMAT and CDAT or some other mechanisms may provide the read/write
bandwidth/latency data to be used to calculate abstract distance.  For
them, we can provide a shared implementation in mm/memory-tiers.c to map
from read/write bandwith/latency to the abstract distance.  Can this
solve your concerns about the consistency among algorithms?  If so, we
can do that when we add the second algorithm that needs that.

--
Best Regards,
Huang, Ying

Re: [PATCH RESEND 2/4] acpi, hmat: refactor hmat_register_target_initiators()

2023-08-10 Thread Huang, Ying

Hi, Jonathan,

Thanks for review!

Jonathan Cameron  writes:

> On Fri, 21 Jul 2023 09:29:30 +0800
> Huang Ying  wrote:
>
>> Previously, in hmat_register_target_initiators(), the performance
>> attributes are calculated and the corresponding sysfs links and files
>> are created too.  Which is called during memory onlining.
>> 
>> But now, to calculate the abstract distance of a memory target before
>> memory onlining, we need to calculate the performance attributes for
>> a memory target without creating sysfs links and files.
>> 
>> To do that, hmat_register_target_initiators() is refactored to make it
>> possible to calculate performance attributes separately.
>> 
>> Signed-off-by: "Huang, Ying" 
>> Cc: Aneesh Kumar K.V 
>> Cc: Wei Xu 
>> Cc: Alistair Popple 
>> Cc: Dan Williams 
>> Cc: Dave Hansen 
>> Cc: Davidlohr Bueso 
>> Cc: Johannes Weiner 
>> Cc: Jonathan Cameron 
>> Cc: Michal Hocko 
>> Cc: Yang Shi 
>> Cc: Rafael J Wysocki 
>
> Unfortunately I don't think I still have the tables I used to test the
> generic initiator and won't get time to generate them all again in
> next few weeks.  So just a superficial review for now.
> I 'think' the cleanup looks good but the original code was rather fiddly
> so I'm not 100% sure nothing is missed.
>
> One comment inline on the fact the list is now sorted twice.
>
>
>> ---
>>  drivers/acpi/numa/hmat.c | 81 +++-
>>  1 file changed, 30 insertions(+), 51 deletions(-)
>> 
>> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
>> index bba268ecd802..2dee0098f1a9 100644
>> --- a/drivers/acpi/numa/hmat.c
>> +++ b/drivers/acpi/numa/hmat.c
>> @@ -582,28 +582,25 @@ static int initiators_to_nodemask(unsigned long 
>> *p_nodes)
>>  return 0;
>>  }
>>  
>> -static void hmat_register_target_initiators(struct memory_target *target)
>> +static void hmat_update_target_attrs(struct memory_target *target,
>> + unsigned long *p_nodes, int access)
>>  {
>> -static DECLARE_BITMAP(p_nodes, MAX_NUMNODES);
>>  struct memory_initiator *initiator;
>> -unsigned int mem_nid, cpu_nid;
>> +unsigned int cpu_nid;
>>  struct memory_locality *loc = NULL;
>>  u32 best = 0;
>> -bool access0done = false;
>>  int i;
>>  
>> -mem_nid = pxm_to_node(target->memory_pxm);
>> +bitmap_zero(p_nodes, MAX_NUMNODES);
>>  /*
>> - * If the Address Range Structure provides a local processor pxm, link
>> + * If the Address Range Structure provides a local processor pxm, set
>>   * only that one. Otherwise, find the best performance attributes and
>> - * register all initiators that match.
>> + * collect all initiators that match.
>>   */
>>  if (target->processor_pxm != PXM_INVAL) {
>>  cpu_nid = pxm_to_node(target->processor_pxm);
>> -register_memory_node_under_compute_node(mem_nid, cpu_nid, 0);
>> -access0done = true;
>> -if (node_state(cpu_nid, N_CPU)) {
>> -register_memory_node_under_compute_node(mem_nid, 
>> cpu_nid, 1);
>> +if (access == 0 || node_state(cpu_nid, N_CPU)) {
>> +set_bit(target->processor_pxm, p_nodes);
>>  return;
>>  }
>>  }
>> @@ -617,47 +614,10 @@ static void hmat_register_target_initiators(struct 
>> memory_target *target)
>>   * We'll also use the sorting to prime the candidate nodes with known
>>   * initiators.
>>   */
>> -bitmap_zero(p_nodes, MAX_NUMNODES);
>>  list_sort(NULL, , initiator_cmp);
>>  if (initiators_to_nodemask(p_nodes) < 0)
>>      return;
>
> One result of this refactor is that a few things run twice, that previously 
> only ran once
> like this list_sort()
> Not necessarily a problem though as probably fairly cheap.

Yes.  The original code sorts once for each target.  But it appears that
it's unnecessary too.  We can sort the initiators list when adding new
item to it in alloc_memory_initiator().  If necessary, I can add an
additional patch to do that.  But as you said, it may be unnecessary
because the sort should be fairly cheap.

--
Best Regards,
Huang, Ying

>>  
>> -if (!access0done) {
>> -for (i = WRITE_LATENCY; i <= READ_BANDWIDTH; i++) {
>> -loc = localities_types[i];
>> -if

Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

2023-07-26 Thread Huang, Ying

Alistair Popple  writes:

> "Huang, Ying"  writes:
>
>> Alistair Popple  writes:
>>
>>> "Huang, Ying"  writes:
>>>
>>>>>> And, I don't think that we are forced to use the general notifier
>>>>>> chain interface in all memory device drivers.  If the memory device
>>>>>> driver has better understanding of the memory device, it can use other
>>>>>> way to determine abstract distance.  For example, a CXL memory device
>>>>>> driver can identify abstract distance by itself.  While other memory
>>>>>> device drivers can use the general notifier chain interface at the
>>>>>> same time.
>>>>>
>>>>> Whilst I think personally I would find that flexibility useful I am
>>>>> concerned it means every driver will just end up divining it's own
>>>>> distance rather than ensuring data in HMAT/CDAT/etc. is correct. That
>>>>> would kind of defeat the purpose of it all then.
>>>>
>>>> But we have no way to enforce that too.
>>>
>>> Enforce that HMAT/CDAT/etc. is correct? Agree we can't enforce it, but
>>> we can influence it. If drivers can easily ignore the notifier chain and
>>> do their own thing that's what will happen.
>>
>> IMHO, both enforce HMAT/CDAT/etc is correct and enforce drivers to use
>> general interface we provided.  Anyway, we should try to make HMAT/CDAT
>> works well, so drivers want to use them :-)
>
> Exactly :-)
>
>>>>>> While other memory device drivers can use the general notifier chain
>>>>>> interface at the same time.
>>>
>>> How would that work in practice though? The abstract distance as far as
>>> I can tell doesn't have any meaning other than establishing preferences
>>> for memory demotion order. Therefore all calculations are relative to
>>> the rest of the calculations on the system. So if a driver does it's own
>>> thing how does it choose a sensible distance? IHMO the value here is in
>>> coordinating all that through a standard interface, whether that is HMAT
>>> or something else.
>>
>> Only if different algorithms follow the same basic principle.  For
>> example, the abstract distance of default DRAM nodes are fixed
>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
>> in linear direct proportion to the memory latency and inversely
>> proportional to the memory bandwidth.  Use the memory latency and
>> bandwidth of default DRAM nodes as base.
>>
>> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
>> some other methods to report the raw memory latency and bandwidth, we
>> can use them too.
>
> Argh! So we could address my concerns by having drivers feed
> latency/bandwidth numbers into a standard calculation algorithm right?
> Ie. Rather than having drivers calculate abstract distance themselves we
> have the notifier chains return the raw performance data from which the
> abstract distance is derived.

Now, memory device drivers only need a general interface to get the
abstract distance from the NUMA node ID.  In the future, if they need
more interfaces, we can add them.  For example, the interface you
suggested above.

--
Best Regards,
Huang, Ying

Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

2023-07-26 Thread Huang, Ying

Alistair Popple  writes:

> "Huang, Ying"  writes:
>
>>>> The other way (suggested by this series) is to make dax/kmem call a
>>>> notifier chain, then CXL CDAT or ACPI HMAT can identify the type of
>>>> device and calculate the distance if the type is correct for them.  I
>>>> don't think that it's good to make dax/kem to know every possible
>>>> types of memory devices.
>>>
>>> Do we expect there to be lots of different types of memory devices
>>> sharing a common dax/kmem driver though? Must admit I'm coming from a
>>> GPU background where we'd expect each type of device to have it's own
>>> driver anyway so wasn't expecting different types of memory devices to
>>> be handled by the same driver.
>>
>> Now, dax/kmem.c is used for
>>
>> - PMEM (Optane DCPMM, or AEP)
>> - CXL.mem
>> - HBM (attached to CPU)
>
> Thanks a lot for the background! I will admit to having a faily narrow
> focus here.
>
>>>> And, I don't think that we are forced to use the general notifier
>>>> chain interface in all memory device drivers.  If the memory device
>>>> driver has better understanding of the memory device, it can use other
>>>> way to determine abstract distance.  For example, a CXL memory device
>>>> driver can identify abstract distance by itself.  While other memory
>>>> device drivers can use the general notifier chain interface at the
>>>> same time.
>>>
>>> Whilst I think personally I would find that flexibility useful I am
>>> concerned it means every driver will just end up divining it's own
>>> distance rather than ensuring data in HMAT/CDAT/etc. is correct. That
>>> would kind of defeat the purpose of it all then.
>>
>> But we have no way to enforce that too.
>
> Enforce that HMAT/CDAT/etc. is correct? Agree we can't enforce it, but
> we can influence it. If drivers can easily ignore the notifier chain and
> do their own thing that's what will happen.

IMHO, both enforce HMAT/CDAT/etc is correct and enforce drivers to use
general interface we provided.  Anyway, we should try to make HMAT/CDAT
works well, so drivers want to use them :-)

>>>> While other memory device drivers can use the general notifier chain
>>>> interface at the same time.
>
> How would that work in practice though? The abstract distance as far as
> I can tell doesn't have any meaning other than establishing preferences
> for memory demotion order. Therefore all calculations are relative to
> the rest of the calculations on the system. So if a driver does it's own
> thing how does it choose a sensible distance? IHMO the value here is in
> coordinating all that through a standard interface, whether that is HMAT
> or something else.

Only if different algorithms follow the same basic principle.  For
example, the abstract distance of default DRAM nodes are fixed
(MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
in linear direct proportion to the memory latency and inversely
proportional to the memory bandwidth.  Use the memory latency and
bandwidth of default DRAM nodes as base.

HMAT and CDAT report the raw memory latency and bandwidth.  If there are
some other methods to report the raw memory latency and bandwidth, we
can use them too.

--
Best Regards,
Huang, Ying

Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

2023-07-26 Thread Huang, Ying

Alistair Popple  writes:

> "Huang, Ying"  writes:
>
>> Hi, Alistair,
>>
>> Thanks a lot for comments!
>>
>> Alistair Popple  writes:
>>
>>> Huang Ying  writes:
>>>
>>>> The abstract distance may be calculated by various drivers, such as
>>>> ACPI HMAT, CXL CDAT, etc.  While it may be used by various code which
>>>> hot-add memory node, such as dax/kmem etc.  To decouple the algorithm
>>>> users and the providers, the abstract distance calculation algorithms
>>>> management mechanism is implemented in this patch.  It provides
>>>> interface for the providers to register the implementation, and
>>>> interface for the users.
>>>
>>> I wonder if we need this level of decoupling though? It seems to me like
>>> it would be simpler and better for drivers to calculate the abstract
>>> distance directly themselves by calling the desired algorithm (eg. ACPI
>>> HMAT) and pass this when creating the nodes rather than having a
>>> notifier chain.
>>
>> Per my understanding, ACPI HMAT and memory device drivers (such as
>> dax/kmem) may belong to different subsystems (ACPI vs. dax).  It's not
>> good to call functions across subsystems directly.  So, I think it's
>> better to use a general subsystem: memory-tier.c to decouple them.  If
>> it turns out that a notifier chain is unnecessary, we can use some
>> function pointers instead.
>>
>>> At the moment it seems we've only identified two possible algorithms
>>> (ACPI HMAT and CXL CDAT) and I don't think it would make sense for one
>>> of those to fallback to the other based on priority, so why not just
>>> have drivers call the correct algorithm directly?
>>
>> For example, we have a system with PMEM (persistent memory, Optane
>> DCPMM, or AEP, or something else) in DIMM slots and CXL.mem connected
>> via CXL link to a remote memory pool.  We will need ACPI HMAT for PMEM
>> and CXL CDAT for CXL.mem.  One way is to make dax/kmem identify the
>> types of the device and call corresponding algorithms.
>
> Yes, that is what I was thinking.
>
>> The other way (suggested by this series) is to make dax/kmem call a
>> notifier chain, then CXL CDAT or ACPI HMAT can identify the type of
>> device and calculate the distance if the type is correct for them.  I
>> don't think that it's good to make dax/kem to know every possible
>> types of memory devices.
>
> Do we expect there to be lots of different types of memory devices
> sharing a common dax/kmem driver though? Must admit I'm coming from a
> GPU background where we'd expect each type of device to have it's own
> driver anyway so wasn't expecting different types of memory devices to
> be handled by the same driver.

Now, dax/kmem.c is used for

- PMEM (Optane DCPMM, or AEP)
- CXL.mem
- HBM (attached to CPU)

I understand that for a CXL GPU driver it's OK to call some CXL CDAT
helper to identify the abstract distance of memory attached to GPU.
Because there's no cross-subsystem function calls.  But it looks not
very clean to call from dax/kmem.c to CXL CDAT because it's a
cross-subsystem function call.

>>>> Multiple algorithm implementations can cooperate via calculating
>>>> abstract distance for different memory nodes.  The preference of
>>>> algorithm implementations can be specified via
>>>> priority (notifier_block.priority).
>>>
>>> How/what decides the priority though? That seems like something better
>>> decided by a device driver than the algorithm driver IMHO.
>>
>> Do we need the memory device driver specific priority?  Or we just share
>> a common priority?  For example, the priority of CXL CDAT is always
>> higher than that of ACPI HMAT?  Or architecture specific?
>
> Ok, thanks. Having read the above I think the priority is
> unimportant. Algorithms can either decide to return a distance and
> NOTIFY_STOP_MASK if they can calculate a distance or NOTIFY_DONE if they
> can't for a specific device.

Yes.  In most cases, there are no overlaps among algorithms.

>> And, I don't think that we are forced to use the general notifier
>> chain interface in all memory device drivers.  If the memory device
>> driver has better understanding of the memory device, it can use other
>> way to determine abstract distance.  For example, a CXL memory device
>> driver can identify abstract distance by itself.  While other memory
>> device drivers can use the general notifier chain interface at the
>> same time.
>
> Whilst I think personally I would find that flexibility useful I am
> concerned it means every driver will just end up divining it's own
> distance rather than ensuring data in HMAT/CDAT/etc. is correct. That
> would kind of defeat the purpose of it all then.

But we have no way to enforce that too.

--
Best Regards,
Huang, Ying

Re: [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface

2023-07-25 Thread Huang, Ying

Alistair Popple  writes:

> Huang Ying  writes:
>
>> Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
>> used for slow memory type in kmem driver.  This limits the usage of
>> kmem driver, for example, it cannot be used for HBM (high bandwidth
>> memory).
>>
>> So, we use the general abstract distance calculation mechanism in kmem
>> drivers to get more accurate abstract distance on systems with proper
>> support.  The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as
>> fallback only.
>>
>> Now, multiple memory types may be managed by kmem.  These memory types
>> are put into the "kmem_memory_types" list and protected by
>> kmem_memory_type_lock.
>
> See below but I wonder if kmem_memory_types could be a common helper
> rather than kdax specific?
>
>> Signed-off-by: "Huang, Ying" 
>> Cc: Aneesh Kumar K.V 
>> Cc: Wei Xu 
>> Cc: Alistair Popple 
>> Cc: Dan Williams 
>> Cc: Dave Hansen 
>> Cc: Davidlohr Bueso 
>> Cc: Johannes Weiner 
>> Cc: Jonathan Cameron 
>> Cc: Michal Hocko 
>> Cc: Yang Shi 
>> Cc: Rafael J Wysocki 
>> ---
>>  drivers/dax/kmem.c   | 54 +++-
>>  include/linux/memory-tiers.h |  2 ++
>>  mm/memory-tiers.c|  2 +-
>>  3 files changed, 44 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>> index 898ca9505754..837165037231 100644
>> --- a/drivers/dax/kmem.c
>> +++ b/drivers/dax/kmem.c
>> @@ -49,14 +49,40 @@ struct dax_kmem_data {
>>  struct resource *res[];
>>  };
>>  
>> -static struct memory_dev_type *dax_slowmem_type;
>> +static DEFINE_MUTEX(kmem_memory_type_lock);
>> +static LIST_HEAD(kmem_memory_types);
>> +
>> +static struct memory_dev_type *kmem_find_alloc_memorty_type(int adist)
>> +{
>> +bool found = false;
>> +struct memory_dev_type *mtype;
>> +
>> +mutex_lock(_memory_type_lock);
>> +list_for_each_entry(mtype, _memory_types, list) {
>> +if (mtype->adistance == adist) {
>> +found = true;
>> +break;
>> +}
>> +}
>> +if (!found) {
>> +mtype = alloc_memory_type(adist);
>> +if (!IS_ERR(mtype))
>> +list_add(>list, _memory_types);
>> +}
>> +mutex_unlock(_memory_type_lock);
>> +
>> +return mtype;
>> +}
>> +
>>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>  {
>>  struct device *dev = _dax->dev;
>>  unsigned long total_len = 0;
>>  struct dax_kmem_data *data;
>> +struct memory_dev_type *mtype;
>>  int i, rc, mapped = 0;
>>  int numa_node;
>> +int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
>>  
>>  /*
>>   * Ensure good NUMA information for the persistent memory.
>> @@ -71,6 +97,11 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>  return -EINVAL;
>>  }
>>  
>> +mt_calc_adistance(numa_node, );
>> +mtype = kmem_find_alloc_memorty_type(adist);
>> +if (IS_ERR(mtype))
>> +return PTR_ERR(mtype);
>> +
>
> I wrote my own quick and dirty module to test this and wrote basically
> the same code sequence.
>
> I notice your using a list of memory types here though. I think it would
> be nice to have a common helper that other users could call to do the
> mt_calc_adistance() / kmem_find_alloc_memory_type() /
> init_node_memory_type() sequence and cleanup as my naive approach would
> result in a new memory_dev_type per device even though adist might be
> the same. A common helper would make it easy to de-dup those.

If it's useful, we can move kmem_find_alloc_memory_type() to
memory-tier.c after some revision.  But I tend to move it after we have
the second user.  What do you think about that?

--
Best Regards,
Huang, Ying

>>  for (i = 0; i < dev_dax->nr_range; i++) {
>>  struct range range;
>>  
>> @@ -88,7 +119,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>  return -EINVAL;
>>  }
>>  
>> -init_node_memory_type(numa_node, dax_slowmem_type);
>> +init_node_memory_type(numa_node, mtype);
>>  
>>  rc = -ENOMEM;
>>  data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
>> @@ -167,7 +198,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>  err_res_name:
>>  kfr

Re: [PATCH RESEND 3/4] acpi, hmat: calculate abstract distance with HMAT

2023-07-25 Thread Huang, Ying

Alistair Popple  writes:

> Huang Ying  writes:
>
>> A memory tiering abstract distance calculation algorithm based on ACPI
>> HMAT is implemented.  The basic idea is as follows.
>>
>> The performance attributes of system default DRAM nodes are recorded
>> as the base line.  Whose abstract distance is MEMTIER_ADISTANCE_DRAM.
>> Then, the ratio of the abstract distance of a memory node (target) to
>> MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance
>> attributes of the node to that of the default DRAM nodes.
>
> The problem I encountered here with the calculations is that HBM memory
> ended up in a lower-tiered node which isn't what I wanted (at least when
> that HBM is attached to a GPU say).

I have tested the series on a server machine with HBM (pure HBM, not
attached to a GPU).  Where, HBM is placed in a higher tier than DRAM.

> I suspect this is because the calculations are based on the CPU
> point-of-view (access1) which still sees lower bandwidth to remote HBM
> than local DRAM, even though the remote GPU has higher bandwidth access
> to that memory. Perhaps we need to be considering access0 as well?
> Ie. HBM directly attached to a generic initiator should be in a higher
> tier regardless of CPU access characteristics?

What's your requirements for memory tiers on the machine?  I guess you
want to put GPU attache HBM in a higher tier and put DRAM in a lower
tier.  So, cold HBM pages can be demoted to DRAM when there are memory
pressure on HBM?  This sounds reasonable from GPU point of view.

The above requirements may be satisfied via calculating abstract
distance based on access0 (or combined with access1).  But I suspect
this will be a general solution.  I guess that any memory devices that
are used mainly by the memory initiators other than CPUs want to put
themselves in a higher memory tier than DRAM, regardless of its
access0.

One solution is to put GPU HBM in the highest memory tier (with smallest
abstract distance) always in GPU device driver regardless its HMAT
performance attributes.  Is it possible?

> That said I'm not entirely convinced the HMAT tables I'm testing against
> are accurate/complete.

--
Best Regards,
Huang, Ying

Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

2023-07-24 Thread Huang, Ying

Hi, Alistair,

Thanks a lot for comments!

Alistair Popple  writes:

> Huang Ying  writes:
>
>> The abstract distance may be calculated by various drivers, such as
>> ACPI HMAT, CXL CDAT, etc.  While it may be used by various code which
>> hot-add memory node, such as dax/kmem etc.  To decouple the algorithm
>> users and the providers, the abstract distance calculation algorithms
>> management mechanism is implemented in this patch.  It provides
>> interface for the providers to register the implementation, and
>> interface for the users.
>
> I wonder if we need this level of decoupling though? It seems to me like
> it would be simpler and better for drivers to calculate the abstract
> distance directly themselves by calling the desired algorithm (eg. ACPI
> HMAT) and pass this when creating the nodes rather than having a
> notifier chain.

Per my understanding, ACPI HMAT and memory device drivers (such as
dax/kmem) may belong to different subsystems (ACPI vs. dax).  It's not
good to call functions across subsystems directly.  So, I think it's
better to use a general subsystem: memory-tier.c to decouple them.  If
it turns out that a notifier chain is unnecessary, we can use some
function pointers instead.

> At the moment it seems we've only identified two possible algorithms
> (ACPI HMAT and CXL CDAT) and I don't think it would make sense for one
> of those to fallback to the other based on priority, so why not just
> have drivers call the correct algorithm directly?

For example, we have a system with PMEM (persistent memory, Optane
DCPMM, or AEP, or something else) in DIMM slots and CXL.mem connected
via CXL link to a remote memory pool.  We will need ACPI HMAT for PMEM
and CXL CDAT for CXL.mem.  One way is to make dax/kmem identify the
types of the device and call corresponding algorithms.  The other way
(suggested by this series) is to make dax/kmem call a notifier chain,
then CXL CDAT or ACPI HMAT can identify the type of device and calculate
the distance if the type is correct for them.  I don't think that it's
good to make dax/kem to know every possible types of memory devices.

>> Multiple algorithm implementations can cooperate via calculating
>> abstract distance for different memory nodes.  The preference of
>> algorithm implementations can be specified via
>> priority (notifier_block.priority).
>
> How/what decides the priority though? That seems like something better
> decided by a device driver than the algorithm driver IMHO.

Do we need the memory device driver specific priority?  Or we just share
a common priority?  For example, the priority of CXL CDAT is always
higher than that of ACPI HMAT?  Or architecture specific?

And, I don't think that we are forced to use the general notifier chain
interface in all memory device drivers.  If the memory device driver has
better understanding of the memory device, it can use other way to
determine abstract distance.  For example, a CXL memory device driver
can identify abstract distance by itself.  While other memory device drivers
can use the general notifier chain interface at the same time.

--
Best Regards,
Huang, Ying

Re: [PATCH v2 1/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory()

2023-07-24 Thread Huang, Ying

Vishal Verma  writes:

> In preparation for dax drivers, which can be built as modules,
> to use this interface, export it with EXPORT_SYMBOL_GPL(). Add a #else
> case for the symbol for builds without CONFIG_MEMORY_HOTPLUG.
>
> Cc: Andrew Morton 
> Cc: David Hildenbrand 
> Cc: Oscar Salvador 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Reviewed-by: David Hildenbrand 
> Signed-off-by: Vishal Verma 
> ---
>  include/linux/memory_hotplug.h | 5 +
>  mm/memory_hotplug.c| 1 +
>  2 files changed, 6 insertions(+)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 013c69753c91..fc5da07ad011 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -355,6 +355,11 @@ extern int arch_create_linear_mapping(int nid, u64 
> start, u64 size,
> struct mhp_params *params);
>  void arch_remove_linear_mapping(u64 start, u64 size);
>  extern bool mhp_supports_memmap_on_memory(unsigned long size);
> +#else
> +static inline bool mhp_supports_memmap_on_memory(unsigned long size)
> +{
> + return false;
> +}
>  #endif /* CONFIG_MEMORY_HOTPLUG */

It appears that there is no user of mhp_supports_memmap_on_memory() that
may be compiled with !CONFIG_MEMORY_HOTPLUG?

--
Best Regards,
Huang, Ying

>  #endif /* __LINUX_MEMORY_HOTPLUG_H */
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 3f231cf1b410..e9bcacbcbae2 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1284,6 +1284,7 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
>  IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
>  IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
>  }
> +EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
>  
>  /*
>   * NOTE: The caller must call lock_device_hotplug() to serialize hotplug

Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-07-23 Thread Huang, Ying

Vishal Verma  writes:

> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is currently
> restricted to 'memblock_size' chunks of memory being added. Adding a
> larger span of memory precludes memmap_on_memory semantics.
>
> For users of hotplug such as kmem, large amounts of memory might get
> added from the CXL subsystem. In some cases, this amount may exceed the
> available 'main memory' to store the memmap for the memory being added.
> In this case, it is useful to have a way to place the memmap on the
> memory being added, even if it means splitting the addition into
> memblock-sized chunks.
>
> Change add_memory_resource() to loop over memblock-sized chunks of
> memory if caller requested memmap_on_memory, and if other conditions for
> it are met,. Teach try_remove_memory() to also expect that a memory
> range being removed might have been split up into memblock sized chunks,
> and to loop through those as needed.
>
> Cc: Andrew Morton 
> Cc: David Hildenbrand 
> Cc: Oscar Salvador 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Suggested-by: David Hildenbrand 
> Signed-off-by: Vishal Verma 
> ---
>  mm/memory_hotplug.c | 154 
> +++-
>  1 file changed, 91 insertions(+), 63 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e9bcacbcbae2..20456f0d28e6 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1286,6 +1286,35 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
>  }
>  EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
>  
> +static int add_memory_create_devices(int nid, struct memory_group *group,
> +  u64 start, u64 size, mhp_t mhp_flags)
> +{
> + struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> + struct vmem_altmap mhp_altmap = {};
> + int ret;
> +
> + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
> + mhp_altmap.free = PHYS_PFN(size);
> + mhp_altmap.base_pfn = PHYS_PFN(start);
> + params.altmap = _altmap;
> + }
> +
> + /* call arch's memory hotadd */
> + ret = arch_add_memory(nid, start, size, );
> + if (ret < 0)
> + return ret;
> +
> + /* create memory block devices after memory was added */
> + ret = create_memory_block_devices(start, size, mhp_altmap.alloc,
> +   group);
> + if (ret) {
> + arch_remove_memory(start, size, NULL);
> + return ret;
> + }
> +
> + return 0;
> +}
> +
>  /*
>   * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
>   * and online/offline operations (triggered e.g. by sysfs).
> @@ -1294,11 +1323,10 @@ EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
>   */
>  int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  {
> - struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> + unsigned long memblock_size = memory_block_size_bytes();
>   enum memblock_flags memblock_flags = MEMBLOCK_NONE;
> - struct vmem_altmap mhp_altmap = {};
>   struct memory_group *group = NULL;
> - u64 start, size;
> + u64 start, size, cur_start;
>   bool new_node = false;
>   int ret;
>  
> @@ -1339,27 +1367,20 @@ int __ref add_memory_resource(int nid, struct 
> resource *res, mhp_t mhp_flags)
>   /*
>* Self hosted memmap array
>*/
> - if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> - if (!mhp_supports_memmap_on_memory(size)) {
> - ret = -EINVAL;
> + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
> + mhp_supports_memmap_on_memory(memblock_size)) {
> + for (cur_start = start; cur_start < start + size;
> +  cur_start += memblock_size) {
> + ret = add_memory_create_devices(nid, group, cur_start,
> + memblock_size,
> + mhp_flags);
> + if (ret)
> + goto error;
> + }
> + } else {
> + ret = add_memory_create_devices(nid, group, start, size, 
> mhp_flags);
> + if (ret)
>   goto error;

Another choice to organize code is to use different step (memblock_size
vs. size) in "for" loop.

It's not necessary in this patchset.  It appears that we cannot create
1GB mapping if we put memmap on memory now, right?  If so, is it doable
to support that via separating creating memory mapping from
arch_add_memory()?

> -

Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-07-23 Thread Huang, Ying

"Aneesh Kumar K.V"  writes:

> Vishal Verma  writes:
>
>> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is currently
>> restricted to 'memblock_size' chunks of memory being added. Adding a
>> larger span of memory precludes memmap_on_memory semantics.
>>
>> For users of hotplug such as kmem, large amounts of memory might get
>> added from the CXL subsystem. In some cases, this amount may exceed the
>> available 'main memory' to store the memmap for the memory being added.
>> In this case, it is useful to have a way to place the memmap on the
>> memory being added, even if it means splitting the addition into
>> memblock-sized chunks.
>>
>> Change add_memory_resource() to loop over memblock-sized chunks of
>> memory if caller requested memmap_on_memory, and if other conditions for
>> it are met,. Teach try_remove_memory() to also expect that a memory
>> range being removed might have been split up into memblock sized chunks,
>> and to loop through those as needed.
>>
>> Cc: Andrew Morton 
>> Cc: David Hildenbrand 
>> Cc: Oscar Salvador 
>> Cc: Dan Williams 
>> Cc: Dave Jiang 
>> Cc: Dave Hansen 
>> Cc: Huang Ying 
>> Suggested-by: David Hildenbrand 
>> Signed-off-by: Vishal Verma 
>> ---
>>  mm/memory_hotplug.c | 154 
>> +++-
>>  1 file changed, 91 insertions(+), 63 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index e9bcacbcbae2..20456f0d28e6 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1286,6 +1286,35 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
>>  }
>>  EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
>>  
>> +static int add_memory_create_devices(int nid, struct memory_group *group,
>> + u64 start, u64 size, mhp_t mhp_flags)
>> +{
>> +struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>> +struct vmem_altmap mhp_altmap = {};
>> +int ret;
>> +
>> +if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
>> +mhp_altmap.free = PHYS_PFN(size);
>> +mhp_altmap.base_pfn = PHYS_PFN(start);
>> +params.altmap = _altmap;
>> +}
>> +
>> +/* call arch's memory hotadd */
>> +ret = arch_add_memory(nid, start, size, );
>> +if (ret < 0)
>> +return ret;
>> +
>> +/* create memory block devices after memory was added */
>> +ret = create_memory_block_devices(start, size, mhp_altmap.alloc,
>> +  group);
>> +if (ret) {
>> +arch_remove_memory(start, size, NULL);
>> +return ret;
>> +}
>> +
>> +return 0;
>> +}
>> +
>>  /*
>>   * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
>>   * and online/offline operations (triggered e.g. by sysfs).
>> @@ -1294,11 +1323,10 @@ EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
>>   */
>>  int __ref add_memory_resource(int nid, struct resource *res, mhp_t 
>> mhp_flags)
>>  {
>> -struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>> +unsigned long memblock_size = memory_block_size_bytes();
>>  enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>> -struct vmem_altmap mhp_altmap = {};
>>  struct memory_group *group = NULL;
>> -u64 start, size;
>> +u64 start, size, cur_start;
>>  bool new_node = false;
>>  int ret;
>>  
>> @@ -1339,27 +1367,20 @@ int __ref add_memory_resource(int nid, struct 
>> resource *res, mhp_t mhp_flags)
>>  /*
>>   * Self hosted memmap array
>>   */
>> -if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
>> -if (!mhp_supports_memmap_on_memory(size)) {
>> -ret = -EINVAL;
>> +if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
>> +mhp_supports_memmap_on_memory(memblock_size)) {
>> +for (cur_start = start; cur_start < start + size;
>> + cur_start += memblock_size) {
>> +ret = add_memory_create_devices(nid, group, cur_start,
>> +memblock_size,
>> +mhp_flags);
>> +if (ret)
>> +goto error;
>> +}
>
> We should handle the below error details here. 
>
> 1) If we hit an error after some blocks got added, should we iterate over 
> rest of the dev_dax->nr_range.
> 2) With some blocks added if we return a failure here, we remove the
> resource in dax_kmem. Is that ok? 
>
> IMHO error handling with partial creation of memory blocks in a resource 
> range should be
> documented with this change.

Or, should we remove all added memory blocks upon error?

--
Best Regards,
Huang, Ying

[PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface

2023-07-20 Thread Huang Ying

Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
used for slow memory type in kmem driver.  This limits the usage of
kmem driver, for example, it cannot be used for HBM (high bandwidth
memory).

So, we use the general abstract distance calculation mechanism in kmem
drivers to get more accurate abstract distance on systems with proper
support.  The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as
fallback only.

Now, multiple memory types may be managed by kmem.  These memory types
are put into the "kmem_memory_types" list and protected by
kmem_memory_type_lock.

Signed-off-by: "Huang, Ying" 
Cc: Aneesh Kumar K.V 
Cc: Wei Xu 
Cc: Alistair Popple 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: Davidlohr Bueso 
Cc: Johannes Weiner 
Cc: Jonathan Cameron 
Cc: Michal Hocko 
Cc: Yang Shi 
Cc: Rafael J Wysocki 
---
 drivers/dax/kmem.c   | 54 +++-
 include/linux/memory-tiers.h |  2 ++
 mm/memory-tiers.c|  2 +-
 3 files changed, 44 insertions(+), 14 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 898ca9505754..837165037231 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -49,14 +49,40 @@ struct dax_kmem_data {
struct resource *res[];
 };
 
-static struct memory_dev_type *dax_slowmem_type;
+static DEFINE_MUTEX(kmem_memory_type_lock);
+static LIST_HEAD(kmem_memory_types);
+
+static struct memory_dev_type *kmem_find_alloc_memorty_type(int adist)
+{
+   bool found = false;
+   struct memory_dev_type *mtype;
+
+   mutex_lock(_memory_type_lock);
+   list_for_each_entry(mtype, _memory_types, list) {
+   if (mtype->adistance == adist) {
+   found = true;
+   break;
+   }
+   }
+   if (!found) {
+   mtype = alloc_memory_type(adist);
+   if (!IS_ERR(mtype))
+   list_add(>list, _memory_types);
+   }
+   mutex_unlock(_memory_type_lock);
+
+   return mtype;
+}
+
 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 {
struct device *dev = _dax->dev;
unsigned long total_len = 0;
struct dax_kmem_data *data;
+   struct memory_dev_type *mtype;
int i, rc, mapped = 0;
int numa_node;
+   int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
 
/*
 * Ensure good NUMA information for the persistent memory.
@@ -71,6 +97,11 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
return -EINVAL;
}
 
+   mt_calc_adistance(numa_node, );
+   mtype = kmem_find_alloc_memorty_type(adist);
+   if (IS_ERR(mtype))
+   return PTR_ERR(mtype);
+
for (i = 0; i < dev_dax->nr_range; i++) {
struct range range;
 
@@ -88,7 +119,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
return -EINVAL;
}
 
-   init_node_memory_type(numa_node, dax_slowmem_type);
+   init_node_memory_type(numa_node, mtype);
 
rc = -ENOMEM;
data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
@@ -167,7 +198,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 err_res_name:
kfree(data);
 err_dax_kmem_data:
-   clear_node_memory_type(numa_node, dax_slowmem_type);
+   clear_node_memory_type(numa_node, mtype);
return rc;
 }
 
@@ -219,7 +250,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 * for that. This implies this reference will be around
 * till next reboot.
 */
-   clear_node_memory_type(node, dax_slowmem_type);
+   clear_node_memory_type(node, NULL);
}
 }
 #else
@@ -251,12 +282,6 @@ static int __init dax_kmem_init(void)
if (!kmem_name)
return -ENOMEM;
 
-   dax_slowmem_type = alloc_memory_type(MEMTIER_DEFAULT_DAX_ADISTANCE);
-   if (IS_ERR(dax_slowmem_type)) {
-   rc = PTR_ERR(dax_slowmem_type);
-   goto err_dax_slowmem_type;
-   }
-
rc = dax_driver_register(_dax_kmem_driver);
if (rc)
goto error_dax_driver;
@@ -264,18 +289,21 @@ static int __init dax_kmem_init(void)
return rc;
 
 error_dax_driver:
-   destroy_memory_type(dax_slowmem_type);
-err_dax_slowmem_type:
kfree_const(kmem_name);
return rc;
 }
 
 static void __exit dax_kmem_exit(void)
 {
+   struct memory_dev_type *mtype, *mtn;
+
dax_driver_unregister(_dax_kmem_driver);
if (!any_hotremove_failed)
kfree_const(kmem_name);
-   destroy_memory_type(dax_slowmem_type);
+   list_for_each_entry_safe(mtype, mtn, _memory_types, list) {
+   list_del(>list);
+   destroy_memory_type(mtype);
+   }
 }
 
 MODULE_AUTHOR("Intel Corporation");
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 9377239c8

[PATCH RESEND 3/4] acpi, hmat: calculate abstract distance with HMAT

2023-07-20 Thread Huang Ying

A memory tiering abstract distance calculation algorithm based on ACPI
HMAT is implemented.  The basic idea is as follows.

The performance attributes of system default DRAM nodes are recorded
as the base line.  Whose abstract distance is MEMTIER_ADISTANCE_DRAM.
Then, the ratio of the abstract distance of a memory node (target) to
MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance
attributes of the node to that of the default DRAM nodes.

Signed-off-by: "Huang, Ying" 
Cc: Aneesh Kumar K.V 
Cc: Wei Xu 
Cc: Alistair Popple 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: Davidlohr Bueso 
Cc: Johannes Weiner 
Cc: Jonathan Cameron 
Cc: Michal Hocko 
Cc: Yang Shi 
Cc: Rafael J Wysocki 
---
 drivers/acpi/numa/hmat.c | 138 ++-
 include/linux/memory-tiers.h |   2 +
 mm/memory-tiers.c|   2 +-
 3 files changed, 140 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index 2dee0098f1a9..306a912090f0 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static u8 hmat_revision;
 static int hmat_disable __initdata;
@@ -759,6 +760,137 @@ static int hmat_callback(struct notifier_block *self,
return NOTIFY_OK;
 }
 
+static int hmat_adistance_disabled;
+static struct node_hmem_attrs default_dram_attrs;
+
+static void dump_hmem_attrs(struct node_hmem_attrs *attrs)
+{
+   pr_cont("read_latency: %u, write_latency: %u, read_bandwidth: %u, 
write_bandwidth: %u\n",
+   attrs->read_latency, attrs->write_latency,
+   attrs->read_bandwidth, attrs->write_bandwidth);
+}
+
+static void disable_hmat_adistance_algorithm(void)
+{
+   hmat_adistance_disabled = true;
+}
+
+static int hmat_init_default_dram_attrs(void)
+{
+   struct memory_target *target;
+   struct node_hmem_attrs *attrs;
+   int nid, pxm;
+   int nid_dram = NUMA_NO_NODE;
+
+   if (default_dram_attrs.read_latency +
+   default_dram_attrs.write_latency != 0)
+   return 0;
+
+   if (!default_dram_type)
+   return -EIO;
+
+   for_each_node_mask(nid, default_dram_type->nodes) {
+   pxm = node_to_pxm(nid);
+   target = find_mem_target(pxm);
+   if (!target)
+   continue;
+   attrs = >hmem_attrs[1];
+   if (nid_dram == NUMA_NO_NODE) {
+   if (attrs->read_latency + attrs->write_latency == 0 ||
+   attrs->read_bandwidth + attrs->write_bandwidth == 
0) {
+   pr_info("hmat: invalid hmem attrs for default 
DRAM node: %d,\n",
+   nid);
+   pr_info("  ");
+   dump_hmem_attrs(attrs);
+   pr_info("  disable hmat based abstract distance 
algorithm.\n");
+   disable_hmat_adistance_algorithm();
+   return -EIO;
+   }
+   nid_dram = nid;
+   default_dram_attrs = *attrs;
+   continue;
+   }
+
+   /*
+* The performance of all default DRAM nodes is expected
+* to be same (that is, the variation is less than 10%).
+* And it will be used as base to calculate the abstract
+* distance of other memory nodes.
+*/
+   if (abs(attrs->read_latency - default_dram_attrs.read_latency) 
* 10 >
+   default_dram_attrs.read_latency ||
+   abs(attrs->write_latency - 
default_dram_attrs.write_latency) * 10 >
+   default_dram_attrs.write_latency ||
+   abs(attrs->read_bandwidth - 
default_dram_attrs.read_bandwidth) * 10 >
+   default_dram_attrs.read_bandwidth) {
+   pr_info("hmat: hmem attrs for DRAM nodes mismatch.\n");
+   pr_info("  node %d:", nid_dram);
+   dump_hmem_attrs(_dram_attrs);
+   pr_info("  node %d:", nid);
+   dump_hmem_attrs(attrs);
+   pr_info("  disable hmat based abstract distance 
algorithm.\n");
+   disable_hmat_adistance_algorithm();
+   return -EIO;
+   }
+   }
+
+   return 0;
+}
+
+static int hmat_calculate_adistance(struct notifier_block *self,
+   unsigned long nid, void *data)
+{
+   static DECLARE_BITMAP(p_nodes, MAX_NUMNODES);
+   struct memory_target *target;
+   struct node_hmem_attrs *attrs;
+   int *adist = data;
+   int pxm;
+
+

[PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

2023-07-20 Thread Huang Ying

The abstract distance may be calculated by various drivers, such as
ACPI HMAT, CXL CDAT, etc.  While it may be used by various code which
hot-add memory node, such as dax/kmem etc.  To decouple the algorithm
users and the providers, the abstract distance calculation algorithms
management mechanism is implemented in this patch.  It provides
interface for the providers to register the implementation, and
interface for the users.

Multiple algorithm implementations can cooperate via calculating
abstract distance for different memory nodes.  The preference of
algorithm implementations can be specified via
priority (notifier_block.priority).

Signed-off-by: "Huang, Ying" 
Cc: Aneesh Kumar K.V 
Cc: Wei Xu 
Cc: Alistair Popple 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: Davidlohr Bueso 
Cc: Johannes Weiner 
Cc: Jonathan Cameron 
Cc: Michal Hocko 
Cc: Yang Shi 
Cc: Rafael J Wysocki 
---
 include/linux/memory-tiers.h | 19 
 mm/memory-tiers.c| 59 
 2 files changed, 78 insertions(+)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index fc9647b1b4f9..c6429e624244 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 /*
  * Each tier cover a abstrace distance chunk size of 128
  */
@@ -36,6 +37,9 @@ struct memory_dev_type *alloc_memory_type(int adistance);
 void destroy_memory_type(struct memory_dev_type *memtype);
 void init_node_memory_type(int node, struct memory_dev_type *default_type);
 void clear_node_memory_type(int node, struct memory_dev_type *memtype);
+int register_mt_adistance_algorithm(struct notifier_block *nb);
+int unregister_mt_adistance_algorithm(struct notifier_block *nb);
+int mt_calc_adistance(int node, int *adist);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -97,5 +101,20 @@ static inline bool node_is_toptier(int node)
 {
return true;
 }
+
+static inline int register_mt_adistance_algorithm(struct notifier_block *nb)
+{
+   return 0;
+}
+
+static inline int unregister_mt_adistance_algorithm(struct notifier_block *nb)
+{
+   return 0;
+}
+
+static inline int mt_calc_adistance(int node, int *adist)
+{
+   return NOTIFY_DONE;
+}
 #endif /* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index a516e303e304..1e55fbe2ad51 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -105,6 +106,8 @@ static int top_tier_adistance;
 static struct demotion_nodes *node_demotion __read_mostly;
 #endif /* CONFIG_MIGRATION */
 
+static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
+
 static inline struct memory_tier *to_memory_tier(struct device *device)
 {
return container_of(device, struct memory_tier, dev);
@@ -592,6 +595,62 @@ void clear_node_memory_type(int node, struct 
memory_dev_type *memtype)
 }
 EXPORT_SYMBOL_GPL(clear_node_memory_type);
 
+/**
+ * register_mt_adistance_algorithm() - Register memory tiering abstract 
distance algorithm
+ * @nb: The notifier block which describe the algorithm
+ *
+ * Return: 0 on success, errno on error.
+ *
+ * Every memory tiering abstract distance algorithm provider needs to
+ * register the algorithm with register_mt_adistance_algorithm().  To
+ * calculate the abstract distance for a specified memory node, the
+ * notifier function will be called unless some high priority
+ * algorithm has provided result.  The prototype of the notifier
+ * function is as follows,
+ *
+ *   int (*algorithm_notifier)(struct notifier_block *nb,
+ * unsigned long nid, void *data);
+ *
+ * Where "nid" specifies the memory node, "data" is the pointer to the
+ * returned abstract distance (that is, "int *adist").  If the
+ * algorithm provides the result, NOTIFY_STOP should be returned.
+ * Otherwise, return_value & %NOTIFY_STOP_MASK == 0 to allow the next
+ * algorithm in the chain to provide the result.
+ */
+int register_mt_adistance_algorithm(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_register(_adistance_algorithms, nb);
+}
+EXPORT_SYMBOL_GPL(register_mt_adistance_algorithm);
+
+/**
+ * unregister_mt_adistance_algorithm() - Unregister memory tiering abstract 
distance algorithm
+ * @nb: the notifier block which describe the algorithm
+ *
+ * Return: 0 on success, errno on error.
+ */
+int unregister_mt_adistance_algorithm(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_unregister(_adistance_algorithms, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_mt_adistance_algorithm);
+
+/**
+ * mt_calc_adistance() - Calculate abstract distance with registered algorithms
+ * @node: the node to calculate abstract distance for
+ * @adist: the retu

[PATCH RESEND 2/4] acpi, hmat: refactor hmat_register_target_initiators()

2023-07-20 Thread Huang Ying

Previously, in hmat_register_target_initiators(), the performance
attributes are calculated and the corresponding sysfs links and files
are created too.  Which is called during memory onlining.

But now, to calculate the abstract distance of a memory target before
memory onlining, we need to calculate the performance attributes for
a memory target without creating sysfs links and files.

To do that, hmat_register_target_initiators() is refactored to make it
possible to calculate performance attributes separately.

Signed-off-by: "Huang, Ying" 
Cc: Aneesh Kumar K.V 
Cc: Wei Xu 
Cc: Alistair Popple 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: Davidlohr Bueso 
Cc: Johannes Weiner 
Cc: Jonathan Cameron 
Cc: Michal Hocko 
Cc: Yang Shi 
Cc: Rafael J Wysocki 
---
 drivers/acpi/numa/hmat.c | 81 +++-
 1 file changed, 30 insertions(+), 51 deletions(-)

diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index bba268ecd802..2dee0098f1a9 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -582,28 +582,25 @@ static int initiators_to_nodemask(unsigned long *p_nodes)
return 0;
 }
 
-static void hmat_register_target_initiators(struct memory_target *target)
+static void hmat_update_target_attrs(struct memory_target *target,
+unsigned long *p_nodes, int access)
 {
-   static DECLARE_BITMAP(p_nodes, MAX_NUMNODES);
struct memory_initiator *initiator;
-   unsigned int mem_nid, cpu_nid;
+   unsigned int cpu_nid;
struct memory_locality *loc = NULL;
u32 best = 0;
-   bool access0done = false;
int i;
 
-   mem_nid = pxm_to_node(target->memory_pxm);
+   bitmap_zero(p_nodes, MAX_NUMNODES);
/*
-* If the Address Range Structure provides a local processor pxm, link
+* If the Address Range Structure provides a local processor pxm, set
 * only that one. Otherwise, find the best performance attributes and
-* register all initiators that match.
+* collect all initiators that match.
 */
if (target->processor_pxm != PXM_INVAL) {
cpu_nid = pxm_to_node(target->processor_pxm);
-   register_memory_node_under_compute_node(mem_nid, cpu_nid, 0);
-   access0done = true;
-   if (node_state(cpu_nid, N_CPU)) {
-   register_memory_node_under_compute_node(mem_nid, 
cpu_nid, 1);
+   if (access == 0 || node_state(cpu_nid, N_CPU)) {
+   set_bit(target->processor_pxm, p_nodes);
return;
}
}
@@ -617,47 +614,10 @@ static void hmat_register_target_initiators(struct 
memory_target *target)
 * We'll also use the sorting to prime the candidate nodes with known
 * initiators.
 */
-   bitmap_zero(p_nodes, MAX_NUMNODES);
list_sort(NULL, , initiator_cmp);
if (initiators_to_nodemask(p_nodes) < 0)
return;
 
-   if (!access0done) {
-   for (i = WRITE_LATENCY; i <= READ_BANDWIDTH; i++) {
-   loc = localities_types[i];
-   if (!loc)
-   continue;
-
-   best = 0;
-   list_for_each_entry(initiator, , node) {
-   u32 value;
-
-   if (!test_bit(initiator->processor_pxm, 
p_nodes))
-   continue;
-
-   value = hmat_initiator_perf(target, initiator,
-   loc->hmat_loc);
-   if (hmat_update_best(loc->hmat_loc->data_type, 
value, ))
-   bitmap_clear(p_nodes, 0, 
initiator->processor_pxm);
-   if (value != best)
-   clear_bit(initiator->processor_pxm, 
p_nodes);
-   }
-   if (best)
-   hmat_update_target_access(target, 
loc->hmat_loc->data_type,
- best, 0);
-   }
-
-   for_each_set_bit(i, p_nodes, MAX_NUMNODES) {
-   cpu_nid = pxm_to_node(i);
-   register_memory_node_under_compute_node(mem_nid, 
cpu_nid, 0);
-   }
-   }
-
-   /* Access 1 ignores Generic Initiators */
-   bitmap_zero(p_nodes, MAX_NUMNODES);
-   if (initiators_to_nodemask(p_nodes) < 0)
-   return;
-
for (i = WRITE_LATENCY; i <= READ_BANDWIDTH; i++) {
loc = localities_types[i];
if (!loc)
@@ -667,7 +627,7 @@ static void hmat_register_target_initiators(struct 
memory_target *target)
list_for_each_en

[PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT

2023-07-20 Thread Huang Ying

We have the explicit memory tiers framework to manage systems with
multiple types of memory, e.g., DRAM in DIMM slots and CXL memory
devices.  Where, same kind of memory devices will be grouped into
memory types, then put into memory tiers.  To describe the performance
of a memory type, abstract distance is defined.  Which is in direct
proportion to the memory latency and inversely proportional to the
memory bandwidth.  To keep the code as simple as possible, fixed
abstract distance is used in dax/kmem to describe slow memory such as
Optane DCPMM.

To support more memory types, in this series, we added the abstract
distance calculation algorithm management mechanism, provided a
algorithm implementation based on ACPI HMAT, and used the general
abstract distance calculation interface in dax/kmem driver.  So,
dax/kmem can support HBM (high bandwidth memory) in addition to the
original Optane DCPMM.

Changelog:

V1 (from RFC):

- Added some comments per Aneesh's comments, Thanks!

Best Regards,
Huang, Ying

Re: [PATCH] memory tier: rename destroy_memory_type() to put_memory_type()

2023-07-06 Thread Huang, Ying

Miaohe Lin  writes:

> It appears that destroy_memory_type() isn't a very good name because
> we usually will not free the memory_type here. So rename it to a more
> appropriate name i.e. put_memory_type().
>
> Suggested-by: Huang, Ying 
> Signed-off-by: Miaohe Lin 

LGTM, Thanks!

Reviewed-by: "Huang, Ying" 

> ---
>  drivers/dax/kmem.c   | 4 ++--
>  include/linux/memory-tiers.h | 4 ++--
>  mm/memory-tiers.c| 6 +++---
>  3 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 898ca9505754..c57acb73e3db 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -264,7 +264,7 @@ static int __init dax_kmem_init(void)
>   return rc;
>  
>  error_dax_driver:
> - destroy_memory_type(dax_slowmem_type);
> + put_memory_type(dax_slowmem_type);
>  err_dax_slowmem_type:
>   kfree_const(kmem_name);
>   return rc;
> @@ -275,7 +275,7 @@ static void __exit dax_kmem_exit(void)
>   dax_driver_unregister(_dax_kmem_driver);
>   if (!any_hotremove_failed)
>   kfree_const(kmem_name);
> - destroy_memory_type(dax_slowmem_type);
> + put_memory_type(dax_slowmem_type);
>  }
>  
>  MODULE_AUTHOR("Intel Corporation");
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index fc9647b1b4f9..437441cdf78f 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -33,7 +33,7 @@ struct memory_dev_type {
>  #ifdef CONFIG_NUMA
>  extern bool numa_demotion_enabled;
>  struct memory_dev_type *alloc_memory_type(int adistance);
> -void destroy_memory_type(struct memory_dev_type *memtype);
> +void put_memory_type(struct memory_dev_type *memtype);
>  void init_node_memory_type(int node, struct memory_dev_type *default_type);
>  void clear_node_memory_type(int node, struct memory_dev_type *memtype);
>  #ifdef CONFIG_MIGRATION
> @@ -68,7 +68,7 @@ static inline struct memory_dev_type *alloc_memory_type(int 
> adistance)
>   return NULL;
>  }
>  
> -static inline void destroy_memory_type(struct memory_dev_type *memtype)
> +static inline void put_memory_type(struct memory_dev_type *memtype)
>  {
>  
>  }
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 1719fa3bcf02..c49ab03f49b1 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -560,11 +560,11 @@ struct memory_dev_type *alloc_memory_type(int adistance)
>  }
>  EXPORT_SYMBOL_GPL(alloc_memory_type);
>  
> -void destroy_memory_type(struct memory_dev_type *memtype)
> +void put_memory_type(struct memory_dev_type *memtype)
>  {
>   kref_put(>kref, release_memtype);
>  }
> -EXPORT_SYMBOL_GPL(destroy_memory_type);
> +EXPORT_SYMBOL_GPL(put_memory_type);
>  
>  void init_node_memory_type(int node, struct memory_dev_type *memtype)
>  {
> @@ -586,7 +586,7 @@ void clear_node_memory_type(int node, struct 
> memory_dev_type *memtype)
>*/
>   if (!node_memory_types[node].map_count) {
>   node_memory_types[node].memtype = NULL;
> - destroy_memory_type(memtype);
> + put_memory_type(memtype);
>   }
>   mutex_unlock(_tier_lock);
>  }

Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

2023-06-16 Thread Huang, Ying

Vishal Verma  writes:

> With DAX memory regions originating from CXL memory expanders or
> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
> on a system without enough 'regular' main memory to support the memmap
> for it. To avoid this, ensure that all kmem managed hotplugged memory is
> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
> new memory region being hot added.
>
> To do this, call add_memory() in chunks of memory_block_size_bytes() as
> that is a requirement for memmap_on_memory. Additionally, Use the
> mhp_flag to force the memmap_on_memory checks regardless of the
> respective module parameter setting.
>
> Cc: "Rafael J. Wysocki" 
> Cc: Len Brown 
> Cc: Andrew Morton 
> Cc: David Hildenbrand 
> Cc: Oscar Salvador 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/kmem.c | 49 -
>  1 file changed, 36 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 7b36db6f1cbd..0751346193ef 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "dax-private.h"
>  #include "bus.h"
>  
> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>   data->mgid = rc;
>  
>   for (i = 0; i < dev_dax->nr_range; i++) {
> + u64 cur_start, cur_len, remaining;
>   struct resource *res;
>   struct range range;
>  
> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>   res->flags = IORESOURCE_SYSTEM_RAM;
>  
>   /*
> -  * Ensure that future kexec'd kernels will not treat
> -  * this as RAM automatically.
> +  * Add memory in chunks of memory_block_size_bytes() so that
> +  * it is considered for MHP_MEMMAP_ON_MEMORY
> +  * @range has already been aligned to memory_block_size_bytes(),
> +  * so the following loop will always break it down cleanly.
>*/
> - rc = add_memory_driver_managed(data->mgid, range.start,
> - range_len(), kmem_name, MHP_NID_IS_MGID);
> + cur_start = range.start;
> + cur_len = memory_block_size_bytes();
> + remaining = range_len();
> + while (remaining) {
> + mhp_t mhp_flags = MHP_NID_IS_MGID;
>  
> - if (rc) {
> - dev_warn(dev, "mapping%d: %#llx-%#llx memory add 
> failed\n",
> - i, range.start, range.end);
> - remove_resource(res);
> - kfree(res);
> - data->res[i] = NULL;
> - if (mapped)
> - continue;
> - goto err_request_mem;
> + if (mhp_supports_memmap_on_memory(cur_len,
> +   MHP_MEMMAP_ON_MEMORY))
> + mhp_flags |= MHP_MEMMAP_ON_MEMORY;
> + /*
> +  * Ensure that future kexec'd kernels will not treat
> +  * this as RAM automatically.
> +  */
> + rc = add_memory_driver_managed(data->mgid, cur_start,
> +cur_len, kmem_name,
> +mhp_flags);
> +
> + if (rc) {
> + dev_warn(dev,
> +  "mapping%d: %#llx-%#llx memory add 
> failed\n",
> +  i, cur_start, cur_start + cur_len - 1);
> + remove_resource(res);
> + kfree(res);
> + data->res[i] = NULL;
> + if (mapped)
> + continue;
> + goto err_request_mem;
> + }
> +
> + cur_start += cur_len;
> + remaining -= cur_len;
>   }
>   mapped++;
>   }

It appears that we need to hot-remove memory in the granularity of
memory_block_size_bytes() too, according to try_remove_memory().  If so,
it seems better to allocate one dax_kmem_data.res[] element for each
memory block instead of dax region?

Best Regards,
Huang, Ying

Re: [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param

2023-06-16 Thread Huang, Ying

Hi, Vishal,

Thanks for your patch!

Vishal Verma  writes:

> For memory hotplug to consider MHP_MEMMAP_ON_MEMORY behavior, the
> 'memmap_on_memory' module parameter was a hard requirement.
>
> In preparation for the dax/kmem driver to use memmap_on_memory
> semantics, arrange for the module parameter check to be bypassed via the
> appropriate mhp_flag.
>
> Recall that the kmem driver could contribute huge amounts of hotplugged
> memory originating from special purposes devices such as CXL memory
> expanders. In some cases memmap_on_memory may be the /only/ way this new
> memory can be hotplugged. Hence it makes sense for kmem to have a way to
> force memmap_on_memory without depending on a module param, if all the
> other conditions for it are met.
>
> The only other user of this interface is acpi/acpi_memoryhotplug.c,
> which only enables the mhp_flag if an initial
> mhp_supports_memmap_on_memory() test passes. Maintain the existing
> behavior and semantics for this by performing the initial check from
> acpi without the MHP_MEMMAP_ON_MEMORY flag, so its decision falls back
> to the module parameter.
>
> Cc: "Rafael J. Wysocki" 
> Cc: Len Brown 
> Cc: Andrew Morton 
> Cc: David Hildenbrand 
> Cc: Oscar Salvador 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Signed-off-by: Vishal Verma 
> ---
>  include/linux/memory_hotplug.h |  2 +-
>  drivers/acpi/acpi_memhotplug.c |  2 +-
>  mm/memory_hotplug.c| 24 
>  3 files changed, 18 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 9fcbf5706595..c9ddcd3cad70 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -358,7 +358,7 @@ extern struct zone *zone_for_pfn_range(int online_type, 
> int nid,
>  extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
> struct mhp_params *params);
>  void arch_remove_linear_mapping(u64 start, u64 size);
> -extern bool mhp_supports_memmap_on_memory(unsigned long size);
> +extern bool mhp_supports_memmap_on_memory(unsigned long size, mhp_t 
> mhp_flags);
>  #endif /* CONFIG_MEMORY_HOTPLUG */
>  
>  #endif /* __LINUX_MEMORY_HOTPLUG_H */
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 24f662d8bd39..119d3bb49753 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -211,7 +211,7 @@ static int acpi_memory_enable_device(struct 
> acpi_memory_device *mem_device)
>   if (!info->length)
>   continue;
>  
> - if (mhp_supports_memmap_on_memory(info->length))
> + if (mhp_supports_memmap_on_memory(info->length, 0))
>   mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>   result = __add_memory(mgid, info->start_addr, info->length,
> mhp_flags);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 8e0fa209d533..bb3845830922 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1283,15 +1283,21 @@ static int online_memory_block(struct memory_block 
> *mem, void *arg)
>   return device_online(>dev);
>  }
>  
> -bool mhp_supports_memmap_on_memory(unsigned long size)
> +bool mhp_supports_memmap_on_memory(unsigned long size, mhp_t mhp_flags)
>  {
>   unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
>   unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
>   unsigned long remaining_size = size - vmemmap_size;
>  
>   /*
> -  * Besides having arch support and the feature enabled at runtime, we
> -  * need a few more assumptions to hold true:
> +  * The MHP_MEMMAP_ON_MEMORY flag indicates a caller that wants to force
> +  * memmap_on_memory (if other conditions are met), regardless of the
> +  * module parameter. drivers/dax/kmem.c is an example, where large
> +  * amounts of hotplug memory may come from, and the only option to
> +  * successfully online all of it is to place the memmap on this memory.
> +  *
> +  * Besides having arch support and the feature enabled at runtime or
> +  * via the mhp_flag, we need a few more assumptions to hold true:
>*
>* a) We span a single memory block: memory onlining/offlinin;g happens
>*in memory block granularity. We don't want the vmemmap of online
> @@ -1315,10 +1321,12 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
>*   altmap as an alternative source of memory, and we do not 
> exactly
>*   populate a single PMD.
>

Re: [PATCH v3 1/4] mm/swapfile: use percpu_ref to serialize against concurrent swapoff

2021-04-20 Thread Huang, Ying

a_race(!(si->flags & SWP_VALID)))
> - goto unlock_out;
> + if (!percpu_ref_tryget_live(>users))
> + goto out;
> + /*
> +  * Guarantee the si->users are checked before accessing other
> +  * fields of swap_info_struct.
> +  *
> +  * Paired with the spin_unlock() after setup_swap_info() in
> +  * enable_swap_info().
> +  */
> + smp_rmb();
>   offset = swp_offset(entry);
>   if (offset >= si->max)
> - goto unlock_out;
> + goto put_out;
>  
>   return si;
>  bad_nofile:
>   pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
>  out:
>   return NULL;
> -unlock_out:
> - rcu_read_unlock();
> +put_out:
> + percpu_ref_put(>users);
>   return NULL;
>  }
>  
> @@ -2466,7 +2475,7 @@ static void setup_swap_info(struct swap_info_struct *p, 
> int prio,
>  
>  static void _enable_swap_info(struct swap_info_struct *p)
>  {
> - p->flags |= SWP_WRITEOK | SWP_VALID;
> + p->flags |= SWP_WRITEOK;
>   atomic_long_add(p->pages, _swap_pages);
>   total_swap_pages += p->pages;
>  
> @@ -2497,10 +2506,9 @@ static void enable_swap_info(struct swap_info_struct 
> *p, int prio,
>   spin_unlock(>lock);
>   spin_unlock(_lock);
>   /*
> -  * Guarantee swap_map, cluster_info, etc. fields are valid
> -  * between get/put_swap_device() if SWP_VALID bit is set
> +  * Finished initialized swap device, now it's safe to reference it.

s/initialized/initializing/

Otherwise looks good to me!  Thanks!

Reviewed-by: "Huang, Ying" 

>*/
> - synchronize_rcu();
> + percpu_ref_resurrect(>users);
>   spin_lock(_lock);
>   spin_lock(>lock);
>   _enable_swap_info(p);
> @@ -2616,16 +2624,16 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
> specialfile)
>  
>   reenable_swap_slots_cache_unlock();
>  
> - spin_lock(_lock);
> - spin_lock(>lock);
> - p->flags &= ~SWP_VALID; /* mark swap device as invalid */
> - spin_unlock(>lock);
> - spin_unlock(_lock);
>   /*
> -  * wait for swap operations protected by get/put_swap_device()
> -  * to complete
> +  * Wait for swap operations protected by get/put_swap_device()
> +  * to complete.
> +  *
> +  * We need synchronize_rcu() here to protect the accessing to
> +  * the swap cache data structure.
>*/
> + percpu_ref_kill(>users);
>   synchronize_rcu();
> + wait_for_completion(>comp);
>  
>   flush_work(>discard_work);
>  
> @@ -2857,6 +2865,12 @@ static struct swap_info_struct *alloc_swap_info(void)
>   if (!p)
>   return ERR_PTR(-ENOMEM);
>  
> + if (percpu_ref_init(>users, swap_users_ref_free,
> + PERCPU_REF_INIT_DEAD, GFP_KERNEL)) {
> + kvfree(p);
> + return ERR_PTR(-ENOMEM);
> + }
> +
>   spin_lock(_lock);
>   for (type = 0; type < nr_swapfiles; type++) {
>   if (!(swap_info[type]->flags & SWP_USED))
> @@ -2864,6 +2878,7 @@ static struct swap_info_struct *alloc_swap_info(void)
>   }
>   if (type >= MAX_SWAPFILES) {
>   spin_unlock(_lock);
> + percpu_ref_exit(>users);
>   kvfree(p);
>   return ERR_PTR(-EPERM);
>   }
> @@ -2891,9 +2906,13 @@ static struct swap_info_struct *alloc_swap_info(void)
>   plist_node_init(>avail_lists[i], 0);
>   p->flags = SWP_USED;
>   spin_unlock(_lock);
> - kvfree(defer);
> + if (defer) {
> + percpu_ref_exit(>users);
> + kvfree(defer);
> + }
>   spin_lock_init(>lock);
>   spin_lock_init(>cont_lock);
> + init_completion(>comp);
>  
>   return p;
>  }

Re: [PATCH v3 4/4] mm/shmem: fix shmem_swapin() race with swapoff

2021-04-20 Thread Huang, Ying

Miaohe Lin  writes:

> When I was investigating the swap code, I found the below possible race
> window:
>
> CPU 1 CPU 2
> - -
> shmem_swapin
>   swap_cluster_readahead
> if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
>   swapoff
> percpu_ref_kill(>users)
> synchronize_rcu()
> wait_for_completion

I don't think the above 3 lines are relevant for the race.

> ..
> si->swap_file = NULL;
> struct inode *inode = si->swap_file->f_mapping->host;[oops!]
>
> Close this race window by using get/put_swap_device() to guard against
> concurrent swapoff.
>
> Fixes: 8fd2e0b505d1 ("mm: swap: check if swap backing device is congested or 
> not")
> Signed-off-by: Miaohe Lin 
> ---
>  mm/shmem.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 26c76b13ad23..936ba5595297 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1492,15 +1492,21 @@ static void shmem_pseudo_vma_destroy(struct 
> vm_area_struct *vma)
>  static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
>   struct shmem_inode_info *info, pgoff_t index)
>  {
> + struct swap_info_struct *si;
>   struct vm_area_struct pvma;
>   struct page *page;
>   struct vm_fault vmf = {
>   .vma = ,
>   };
>  
> + /* Prevent swapoff from happening to us. */
> + si = get_swap_device(swap);

Better to put get/put_swap_device() in shmem_swapin_page(), that make it
possible for us to remove get/put_swap_device() in lookup_swap_cache().

Best Regards,
Huang, Ying

> + if (unlikely(!si))
> + return NULL;
>   shmem_pseudo_vma_init(, info, index);
>   page = swap_cluster_readahead(swap, gfp, );
>   shmem_pseudo_vma_destroy();
> + put_swap_device(si);
>  
>   return page;
>  }

Re: [PATCH v3 3/4] mm/swap: remove confusing checking for non_swap_entry() in swap_ra_info()

2021-04-20 Thread Huang, Ying

Miaohe Lin  writes:

> The non_swap_entry() was used for working with VMA based swap readahead
> via commit ec560175c0b6 ("mm, swap: VMA based swap readahead").

At that time, the non_swap_entry() checking is necessary because the
function is called before checking that in do_swap_page().

> Then it's
> moved to swap_ra_info() since commit eaf649ebc3ac ("mm: swap: clean up swap
> readahead").

After that, the non_swap_entry() checking is unnecessary, because
swap_ra_info() is called after non_swap_entry() has been checked
already.  The resulting code is confusing.

> But this makes the code confusing. The non_swap_entry() check
> looks racy because while we released the pte lock, somebody else might have
> faulted in this pte. So we should check whether it's swap pte first to
> guard against such race or swap_type will be unexpected.

The race isn't important because it will not cause problem.

Best Regards,
Huang, Ying

> But the swap_entry
> isn't used in this function and we will have enough checking when we really
> operate the PTE entries later. So checking for non_swap_entry() is not
> really needed here and should be removed to avoid confusion.
>
> Signed-off-by: Miaohe Lin 
> ---
>  mm/swap_state.c | 6 --
>  1 file changed, 6 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 272ea2108c9d..df5405384520 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -721,7 +721,6 @@ static void swap_ra_info(struct vm_fault *vmf,
>  {
>   struct vm_area_struct *vma = vmf->vma;
>   unsigned long ra_val;
> - swp_entry_t entry;
>   unsigned long faddr, pfn, fpfn;
>   unsigned long start, end;
>   pte_t *pte, *orig_pte;
> @@ -739,11 +738,6 @@ static void swap_ra_info(struct vm_fault *vmf,
>  
>   faddr = vmf->address;
>   orig_pte = pte = pte_offset_map(vmf->pmd, faddr);
> - entry = pte_to_swp_entry(*pte);
> - if ((unlikely(non_swap_entry(entry {
> - pte_unmap(orig_pte);
> - return;
> - }
>  
>   fpfn = PFN_DOWN(faddr);
>   ra_val = GET_SWAP_RA_VAL(vma);

Re: [PATCH v3 2/4] swap: fix do_swap_page() race with swapoff

2021-04-20 Thread Huang, Ying

Miaohe Lin  writes:

> When I was investigating the swap code, I found the below possible race
> window:
>
> CPU 1 CPU 2
> - -
> do_swap_page
>   if (data_race(si->flags & SWP_SYNCHRONOUS_IO)
>   swap_readpage
> if (data_race(sis->flags & SWP_FS_OPS)) {
>   swapoff
> p->flags &= ~SWP_VALID;
> ..
> synchronize_rcu();
> ..

You have deleted SWP_VALID and RCU solution in 1/4, so please revise this.

> p->swap_file = NULL;
> struct file *swap_file = sis->swap_file;
> struct address_space *mapping = swap_file->f_mapping;[oops!]
>
> Note that for the pages that are swapped in through swap cache, this isn't
> an issue. Because the page is locked, and the swap entry will be marked
> with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been
> unlocked.
>
> Using current get/put_swap_device() to guard against concurrent swapoff for
> swap_readpage() looks terrible because swap_readpage() may take really long
> time. And this race may not be really pernicious because swapoff is usually
> done when system shutdown only. To reduce the performance overhead on the
> hot-path as much as possible, it appears we can use the percpu_ref to close
> this race window(as suggested by Huang, Ying).

This needs to be revised too.  Unless you squash 1/4 and 2/4.

> Fixes: 0bcac06f27d7 ("mm,swap: skip swapcache for swapin of synchronous 
> device")
> Reported-by: kernel test robot  (auto build test ERROR)
> Signed-off-by: Miaohe Lin 
> ---
>  include/linux/swap.h | 9 +
>  mm/memory.c  | 9 +
>  2 files changed, 18 insertions(+)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index c9e7fea10b83..46d51d058d05 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -527,6 +527,15 @@ static inline struct swap_info_struct 
> *swp_swap_info(swp_entry_t entry)
>   return NULL;
>  }
>  
> +static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
> +{
> + return NULL;
> +}
> +
> +static inline void put_swap_device(struct swap_info_struct *si)
> +{
> +}
> +
>  #define swap_address_space(entry)(NULL)
>  #define get_nr_swap_pages()  0L
>  #define total_swap_pages 0L
> diff --git a/mm/memory.c b/mm/memory.c
> index 27014c3bde9f..7a2fe12cf641 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  {
>   struct vm_area_struct *vma = vmf->vma;
>   struct page *page = NULL, *swapcache;
> + struct swap_info_struct *si = NULL;
>   swp_entry_t entry;
>   pte_t pte;
>   int locked;
> @@ -3338,6 +3339,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   goto out;
>   }
>  
> + /* Prevent swapoff from happening to us. */
> + si = get_swap_device(entry);

There's

struct swap_info_struct *si = swp_swap_info(entry);

in do_swap_page(), you can remove that.

Best Regards,
Huang, Ying

> + if (unlikely(!si))
> + goto out;
>  
>   delayacct_set_flag(current, DELAYACCT_PF_SWAPIN);
>   page = lookup_swap_cache(entry, vma, vmf->address);
> @@ -3514,6 +3519,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  unlock:
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
>  out:
> + if (si)
> + put_swap_device(si);
>   return ret;
>  out_nomap:
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -3525,6 +3532,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   unlock_page(swapcache);
>   put_page(swapcache);
>   }
> + if (si)
> + put_swap_device(si);
>   return ret;
>  }

Re: [PATCH v2 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-19 Thread Huang, Ying

Miaohe Lin  writes:

> On 2021/4/19 15:09, Huang, Ying wrote:
>> Miaohe Lin  writes:
>> 
>>> On 2021/4/19 10:48, Huang, Ying wrote:
>>>> Miaohe Lin  writes:
>>>>
>>>>> We will use percpu-refcount to serialize against concurrent swapoff. This
>>>>> patch adds the percpu_ref support for swap.
>>>>>
>>>>> Signed-off-by: Miaohe Lin 
>>>>> ---
>>>>>  include/linux/swap.h |  3 +++
>>>>>  mm/swapfile.c| 33 +
>>>>>  2 files changed, 32 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>> index 144727041e78..8be36eb58b7a 100644
>>>>> --- a/include/linux/swap.h
>>>>> +++ b/include/linux/swap.h
>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>>>   * The in-memory structure used to track swap areas.
>>>>>   */
>>>>>  struct swap_info_struct {
>>>>> + struct percpu_ref users;/* serialization against concurrent 
>>>>> swapoff */
>>>>
>>>> The comments aren't general enough.  We use this to check whether the
>>>> swap device has been fully initialized, etc. May be something as below?
>>>>
>>>> /* indicate and keep swap device valid */
>>>
>>> Looks good.
>>>
>>>>
>>>>>   unsigned long   flags;  /* SWP_USED etc: see above */
>>>>>   signed shortprio;   /* swap priority of this type */
>>>>>   struct plist_node list; /* entry in swap_active_head */
>>>>> @@ -260,6 +261,8 @@ struct swap_info_struct {
>>>>>   struct block_device *bdev;  /* swap device or bdev of swap file */
>>>>>   struct file *swap_file; /* seldom referenced */
>>>>>   unsigned int old_block_size;/* seldom referenced */
>>>>> + bool ref_initialized;   /* seldom referenced */
>>>>> + struct completion comp; /* seldom referenced */
>>>>>  #ifdef CONFIG_FRONTSWAP
>>>>>   unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
>>>>>   atomic_t frontswap_pages;   /* frontswap pages in-use counter */
>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>>> index 149e77454e3c..66515a3a2824 100644
>>>>> --- a/mm/swapfile.c
>>>>> +++ b/mm/swapfile.c
>>>>> @@ -39,6 +39,7 @@
>>>>>  #include 
>>>>>  #include 
>>>>>  #include 
>>>>> +#include 
>>>>>  
>>>>>  #include 
>>>>>  #include 
>>>>> @@ -511,6 +512,14 @@ static void swap_discard_work(struct work_struct 
>>>>> *work)
>>>>>   spin_unlock(>lock);
>>>>>  }
>>>>>  
>>>>> +static void swap_users_ref_free(struct percpu_ref *ref)
>>>>> +{
>>>>> + struct swap_info_struct *si;
>>>>> +
>>>>> + si = container_of(ref, struct swap_info_struct, users);
>>>>> + complete(>comp);
>>>>> +}
>>>>> +
>>>>>  static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>>>>>  {
>>>>>   struct swap_cluster_info *ci = si->cluster_info;
>>>>> @@ -2500,7 +2509,7 @@ static void enable_swap_info(struct 
>>>>> swap_info_struct *p, int prio,
>>>>>* Guarantee swap_map, cluster_info, etc. fields are valid
>>>>>* between get/put_swap_device() if SWP_VALID bit is set
>>>>>*/
>>>>> - synchronize_rcu();
>>>>
>>>> You cannot remove this without changing get/put_swap_device().  It's
>>>> better to squash at least PATCH 1-2.
>>>
>>> Will squash PATCH 1-2. Thanks.
>>>
>>>>
>>>>> + percpu_ref_resurrect(>users);
>>>>>   spin_lock(_lock);
>>>>>   spin_lock(>lock);
>>>>>   _enable_swap_info(p);
>>>>> @@ -2621,11 +2630,18 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
>>>>> specialfile)
>>>>>   p->flags &= ~SWP_VALID; /* mark swap device as invalid */
>>>>>   spin_unlock(>lock);
>>>>>   spin_unlock(_lock);
>>>>> +
>>>>> + percpu_ref_kill(>users);
>>>>

Re: [PATCH v2 5/5] mm/shmem: fix shmem_swapin() race with swapoff

2021-04-19 Thread Huang, Ying

Miaohe Lin  writes:

> On 2021/4/19 15:04, Huang, Ying wrote:
>> Miaohe Lin  writes:
>> 
>>> On 2021/4/19 10:15, Huang, Ying wrote:
>>>> Miaohe Lin  writes:
>>>>
>>>>> When I was investigating the swap code, I found the below possible race
>>>>> window:
>>>>>
>>>>> CPU 1   CPU 2
>>>>> -   -
>>>>> shmem_swapin
>>>>>   swap_cluster_readahead
>>>>> if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
>>>>> swapoff
>>>>>   si->flags &= ~SWP_VALID;
>>>>>   ..
>>>>>   synchronize_rcu();
>>>>>   ..
>>>>
>>>> You have removed these code in the previous patches of the series.  And
>>>> they are not relevant in this patch.
>>>
>>> Yes, I should change these. Thanks.
>>>
>>>>
>>>>>   si->swap_file = NULL;
>>>>> struct inode *inode = si->swap_file->f_mapping->host;[oops!]
>>>>>
>>>>> Close this race window by using get/put_swap_device() to guard against
>>>>> concurrent swapoff.
>>>>>
>>>>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>>>>
>>>> No.  This isn't the commit that introduces the race condition.  Please
>>>> recheck your git blame result.
>>>>
>>>
>>> I think this is really hard to find exact commit. I used git blame and found
>>> this race should be existed when this is introduced. Any suggestion ?
>>> Thanks.
>> 
>> I think the commit that introduces the race condition is commit
>> 8fd2e0b505d1 ("mm: swap: check if swap backing device is congested or
>> not")
>> 
>
> Thanks.
> The commit log only describes one race condition. And for that one, this 
> should be correct
> Fixes tag. But there are still many other race conditions inside 
> swap_cluster_readahead,
> such as swap_readpage() called from swap_cluster_readahead. This tag could 
> not cover the
> all race windows.

No. swap_readpage() in swap_cluster_readahead() is OK.  Because
__read_swap_cache_async() is called before that, so the swap entry will
be marked with SWAP_HAS_CACHE, and page will be locked.

Best Regards,
Huang, Ying

>> Best Regards,
>> Huang, Ying
>> 
>>>> Best Regards,
>>>> Huang, Ying
>>>>
>>>>> Signed-off-by: Miaohe Lin 
>>>>> ---
>>>>>  mm/shmem.c | 6 ++
>>>>>  1 file changed, 6 insertions(+)
>>>>>
>>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>>> index 26c76b13ad23..936ba5595297 100644
>>>>> --- a/mm/shmem.c
>>>>> +++ b/mm/shmem.c
>>>>> @@ -1492,15 +1492,21 @@ static void shmem_pseudo_vma_destroy(struct 
>>>>> vm_area_struct *vma)
>>>>>  static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
>>>>>   struct shmem_inode_info *info, pgoff_t index)
>>>>>  {
>>>>> + struct swap_info_struct *si;
>>>>>   struct vm_area_struct pvma;
>>>>>   struct page *page;
>>>>>   struct vm_fault vmf = {
>>>>>   .vma = ,
>>>>>   };
>>>>>  
>>>>> + /* Prevent swapoff from happening to us. */
>>>>> + si = get_swap_device(swap);
>>>>> + if (unlikely(!si))
>>>>> + return NULL;
>>>>>   shmem_pseudo_vma_init(, info, index);
>>>>>   page = swap_cluster_readahead(swap, gfp, );
>>>>>   shmem_pseudo_vma_destroy();
>>>>> + put_swap_device(si);
>>>>>  
>>>>>   return page;
>>>>>  }
>>>> .
>>>>
>> .
>>

Re: [PATCH v2 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-19 Thread Huang, Ying

Miaohe Lin  writes:

> On 2021/4/19 10:48, Huang, Ying wrote:
>> Miaohe Lin  writes:
>> 
>>> We will use percpu-refcount to serialize against concurrent swapoff. This
>>> patch adds the percpu_ref support for swap.
>>>
>>> Signed-off-by: Miaohe Lin 
>>> ---
>>>  include/linux/swap.h |  3 +++
>>>  mm/swapfile.c| 33 +
>>>  2 files changed, 32 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 144727041e78..8be36eb58b7a 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>   * The in-memory structure used to track swap areas.
>>>   */
>>>  struct swap_info_struct {
>>> +   struct percpu_ref users;/* serialization against concurrent 
>>> swapoff */
>> 
>> The comments aren't general enough.  We use this to check whether the
>> swap device has been fully initialized, etc. May be something as below?
>> 
>> /* indicate and keep swap device valid */
>
> Looks good.
>
>> 
>>> unsigned long   flags;  /* SWP_USED etc: see above */
>>> signed shortprio;   /* swap priority of this type */
>>> struct plist_node list; /* entry in swap_active_head */
>>> @@ -260,6 +261,8 @@ struct swap_info_struct {
>>> struct block_device *bdev;  /* swap device or bdev of swap file */
>>> struct file *swap_file; /* seldom referenced */
>>> unsigned int old_block_size;/* seldom referenced */
>>> +   bool ref_initialized;   /* seldom referenced */
>>> +   struct completion comp; /* seldom referenced */
>>>  #ifdef CONFIG_FRONTSWAP
>>> unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
>>> atomic_t frontswap_pages;   /* frontswap pages in-use counter */
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 149e77454e3c..66515a3a2824 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -39,6 +39,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  
>>>  #include 
>>>  #include 
>>> @@ -511,6 +512,14 @@ static void swap_discard_work(struct work_struct *work)
>>> spin_unlock(>lock);
>>>  }
>>>  
>>> +static void swap_users_ref_free(struct percpu_ref *ref)
>>> +{
>>> +   struct swap_info_struct *si;
>>> +
>>> +   si = container_of(ref, struct swap_info_struct, users);
>>> +   complete(>comp);
>>> +}
>>> +
>>>  static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>>>  {
>>> struct swap_cluster_info *ci = si->cluster_info;
>>> @@ -2500,7 +2509,7 @@ static void enable_swap_info(struct swap_info_struct 
>>> *p, int prio,
>>>  * Guarantee swap_map, cluster_info, etc. fields are valid
>>>  * between get/put_swap_device() if SWP_VALID bit is set
>>>  */
>>> -   synchronize_rcu();
>> 
>> You cannot remove this without changing get/put_swap_device().  It's
>> better to squash at least PATCH 1-2.
>
> Will squash PATCH 1-2. Thanks.
>
>> 
>>> +   percpu_ref_resurrect(>users);
>>> spin_lock(_lock);
>>> spin_lock(>lock);
>>> _enable_swap_info(p);
>>> @@ -2621,11 +2630,18 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
>>> specialfile)
>>> p->flags &= ~SWP_VALID; /* mark swap device as invalid */
>>> spin_unlock(>lock);
>>> spin_unlock(_lock);
>>> +
>>> +   percpu_ref_kill(>users);
>>> /*
>>> -* wait for swap operations protected by get/put_swap_device()
>>> -* to complete
>>> +* We need synchronize_rcu() here to protect the accessing
>>> +* to the swap cache data structure.
>>>  */
>>> synchronize_rcu();
>>> +   /*
>>> +* Wait for swap operations protected by get/put_swap_device()
>>> +* to complete.
>>> +*/
>> 
>> I think the comments (after some revision) can be moved before
>> percpu_ref_kill().  The synchronize_rcu() comments can be merged.
>> 
>
> Ok.
>
>>> +   wait_for_completion(>comp);
>>>  
>>> flush_work(>discard_work);
>>>  
>>> @@ -3132,7 +3148,7 @@

Re: [PATCH v2 5/5] mm/shmem: fix shmem_swapin() race with swapoff

2021-04-19 Thread Huang, Ying

Miaohe Lin  writes:

> On 2021/4/19 10:15, Huang, Ying wrote:
>> Miaohe Lin  writes:
>> 
>>> When I was investigating the swap code, I found the below possible race
>>> window:
>>>
>>> CPU 1   CPU 2
>>> -   -
>>> shmem_swapin
>>>   swap_cluster_readahead
>>> if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
>>> swapoff
>>>   si->flags &= ~SWP_VALID;
>>>   ..
>>>   synchronize_rcu();
>>>   ..
>> 
>> You have removed these code in the previous patches of the series.  And
>> they are not relevant in this patch.
>
> Yes, I should change these. Thanks.
>
>> 
>>>   si->swap_file = NULL;
>>> struct inode *inode = si->swap_file->f_mapping->host;[oops!]
>>>
>>> Close this race window by using get/put_swap_device() to guard against
>>> concurrent swapoff.
>>>
>>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>> 
>> No.  This isn't the commit that introduces the race condition.  Please
>> recheck your git blame result.
>> 
>
> I think this is really hard to find exact commit. I used git blame and found
> this race should be existed when this is introduced. Any suggestion ?
> Thanks.

I think the commit that introduces the race condition is commit
8fd2e0b505d1 ("mm: swap: check if swap backing device is congested or
not")

Best Regards,
Huang, Ying

>> Best Regards,
>> Huang, Ying
>> 
>>> Signed-off-by: Miaohe Lin 
>>> ---
>>>  mm/shmem.c | 6 ++
>>>  1 file changed, 6 insertions(+)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 26c76b13ad23..936ba5595297 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -1492,15 +1492,21 @@ static void shmem_pseudo_vma_destroy(struct 
>>> vm_area_struct *vma)
>>>  static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
>>> struct shmem_inode_info *info, pgoff_t index)
>>>  {
>>> +   struct swap_info_struct *si;
>>> struct vm_area_struct pvma;
>>> struct page *page;
>>> struct vm_fault vmf = {
>>> .vma = ,
>>> };
>>>  
>>> +   /* Prevent swapoff from happening to us. */
>>> +   si = get_swap_device(swap);
>>> +   if (unlikely(!si))
>>> +   return NULL;
>>> shmem_pseudo_vma_init(, info, index);
>>> page = swap_cluster_readahead(swap, gfp, );
>>> shmem_pseudo_vma_destroy();
>>> +   put_swap_device(si);
>>>  
>>> return page;
>>>  }
>> .
>>

Re: [PATCH v2 2/5] mm/swapfile: use percpu_ref to serialize against concurrent swapoff

2021-04-18 Thread Huang, Ying

Miaohe Lin  writes:

> Use percpu_ref to serialize against concurrent swapoff. Also remove the
> SWP_VALID flag because it's used together with RCU solution.
>
> Signed-off-by: Miaohe Lin 
> ---
>  include/linux/swap.h |  3 +--
>  mm/swapfile.c| 43 +--
>  2 files changed, 18 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 8be36eb58b7a..993693b38109 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -177,7 +177,6 @@ enum {
>   SWP_PAGE_DISCARD = (1 << 10),   /* freed swap page-cluster discards */
>   SWP_STABLE_WRITES = (1 << 11),  /* no overwrite PG_writeback pages */
>   SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */
> - SWP_VALID   = (1 << 13),/* swap is valid to be operated on? */
>   /* add others here before... */
>   SWP_SCANNING= (1 << 14),/* refcount in scan_swap_map */
>  };
> @@ -514,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>  
>  static inline void put_swap_device(struct swap_info_struct *si)
>  {
> - rcu_read_unlock();
> + percpu_ref_put(>users);
>  }
>  
>  #else /* CONFIG_SWAP */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 66515a3a2824..90e197bc2eeb 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1279,18 +1279,12 @@ static unsigned char __swap_entry_free_locked(struct 
> swap_info_struct *p,
>   * via preventing the swap device from being swapoff, until
>   * put_swap_device() is called.  Otherwise return NULL.
>   *
> - * The entirety of the RCU read critical section must come before the
> - * return from or after the call to synchronize_rcu() in
> - * enable_swap_info() or swapoff().  So if "si->flags & SWP_VALID" is
> - * true, the si->map, si->cluster_info, etc. must be valid in the
> - * critical section.
> - *
>   * Notice that swapoff or swapoff+swapon can still happen before the
> - * rcu_read_lock() in get_swap_device() or after the rcu_read_unlock()
> - * in put_swap_device() if there isn't any other way to prevent
> - * swapoff, such as page lock, page table lock, etc.  The caller must
> - * be prepared for that.  For example, the following situation is
> - * possible.
> + * percpu_ref_tryget_live() in get_swap_device() or after the
> + * percpu_ref_put() in put_swap_device() if there isn't any other way
> + * to prevent swapoff, such as page lock, page table lock, etc.  The
> + * caller must be prepared for that.  For example, the following
> + * situation is possible.
>   *
>   *   CPU1CPU2
>   *   do_swap_page()
> @@ -1318,21 +1312,24 @@ struct swap_info_struct *get_swap_device(swp_entry_t 
> entry)
>   si = swp_swap_info(entry);
>   if (!si)
>   goto bad_nofile;
> -
> - rcu_read_lock();
> - if (data_race(!(si->flags & SWP_VALID)))
> - goto unlock_out;
> + if (!percpu_ref_tryget_live(>users))
> + goto out;
> + /*
> +  * Guarantee we will not reference uninitialized fields
> +  * of swap_info_struct.
> +  */

/*
 * Guarantee the si->users are checked before accessing other fields of
 * swap_info_struct.
*/

> + smp_rmb();

Usually, smp_rmb() need to be paired with smp_wmb().  Some comments are
needed for that.  Here smb_rmb() is paired with the spin_unlock() after
setup_swap_info() in enable_swap_info().

>   offset = swp_offset(entry);
>   if (offset >= si->max)
> - goto unlock_out;
> + goto put_out;
>  
>   return si;
>  bad_nofile:
>   pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
>  out:
>   return NULL;
> -unlock_out:
> - rcu_read_unlock();
> +put_out:
> + percpu_ref_put(>users);
>   return NULL;
>  }
>  
> @@ -2475,7 +2472,7 @@ static void setup_swap_info(struct swap_info_struct *p, 
> int prio,
>  
>  static void _enable_swap_info(struct swap_info_struct *p)
>  {
> - p->flags |= SWP_WRITEOK | SWP_VALID;
> + p->flags |= SWP_WRITEOK;
>   atomic_long_add(p->pages, _swap_pages);
>   total_swap_pages += p->pages;
>  
> @@ -2507,7 +2504,7 @@ static void enable_swap_info(struct swap_info_struct 
> *p, int prio,
>   spin_unlock(_lock);
>   /*
>* Guarantee swap_map, cluster_info, etc. fields are valid
> -  * between get/put_swap_device() if SWP_VALID bit is set
> +  * between get/put_swap_device().
>*/

The comments need to be revised.  Something likes below?

/* Finished initialize

Re: [PATCH v2 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-18 Thread Huang, Ying

Miaohe Lin  writes:

> We will use percpu-refcount to serialize against concurrent swapoff. This
> patch adds the percpu_ref support for swap.
>
> Signed-off-by: Miaohe Lin 
> ---
>  include/linux/swap.h |  3 +++
>  mm/swapfile.c| 33 +
>  2 files changed, 32 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 144727041e78..8be36eb58b7a 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>   * The in-memory structure used to track swap areas.
>   */
>  struct swap_info_struct {
> + struct percpu_ref users;/* serialization against concurrent 
> swapoff */

The comments aren't general enough.  We use this to check whether the
swap device has been fully initialized, etc. May be something as below?

/* indicate and keep swap device valid */

>   unsigned long   flags;  /* SWP_USED etc: see above */
>   signed shortprio;   /* swap priority of this type */
>   struct plist_node list; /* entry in swap_active_head */
> @@ -260,6 +261,8 @@ struct swap_info_struct {
>   struct block_device *bdev;  /* swap device or bdev of swap file */
>   struct file *swap_file; /* seldom referenced */
>   unsigned int old_block_size;/* seldom referenced */
> + bool ref_initialized;   /* seldom referenced */
> + struct completion comp; /* seldom referenced */
>  #ifdef CONFIG_FRONTSWAP
>   unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
>   atomic_t frontswap_pages;   /* frontswap pages in-use counter */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 149e77454e3c..66515a3a2824 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -39,6 +39,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -511,6 +512,14 @@ static void swap_discard_work(struct work_struct *work)
>   spin_unlock(>lock);
>  }
>  
> +static void swap_users_ref_free(struct percpu_ref *ref)
> +{
> + struct swap_info_struct *si;
> +
> + si = container_of(ref, struct swap_info_struct, users);
> + complete(>comp);
> +}
> +
>  static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>  {
>   struct swap_cluster_info *ci = si->cluster_info;
> @@ -2500,7 +2509,7 @@ static void enable_swap_info(struct swap_info_struct 
> *p, int prio,
>* Guarantee swap_map, cluster_info, etc. fields are valid
>* between get/put_swap_device() if SWP_VALID bit is set
>*/
> - synchronize_rcu();

You cannot remove this without changing get/put_swap_device().  It's
better to squash at least PATCH 1-2.

> + percpu_ref_resurrect(>users);
>   spin_lock(_lock);
>   spin_lock(>lock);
>   _enable_swap_info(p);
> @@ -2621,11 +2630,18 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
> specialfile)
>   p->flags &= ~SWP_VALID; /* mark swap device as invalid */
>   spin_unlock(>lock);
>   spin_unlock(_lock);
> +
> + percpu_ref_kill(>users);
>   /*
> -  * wait for swap operations protected by get/put_swap_device()
> -  * to complete
> +  * We need synchronize_rcu() here to protect the accessing
> +  * to the swap cache data structure.
>*/
>   synchronize_rcu();
> + /*
> +  * Wait for swap operations protected by get/put_swap_device()
> +  * to complete.
> +  */

I think the comments (after some revision) can be moved before
percpu_ref_kill().  The synchronize_rcu() comments can be merged.

> + wait_for_completion(>comp);
>  
>   flush_work(>discard_work);
>  
> @@ -3132,7 +3148,7 @@ static bool swap_discardable(struct swap_info_struct 
> *si)
>  SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  {
>   struct swap_info_struct *p;
> - struct filename *name;
> + struct filename *name = NULL;
>   struct file *swap_file = NULL;
>   struct address_space *mapping;
>   int prio;
> @@ -3163,6 +3179,15 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
> specialfile, int, swap_flags)
>  
>   INIT_WORK(>discard_work, swap_discard_work);
>  
> + if (!p->ref_initialized) {

I don't think it's necessary to add another flag p->ref_initialized.  We
can distinguish newly allocated and reused swap_info_struct in 
alloc_swap_info().

Best Regards,
Huang, Ying

> + error = percpu_ref_init(>users, swap_users_ref_free,
> + PERCPU_REF_INIT_DEAD, GFP_KERNEL);
> + if (unlikely(error))
> + goto bad_swap;
> + init_completion(>comp);
> + p->ref_initialized = true;
> + }
> +
>   name = getname(specialfile);
>   if (IS_ERR(name)) {
>   error = PTR_ERR(name);

Re: [PATCH v2 3/5] swap: fix do_swap_page() race with swapoff

2021-04-18 Thread Huang, Ying

Miaohe Lin  writes:

> When I was investigating the swap code, I found the below possible race
> window:
>
> CPU 1 CPU 2
> - -
> do_swap_page

This is OK for swap cache cases.  So

  if (data_race(si->flags & SWP_SYNCHRONOUS_IO))

should be shown here.

>   swap_readpage(skip swap cache case)
> if (data_race(sis->flags & SWP_FS_OPS)) {
>   swapoff
> p->flags = &= ~SWP_VALID;
> ..
> synchronize_rcu();
> ..
> p->swap_file = NULL;
> struct file *swap_file = sis->swap_file;
> struct address_space *mapping = swap_file->f_mapping;[oops!]
>
> Note that for the pages that are swapped in through swap cache, this isn't
> an issue. Because the page is locked, and the swap entry will be marked
> with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been
> unlocked.
>
> Using current get/put_swap_device() to guard against concurrent swapoff for
> swap_readpage() looks terrible because swap_readpage() may take really long
> time. And this race may not be really pernicious because swapoff is usually
> done when system shutdown only. To reduce the performance overhead on the
> hot-path as much as possible, it appears we can use the percpu_ref to close
> this race window(as suggested by Huang, Ying).

I still suggest to squash PATCH 1-3, at least PATCH 1-2.  That will
change the relevant code together and make it easier to review.

Best Regards,
Huang, Ying

> Fixes: 0bcac06f27d7 ("mm,swap: skip swapcache for swapin of synchronous 
> device")
> Reported-by: kernel test robot  (auto build test ERROR)
> Signed-off-by: Miaohe Lin 
> ---
>  include/linux/swap.h | 9 +
>  mm/memory.c  | 9 +
>  2 files changed, 18 insertions(+)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 993693b38109..523c2411a135 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -528,6 +528,15 @@ static inline struct swap_info_struct 
> *swp_swap_info(swp_entry_t entry)
>   return NULL;
>  }
>  
> +static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
> +{
> + return NULL;
> +}
> +
> +static inline void put_swap_device(struct swap_info_struct *si)
> +{
> +}
> +
>  #define swap_address_space(entry)(NULL)
>  #define get_nr_swap_pages()  0L
>  #define total_swap_pages 0L
> diff --git a/mm/memory.c b/mm/memory.c
> index 27014c3bde9f..7a2fe12cf641 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  {
>   struct vm_area_struct *vma = vmf->vma;
>   struct page *page = NULL, *swapcache;
> + struct swap_info_struct *si = NULL;
>   swp_entry_t entry;
>   pte_t pte;
>   int locked;
> @@ -3338,6 +3339,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   goto out;
>   }
>  
> + /* Prevent swapoff from happening to us. */
> + si = get_swap_device(entry);
> + if (unlikely(!si))
> + goto out;
>  
>   delayacct_set_flag(current, DELAYACCT_PF_SWAPIN);
>   page = lookup_swap_cache(entry, vma, vmf->address);
> @@ -3514,6 +3519,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  unlock:
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
>  out:
> + if (si)
> + put_swap_device(si);
>   return ret;
>  out_nomap:
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -3525,6 +3532,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   unlock_page(swapcache);
>   put_page(swapcache);
>   }
> + if (si)
> + put_swap_device(si);
>   return ret;
>  }

Re: [PATCH v2 5/5] mm/shmem: fix shmem_swapin() race with swapoff

2021-04-18 Thread Huang, Ying

Miaohe Lin  writes:

> When I was investigating the swap code, I found the below possible race
> window:
>
> CPU 1   CPU 2
> -   -
> shmem_swapin
>   swap_cluster_readahead
> if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
> swapoff
>   si->flags &= ~SWP_VALID;
>   ..
>   synchronize_rcu();
>   ..

You have removed these code in the previous patches of the series.  And
they are not relevant in this patch.

>   si->swap_file = NULL;
> struct inode *inode = si->swap_file->f_mapping->host;[oops!]
>
> Close this race window by using get/put_swap_device() to guard against
> concurrent swapoff.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")

No.  This isn't the commit that introduces the race condition.  Please
recheck your git blame result.

Best Regards,
Huang, Ying

> Signed-off-by: Miaohe Lin 
> ---
>  mm/shmem.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 26c76b13ad23..936ba5595297 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1492,15 +1492,21 @@ static void shmem_pseudo_vma_destroy(struct 
> vm_area_struct *vma)
>  static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
>   struct shmem_inode_info *info, pgoff_t index)
>  {
> + struct swap_info_struct *si;
>   struct vm_area_struct pvma;
>   struct page *page;
>   struct vm_fault vmf = {
>   .vma = ,
>   };
>  
> + /* Prevent swapoff from happening to us. */
> + si = get_swap_device(swap);
> + if (unlikely(!si))
> + return NULL;
>   shmem_pseudo_vma_init(, info, index);
>   page = swap_cluster_readahead(swap, gfp, );
>   shmem_pseudo_vma_destroy();
> + put_swap_device(si);
>  
>   return page;
>  }

Re: [PATCH v2 4/5] mm/swap: remove confusing checking for non_swap_entry() in swap_ra_info()

2021-04-18 Thread Huang, Ying

Miaohe Lin  writes:

> While we released the pte lock, somebody else might faulted in this pte.
> So we should check whether it's swap pte first to guard against such race
> or swp_type would be unexpected. But the swap_entry isn't used in this
> function and we will have enough checking when we really operate the PTE
> entries later. So checking for non_swap_entry() is not really needed here
> and should be removed to avoid confusion.

Please rephrase the change log to describe why we have the code and why
it's unnecessary now.  You can dig the git history via git-blame to find
out it.

The patch itself looks good to me.

Best Regards,
Huang, Ying

> Signed-off-by: Miaohe Lin 
> ---
>  mm/swap_state.c | 6 --
>  1 file changed, 6 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 272ea2108c9d..df5405384520 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -721,7 +721,6 @@ static void swap_ra_info(struct vm_fault *vmf,
>  {
>   struct vm_area_struct *vma = vmf->vma;
>   unsigned long ra_val;
> - swp_entry_t entry;
>   unsigned long faddr, pfn, fpfn;
>   unsigned long start, end;
>   pte_t *pte, *orig_pte;
> @@ -739,11 +738,6 @@ static void swap_ra_info(struct vm_fault *vmf,
>  
>   faddr = vmf->address;
>   orig_pte = pte = pte_offset_map(vmf->pmd, faddr);
> - entry = pte_to_swp_entry(*pte);
> - if ((unlikely(non_swap_entry(entry {
> - pte_unmap(orig_pte);
> - return;
> - }
>  
>   fpfn = PFN_DOWN(faddr);
>   ra_val = GET_SWAP_RA_VAL(vma);

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-16 Thread Huang, Ying

Miaohe Lin  writes:

> On 2021/4/15 22:31, Dennis Zhou wrote:
>> On Thu, Apr 15, 2021 at 01:24:31PM +0800, Huang, Ying wrote:
>>> Dennis Zhou  writes:
>>>
>>>> On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote:
>>>>> Dennis Zhou  writes:
>>>>>
>>>>>> On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
>>>>>>> Dennis Zhou  writes:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
>>>>>>>>> Miaohe Lin  writes:
>>>>>>>>>
>>>>>>>>>> On 2021/4/14 9:17, Huang, Ying wrote:
>>>>>>>>>>> Miaohe Lin  writes:
>>>>>>>>>>>
>>>>>>>>>>>> On 2021/4/12 15:24, Huang, Ying wrote:
>>>>>>>>>>>>> "Huang, Ying"  writes:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Miaohe Lin  writes:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We will use percpu-refcount to serialize against concurrent 
>>>>>>>>>>>>>>> swapoff. This
>>>>>>>>>>>>>>> patch adds the percpu_ref support for later fixup.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Miaohe Lin 
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>  include/linux/swap.h |  2 ++
>>>>>>>>>>>>>>>  mm/swapfile.c| 25 ++---
>>>>>>>>>>>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>>>>>>>>>>>> index 144727041e78..849ba5265c11 100644
>>>>>>>>>>>>>>> --- a/include/linux/swap.h
>>>>>>>>>>>>>>> +++ b/include/linux/swap.h
>>>>>>>>>>>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>>>>>>>>>>>>>   * The in-memory structure used to track swap areas.
>>>>>>>>>>>>>>>   */
>>>>>>>>>>>>>>>  struct swap_info_struct {
>>>>>>>>>>>>>>> +   struct percpu_ref users;/* serialization 
>>>>>>>>>>>>>>> against concurrent swapoff */
>>>>>>>>>>>>>>> unsigned long   flags;  /* SWP_USED etc: see 
>>>>>>>>>>>>>>> above */
>>>>>>>>>>>>>>> signed shortprio;   /* swap priority of 
>>>>>>>>>>>>>>> this type */
>>>>>>>>>>>>>>> struct plist_node list; /* entry in 
>>>>>>>>>>>>>>> swap_active_head */
>>>>>>>>>>>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>>>>>>>>>>>>>>> struct block_device *bdev;  /* swap device or bdev 
>>>>>>>>>>>>>>> of swap file */
>>>>>>>>>>>>>>> struct file *swap_file; /* seldom referenced */
>>>>>>>>>>>>>>> unsigned int old_block_size;/* seldom referenced */
>>>>>>>>>>>>>>> +   struct completion comp; /* seldom referenced */
>>>>>>>>>>>>>>>  #ifdef CONFIG_FRONTSWAP
>>>>>>>>>>>>>>> unsigned long *frontswap_map;   /* frontswap in-use, 
>>>>>>>>>>>>>>> one bit per page */
>>>>>>>>>>>>>>> atomic_t frontswap_pages;   /* frontswap pages 
>>>>>>>>>>>>>>> in-use counter */
>>>>>>>>>>>

Re: [RFC PATCH] percpu_ref: Make percpu_ref_tryget*() ACQUIRE operations

2021-04-16 Thread Huang, Ying

Kent Overstreet  writes:

> On Thu, Apr 15, 2021 at 09:42:56PM -0700, Paul E. McKenney wrote:
>> On Tue, Apr 13, 2021 at 10:47:03AM +0800, Huang Ying wrote:
>> > One typical use case of percpu_ref_tryget() family functions is as
>> > follows,
>> > 
>> >   if (percpu_ref_tryget(>ref)) {
>> >  /* Operate on the other fields of *p */
>> >   }
>> > 
>> > The refcount needs to be checked before operating on the other fields
>> > of the data structure (*p), otherwise, the values gotten from the
>> > other fields may be invalid or inconsistent.  To guarantee the correct
>> > memory ordering, percpu_ref_tryget*() needs to be the ACQUIRE
>> > operations.
>> 
>> I am not seeing the need for this.
>> 
>> If __ref_is_percpu() returns true, then the overall count must be non-zero
>> and there will be an RCU grace period between now and the time that this
>> count becomes zero.  For the calls to __ref_is_percpu() enclosed within
>> rcu_read_lock() and rcu_read_unlock(), the grace period will provide
>> the needed ordering.  (See the comment header for the synchronize_rcu()
>> function.)
>> 
>> Otherwise, when __ref_is_percpu() returns false, its caller does a
>> value-returning atomic read-modify-write operation, which provides
>> full ordering.

Hi, Paul,

Yes, for the cases you described (from non-zero to 0), current code
works well, no changes are needed.

>> Either way, the required acquire semantics (and more) are already
>> provided, and in particular, this analysis covers the percpu_ref_tryget()
>> you call out above.
>> 
>> Or am I missing something subtle here?
>
> I think you're right, but some details about the race we're concerned about
> would be helpful. Are we concerned about seeing values from after the ref has
> hit 0? In that case I agree with Paul. Or is the concern about seeing values
> from before a transition from 0 to nonzero?

Hi, Kent,

Yes, that's exactly what I concern about.  In swap code, we may get a
pointer to a data structure (swap_info_struct) when its refcount is 0
(not fully initialized), and we cannot access the other fields of the
data structure until its refcount becomes non-zero (fully initialized).
So the order must be guaranteed between checking refcount and accessing
the other fields of the data structure.

I have discussed with Dennis Zhou about this in another thread too,

https://lore.kernel.org/lkml/87o8egp1bk@yhuang6-desk1.ccr.corp.intel.com/
https://lore.kernel.org/lkml/yhholuiar3qj1...@google.com/

He think the use case of swap code isn't typical.  So he prefers to deal
with that in swap code, such as adding a smp_rmb() after
percpu_ref_tryget_live(), etc.

So, if the transition from 0 to non-zero isn't concerned in most other
use cases, I am fine to deal with that in the swap code.

> That wasn't a concern when I wrote
> the code for the patterns of use I had in mind, but Tejun's done some stuff 
> with
> the code since.
>
> Huang, can you elaborate?

Best Regards,
Huang, Ying

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-15 Thread Huang, Ying

Dennis Zhou  writes:

> On Thu, Apr 15, 2021 at 01:24:31PM +0800, Huang, Ying wrote:
>> Dennis Zhou  writes:
>> 
>> > On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote:
>> >> Dennis Zhou  writes:
>> >> 
>> >> > On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
>> >> >> Dennis Zhou  writes:
>> >> >> 
>> >> >> > Hello,
>> >> >> >
>> >> >> > On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
>> >> >> >> Miaohe Lin  writes:
>> >> >> >> 
>> >> >> >> > On 2021/4/14 9:17, Huang, Ying wrote:
>> >> >> >> >> Miaohe Lin  writes:
>> >> >> >> >> 
>> >> >> >> >>> On 2021/4/12 15:24, Huang, Ying wrote:
>> >> >> >> >>>> "Huang, Ying"  writes:
>> >> >> >> >>>>
>> >> >> >> >>>>> Miaohe Lin  writes:
>> >> >> >> >>>>>
>> >> >> >> >>>>>> We will use percpu-refcount to serialize against concurrent 
>> >> >> >> >>>>>> swapoff. This
>> >> >> >> >>>>>> patch adds the percpu_ref support for later fixup.
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> Signed-off-by: Miaohe Lin 
>> >> >> >> >>>>>> ---
>> >> >> >> >>>>>>  include/linux/swap.h |  2 ++
>> >> >> >> >>>>>>  mm/swapfile.c| 25 ++---
>> >> >> >> >>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> >> >> >>>>>> index 144727041e78..849ba5265c11 100644
>> >> >> >> >>>>>> --- a/include/linux/swap.h
>> >> >> >> >>>>>> +++ b/include/linux/swap.h
>> >> >> >> >>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>> >> >> >> >>>>>>   * The in-memory structure used to track swap areas.
>> >> >> >> >>>>>>   */
>> >> >> >> >>>>>>  struct swap_info_struct {
>> >> >> >> >>>>>> +struct percpu_ref users;/* serialization 
>> >> >> >> >>>>>> against concurrent swapoff */
>> >> >> >> >>>>>>  unsigned long   flags;  /* SWP_USED etc: see 
>> >> >> >> >>>>>> above */
>> >> >> >> >>>>>>  signed shortprio;   /* swap priority of 
>> >> >> >> >>>>>> this type */
>> >> >> >> >>>>>>  struct plist_node list; /* entry in 
>> >> >> >> >>>>>> swap_active_head */
>> >> >> >> >>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>> >> >> >> >>>>>>  struct block_device *bdev;  /* swap device or bdev 
>> >> >> >> >>>>>> of swap file */
>> >> >> >> >>>>>>  struct file *swap_file; /* seldom referenced */
>> >> >> >> >>>>>>  unsigned int old_block_size;/* seldom referenced */
>> >> >> >> >>>>>> +struct completion comp; /* seldom referenced */
>> >> >> >> >>>>>>  #ifdef CONFIG_FRONTSWAP
>> >> >> >> >>>>>>  unsigned long *frontswap_map;   /* frontswap in-use, 
>> >> >> >> >>>>>> one bit per page */
>> >> >> >> >>>>>>  atomic_t frontswap_pages;   /* frontswap pages 
>> >> >> >> >>>>>> in-use counter */
>> >> >> >> >>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> >> >> >>>&g

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-15 Thread Huang, Ying

Yu Zhao  writes:

> On Wed, Apr 14, 2021 at 9:00 PM Andi Kleen  wrote:
>>
>> > We fall back to the rmap when it's obviously not smart to do so. There
>> > is still a lot of room for improvement in this function though, i.e.,
>> > it should be per VMA and NUMA aware.
>>
>> Okay so it's more a question to tune the cross over heuristic. That
>> sounds much easier than replacing everything.
>>
>> Of course long term it might be a problem to maintain too many
>> different ways to do things, but I suppose short term it's a reasonable
>> strategy.
>
> Hi Rik, Ying,
>
> Sorry for being persistent. I want to make sure we are on the same page:
>
> Page table scanning doesn't replace the existing rmap walk. It is
> complementary and only happens when it is likely that most of the
> pages on a system under pressure have been referenced, i.e., out of
> *inactive* pages, by definition of the existing implementation. Under
> such a condition, scanning *active* pages one by one with the rmap is
> likely to cost more than scanning them all at once via page tables.
> When we evict *inactive* pages, we still use the rmap and share a
> common path with the existing code.
>
> Page table scanning falls back to the rmap walk if the page tables of
> a process are apparently sparse, i.e., rss < size of the page tables.
>
> I should have clarified this at the very beginning of the discussion.
> But it has become so natural to me and I assumed we'd all see it this
> way.
>
> Your concern regarding the NUMA optimization is still valid, and it's
> a high priority.

Hi, Yu,

In general, I think it's a good idea to combine the page table scanning
and rmap scanning in the page reclaiming.  For example, if the
working-set is transitioned, we can take advantage of the fast page
table scanning to identify the new working-set quickly.  While we can
fallback to the rmap scanning if the page table scanning doesn't help.

Best Regards,
Huang, Ying

Re: [v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count

2021-04-15 Thread Huang, Ying

"Zi Yan"  writes:

> On 13 Apr 2021, at 23:00, Huang, Ying wrote:
>
>> Yang Shi  writes:
>>
>>> The generic migration path will check refcount, so no need check refcount 
>>> here.
>>> But the old code actually prevents from migrating shared THP (mapped by 
>>> multiple
>>> processes), so bail out early if mapcount is > 1 to keep the behavior.
>>
>> What prevents us from migrating shared THP?  If no, why not just remove
>> the old refcount checking?
>
> If two or more processes are in different NUMA nodes, a THP shared by them 
> can be
> migrated back and forth between NUMA nodes, which is quite costly. Unless we 
> have
> a better way of figuring out a good location for such pages to reduce the 
> number
> of migration, it might be better not to move them, right?
>

Some mechanism has been provided in should_numa_migrate_memory() to
identify the shared pages from the private pages.  Do you find it
doesn't work well in some situations?

The multiple threads in one process which run on different NUMA nodes
may share pages too.  So it isn't a good solution to exclude pages
shared by multiple processes.

Best Regards,
Huang, Ying

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-14 Thread Huang, Ying

Dennis Zhou  writes:

> On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote:
>> Dennis Zhou  writes:
>> 
>> > On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
>> >> Dennis Zhou  writes:
>> >> 
>> >> > Hello,
>> >> >
>> >> > On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
>> >> >> Miaohe Lin  writes:
>> >> >> 
>> >> >> > On 2021/4/14 9:17, Huang, Ying wrote:
>> >> >> >> Miaohe Lin  writes:
>> >> >> >> 
>> >> >> >>> On 2021/4/12 15:24, Huang, Ying wrote:
>> >> >> >>>> "Huang, Ying"  writes:
>> >> >> >>>>
>> >> >> >>>>> Miaohe Lin  writes:
>> >> >> >>>>>
>> >> >> >>>>>> We will use percpu-refcount to serialize against concurrent 
>> >> >> >>>>>> swapoff. This
>> >> >> >>>>>> patch adds the percpu_ref support for later fixup.
>> >> >> >>>>>>
>> >> >> >>>>>> Signed-off-by: Miaohe Lin 
>> >> >> >>>>>> ---
>> >> >> >>>>>>  include/linux/swap.h |  2 ++
>> >> >> >>>>>>  mm/swapfile.c| 25 ++---
>> >> >> >>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>> >> >> >>>>>>
>> >> >> >>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> >> >>>>>> index 144727041e78..849ba5265c11 100644
>> >> >> >>>>>> --- a/include/linux/swap.h
>> >> >> >>>>>> +++ b/include/linux/swap.h
>> >> >> >>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>> >> >> >>>>>>   * The in-memory structure used to track swap areas.
>> >> >> >>>>>>   */
>> >> >> >>>>>>  struct swap_info_struct {
>> >> >> >>>>>> +   struct percpu_ref users;/* serialization 
>> >> >> >>>>>> against concurrent swapoff */
>> >> >> >>>>>> unsigned long   flags;  /* SWP_USED etc: see 
>> >> >> >>>>>> above */
>> >> >> >>>>>> signed shortprio;   /* swap priority of 
>> >> >> >>>>>> this type */
>> >> >> >>>>>> struct plist_node list; /* entry in 
>> >> >> >>>>>> swap_active_head */
>> >> >> >>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>> >> >> >>>>>> struct block_device *bdev;  /* swap device or bdev 
>> >> >> >>>>>> of swap file */
>> >> >> >>>>>> struct file *swap_file; /* seldom referenced */
>> >> >> >>>>>> unsigned int old_block_size;/* seldom referenced */
>> >> >> >>>>>> +   struct completion comp; /* seldom referenced */
>> >> >> >>>>>>  #ifdef CONFIG_FRONTSWAP
>> >> >> >>>>>> unsigned long *frontswap_map;   /* frontswap in-use, 
>> >> >> >>>>>> one bit per page */
>> >> >> >>>>>> atomic_t frontswap_pages;   /* frontswap pages 
>> >> >> >>>>>> in-use counter */
>> >> >> >>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> >> >>>>>> index 149e77454e3c..724173cd7d0c 100644
>> >> >> >>>>>> --- a/mm/swapfile.c
>> >> >> >>>>>> +++ b/mm/swapfile.c
>> >> >> >>>>>> @@ -39,6 +39,7 @@
>> >> >> >>>>>>  #include 
>> >> >> >>>>>>  #include 
>> >> >> >>>>>>  #include 
>> >> >> >>>>>> +#include 
>> >> >> >>>>>>  
>&

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Huang, Ying

Yu Zhao  writes:

> On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying  wrote:
>>
>> Yu Zhao  writes:
>>
>> > On Tue, Apr 13, 2021 at 8:30 PM Rik van Riel  wrote:
>> >>
>> >> On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote:
>> >> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
>> >> >
>> >> > > The initial posting of this patchset did no better, in fact it did
>> >> > > a bit
>> >> > > worse. Performance dropped to the same levels and kswapd was using
>> >> > > as
>> >> > > much CPU as before, but on top of that we also got excessive
>> >> > > swapping.
>> >> > > Not at a high rate, but 5-10MB/sec continually.
>> >> > >
>> >> > > I had some back and forths with Yu Zhao and tested a few new
>> >> > > revisions,
>> >> > > and the current series does much better in this regard. Performance
>> >> > > still dips a bit when page cache fills, but not nearly as much, and
>> >> > > kswapd is using less CPU than before.
>> >> >
>> >> > Profiles would be interesting, because it sounds to me like reclaim
>> >> > *might* be batching page cache removal better (e.g. fewer, larger
>> >> > batches) and so spending less time contending on the mapping tree
>> >> > lock...
>> >> >
>> >> > IOWs, I suspect this result might actually be a result of less lock
>> >> > contention due to a change in batch processing characteristics of
>> >> > the new algorithm rather than it being a "better" algorithm...
>> >>
>> >> That seems quite likely to me, given the issues we have
>> >> had with virtual scan reclaim algorithms in the past.
>> >
>> > Hi Rik,
>> >
>> > Let paste the code so we can move beyond the "batching" hypothesis:
>> >
>> > static int __remove_mapping(struct address_space *mapping, struct page
>> > *page,
>> > bool reclaimed, struct mem_cgroup 
>> > *target_memcg)
>> > {
>> > unsigned long flags;
>> > int refcount;
>> > void *shadow = NULL;
>> >
>> > BUG_ON(!PageLocked(page));
>> > BUG_ON(mapping != page_mapping(page));
>> >
>> > xa_lock_irqsave(>i_pages, flags);
>> >
>> >> SeongJae, what is this algorithm supposed to do when faced
>> >> with situations like this:
>> >
>> > I'll assume the questions were directed at me, not SeongJae.
>> >
>> >> 1) Running on a system with 8 NUMA nodes, and
>> >> memory
>> >>pressure in one of those nodes.
>> >> 2) Running PostgresQL or Oracle, with hundreds of
>> >>processes mapping the same (very large) shared
>> >>memory segment.
>> >>
>> >> How do you keep your algorithm from falling into the worst
>> >> case virtual scanning scenarios that were crippling the
>> >> 2.4 kernel 15+ years ago on systems with just a few GB of
>> >> memory?
>> >
>> > There is a fundamental shift: that time we were scanning for cold pages,
>> > and nowadays we are scanning for hot pages.
>> >
>> > I'd be surprised if scanning for cold pages didn't fall apart, because it'd
>> > find most of the entries accessed, if they are present at all.
>> >
>> > Scanning for hot pages, on the other hand, is way better. Let me just
>> > reiterate:
>> > 1) It will not scan page tables from processes that have been sleeping
>> >since the last scan.
>> > 2) It will not scan PTE tables under non-leaf PMD entries that do not
>> >have the accessed bit set, when
>> >CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
>> > 3) It will not zigzag between the PGD table and the same PMD or PTE
>> >table spanning multiple VMAs. In other words, it finishes all the
>> >VMAs with the range of the same PMD or PTE table before it returns
>> >to the PGD table. This optimizes workloads that have large numbers
>> >of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.
>> >
>> > So the cost is roughly proportional to the number of referenced pages it
>> > discovers. If there is no memory pressure, no scanning at all. For a system
>> > under heavy memory pressure, most of the

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Huang, Ying

Yu Zhao  writes:

> On Tue, Apr 13, 2021 at 8:30 PM Rik van Riel  wrote:
>>
>> On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote:
>> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
>> >
>> > > The initial posting of this patchset did no better, in fact it did
>> > > a bit
>> > > worse. Performance dropped to the same levels and kswapd was using
>> > > as
>> > > much CPU as before, but on top of that we also got excessive
>> > > swapping.
>> > > Not at a high rate, but 5-10MB/sec continually.
>> > >
>> > > I had some back and forths with Yu Zhao and tested a few new
>> > > revisions,
>> > > and the current series does much better in this regard. Performance
>> > > still dips a bit when page cache fills, but not nearly as much, and
>> > > kswapd is using less CPU than before.
>> >
>> > Profiles would be interesting, because it sounds to me like reclaim
>> > *might* be batching page cache removal better (e.g. fewer, larger
>> > batches) and so spending less time contending on the mapping tree
>> > lock...
>> >
>> > IOWs, I suspect this result might actually be a result of less lock
>> > contention due to a change in batch processing characteristics of
>> > the new algorithm rather than it being a "better" algorithm...
>>
>> That seems quite likely to me, given the issues we have
>> had with virtual scan reclaim algorithms in the past.
>
> Hi Rik,
>
> Let paste the code so we can move beyond the "batching" hypothesis:
>
> static int __remove_mapping(struct address_space *mapping, struct page
> *page,
> bool reclaimed, struct mem_cgroup *target_memcg)
> {
> unsigned long flags;
> int refcount;
> void *shadow = NULL;
>
> BUG_ON(!PageLocked(page));
> BUG_ON(mapping != page_mapping(page));
>
> xa_lock_irqsave(>i_pages, flags);
>
>> SeongJae, what is this algorithm supposed to do when faced
>> with situations like this:
>
> I'll assume the questions were directed at me, not SeongJae.
>
>> 1) Running on a system with 8 NUMA nodes, and
>> memory
>>pressure in one of those nodes.
>> 2) Running PostgresQL or Oracle, with hundreds of
>>processes mapping the same (very large) shared
>>memory segment.
>>
>> How do you keep your algorithm from falling into the worst
>> case virtual scanning scenarios that were crippling the
>> 2.4 kernel 15+ years ago on systems with just a few GB of
>> memory?
>
> There is a fundamental shift: that time we were scanning for cold pages,
> and nowadays we are scanning for hot pages.
>
> I'd be surprised if scanning for cold pages didn't fall apart, because it'd
> find most of the entries accessed, if they are present at all.
>
> Scanning for hot pages, on the other hand, is way better. Let me just
> reiterate:
> 1) It will not scan page tables from processes that have been sleeping
>since the last scan.
> 2) It will not scan PTE tables under non-leaf PMD entries that do not
>have the accessed bit set, when
>CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> 3) It will not zigzag between the PGD table and the same PMD or PTE
>table spanning multiple VMAs. In other words, it finishes all the
>VMAs with the range of the same PMD or PTE table before it returns
>to the PGD table. This optimizes workloads that have large numbers
>of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.
>
> So the cost is roughly proportional to the number of referenced pages it
> discovers. If there is no memory pressure, no scanning at all. For a system
> under heavy memory pressure, most of the pages are referenced (otherwise
> why would it be under memory pressure?), and if we use the rmap, we need to
> scan a lot of pages anyway. Why not just scan them all?

This may be not the case.  For rmap scanning, it's possible to scan only
a small portion of memory.  But with the page table scanning, you need
to scan almost all (I understand you have some optimization as above).
As Rik shown in the test case above, there may be memory pressure on
only one of 8 NUMA nodes (because of NUMA binding?).  Then ramp scanning
only needs to scan pages in this node, while the page table scanning may
need to scan pages in other nodes too.

Best Regards,
Huang, Ying

> This way you save a
> lot because of batching (now it's time to talk about batching). Besides,
> page tables have far better memory locality than the rmap. For the shared
> memory example you gave, the rmap need

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-13 Thread Huang, Ying

Dennis Zhou  writes:

> On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
>> Dennis Zhou  writes:
>> 
>> > Hello,
>> >
>> > On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
>> >> Miaohe Lin  writes:
>> >> 
>> >> > On 2021/4/14 9:17, Huang, Ying wrote:
>> >> >> Miaohe Lin  writes:
>> >> >> 
>> >> >>> On 2021/4/12 15:24, Huang, Ying wrote:
>> >> >>>> "Huang, Ying"  writes:
>> >> >>>>
>> >> >>>>> Miaohe Lin  writes:
>> >> >>>>>
>> >> >>>>>> We will use percpu-refcount to serialize against concurrent 
>> >> >>>>>> swapoff. This
>> >> >>>>>> patch adds the percpu_ref support for later fixup.
>> >> >>>>>>
>> >> >>>>>> Signed-off-by: Miaohe Lin 
>> >> >>>>>> ---
>> >> >>>>>>  include/linux/swap.h |  2 ++
>> >> >>>>>>  mm/swapfile.c| 25 ++---
>> >> >>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>> >> >>>>>>
>> >> >>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> >>>>>> index 144727041e78..849ba5265c11 100644
>> >> >>>>>> --- a/include/linux/swap.h
>> >> >>>>>> +++ b/include/linux/swap.h
>> >> >>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>> >> >>>>>>   * The in-memory structure used to track swap areas.
>> >> >>>>>>   */
>> >> >>>>>>  struct swap_info_struct {
>> >> >>>>>> +  struct percpu_ref users;/* serialization against 
>> >> >>>>>> concurrent swapoff */
>> >> >>>>>>unsigned long   flags;  /* SWP_USED etc: see above */
>> >> >>>>>>signed shortprio;   /* swap priority of this type */
>> >> >>>>>>struct plist_node list; /* entry in swap_active_head */
>> >> >>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>> >> >>>>>>struct block_device *bdev;  /* swap device or bdev of swap 
>> >> >>>>>> file */
>> >> >>>>>>struct file *swap_file; /* seldom referenced */
>> >> >>>>>>unsigned int old_block_size;/* seldom referenced */
>> >> >>>>>> +  struct completion comp; /* seldom referenced */
>> >> >>>>>>  #ifdef CONFIG_FRONTSWAP
>> >> >>>>>>unsigned long *frontswap_map;   /* frontswap in-use, one bit 
>> >> >>>>>> per page */
>> >> >>>>>>atomic_t frontswap_pages;   /* frontswap pages in-use 
>> >> >>>>>> counter */
>> >> >>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> >>>>>> index 149e77454e3c..724173cd7d0c 100644
>> >> >>>>>> --- a/mm/swapfile.c
>> >> >>>>>> +++ b/mm/swapfile.c
>> >> >>>>>> @@ -39,6 +39,7 @@
>> >> >>>>>>  #include 
>> >> >>>>>>  #include 
>> >> >>>>>>  #include 
>> >> >>>>>> +#include 
>> >> >>>>>>  
>> >> >>>>>>  #include 
>> >> >>>>>>  #include 
>> >> >>>>>> @@ -511,6 +512,15 @@ static void swap_discard_work(struct 
>> >> >>>>>> work_struct *work)
>> >> >>>>>>spin_unlock(>lock);
>> >> >>>>>>  }
>> >> >>>>>>  
>> >> >>>>>> +static void swap_users_ref_free(struct percpu_ref *ref)
>> >> >>>>>> +{
>> >> >>>>>> +  struct swap_info_struct *si;
>> >> >>>>>> +
>> >> >>>>>> +  si = container_of(ref, struct swap_info_struct, users);
>> >> >>>>>> +  complete(>comp);
>> &

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-13 Thread Huang, Ying

Dennis Zhou  writes:

> Hello,
>
> On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
>> Miaohe Lin  writes:
>> 
>> > On 2021/4/14 9:17, Huang, Ying wrote:
>> >> Miaohe Lin  writes:
>> >> 
>> >>> On 2021/4/12 15:24, Huang, Ying wrote:
>> >>>> "Huang, Ying"  writes:
>> >>>>
>> >>>>> Miaohe Lin  writes:
>> >>>>>
>> >>>>>> We will use percpu-refcount to serialize against concurrent swapoff. 
>> >>>>>> This
>> >>>>>> patch adds the percpu_ref support for later fixup.
>> >>>>>>
>> >>>>>> Signed-off-by: Miaohe Lin 
>> >>>>>> ---
>> >>>>>>  include/linux/swap.h |  2 ++
>> >>>>>>  mm/swapfile.c| 25 ++---
>> >>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>> >>>>>>
>> >>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >>>>>> index 144727041e78..849ba5265c11 100644
>> >>>>>> --- a/include/linux/swap.h
>> >>>>>> +++ b/include/linux/swap.h
>> >>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>> >>>>>>   * The in-memory structure used to track swap areas.
>> >>>>>>   */
>> >>>>>>  struct swap_info_struct {
>> >>>>>> + struct percpu_ref users;/* serialization against 
>> >>>>>> concurrent swapoff */
>> >>>>>>   unsigned long   flags;  /* SWP_USED etc: see above */
>> >>>>>>   signed shortprio;   /* swap priority of this type */
>> >>>>>>   struct plist_node list; /* entry in swap_active_head */
>> >>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>> >>>>>>   struct block_device *bdev;  /* swap device or bdev of swap 
>> >>>>>> file */
>> >>>>>>   struct file *swap_file; /* seldom referenced */
>> >>>>>>   unsigned int old_block_size;/* seldom referenced */
>> >>>>>> + struct completion comp; /* seldom referenced */
>> >>>>>>  #ifdef CONFIG_FRONTSWAP
>> >>>>>>   unsigned long *frontswap_map;   /* frontswap in-use, one bit 
>> >>>>>> per page */
>> >>>>>>   atomic_t frontswap_pages;   /* frontswap pages in-use 
>> >>>>>> counter */
>> >>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >>>>>> index 149e77454e3c..724173cd7d0c 100644
>> >>>>>> --- a/mm/swapfile.c
>> >>>>>> +++ b/mm/swapfile.c
>> >>>>>> @@ -39,6 +39,7 @@
>> >>>>>>  #include 
>> >>>>>>  #include 
>> >>>>>>  #include 
>> >>>>>> +#include 
>> >>>>>>  
>> >>>>>>  #include 
>> >>>>>>  #include 
>> >>>>>> @@ -511,6 +512,15 @@ static void swap_discard_work(struct work_struct 
>> >>>>>> *work)
>> >>>>>>   spin_unlock(>lock);
>> >>>>>>  }
>> >>>>>>  
>> >>>>>> +static void swap_users_ref_free(struct percpu_ref *ref)
>> >>>>>> +{
>> >>>>>> + struct swap_info_struct *si;
>> >>>>>> +
>> >>>>>> + si = container_of(ref, struct swap_info_struct, users);
>> >>>>>> + complete(>comp);
>> >>>>>> + percpu_ref_exit(>users);
>> >>>>>
>> >>>>> Because percpu_ref_exit() is used, we cannot use percpu_ref_tryget() in
>> >>>>> get_swap_device(), better to add comments there.
>> >>>>
>> >>>> I just noticed that the comments of percpu_ref_tryget_live() says,
>> >>>>
>> >>>>  * This function is safe to call as long as @ref is between init and 
>> >>>> exit.
>> >>>>
>> >>>> While we need to call get_swap_device() almost at any time, so it's
>> >>>> b

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-13 Thread Huang, Ying

Miaohe Lin  writes:

> On 2021/4/13 9:27, Huang, Ying wrote:
>> Miaohe Lin  writes:
>> 
>>> When I was investigating the swap code, I found the below possible race
>>> window:
>>>
>>> CPU 1   CPU 2
>>> -   -
>>> do_swap_page
>>>   synchronous swap_readpage
>>> alloc_page_vma
>>> swapoff
>>>   release swap_file, bdev, or ...
>>>   swap_readpage
>>> check sis->flags is ok
>>>   access swap_file, bdev...[oops!]
>>> si->flags = 0
>>>
>>> Using current get/put_swap_device() to guard against concurrent swapoff for
>>> swap_readpage() looks terrible because swap_readpage() may take really long
>>> time. And this race may not be really pernicious because swapoff is usually
>>> done when system shutdown only. To reduce the performance overhead on the
>>> hot-path as much as possible, it appears we can use the percpu_ref to close
>>> this race window(as suggested by Huang, Ying).
>>>
>>> Fixes: 235b62176712 ("mm/swap: add cluster lock")
>> 
>> This isn't the commit that introduces the race.  You can use `git blame`
>> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
>> swap: skip swapcache for swapin of synchronous device".
>> 
>
> Sorry about it! What I refer to is commit eb085574a752 ("mm, swap: fix race 
> between
> swapoff and some swap operations"). And I think this commit does not fix the 
> race
> condition completely, so I reuse the Fixes tag inside it.
>
>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>> picture.
>> 
>>> Signed-off-by: Miaohe Lin 
>>> ---
>>>  include/linux/swap.h |  2 +-
>>>  mm/memory.c  | 10 ++
>>>  mm/swapfile.c| 28 +++-
>>>  3 files changed, 22 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 849ba5265c11..9066addb57fd 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>>>  
>>>  static inline void put_swap_device(struct swap_info_struct *si)
>>>  {
>>> -   rcu_read_unlock();
>>> +   percpu_ref_put(>users);
>>>  }
>>>  
>>>  #else /* CONFIG_SWAP */
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index cc71a445c76c..8543c47b955c 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>  {
>>> struct vm_area_struct *vma = vmf->vma;
>>> struct page *page = NULL, *swapcache;
>>> +   struct swap_info_struct *si = NULL;
>>> swp_entry_t entry;
>>> pte_t pte;
>>> int locked;
>>> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> }
>>>  
>>>
>> 
>> I suggest to add comments here as follows (words copy from Matthew Wilcox)
>> 
>>  /* Prevent swapoff from happening to us */
>
> Ok.
>
>> 
>>> +   si = get_swap_device(entry);
>>> +   /* In case we raced with swapoff. */
>>> +   if (unlikely(!si))
>>> +   goto out;
>>> +
>> 
>> Because we wrap the whole do_swap_page() with get/put_swap_device()
>> now.  We can remove several get/put_swap_device() for function called by
>> do_swap_page().  That can be another optimization patch.
>
> I tried to remove several get/put_swap_device() for function called
> by do_swap_page() only before I send this series. But it seems they have
> other callers without proper get/put_swap_device().

Then we need to revise these callers instead.  Anyway, can be another
series.

Best Regards,
Huang, Ying

Re: [v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count

2021-04-13 Thread Huang, Ying

Yang Shi  writes:

> The generic migration path will check refcount, so no need check refcount 
> here.
> But the old code actually prevents from migrating shared THP (mapped by 
> multiple
> processes), so bail out early if mapcount is > 1 to keep the behavior.

What prevents us from migrating shared THP?  If no, why not just remove
the old refcount checking?

Best Regards,
Huang, Ying

> Signed-off-by: Yang Shi 
> ---
>  mm/migrate.c | 16 
>  1 file changed, 4 insertions(+), 12 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index a72994c68ec6..dc7cc7f3a124 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2067,6 +2067,10 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, 
> struct page *page)
>  
>   VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);
>  
> + /* Do not migrate THP mapped by multiple processes */
> + if (PageTransHuge(page) && page_mapcount(page) > 1)
> + return 0;
> +
>   /* Avoid migrating to a node that is nearly full */
>   if (!migrate_balanced_pgdat(pgdat, compound_nr(page)))
>   return 0;
> @@ -2074,18 +2078,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, 
> struct page *page)
>   if (isolate_lru_page(page))
>   return 0;
>  
> - /*
> -  * migrate_misplaced_transhuge_page() skips page migration's usual
> -  * check on page_count(), so we must do it here, now that the page
> -  * has been isolated: a GUP pin, or any other pin, prevents migration.
> -  * The expected page count is 3: 1 for page's mapcount and 1 for the
> -  * caller's pin and 1 for the reference taken by isolate_lru_page().
> -  */
> - if (PageTransHuge(page) && page_count(page) != 3) {
> - putback_lru_page(page);
> - return 0;
> - }
> -
>   page_lru = page_is_file_lru(page);
>   mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
>   thp_nr_pages(page));

Re: [v2 PATCH 3/7] mm: thp: refactor NUMA fault handling

2021-04-13 Thread Huang, Ying

Yang Shi  writes:

> When the THP NUMA fault support was added THP migration was not supported yet.
> So the ad hoc THP migration was implemented in NUMA fault handling.  Since 
> v4.14
> THP migration has been supported so it doesn't make too much sense to still 
> keep
> another THP migration implementation rather than using the generic migration
> code.
>
> This patch reworked the NUMA fault handling to use generic migration 
> implementation
> to migrate misplaced page.  There is no functional change.
>
> After the refactor the flow of NUMA fault handling looks just like its
> PTE counterpart:
>   Acquire ptl
>   Prepare for migration (elevate page refcount)
>   Release ptl
>   Isolate page from lru and elevate page refcount
>   Migrate the misplaced THP
>
> If migration is failed just restore the old normal PMD.
>
> In the old code anon_vma lock was needed to serialize THP migration
> against THP split, but since then the THP code has been reworked a lot,
> it seems anon_vma lock is not required anymore to avoid the race.
>
> The page refcount elevation when holding ptl should prevent from THP
> split.
>
> Use migrate_misplaced_page() for both base page and THP NUMA hinting
> fault and remove all the dead and duplicate code.
>
> Signed-off-by: Yang Shi 
> ---
>  include/linux/migrate.h |  23 --
>  mm/huge_memory.c| 143 ++--
>  mm/internal.h   |  18 
>  mm/migrate.c| 177 
>  4 files changed, 77 insertions(+), 284 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 4bb4e519e3f5..163d6f2b03d1 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -95,14 +95,9 @@ static inline void __ClearPageMovable(struct page *page)
>  #endif
>  
>  #ifdef CONFIG_NUMA_BALANCING
> -extern bool pmd_trans_migrating(pmd_t pmd);
>  extern int migrate_misplaced_page(struct page *page,
> struct vm_area_struct *vma, int node);
>  #else
> -static inline bool pmd_trans_migrating(pmd_t pmd)
> -{
> - return false;
> -}
>  static inline int migrate_misplaced_page(struct page *page,
>struct vm_area_struct *vma, int node)
>  {
> @@ -110,24 +105,6 @@ static inline int migrate_misplaced_page(struct page 
> *page,
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> -#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> -extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> - struct vm_area_struct *vma,
> - pmd_t *pmd, pmd_t entry,
> - unsigned long address,
> - struct page *page, int node);
> -#else
> -static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> - struct vm_area_struct *vma,
> - pmd_t *pmd, pmd_t entry,
> - unsigned long address,
> - struct page *page, int node)
> -{
> - return -EAGAIN;
> -}
> -#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
> -
> -
>  #ifdef CONFIG_MIGRATION
>  
>  /*
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 35cac4aeaf68..94981907fd4c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1418,93 +1418,21 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  {
>   struct vm_area_struct *vma = vmf->vma;
>   pmd_t pmd = vmf->orig_pmd;
> - struct anon_vma *anon_vma = NULL;
> + pmd_t oldpmd;

nit: the usage of oldpmd and pmd in the function appears not very
consistent.  How about make oldpmd == vmf->orig_pmd always.  While make
pmd the changed one?

Best Regards,
Huang, Ying

>   struct page *page;
>   unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> - int page_nid = NUMA_NO_NODE, this_nid = numa_node_id();
> + int page_nid = NUMA_NO_NODE;
>   int target_nid, last_cpupid = -1;
> - bool page_locked;
>   bool migrated = false;
> - bool was_writable;
> + bool was_writable = pmd_savedwrite(pmd);
>   int flags = 0;
>  
>   vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> - if (unlikely(!pmd_same(pmd, *vmf->pmd)))
> - goto out_unlock;
> -
> - /*
> -  * If there are potential migrations, wait for completion and retry
> -  * without disrupting NUMA hinting information. Do not relock and
> -  * check_same as the page may no longer be mapped.
> -  */
> - if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
> - page = pmd

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-13 Thread Huang, Ying

Miaohe Lin  writes:

> On 2021/4/14 9:17, Huang, Ying wrote:
>> Miaohe Lin  writes:
>> 
>>> On 2021/4/12 15:24, Huang, Ying wrote:
>>>> "Huang, Ying"  writes:
>>>>
>>>>> Miaohe Lin  writes:
>>>>>
>>>>>> We will use percpu-refcount to serialize against concurrent swapoff. This
>>>>>> patch adds the percpu_ref support for later fixup.
>>>>>>
>>>>>> Signed-off-by: Miaohe Lin 
>>>>>> ---
>>>>>>  include/linux/swap.h |  2 ++
>>>>>>  mm/swapfile.c| 25 ++---
>>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>>> index 144727041e78..849ba5265c11 100644
>>>>>> --- a/include/linux/swap.h
>>>>>> +++ b/include/linux/swap.h
>>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>>>>   * The in-memory structure used to track swap areas.
>>>>>>   */
>>>>>>  struct swap_info_struct {
>>>>>> +struct percpu_ref users;/* serialization against 
>>>>>> concurrent swapoff */
>>>>>>  unsigned long   flags;  /* SWP_USED etc: see above */
>>>>>>  signed shortprio;   /* swap priority of this type */
>>>>>>  struct plist_node list; /* entry in swap_active_head */
>>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>>>>>>  struct block_device *bdev;  /* swap device or bdev of swap 
>>>>>> file */
>>>>>>  struct file *swap_file; /* seldom referenced */
>>>>>>  unsigned int old_block_size;/* seldom referenced */
>>>>>> +struct completion comp; /* seldom referenced */
>>>>>>  #ifdef CONFIG_FRONTSWAP
>>>>>>  unsigned long *frontswap_map;   /* frontswap in-use, one bit 
>>>>>> per page */
>>>>>>  atomic_t frontswap_pages;   /* frontswap pages in-use 
>>>>>> counter */
>>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>>>> index 149e77454e3c..724173cd7d0c 100644
>>>>>> --- a/mm/swapfile.c
>>>>>> +++ b/mm/swapfile.c
>>>>>> @@ -39,6 +39,7 @@
>>>>>>  #include 
>>>>>>  #include 
>>>>>>  #include 
>>>>>> +#include 
>>>>>>  
>>>>>>  #include 
>>>>>>  #include 
>>>>>> @@ -511,6 +512,15 @@ static void swap_discard_work(struct work_struct 
>>>>>> *work)
>>>>>>  spin_unlock(>lock);
>>>>>>  }
>>>>>>  
>>>>>> +static void swap_users_ref_free(struct percpu_ref *ref)
>>>>>> +{
>>>>>> +struct swap_info_struct *si;
>>>>>> +
>>>>>> +si = container_of(ref, struct swap_info_struct, users);
>>>>>> +complete(>comp);
>>>>>> +percpu_ref_exit(>users);
>>>>>
>>>>> Because percpu_ref_exit() is used, we cannot use percpu_ref_tryget() in
>>>>> get_swap_device(), better to add comments there.
>>>>
>>>> I just noticed that the comments of percpu_ref_tryget_live() says,
>>>>
>>>>  * This function is safe to call as long as @ref is between init and exit.
>>>>
>>>> While we need to call get_swap_device() almost at any time, so it's
>>>> better to avoid to call percpu_ref_exit() at all.  This will waste some
>>>> memory, but we need to follow the API definition to avoid potential
>>>> issues in the long term.
>>>
>>> I have to admit that I'am not really familiar with percpu_ref. So I read the
>>> implementation code of the percpu_ref and found percpu_ref_tryget_live() 
>>> could
>>> be called after exit now. But you're right we need to follow the API 
>>> definition
>>> to avoid potential issues in the long term.
>>>
>>>>
>>>> And we need to call percpu_ref_init() before insert the swap_info_struct
>>>> into the swap_info[].
>>>
>>> If we remove the call to percpu_ref_exit(), we should not use 
>>&

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-13 Thread Huang, Ying

Miaohe Lin  writes:

> On 2021/4/12 15:24, Huang, Ying wrote:
>> "Huang, Ying"  writes:
>> 
>>> Miaohe Lin  writes:
>>>
>>>> We will use percpu-refcount to serialize against concurrent swapoff. This
>>>> patch adds the percpu_ref support for later fixup.
>>>>
>>>> Signed-off-by: Miaohe Lin 
>>>> ---
>>>>  include/linux/swap.h |  2 ++
>>>>  mm/swapfile.c| 25 ++---
>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index 144727041e78..849ba5265c11 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>>   * The in-memory structure used to track swap areas.
>>>>   */
>>>>  struct swap_info_struct {
>>>> +  struct percpu_ref users;/* serialization against concurrent 
>>>> swapoff */
>>>>unsigned long   flags;  /* SWP_USED etc: see above */
>>>>signed shortprio;   /* swap priority of this type */
>>>>struct plist_node list; /* entry in swap_active_head */
>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>>>>struct block_device *bdev;  /* swap device or bdev of swap file */
>>>>struct file *swap_file; /* seldom referenced */
>>>>unsigned int old_block_size;/* seldom referenced */
>>>> +  struct completion comp; /* seldom referenced */
>>>>  #ifdef CONFIG_FRONTSWAP
>>>>unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
>>>>atomic_t frontswap_pages;   /* frontswap pages in-use counter */
>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>> index 149e77454e3c..724173cd7d0c 100644
>>>> --- a/mm/swapfile.c
>>>> +++ b/mm/swapfile.c
>>>> @@ -39,6 +39,7 @@
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>> +#include 
>>>>  
>>>>  #include 
>>>>  #include 
>>>> @@ -511,6 +512,15 @@ static void swap_discard_work(struct work_struct 
>>>> *work)
>>>>spin_unlock(>lock);
>>>>  }
>>>>  
>>>> +static void swap_users_ref_free(struct percpu_ref *ref)
>>>> +{
>>>> +  struct swap_info_struct *si;
>>>> +
>>>> +  si = container_of(ref, struct swap_info_struct, users);
>>>> +  complete(>comp);
>>>> +  percpu_ref_exit(>users);
>>>
>>> Because percpu_ref_exit() is used, we cannot use percpu_ref_tryget() in
>>> get_swap_device(), better to add comments there.
>> 
>> I just noticed that the comments of percpu_ref_tryget_live() says,
>> 
>>  * This function is safe to call as long as @ref is between init and exit.
>> 
>> While we need to call get_swap_device() almost at any time, so it's
>> better to avoid to call percpu_ref_exit() at all.  This will waste some
>> memory, but we need to follow the API definition to avoid potential
>> issues in the long term.
>
> I have to admit that I'am not really familiar with percpu_ref. So I read the
> implementation code of the percpu_ref and found percpu_ref_tryget_live() could
> be called after exit now. But you're right we need to follow the API 
> definition
> to avoid potential issues in the long term.
>
>> 
>> And we need to call percpu_ref_init() before insert the swap_info_struct
>> into the swap_info[].
>
> If we remove the call to percpu_ref_exit(), we should not use 
> percpu_ref_init()
> here because *percpu_ref->data is assumed to be NULL* in percpu_ref_init() 
> while
> this is not the case as we do not call percpu_ref_exit(). Maybe 
> percpu_ref_reinit()
> or percpu_ref_resurrect() will do the work.
>
> One more thing, how could I distinguish the killed percpu_ref from newly 
> allocated one?
> It seems percpu_ref_is_dying is only safe to call when @ref is between init 
> and exit.
> Maybe I could do this in alloc_swap_info()?

Yes.  In alloc_swap_info(), you can distinguish newly allocated and
reused swap_info_struct.

>> 
>>>> +}
>>>> +
>>>>  static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>>>>  {
>>>>struct swap_cluster_info *ci = si->cluster_info;
>>>> @@ -2500,7 +2510,7 @@ static void enable_swap_info(struct swap_info_struct 
&

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-13 Thread Huang, Ying

Tim Chen  writes:

> On 4/12/21 6:27 PM, Huang, Ying wrote:
>
>> 
>> This isn't the commit that introduces the race.  You can use `git blame`
>> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
>> swap: skip swapcache for swapin of synchronous device".
>> 
>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>> picture.
>
> I'll suggest make fix to do_swap_page race with get/put_swap_device
> as a first patch. Then the per_cpu_ref stuff in patch 1 and patch 2 can
> be combined together.

The original get/put_swap_device() use rcu_read_lock/unlock().  I don't
think it's good to wrap swap_read_page() with it.  After all, some
complex operations are done in swap_read_page(), including
blk_io_schedule().

Best Regards,
Huang, Ying

Re: [RFC] mm: activate access-more-than-once page via NUMA balancing

2021-04-12 Thread Huang, Ying

Yu Zhao  writes:

> On Fri, Mar 26, 2021 at 12:21 AM Huang, Ying  wrote:
>>
>> Mel Gorman  writes:
>>
>> > On Thu, Mar 25, 2021 at 12:33:45PM +0800, Huang, Ying wrote:
>> >> > I caution against this patch.
>> >> >
>> >> > It's non-deterministic for a number of reasons. As it requires NUMA
>> >> > balancing to be enabled, the pageout behaviour of a system changes when
>> >> > NUMA balancing is active. If this led to pages being artificially and
>> >> > inappropriately preserved, NUMA balancing could be disabled for the
>> >> > wrong reasons.  It only applies to pages that have no target node so
>> >> > memory policies affect which pages are activated differently. Similarly,
>> >> > NUMA balancing does not scan all VMAs and some pages may never trap a
>> >> > NUMA fault as a result. The timing of when an address space gets scanned
>> >> > is driven by the locality of pages and so the timing of page activation
>> >> > potentially becomes linked to whether pages are local or need to migrate
>> >> > (although not right now for this patch as it only affects pages with a
>> >> > target nid of NUMA_NO_NODE). In other words, changes in NUMA balancing
>> >> > that affect migration potentially affect the aging rate.  Similarly,
>> >> > the activate rate of a process with a single thread and multiple threads
>> >> > potentially have different activation rates.
>> >> >
>> >> > Finally, the NUMA balancing scan algorithm is sub-optimal. It 
>> >> > potentially
>> >> > scans the entire address space even though only a small number of pages
>> >> > are scanned. This is particularly problematic when a process has a lot
>> >> > of threads because threads are redundantly scanning the same regions. If
>> >> > NUMA balancing ever introduced range tracking of faulted pages to limit
>> >> > how much scanning it has to do, it would inadvertently cause a change in
>> >> > page activation rate.
>> >> >
>> >> > NUMA balancing is about page locality, it should not get conflated with
>> >> > page aging.
>> >>
>> >> I understand your concerns about binding the NUMA balancing and page
>> >> reclaiming.  The requirement of the page locality and page aging is
>> >> different, so the policies need to be different.  This is the wrong part
>> >> of the patch.
>> >>
>> >> From another point of view, it's still possible to share some underlying
>> >> mechanisms (and code) between them.  That is, scanning the page tables
>> >> to make pages unaccessible and capture the page accesses via the page
>> >> fault.
>> >
>> > Potentially yes but not necessarily recommended for page aging. NUMA
>> > balancing has to be careful about the rate it scans pages to avoid
>> > excessive overhead so it's driven by locality. The scanning happens
>> > within a tasks context so during that time, the task is not executing
>> > its normal work and it incurs the overhead for faults. Generally, this
>> > is not too much overhead because pages get migrated locally, the scan
>> > rate drops and so does the overhead.
>> >
>> > However, if you want to drive page aging, that is constant so the rate
>> > could not be easily adapted in a way that would be deterministic.
>> >
>> >> Now these page accessing information is used for the page
>> >> locality.  Do you think it's a good idea to use these information for
>> >> the page aging too (but with a different policy as you pointed out)?
>> >>
>> >
>> > I'm not completely opposed to it but I think the overhead it would
>> > introduce could be severe. Worse, if a workload fits in memory and there
>> > is limited to no memory pressure, it's all overhead for no gain. Early
>> > generations of NUMA balancing had to find a balance to sure the gains
>> > from locality exceeded the cost of measuring locality and doing the same
>> > for page aging in some ways is even more challenging.
>>
>> Yes.  I will think more about it from the overhead vs. gain point of
>> view.  Thanks a lot for your sharing on that.
>>
>> >> From yet another point of view :-), in current NUMA balancing
>> >> implementation, it's assumed that the node private pages can fit in the
>> >> accessing node.  But this m

Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list

2021-04-12 Thread Huang, Ying

Yu Zhao  writes:

> On Wed, Mar 24, 2021 at 12:58 AM Huang, Ying  wrote:
>>
>> Yu Zhao  writes:
>>
>> > On Mon, Mar 22, 2021 at 11:13:19AM +0800, Huang, Ying wrote:
>> >> Yu Zhao  writes:
>> >>
>> >> > On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
>> >> >> Yu Zhao  writes:
>> >> >>
>> >> >> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
>> >> >> > The scanning overhead is only one of the two major problems of the
>> >> >> > current page reclaim. The other problem is the granularity of the
>> >> >> > active/inactive (sizes). We stopped using them in making job
>> >> >> > scheduling decision a long time ago. I know another large internet
>> >> >> > company adopted a similar approach as ours, and I'm wondering how
>> >> >> > everybody else is coping with the discrepancy from those counters.
>> >> >>
>> >> >> From intuition, the scanning overhead of the full page table scanning
>> >> >> appears higher than that of the rmap scanning for a small portion of
>> >> >> system memory.  But form your words, you think the reality is the
>> >> >> reverse?  If others concern about the overhead too, finally, I think 
>> >> >> you
>> >> >> need to prove the overhead of the page table scanning isn't too higher,
>> >> >> or even lower with more data and theory.
>> >> >
>> >> > There is a misunderstanding here. I never said anything about full
>> >> > page table scanning. And this is not how it's done in this series
>> >> > either. I guess the misunderstanding has something to do with the cold
>> >> > memory tracking you are thinking about?
>> >>
>> >> If my understanding were correct, from the following code path in your
>> >> patch 10/14,
>> >>
>> >> age_active_anon
>> >>   age_lru_gens
>> >> try_walk_mm_list
>> >>   walk_mm_list
>> >> walk_mm
>> >>
>> >> So, in kswapd(), the page tables of many processes may be scanned
>> >> fully.  If the number of processes that are active are high, the
>> >> overhead may be high too.
>> >
>> > That's correct. Just in case we have different definitions of what we
>> > call "full":
>> >
>> >   I understand it as the full range of the address space of a process
>> >   that was loaded by switch_mm() at least once since the last scan.
>> >   This is not the case because we don't scan the full range -- we skip
>> >   holes and VMAs that are unevictable, as well as PTE tables that have
>> >   no accessed entries on x86_64, by should_skip_vma() and
>> >   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG.
>> >
>> >   If you are referring to the full range of PTE tables that have at
>> >   least one accessed entry, i.e., other 511 are not none  but have not
>> >   been accessed either since the last scan on x86_64, then yes, you
>> >   are right again :) This is the worse case scenario.
>>
>> OK.  So there's no fundamental difference between us on this.
>>
>> >> > This series uses page tables to discover page accesses when a system
>> >> > has run out of inactive pages. Under such a situation, the system is
>> >> > very likely to have a lot of page accesses, and using the rmap is
>> >> > likely to cost a lot more because its poor memory locality compared
>> >> > with page tables.
>> >>
>> >> This is the theory.  Can you verify this with more data?  Including the
>> >> CPU cycles or time spent scanning page tables?
>> >
>> > Yes, I'll be happy to do so as I should, because page table scanning
>> > is counterintuitive. Let me add more theory in case it's still unclear
>> > to others.
>> >
>> > From my understanding, the two fundamental questions we need to
>> > consider in terms of page reclaim are:
>> >
>> >   What are the sizes of hot clusters (spatial locality) should we
>> >   expect under memory pressure?
>> >
>> >   On smaller systems with 4GB memory, our observations are that the
>> >   average size of hot clusters found during each scan is 32KB. On
>> >   larger systems with hundreds of gig

Re: [PATCH v1 00/14] Multigenerational LRU

2021-04-12 Thread Huang, Ying

om_kill /proc/vmstat
>   oom_kill 81
>
>   The test process context (the multigenerational LRU):
> 33.12%  sparse   page_vma_mapped_walk
> 10.70%  sparse   walk_pud_range
>  9.64%  sparse   page_counter_try_charge
>  6.63%  sparse   propagate_protected_usage
>  4.43%  sparse   native_queued_spin_lock_slowpath
>  3.85%  sparse   page_counter_uncharge
>  3.71%  sparse   irqentry_exit_to_user_mode
>  2.16%  sparse   _raw_spin_lock
>  1.83%  sparse   unmap_page_range
>  1.82%  sparse   shrink_slab
>
>   CPU % (direct reclaim vs the rest): 47% vs 53%
>   # grep oom_kill /proc/vmstat
>   oom_kill 80
>
> I also compared other numbers from /proc/vmstat. They do not provide
> any additional insight than the profiles, so I will just omit them
> here.
>
> The following optimizations and the stats measuring their efficacies
> explain why the multigenerational LRU did not perform worse:
>
>   Optimization 1: take advantage of the scheduling information.
> # of active processes   270
> # of inactive processes 105
>
>   Optimization 2: take the advantage of the accessed bit on non-leaf
>   PMD entries.
> # of old non-leaf PMD entries   30523335
> # of young non-leaf PMD entries 1358400
>
> These stats are not currently included. But I will add them to the
> debugfs interface in the next version coming soon. And I will also add
> another optimization for Android. It reduces zigzags when there are
> many single-page VMAs, i.e., not returning to the PGD table for each
> of such VMAs. Just a heads-up.
>
> The rmap, on the other hand, had to
>   1) lock each (shmem) page it scans
>   2) go through five levels of page tables for each page, even though
>   some of them have the same LCAs
> during the test. The second part is worse given that I have 5 levels
> of page tables configured.
>
> Any additional benchmarks you would suggest? Thanks.

Hi, Yu,

Thanks for your data.

In addition to the data your measured above, is it possible for you to
measure some raw data?  For example, how many CPU cycles does it take to
scan all pages in the system?  For the page table scanning, the page
tables of all processes will be scanned.  For the rmap scanning, all
pages in LRU will be scanned.  And we can do that with difference
parameters, for example, shared vs. non-shared, sparse vs. dense.  Then
we can get an idea about how fast the page table scanning can be.

Best Regards,
Huang, Ying

[RFC PATCH] percpu_ref: Make percpu_ref_tryget*() ACQUIRE operations

2021-04-12 Thread Huang Ying

One typical use case of percpu_ref_tryget() family functions is as
follows,

  if (percpu_ref_tryget(>ref)) {
  /* Operate on the other fields of *p */
  }

The refcount needs to be checked before operating on the other fields
of the data structure (*p), otherwise, the values gotten from the
other fields may be invalid or inconsistent.  To guarantee the correct
memory ordering, percpu_ref_tryget*() needs to be the ACQUIRE
operations.

This function implements that via using smp_load_acquire() in
__ref_is_percpu() to read the percpu pointer.

Signed-off-by: "Huang, Ying" 
Cc: Tejun Heo 
Cc: Kent Overstreet 
Cc: "Paul E. McKenney" 
Cc: Roman Gushchin 
Cc: Ming Lei 
Cc: Al Viro 
Cc: Miaohe Lin 
---
 include/linux/percpu-refcount.h | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 16c35a728b4c..9838f7ea4bf1 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -165,13 +165,13 @@ static inline bool __ref_is_percpu(struct percpu_ref *ref,
 * !__PERCPU_REF_ATOMIC, which may be set asynchronously, and then
 * used as a pointer.  If the compiler generates a separate fetch
 * when using it as a pointer, __PERCPU_REF_ATOMIC may be set in
-* between contaminating the pointer value, meaning that
-* READ_ONCE() is required when fetching it.
+* between contaminating the pointer value, smp_load_acquire()
+* will prevent this.
 *
-* The dependency ordering from the READ_ONCE() pairs
+* The dependency ordering from the smp_load_acquire() pairs
 * with smp_store_release() in __percpu_ref_switch_to_percpu().
 */
-   percpu_ptr = READ_ONCE(ref->percpu_count_ptr);
+   percpu_ptr = smp_load_acquire(>percpu_count_ptr);
 
/*
 * Theoretically, the following could test just ATOMIC; however,
@@ -231,6 +231,9 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
  * Returns %true on success; %false on failure.
  *
  * This function is safe to call as long as @ref is between init and exit.
+ *
+ * This function is an ACQUIRE operation, that is, all memory operations
+ * after will appear to happen after checking the refcount.
  */
 static inline bool percpu_ref_tryget_many(struct percpu_ref *ref,
  unsigned long nr)
@@ -260,6 +263,9 @@ static inline bool percpu_ref_tryget_many(struct percpu_ref 
*ref,
  * Returns %true on success; %false on failure.
  *
  * This function is safe to call as long as @ref is between init and exit.
+ *
+ * This function is an ACQUIRE operation, that is, all memory operations
+ * after will appear to happen after checking the refcount.
  */
 static inline bool percpu_ref_tryget(struct percpu_ref *ref)
 {
@@ -280,6 +286,9 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
  * percpu_ref_tryget_live().
  *
  * This function is safe to call as long as @ref is between init and exit.
+ *
+ * This function is an ACQUIRE operation, that is, all memory operations
+ * after will appear to happen after checking the refcount.
  */
 static inline bool percpu_ref_tryget_live(struct percpu_ref *ref)
 {
-- 
2.30.2

Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

2021-04-12 Thread Huang, Ying

Tim Chen  writes:

> On 4/8/21 4:52 AM, Michal Hocko wrote:
>
>>> The top tier memory used is reported in
>>>
>>> memory.toptier_usage_in_bytes
>>>
>>> The amount of top tier memory usable by each cgroup without
>>> triggering page reclaim is controlled by the
>>>
>>> memory.toptier_soft_limit_in_bytes 
>> 
>
> Michal,
>
> Thanks for your comments.  I will like to take a step back and
> look at the eventual goal we envision: a mechanism to partition the 
> tiered memory between the cgroups. 
>
> A typical use case may be a system with two set of tasks.
> One set of task is very latency sensitive and we desire instantaneous
> response from them. Another set of tasks will be running batch jobs
> were latency and performance is not critical.   In this case,
> we want to carve out enough top tier memory such that the working set
> of the latency sensitive tasks can fit entirely in the top tier memory.
> The rest of the top tier memory can be assigned to the background tasks.  
>
> To achieve such cgroup based tiered memory management, we probably want
> something like the following.
>
> For generalization let's say that there are N tiers of memory t_0, t_1 ... 
> t_N-1,
> where tier t_0 sits at the top and demotes to the lower tier. 
> We envision for this top tier memory t0 the following knobs and counters 
> in the cgroup memory controller
>
> memory_t0.current Current usage of tier 0 memory by the cgroup.
>
> memory_t0.min If tier 0 memory used by the cgroup falls below this low
>   boundary, the memory will not be subjected to demotion
>   to lower tiers to free up memory at tier 0.  
>
> memory_t0.low Above this boundary, the tier 0 memory will be subjected
>   to demotion.  The demotion pressure will be proportional
>   to the overage.
>
> memory_t0.highIf tier 0 memory used by the cgroup exceeds 
> this high
>   boundary, allocation of tier 0 memory by the cgroup will
>   be throttled. The tier 0 memory used by this cgroup
>   will also be subjected to heavy demotion.

I think we don't really need throttle here, because we can fallback to
allocate memory from the t1.  That will not cause something like IO
device bandwidth saturation.

Best Regards,
Huang, Ying

> memory_t0.max This will be a hard usage limit of tier 0 memory on the 
> cgroup.
>
> If needed, memory_t[12...].current/min/low/high for additional tiers can be 
> added.
> This follows closely with the design of the general memory controller 
> interface.  
>
> Will such an interface looks sane and acceptable with everyone?
>
> The patch set I posted is meant to be a straw man cgroup v1 implementation
> and I readily admits that it falls short of the eventual functionality 
> we want to achieve.  It is meant to solicit feedback from everyone on how the 
> tiered
> memory management should work.
>
>> Are you trying to say that soft limit acts as some sort of guarantee?
>
> No, the soft limit does not offers guarantee.  It will only serves to keep 
> the usage
> of the top tier memory in the vicinity of the soft limits.
>
>> Does that mean that if the memcg is under memory pressure top tiear
>> memory is opted out from any reclaim if the usage is not in excess?
>
> In the prototype implementation, regular memory reclaim is still in effect
> if we are under heavy memory pressure. 
>
>> 
>> From you previous email it sounds more like the limit is evaluated on
>> the global memory pressure to balance specific memcgs which are in
>> excess when trying to reclaim/demote a toptier numa node.
>
> On a top tier node, if the free memory on the node falls below a percentage, 
> then
> we will start to reclaim/demote from the node.
>
>> 
>> Soft limit reclaim has several problems. Those are historical and
>> therefore the behavior cannot be changed. E.g. go after the biggest
>> excessed memcg (with priority 0 - aka potential full LRU scan) and then
>> continue with a normal reclaim. This can be really disruptive to the top
>> user.
>
> Thanks for pointing out these problems with soft limit explicitly.
>
>> 
>> So you can likely define a more sane semantic. E.g. push back memcgs
>> proporitional to their excess but then we have two different soft limits
>> behavior which is bad as well. I am not really sure there is a sensible
>> way out by (ab)using soft limit here.
>> 
>> Also I am not really sure how this is going to be used in practice.
>> There is no

Re: [PATCH 5/5] mm/swap_state: fix swap_cluster_readahead() race with swapoff

2021-04-12 Thread Huang, Ying

Miaohe Lin  writes:

> swap_cluster_readahead() could race with swapoff and might dereference
> si->swap_file after it's released by swapoff. Close this race window by
> using get/put_swap_device() pair.

I think we should fix the callers instead to reduce the overhead.  Now,
do_swap_page() has been fixed.  We need to fix shmem_swapin().

Best Regards,
Huang, Ying

> Signed-off-by: Miaohe Lin 
> ---
>  mm/swap_state.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3bf0d0c297bc..eba6b0cf6cf9 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -626,12 +626,17 @@ struct page *swap_cluster_readahead(swp_entry_t entry, 
> gfp_t gfp_mask,
>   unsigned long offset = entry_offset;
>   unsigned long start_offset, end_offset;
>   unsigned long mask;
> - struct swap_info_struct *si = swp_swap_info(entry);
> + struct swap_info_struct *si;
>   struct blk_plug plug;
>   bool do_poll = true, page_allocated;
>   struct vm_area_struct *vma = vmf->vma;
>   unsigned long addr = vmf->address;
>  
> + si = get_swap_device(entry);
> + /* In case we raced with swapoff. */
> + if (!si)
> + return NULL;
> +
>   mask = swapin_nr_pages(offset) - 1;
>   if (!mask)
>   goto skip;
> @@ -673,7 +678,9 @@ struct page *swap_cluster_readahead(swp_entry_t entry, 
> gfp_t gfp_mask,
>  
>   lru_add_drain();/* Push any new pages onto the LRU now */
>  skip:
> - return read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll);
> + page = read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll);
> + put_swap_device(si);
> + return page;
>  }
>  
>  int init_swap_address_space(unsigned int type, unsigned long nr_pages)

Re: [PATCH 3/5] mm/swap_state: fix get_shadow_from_swap_cache() race with swapoff

2021-04-12 Thread Huang, Ying

Miaohe Lin  writes:

> The function get_shadow_from_swap_cache() can race with swapoff, though
> it's only called by do_swap_page() now.
>
> Fixes: aae466b0052e ("mm/swap: implement workingset detection for anonymous 
> LRU")
> Signed-off-by: Miaohe Lin 

This is unnecessary.  The only caller has guaranteed the swap device
from swapoff.

Best Regards,
Huang, Ying

> ---
>  mm/swap_state.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 272ea2108c9d..709c260d644a 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -83,11 +83,14 @@ void show_swap_cache_info(void)
>  
>  void *get_shadow_from_swap_cache(swp_entry_t entry)
>  {
> - struct address_space *address_space = swap_address_space(entry);
> - pgoff_t idx = swp_offset(entry);
> + struct swap_info_struct *si;
>   struct page *page;
>  
> - page = xa_load(_space->i_pages, idx);
> + si = get_swap_device(entry);
> + if (!si)
> + return NULL;
> + page = xa_load(_address_space(entry)->i_pages, swp_offset(entry));
> + put_swap_device(si);
>   if (xa_is_value(page))
>   return page;
>   return NULL;

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-12 Thread Huang, Ying

Miaohe Lin  writes:

> When I was investigating the swap code, I found the below possible race
> window:
>
> CPU 1 CPU 2
> - -
> do_swap_page
>   synchronous swap_readpage
> alloc_page_vma
>   swapoff
> release swap_file, bdev, or ...
>   swap_readpage
>   check sis->flags is ok
> access swap_file, bdev...[oops!]
>   si->flags = 0
>
> Using current get/put_swap_device() to guard against concurrent swapoff for
> swap_readpage() looks terrible because swap_readpage() may take really long
> time. And this race may not be really pernicious because swapoff is usually
> done when system shutdown only. To reduce the performance overhead on the
> hot-path as much as possible, it appears we can use the percpu_ref to close
> this race window(as suggested by Huang, Ying).
>
> Fixes: 235b62176712 ("mm/swap: add cluster lock")

This isn't the commit that introduces the race.  You can use `git blame`
find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
swap: skip swapcache for swapin of synchronous device".

And I suggest to merge 1/5 and 2/5 to make it easy to get the full
picture.

> Signed-off-by: Miaohe Lin 
> ---
>  include/linux/swap.h |  2 +-
>  mm/memory.c  | 10 ++
>  mm/swapfile.c| 28 +++-
>  3 files changed, 22 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 849ba5265c11..9066addb57fd 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>  
>  static inline void put_swap_device(struct swap_info_struct *si)
>  {
> - rcu_read_unlock();
> + percpu_ref_put(>users);
>  }
>  
>  #else /* CONFIG_SWAP */
> diff --git a/mm/memory.c b/mm/memory.c
> index cc71a445c76c..8543c47b955c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  {
>   struct vm_area_struct *vma = vmf->vma;
>   struct page *page = NULL, *swapcache;
> + struct swap_info_struct *si = NULL;
>   swp_entry_t entry;
>   pte_t pte;
>   int locked;
> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   }
>  
>

I suggest to add comments here as follows (words copy from Matthew Wilcox)

/* Prevent swapoff from happening to us */

> + si = get_swap_device(entry);
> + /* In case we raced with swapoff. */
> + if (unlikely(!si))
> + goto out;
> +

Because we wrap the whole do_swap_page() with get/put_swap_device()
now.  We can remove several get/put_swap_device() for function called by
do_swap_page().  That can be another optimization patch.

>   delayacct_set_flag(DELAYACCT_PF_SWAPIN);
>   page = lookup_swap_cache(entry, vma, vmf->address);
>   swapcache = page;
> @@ -3514,6 +3520,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  unlock:
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
>  out:
> + if (si)
> + put_swap_device(si);
>   return ret;
>  out_nomap:
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -3525,6 +3533,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   unlock_page(swapcache);
>   put_page(swapcache);
>   }
> + if (si)
> + put_swap_device(si);
>   return ret;
>  }
>  
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 724173cd7d0c..01032c72ceae 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1280,18 +1280,12 @@ static unsigned char __swap_entry_free_locked(struct 
> swap_info_struct *p,
>   * via preventing the swap device from being swapoff, until
>   * put_swap_device() is called.  Otherwise return NULL.
>   *
> - * The entirety of the RCU read critical section must come before the
> - * return from or after the call to synchronize_rcu() in
> - * enable_swap_info() or swapoff().  So if "si->flags & SWP_VALID" is
> - * true, the si->map, si->cluster_info, etc. must be valid in the
> - * critical section.
> - *
>   * Notice that swapoff or swapoff+swapon can still happen before the
> - * rcu_read_lock() in get_swap_device() or after the rcu_read_unlock()
> - * in put_swap_device() if there isn't any other way to prevent
> - * swapoff, such as page lock, page table lock, etc.  The caller must
> - * be prepared for that.  For example, the following situation is
> - * possible.
> + * percpu_ref_tryget_live() in get_s

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-12 Thread Huang, Ying

"Huang, Ying"  writes:

> Miaohe Lin  writes:
>
>> We will use percpu-refcount to serialize against concurrent swapoff. This
>> patch adds the percpu_ref support for later fixup.
>>
>> Signed-off-by: Miaohe Lin 
>> ---
>>  include/linux/swap.h |  2 ++
>>  mm/swapfile.c| 25 ++---
>>  2 files changed, 24 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 144727041e78..849ba5265c11 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>   * The in-memory structure used to track swap areas.
>>   */
>>  struct swap_info_struct {
>> +struct percpu_ref users;/* serialization against concurrent 
>> swapoff */
>>  unsigned long   flags;  /* SWP_USED etc: see above */
>>  signed shortprio;   /* swap priority of this type */
>>  struct plist_node list; /* entry in swap_active_head */
>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>>  struct block_device *bdev;  /* swap device or bdev of swap file */
>>  struct file *swap_file; /* seldom referenced */
>>  unsigned int old_block_size;/* seldom referenced */
>> +struct completion comp; /* seldom referenced */
>>  #ifdef CONFIG_FRONTSWAP
>>  unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
>>  atomic_t frontswap_pages;   /* frontswap pages in-use counter */
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 149e77454e3c..724173cd7d0c 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -39,6 +39,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -511,6 +512,15 @@ static void swap_discard_work(struct work_struct *work)
>>  spin_unlock(>lock);
>>  }
>>  
>> +static void swap_users_ref_free(struct percpu_ref *ref)
>> +{
>> +struct swap_info_struct *si;
>> +
>> +si = container_of(ref, struct swap_info_struct, users);
>> +complete(>comp);
>> +percpu_ref_exit(>users);
>
> Because percpu_ref_exit() is used, we cannot use percpu_ref_tryget() in
> get_swap_device(), better to add comments there.

I just noticed that the comments of percpu_ref_tryget_live() says,

 * This function is safe to call as long as @ref is between init and exit.

While we need to call get_swap_device() almost at any time, so it's
better to avoid to call percpu_ref_exit() at all.  This will waste some
memory, but we need to follow the API definition to avoid potential
issues in the long term.

And we need to call percpu_ref_init() before insert the swap_info_struct
into the swap_info[].

>> +}
>> +
>>  static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>>  {
>>  struct swap_cluster_info *ci = si->cluster_info;
>> @@ -2500,7 +2510,7 @@ static void enable_swap_info(struct swap_info_struct 
>> *p, int prio,
>>   * Guarantee swap_map, cluster_info, etc. fields are valid
>>   * between get/put_swap_device() if SWP_VALID bit is set
>>   */
>> -synchronize_rcu();
>> +percpu_ref_reinit(>users);
>
> Although the effect is same, I think it's better to use
> percpu_ref_resurrect() here to improve code readability.

Check the original commit description for commit eb085574a752 "mm, swap:
fix race between swapoff and some swap operations" and discussion email
thread as follows again,

https://lore.kernel.org/linux-mm/20171219053650.gb7...@linux.vnet.ibm.com/

I found that the synchronize_rcu() here is to avoid to call smp_rmb() or
smp_load_acquire() in get_swap_device().  Now we will use
percpu_ref_tryget_live() in get_swap_device(), so we will need to add
the necessary memory barrier, or make sure percpu_ref_tryget_live() has
ACQUIRE semantics.  Per my understanding, we need to change
percpu_ref_tryget_live() for that.

>>  spin_lock(_lock);
>>  spin_lock(>lock);
>>  _enable_swap_info(p);
>> @@ -2621,11 +2631,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
>> specialfile)
>>  p->flags &= ~SWP_VALID; /* mark swap device as invalid */
>>  spin_unlock(>lock);
>>  spin_unlock(_lock);
>> +
>> +percpu_ref_kill(>users);
>>  /*
>>   * wait for swap operations protected by get/put_swap_device()
>>   * to complete
>>   */
>> -synchronize_rcu();
>> +wait_for_completion(>comp);
>
> Better to move percpu_ref_kill() after the comments.  And maybe revise
> the comments.

After reading the original commit description as above, I found that we
need synchronize_rcu() here to protect the accessing to the swap cache
data structure.  Because there's call_rcu() during percpu_ref_kill(), it
appears OK to keep the synchronize_rcu() here.  And we need to revise
the comments to make it clear what is protected by which operation.

Best Regards,
Huang, Ying

[snip]

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-11 Thread Huang, Ying

Miaohe Lin  writes:

> We will use percpu-refcount to serialize against concurrent swapoff. This
> patch adds the percpu_ref support for later fixup.
>
> Signed-off-by: Miaohe Lin 
> ---
>  include/linux/swap.h |  2 ++
>  mm/swapfile.c| 25 ++---
>  2 files changed, 24 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 144727041e78..849ba5265c11 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>   * The in-memory structure used to track swap areas.
>   */
>  struct swap_info_struct {
> + struct percpu_ref users;/* serialization against concurrent 
> swapoff */
>   unsigned long   flags;  /* SWP_USED etc: see above */
>   signed shortprio;   /* swap priority of this type */
>   struct plist_node list; /* entry in swap_active_head */
> @@ -260,6 +261,7 @@ struct swap_info_struct {
>   struct block_device *bdev;  /* swap device or bdev of swap file */
>   struct file *swap_file; /* seldom referenced */
>   unsigned int old_block_size;/* seldom referenced */
> + struct completion comp; /* seldom referenced */
>  #ifdef CONFIG_FRONTSWAP
>   unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
>   atomic_t frontswap_pages;   /* frontswap pages in-use counter */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 149e77454e3c..724173cd7d0c 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -39,6 +39,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -511,6 +512,15 @@ static void swap_discard_work(struct work_struct *work)
>   spin_unlock(>lock);
>  }
>  
> +static void swap_users_ref_free(struct percpu_ref *ref)
> +{
> + struct swap_info_struct *si;
> +
> + si = container_of(ref, struct swap_info_struct, users);
> + complete(>comp);
> + percpu_ref_exit(>users);

Because percpu_ref_exit() is used, we cannot use percpu_ref_tryget() in
get_swap_device(), better to add comments there.

> +}
> +
>  static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>  {
>   struct swap_cluster_info *ci = si->cluster_info;
> @@ -2500,7 +2510,7 @@ static void enable_swap_info(struct swap_info_struct 
> *p, int prio,
>* Guarantee swap_map, cluster_info, etc. fields are valid
>* between get/put_swap_device() if SWP_VALID bit is set
>*/
> - synchronize_rcu();
> + percpu_ref_reinit(>users);

Although the effect is same, I think it's better to use
percpu_ref_resurrect() here to improve code readability.

>   spin_lock(_lock);
>   spin_lock(>lock);
>   _enable_swap_info(p);
> @@ -2621,11 +2631,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
> specialfile)
>   p->flags &= ~SWP_VALID; /* mark swap device as invalid */
>   spin_unlock(>lock);
>   spin_unlock(_lock);
> +
> + percpu_ref_kill(>users);
>   /*
>* wait for swap operations protected by get/put_swap_device()
>* to complete
>*/
> - synchronize_rcu();
> + wait_for_completion(>comp);

Better to move percpu_ref_kill() after the comments.  And maybe revise
the comments.

>  
>   flush_work(>discard_work);
>  
> @@ -3132,7 +3144,7 @@ static bool swap_discardable(struct swap_info_struct 
> *si)
>  SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  {
>   struct swap_info_struct *p;
> - struct filename *name;
> + struct filename *name = NULL;
>   struct file *swap_file = NULL;
>   struct address_space *mapping;
>   int prio;
> @@ -3163,6 +3175,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
> specialfile, int, swap_flags)
>  
>   INIT_WORK(>discard_work, swap_discard_work);
>  
> + init_completion(>comp);
> + error = percpu_ref_init(>users, swap_users_ref_free,
> + PERCPU_REF_INIT_DEAD, GFP_KERNEL);
> + if (unlikely(error))
> + goto bad_swap;
> +
>   name = getname(specialfile);
>   if (IS_ERR(name)) {
>   error = PTR_ERR(name);
> @@ -3356,6 +3374,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
> specialfile, int, swap_flags)
>  bad_swap_unlock_inode:
>   inode_unlock(inode);
>  bad_swap:
> + percpu_ref_exit(>users);

Usually the resource freeing order matches their allocating order
reversely.  So, if there's no special reason, please follow that rule.

Best Regards,
Huang, Ying

>   free_percpu(p->percpu_cluster);
>   p->percpu_cluster = NULL;
>   free_percpu(p->cluster_next_cpu);

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-11 Thread Huang, Ying

Miaohe Lin  writes:

> On 2021/4/10 1:17, Tim Chen wrote:
>> 
>> 
>> On 4/9/21 1:42 AM, Miaohe Lin wrote:
>>> On 2021/4/9 5:34, Tim Chen wrote:
>>>>
>>>>
>>>> On 4/8/21 6:08 AM, Miaohe Lin wrote:
>>>>> When I was investigating the swap code, I found the below possible race
>>>>> window:
>>>>>
>>>>> CPU 1 CPU 2
>>>>> - -
>>>>> do_swap_page
>>>>>   synchronous swap_readpage
>>>>> alloc_page_vma
>>>>>   swapoff
>>>>> release swap_file, bdev, or ...
>>>>
>>>
>>> Many thanks for quick review and reply!
>>>
>>>> Perhaps I'm missing something.  The release of swap_file, bdev etc
>>>> happens after we have cleared the SWP_VALID bit in si->flags in 
>>>> destroy_swap_extents
>>>> if I read the swapoff code correctly.
>>> Agree. Let's look this more close:
>>> CPU1CPU2
>>> -   -
>>> swap_readpage
>>>   if (data_race(sis->flags & SWP_FS_OPS)) {
>>> swapoff
>>>   p->swap_file 
>>> = NULL;
>>> struct file *swap_file = sis->swap_file;
>>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>>>   ...
>>>   p->flags = 0;
>>> ...
>>>
>>> Does this make sense for you?
>> 
>> p->swapfile = NULL happens after the 
>> p->flags &= ~SWP_VALID, synchronize_rcu(), destroy_swap_extents() sequence 
>> in swapoff().
>> 
>> So I don't think the sequence you illustrated on CPU2 is in the right order.
>> That said, without get_swap_device/put_swap_device in swap_readpage, you 
>> could
>> potentially blow pass synchronize_rcu() on CPU2 and causes a problem.  so I 
>> think
>> the problematic race looks something like the following:
>> 
>> 
>> CPU1 CPU2
>> --
>> swap_readpage
>>   if (data_race(sis->flags & SWP_FS_OPS)) {
>>  swapoff
>>p->flags = &= 
>> ~SWP_VALID;
>>..
>>
>> synchronize_rcu();
>>..
>>p->swap_file 
>> = NULL;
>> struct file *swap_file = sis->swap_file;
>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>>        ...
>> ...
>> 
>
> Agree. This is also what I meant to illustrate. And you provide a better one. 
> Many thanks!

For the pages that are swapped in through swap cache.  That isn't an
issue.  Because the page is locked, the swap entry will be marked with
SWAP_HAS_CACHE, so swapoff() cannot proceed until the page has been
unlocked.

So the race is for the fast path as follows,

if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
__swap_count(entry) == 1)

I found it in your original patch description.  But please make it more
explicit to reduce the potential confusing.

Best Regards,
Huang, Ying

Re: [PATCH 4/5] mm/swap_state: fix potential faulted in race in swap_ra_info()

2021-04-11 Thread Huang, Ying

Miaohe Lin  writes:

> On 2021/4/9 16:50, Huang, Ying wrote:
>> Miaohe Lin  writes:
>> 
>>> While we released the pte lock, somebody else might faulted in this pte.
>>> So we should check whether it's swap pte first to guard against such race
>>> or swp_type would be unexpected. And we can also avoid some unnecessary
>>> readahead cpu cycles possibly.
>>>
>>> Fixes: ec560175c0b6 ("mm, swap: VMA based swap readahead")
>>> Signed-off-by: Miaohe Lin 
>>> ---
>>>  mm/swap_state.c | 13 +
>>>  1 file changed, 9 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>>> index 709c260d644a..3bf0d0c297bc 100644
>>> --- a/mm/swap_state.c
>>> +++ b/mm/swap_state.c
>>> @@ -724,10 +724,10 @@ static void swap_ra_info(struct vm_fault *vmf,
>>>  {
>>> struct vm_area_struct *vma = vmf->vma;
>>> unsigned long ra_val;
>>> -   swp_entry_t entry;
>>> +   swp_entry_t swap_entry;
>>> unsigned long faddr, pfn, fpfn;
>>> unsigned long start, end;
>>> -   pte_t *pte, *orig_pte;
>>> +   pte_t *pte, *orig_pte, entry;
>>> unsigned int max_win, hits, prev_win, win, left;
>>>  #ifndef CONFIG_64BIT
>>> pte_t *tpte;
>>> @@ -742,8 +742,13 @@ static void swap_ra_info(struct vm_fault *vmf,
>>>  
>>> faddr = vmf->address;
>>> orig_pte = pte = pte_offset_map(vmf->pmd, faddr);
>>> -   entry = pte_to_swp_entry(*pte);
>>> -   if ((unlikely(non_swap_entry(entry {
>>> +   entry = *pte;
>>> +   if (unlikely(!is_swap_pte(entry))) {
>>> +   pte_unmap(orig_pte);
>>> +   return;
>>> +   }
>>> +   swap_entry = pte_to_swp_entry(entry);
>>> +   if ((unlikely(non_swap_entry(swap_entry {
>>> pte_unmap(orig_pte);
>>> return;
>>> }
>> 
>> This isn't a real issue.  entry or swap_entry isn't used in this
>
> Agree. It seems the entry or swap_entry here is just used for check whether
> pte is still valid swap_entry.

If you check the git history, you will find that the check has been
necessary before.  Because the function is used earlier in
do_swap_page() at that time.

Best Regards,
Huang, Ying

Re: [PATCH 4/5] mm/swap_state: fix potential faulted in race in swap_ra_info()

2021-04-09 Thread Huang, Ying

Miaohe Lin  writes:

> While we released the pte lock, somebody else might faulted in this pte.
> So we should check whether it's swap pte first to guard against such race
> or swp_type would be unexpected. And we can also avoid some unnecessary
> readahead cpu cycles possibly.
>
> Fixes: ec560175c0b6 ("mm, swap: VMA based swap readahead")
> Signed-off-by: Miaohe Lin 
> ---
>  mm/swap_state.c | 13 +
>  1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 709c260d644a..3bf0d0c297bc 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -724,10 +724,10 @@ static void swap_ra_info(struct vm_fault *vmf,
>  {
>   struct vm_area_struct *vma = vmf->vma;
>   unsigned long ra_val;
> - swp_entry_t entry;
> + swp_entry_t swap_entry;
>   unsigned long faddr, pfn, fpfn;
>   unsigned long start, end;
> - pte_t *pte, *orig_pte;
> + pte_t *pte, *orig_pte, entry;
>   unsigned int max_win, hits, prev_win, win, left;
>  #ifndef CONFIG_64BIT
>   pte_t *tpte;
> @@ -742,8 +742,13 @@ static void swap_ra_info(struct vm_fault *vmf,
>  
>   faddr = vmf->address;
>   orig_pte = pte = pte_offset_map(vmf->pmd, faddr);
> - entry = pte_to_swp_entry(*pte);
> - if ((unlikely(non_swap_entry(entry {
> + entry = *pte;
> + if (unlikely(!is_swap_pte(entry))) {
> + pte_unmap(orig_pte);
> + return;
> + }
> + swap_entry = pte_to_swp_entry(entry);
> + if ((unlikely(non_swap_entry(swap_entry {
>   pte_unmap(orig_pte);
>   return;
>   }

This isn't a real issue.  entry or swap_entry isn't used in this
function.  And we have enough checking when we really operate the PTE
entries later.  But I admit it's confusing.  So I suggest to just remove
the checking.  We will check it when necessary.

Best Regards,
Huang, Ying

Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

2021-04-08 Thread Huang, Ying

Yang Shi  writes:

> On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt  wrote:
>>
>> Hi Tim,
>>
>> On Mon, Apr 5, 2021 at 11:08 AM Tim Chen  wrote:
>> >
>> > Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
>> > others NUMA wise, but a byte of media has about the same cost whether it
>> > is close or far.  But, with new memory tiers such as Persistent Memory
>> > (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
>> > PMEM.
>> >
>> > The fast/expensive memory lives in the top tier of the memory hierachy.
>> >
>> > Previously, the patchset
>> > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
>> > https://lore.kernel.org/linux-mm/20210401183216.443c4...@viggo.jf.intel.com/
>> > provides a mechanism to demote cold pages from DRAM node into PMEM.
>> >
>> > And the patchset
>> > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory 
>> > tiering system
>> > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.hu...@intel.com/
>> > provides a mechanism to promote hot pages in PMEM to the DRAM node
>> > leveraging autonuma.
>> >
>> > The two patchsets together keep the hot pages in DRAM and colder pages
>> > in PMEM.
>>
>> Thanks for working on this as this is becoming more and more important
>> particularly in the data centers where memory is a big portion of the
>> cost.
>>
>> I see you have responded to Michal and I will add my more specific
>> response there. Here I wanted to give my high level concern regarding
>> using v1's soft limit like semantics for top tier memory.
>>
>> This patch series aims to distribute/partition top tier memory between
>> jobs of different priorities. We want high priority jobs to have
>> preferential access to the top tier memory and we don't want low
>> priority jobs to hog the top tier memory.
>>
>> Using v1's soft limit like behavior can potentially cause high
>> priority jobs to stall to make enough space on top tier memory on
>> their allocation path and I think this patchset is aiming to reduce
>> that impact by making kswapd do that work. However I think the more
>> concerning issue is the low priority job hogging the top tier memory.
>>
>> The possible ways the low priority job can hog the top tier memory are
>> by allocating non-movable memory or by mlocking the memory. (Oh there
>> is also pinning the memory but I don't know if there is a user api to
>> pin memory?) For the mlocked memory, you need to either modify the
>> reclaim code or use a different mechanism for demoting cold memory.
>
> Do you mean long term pin? RDMA should be able to simply pin the
> memory for weeks. A lot of transient pins come from Direct I/O. They
> should be less concerned.
>
> The low priority jobs should be able to be restricted by cpuset, for
> example, just keep them on second tier memory nodes. Then all the
> above problems are gone.

To optimize the page placement of a process between DRAM and PMEM, we
want to place the hot pages in DRAM and the cold pages in PMEM.  But the
memory accessing pattern changes overtime, so we need to migrate pages
between DRAM and PMEM to adapt to the changing.

To avoid the hot pages be pinned in PMEM always, one way is to online
the PMEM as movable zones.  If so, and if the low priority jobs are
restricted by cpuset to allocate from PMEM only, we may fail to run
quite some workloads as being discussed in the following threads,

https://lore.kernel.org/linux-mm/1604470210-124827-1-git-send-email-feng.t...@intel.com/

>>
>> Basically I am saying we should put the upfront control (limit) on the
>> usage of top tier memory by the jobs.
>
> This sounds similar to what I talked about in LSFMM 2019
> (https://lwn.net/Articles/787418/). We used to have some potential
> usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> when I was with Alibaba.
>
> In the first place I thought about per NUMA node limit, but it was
> very hard to configure it correctly for users unless you know exactly
> about your memory usage and hot/cold memory distribution.
>
> I'm wondering, just off the top of my head, if we could extend the
> semantic of low and min limit. For example, just redefine low and min
> to "the limit on top tier memory". Then we could have low priority
> jobs have 0 low/min limit.

Per my understanding, memory.low/min are for the memory protection
instead of the memory limiting.  memory.high is for the memory limiting.

Best Regards,
Huang, Ying

Re: [PATCH -V2] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault

2021-04-08 Thread Huang, Ying

Mel Gorman  writes:

> On Fri, Apr 02, 2021 at 04:27:17PM +0800, Huang Ying wrote:
>> With NUMA balancing, in hint page fault handler, the faulting page
>> will be migrated to the accessing node if necessary.  During the
>> migration, TLB will be shot down on all CPUs that the process has run
>> on recently.  Because in the hint page fault handler, the PTE will be
>> made accessible before the migration is tried.  The overhead of TLB
>> shooting down can be high, so it's better to be avoided if possible.
>> In fact, if we delay mapping the page until migration, that can be
>> avoided.  This is what this patch doing.
>> 
>> 
>>
>
> Thanks, I think this is ok for Andrew to pick up to see if anything
> bisects to this commit but it's a low risk.
>
> Reviewed-by: Mel Gorman 
>
> More notes;
>
> This is not a universal win given that not all workloads exhibit the
> pattern where accesses occur in parallel threads between when a page
> is marked accessible and when it is migrated. The impact of the patch
> appears to be neutral for those workloads. For workloads that do exhibit
> the pattern, there is a small gain with a reduction in interrupts as
> advertised unlike v1 of the patch. Further tests are running to confirm
> the reduction is in TLB shootdown interrupts but I'm reasonably confident
> that will be the case. Gains are typically small and the load described in
> the changelog appears to be a best case scenario but a 1-5% gain in some
> other workloads is still an improvement. There is still the possibility
> that some workloads will unnecessarily stall as a result of the patch
> for slightly longer periods of time but that is a relatively low risk
> and will be difficult to detect. If I'm wrong, a bisection will find it.

Hi, Mel,

Thanks!

Hi, Andrew,

I found that V2 cannot apply on top of latest mmotm, so I send V3 as
follows.  In case you need it.

https://lore.kernel.org/lkml/20210408132236.1175607-1-ying.hu...@intel.com/

Best Regards,
Huang, Ying

[PATCH -V3] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault

2021-04-08 Thread Huang Ying

With NUMA balancing, in hint page fault handler, the faulting page
will be migrated to the accessing node if necessary.  During the
migration, TLB will be shot down on all CPUs that the process has run
on recently.  Because in the hint page fault handler, the PTE will be
made accessible before the migration is tried.  The overhead of TLB
shooting down can be high, so it's better to be avoided if possible.
In fact, if we delay mapping the page until migration, that can be
avoided.  This is what this patch doing.

For the multiple threads applications, it's possible that a page is
accessed by multiple threads almost at the same time.  In the original
implementation, because the first thread will install the accessible
PTE before migrating the page, the other threads may access the page
directly before the page is made inaccessible again during migration.
While with the patch, the second thread will go through the page fault
handler too. And because of the PageLRU() checking in the following
code path,

  migrate_misplaced_page()
numamigrate_isolate_page()
  isolate_lru_page()

the migrate_misplaced_page() will return 0, and the PTE will be made
accessible in the second thread.

This will introduce a little more overhead.  But we think the
possibility for a page to be accessed by the multiple threads at the
same time is low, and the overhead difference isn't too large.  If
this becomes a problem in some workloads, we need to consider how to
reduce the overhead.

To test the patch, we run a test case as follows on a 2-socket Intel
server (1 NUMA node per socket) with 128GB DRAM (64GB per socket).

1. Run a memory eater on NUMA node 1 to use 40GB memory before running
   pmbench.

2. Run pmbench (normal accessing pattern) with 8 processes, and 8
   threads per process, so there are 64 threads in total.  The
   working-set size of each process is 8960MB, so the total working-set
   size is 8 * 8960MB = 70GB.  The CPU of all pmbench processes is bound
   to node 1.  The pmbench processes will access some DRAM on node 0.

3. After the pmbench processes run for 10 seconds, kill the memory
   eater.  Now, some pages will be migrated from node 0 to node 1 via
   NUMA balancing.

Test results show that, with the patch, the pmbench throughput (page
accesses/s) increases 5.5%.  The number of the TLB shootdowns
interrupts reduces 98% (from ~4.7e7 to ~9.7e5) with about 9.2e6
pages (35.8GB) migrated.  From the perf profile, it can be found that
the CPU cycles spent by try_to_unmap() and its callees reduces from
6.02% to 0.47%.  That is, the CPU cycles spent by TLB shooting down
decreases greatly.

Signed-off-by: "Huang, Ying" 
Reviewed-by: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Peter Xu 
Cc: Johannes Weiner 
Cc: Vlastimil Babka 
Cc: "Matthew Wilcox" 
Cc: Will Deacon 
Cc: Michel Lespinasse 
Cc: Arjun Roy 
Cc: "Kirill A. Shutemov" 
---
 mm/memory.c | 54 +++--
 1 file changed, 32 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index cc71a445c76c..7e9d4e55089c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4159,29 +4159,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
goto out;
}
 
-   /*
-* Make it present again, depending on how arch implements
-* non-accessible ptes, some can allow access by kernel mode.
-*/
-   old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
+   /* Get the normal PTE  */
+   old_pte = ptep_get(vmf->pte);
pte = pte_modify(old_pte, vma->vm_page_prot);
-   pte = pte_mkyoung(pte);
-   if (was_writable)
-   pte = pte_mkwrite(pte);
-   ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
-   update_mmu_cache(vma, vmf->address, vmf->pte);
 
page = vm_normal_page(vma, vmf->address, pte);
-   if (!page) {
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   return 0;
-   }
+   if (!page)
+   goto out_map;
 
/* TODO: handle PTE-mapped THP */
-   if (PageCompound(page)) {
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   return 0;
-   }
+   if (PageCompound(page))
+   goto out_map;
 
/*
 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
@@ -4191,7 +4179,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 * pte_dirty has unpredictable behaviour between PTE scan updates,
 * background writeback, dirty balancing and application behaviour.
 */
-   if (!pte_write(pte))
+   if (!was_writable)
flags |= TNF_NO_GROUP;
 
/*
@@ -4205,23 +4193,45 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid,
);
-   pte_

[PATCH -V2] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault

2021-04-02 Thread Huang Ying

With NUMA balancing, in hint page fault handler, the faulting page
will be migrated to the accessing node if necessary.  During the
migration, TLB will be shot down on all CPUs that the process has run
on recently.  Because in the hint page fault handler, the PTE will be
made accessible before the migration is tried.  The overhead of TLB
shooting down can be high, so it's better to be avoided if possible.
In fact, if we delay mapping the page until migration, that can be
avoided.  This is what this patch doing.

For the multiple threads applications, it's possible that a page is
accessed by multiple threads almost at the same time.  In the original
implementation, because the first thread will install the accessible
PTE before migrating the page, the other threads may access the page
directly before the page is made inaccessible again during migration.
While with the patch, the second thread will go through the page fault
handler too. And because of the PageLRU() checking in the following
code path,

  migrate_misplaced_page()
numamigrate_isolate_page()
  isolate_lru_page()

the migrate_misplaced_page() will return 0, and the PTE will be made
accessible in the second thread.

This will introduce a little more overhead.  But we think the
possibility for a page to be accessed by the multiple threads at the
same time is low, and the overhead difference isn't too large.  If
this becomes a problem in some workloads, we need to consider how to
reduce the overhead.

To test the patch, we run a test case as follows on a 2-socket Intel
server (1 NUMA node per socket) with 128GB DRAM (64GB per socket).

1. Run a memory eater on NUMA node 1 to use 40GB memory before running
   pmbench.

2. Run pmbench (normal accessing pattern) with 8 processes, and 8
   threads per process, so there are 64 threads in total.  The
   working-set size of each process is 8960MB, so the total working-set
   size is 8 * 8960MB = 70GB.  The CPU of all pmbench processes is bound
   to node 1.  The pmbench processes will access some DRAM on node 0.

3. After the pmbench processes run for 10 seconds, kill the memory
   eater.  Now, some pages will be migrated from node 0 to node 1 via
   NUMA balancing.

Test results show that, with the patch, the pmbench throughput (page
accesses/s) increases 5.5%.  The number of the TLB shootdowns
interrupts reduces 98% (from ~4.7e7 to ~9.7e5) with about 9.2e6
pages (35.8GB) migrated.  From the perf profile, it can be found that
the CPU cycles spent by try_to_unmap() and its callees reduces from
6.02% to 0.47%.  That is, the CPU cycles spent by TLB shooting down
decreases greatly.

Signed-off-by: "Huang, Ying" 
Cc: Peter Zijlstra 
Cc: Mel Gorman 
Cc: Peter Xu 
Cc: Johannes Weiner 
Cc: Vlastimil Babka 
Cc: "Matthew Wilcox" 
Cc: Will Deacon 
Cc: Michel Lespinasse 
Cc: Arjun Roy 
Cc: "Kirill A. Shutemov" 
---
 mm/memory.c | 54 +++--
 1 file changed, 32 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index d3273bd69dbb..a00b39e81a25 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4148,29 +4148,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
goto out;
}
 
-   /*
-* Make it present again, Depending on how arch implementes non
-* accessible ptes, some can allow access by kernel mode.
-*/
-   old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
+   /* Get the normal PTE  */
+   old_pte = ptep_get(vmf->pte);
pte = pte_modify(old_pte, vma->vm_page_prot);
-   pte = pte_mkyoung(pte);
-   if (was_writable)
-   pte = pte_mkwrite(pte);
-   ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
-   update_mmu_cache(vma, vmf->address, vmf->pte);
 
page = vm_normal_page(vma, vmf->address, pte);
-   if (!page) {
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   return 0;
-   }
+   if (!page)
+   goto out_map;
 
/* TODO: handle PTE-mapped THP */
-   if (PageCompound(page)) {
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   return 0;
-   }
+   if (PageCompound(page))
+   goto out_map;
 
/*
 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
@@ -4180,7 +4168,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 * pte_dirty has unpredictable behaviour between PTE scan updates,
 * background writeback, dirty balancing and application behaviour.
 */
-   if (!pte_write(pte))
+   if (!was_writable)
flags |= TNF_NO_GROUP;
 
/*
@@ -4194,23 +4182,45 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid,
);
-   pte_

Re: [RFC] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault

2021-03-31 Thread Huang, Ying

Mel Gorman  writes:

> On Wed, Mar 31, 2021 at 07:20:09PM +0800, Huang, Ying wrote:
>> Mel Gorman  writes:
>> 
>> > On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote:
>> >> For NUMA balancing, in hint page fault handler, the faulting page will
>> >> be migrated to the accessing node if necessary.  During the migration,
>> >> TLB will be shot down on all CPUs that the process has run on
>> >> recently.  Because in the hint page fault handler, the PTE will be
>> >> made accessible before the migration is tried.  The overhead of TLB
>> >> shooting down is high, so it's better to be avoided if possible.  In
>> >> fact, if we delay mapping the page in PTE until migration, that can be
>> >> avoided.  This is what this patch doing.
>> >> 
>> >
>> > Why would the overhead be high? It was previously inaccessibly so it's
>> > only parallel accesses making forward progress that trigger the need
>> > for a flush.
>> 
>> Sorry, I don't understand this.  Although the page is inaccessible, the
>> threads may access other pages, so TLB flushing is still necessary.
>> 
>
> You assert the overhead of TLB shootdown is high and yes, it can be
> very high but you also said "the benchmark score has no visible changes"
> indicating the TLB shootdown cost is not a major problem for the workload.
> It does not mean we should ignore it though.
>
>> > 
>> >
>> > If migration is attempted, then the time until the migration PTE is
>> > created is variable. The page has to be isolated from the LRU so there
>> > could be contention on the LRU lock, a new page has to be allocated and
>> > that allocation potentially has to enter the page allocator slow path
>> > etc. During that time, parallel threads make forward progress but with
>> > the patch, multiple threads potentially attempt the allocation and fail
>> > instead of doing real work.
>> 
>> If my understanding of the code were correct, only the first thread will
>> attempt the isolation and allocation.  Because TestClearPageLRU() is
>> called in
>> 
>>   migrate_misplaced_page()
>> numamigrate_isolate_page()
>>   isolate_lru_page()
>> 
>> And migrate_misplaced_page() will return 0 immediately if
>> TestClearPageLRU() returns false.  Then the second thread will make the
>> page accessible and make forward progress.
>> 
>
> Ok, that's true. While additional work is done, the cost is reasonably
> low -- lower than I initially imagined and with fewer side-effects.
>
>> But there's still some timing difference between the original and
>> patched kernel.  We have several choices to reduce the difference.
>> 
>> 1. Check PageLRU() with PTL held in do_numa_page()
>> 
>> If PageLRU() return false, do_numa_page() can make the page accessible
>> firstly.  So the second thread will make the page accessible earlier.
>> 
>> 2. Try to lock the page with PTL held in do_numa_page()
>> 
>> If the try-locking succeeds, it's the first thread, so it can delay
>> mapping.  If try-locking fails, it may be the second thread, so it will
>> make the page accessible firstly.  We need to teach
>> migrate_misplaced_page() to work with the page locked.  This will
>> enlarge the duration that the page is locked.  Is it a problem?
>> 
>> 3. Check page_count() with PTL held in do_numa_page()
>> 
>> The first thread will call get_page() in numa_migrate_prep().  So if the
>> second thread can detect that, it can make the page accessible firstly.
>> The difficulty is that it appears hard to identify the expected
>> page_count() for the file pages.  For anonymous pages, that is much
>> easier, so at least if a page passes the following test, we can delay
>> mapping,
>> 
>> PageAnon(page) && page_count(page) == page_mapcount(page) + 
>> !!PageSwapCache(page)
>> 
>> This will disable the optimization for the file pages.  But it may be
>> good enough?
>> 
>> Which one do you think is better?  Maybe the first one is good enough?
>> 
>
> The first one is probably the most straight-forward but it's more
> important to figure out why interrupts were higher with at least one
> workload when the exact opposite is expected. Investigating which of
> options 1-3 are best and whether it's worth the duplicated check could
> be done as a separate patch.
>
>> > You should consider the following question -- is the potential saving
>> > of an IPI transmission enough to offset the

Re: [RFC] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault

2021-03-31 Thread Huang, Ying

Mel Gorman  writes:

> On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote:
>> For NUMA balancing, in hint page fault handler, the faulting page will
>> be migrated to the accessing node if necessary.  During the migration,
>> TLB will be shot down on all CPUs that the process has run on
>> recently.  Because in the hint page fault handler, the PTE will be
>> made accessible before the migration is tried.  The overhead of TLB
>> shooting down is high, so it's better to be avoided if possible.  In
>> fact, if we delay mapping the page in PTE until migration, that can be
>> avoided.  This is what this patch doing.
>> 
>
> Why would the overhead be high? It was previously inaccessibly so it's
> only parallel accesses making forward progress that trigger the need
> for a flush.

Sorry, I don't understand this.  Although the page is inaccessible, the
threads may access other pages, so TLB flushing is still necessary.

> As your change notes -- "The benchmark score has no visible
> changes". The patch was neither a win nor a loss for your target workload
> but there are more fundamental issues to consider.
>
>> We have tested the patch with the pmbench memory accessing benchmark
>> on a 2-socket Intel server, and found that the number of the TLB
>> shooting down IPI reduces up to 99% (from ~6.0e6 to ~2.3e4) if NUMA
>> balancing is triggered (~8.8e6 pages migrated).  The benchmark score
>> has no visible changes.
>> 
>> Known issues:
>> 
>> For the multiple threads applications, it's possible that the page is
>> accessed by 2 threads almost at the same time.  In the original
>> implementation, the second thread may go accessing the page directly
>> because the first thread has installed the accessible PTE.  While with
>> this patch, there will be a window that the second thread will find
>> the PTE is still inaccessible.  But the difference between the
>> accessible window is small.  Because the page will be made
>> inaccessible soon for migrating.
>> 
>
> If multiple threads trap the hinting fault, only one potentially attempts
> a migration as the others observe the PTE has changed when the PTL is
> acquired and return to userspace. Such threads then have a short window to
> make progress before the PTE *potentially* becomes a migration PTE and
> during that window, the parallel access may not need the page any more
> and never stall on the migration.

Yes.

> That migration PTE may never be created if migrate_misplaced_page
> chooses to ignore the PTE in which case there is minimal disruption.

Yes.  And in the patched kernel, if numa_migrate_prep() returns
NUMA_NO_NODE or migrate_misplaced_page() returns 0, the PTE will be made
accessible too.

> If migration is attempted, then the time until the migration PTE is
> created is variable. The page has to be isolated from the LRU so there
> could be contention on the LRU lock, a new page has to be allocated and
> that allocation potentially has to enter the page allocator slow path
> etc. During that time, parallel threads make forward progress but with
> the patch, multiple threads potentially attempt the allocation and fail
> instead of doing real work.

If my understanding of the code were correct, only the first thread will
attempt the isolation and allocation.  Because TestClearPageLRU() is
called in

  migrate_misplaced_page()
numamigrate_isolate_page()
  isolate_lru_page()

And migrate_misplaced_page() will return 0 immediately if
TestClearPageLRU() returns false.  Then the second thread will make the
page accessible and make forward progress.

But there's still some timing difference between the original and
patched kernel.  We have several choices to reduce the difference.

1. Check PageLRU() with PTL held in do_numa_page()

If PageLRU() return false, do_numa_page() can make the page accessible
firstly.  So the second thread will make the page accessible earlier.

2. Try to lock the page with PTL held in do_numa_page()

If the try-locking succeeds, it's the first thread, so it can delay
mapping.  If try-locking fails, it may be the second thread, so it will
make the page accessible firstly.  We need to teach
migrate_misplaced_page() to work with the page locked.  This will
enlarge the duration that the page is locked.  Is it a problem?

3. Check page_count() with PTL held in do_numa_page()

The first thread will call get_page() in numa_migrate_prep().  So if the
second thread can detect that, it can make the page accessible firstly.
The difficulty is that it appears hard to identify the expected
page_count() for the file pages.  For anonymous pages, that is much
easier, so at least if a page passes the following test, we can delay
mapping,

PageAnon(page) && page_count(page)

Re: [Question] Is there a race window between swapoff vs synchronous swap_readpage

2021-03-30 Thread Huang, Ying

Yu Zhao  writes:

> On Mon, Mar 29, 2021 at 9:44 PM Huang, Ying  wrote:
>>
>> Miaohe Lin  writes:
>>
>> > On 2021/3/30 9:57, Huang, Ying wrote:
>> >> Hi, Miaohe,
>> >>
>> >> Miaohe Lin  writes:
>> >>
>> >>> Hi all,
>> >>> I am investigating the swap code, and I found the below possible race 
>> >>> window:
>> >>>
>> >>> CPU 1   CPU 2
>> >>> -   -
>> >>> do_swap_page
>> >>>   skip swapcache case (synchronous swap_readpage)
>> >>> alloc_page_vma
>> >>> swapoff
>> >>>   release swap_file, 
>> >>> bdev, or ...
>> >>>   swap_readpage
>> >>> check sis->flags is ok
>> >>>   access swap_file, bdev or ...[oops!]
>> >>> si->flags = 0
>> >>>
>> >>> The swapcache case is ok because swapoff will wait on the page_lock of 
>> >>> swapcache page.
>> >>> Is this will really happen or Am I miss something ?
>> >>> Any reply would be really grateful. Thanks! :)
>> >>
>> >> This appears possible.  Even for swapcache case, we can't guarantee the
>> >
>> > Many thanks for reply!
>> >
>> >> swap entry gotten from the page table is always valid too.  The
>> >
>> > The page table may change at any time. And we may thus do some useless 
>> > work.
>> > But the pte_same() check could handle these races correctly if these do not
>> > result in oops.
>> >
>> >> underlying swap device can be swapped off at the same time.  So we use
>> >> get/put_swap_device() for that.  Maybe we need similar stuff here.
>> >
>> > Using get/put_swap_device() to guard against swapoff for swap_readpage() 
>> > sounds
>> > really bad as swap_readpage() may take really long time. Also such race 
>> > may not be
>> > really hurtful because swapoff is usually done when system shutdown only.
>> > I can not figure some simple and stable stuff out to fix this. Any 
>> > suggestions or
>> > could anyone help get rid of such race?
>>
>> Some reference counting on the swap device can prevent swap device from
>> swapping-off.  To reduce the performance overhead on the hot-path as
>> much as possible, it appears we can use the percpu_ref.
>
> Hi,
>
> I've been seeing crashes when testing the latest kernels with
>   stress-ng --class vm -a 20 -t 600s --temp-path /tmp
>
> I haven't had time to look into them yet:
>
> DEBUG_VM:
>   BUG: unable to handle page fault for address: 905c33c9a000
>   Call Trace:
>get_swap_pages+0x278/0x590
>get_swap_page+0x1ab/0x280
>add_to_swap+0x7d/0x130
>shrink_page_list+0xf84/0x25f0
>reclaim_pages+0x313/0x430
>madvise_cold_or_pageout_pte_range+0x95c/0xaa0

If my understanding were correct, two bugs are reported?  One above and
one below?  If so, and the above one is reported firstly.  Can you share
the full bug message reported in dmesg?

Can you convert the call trace to source line?  And the commit of the
kernel?  Or the full kconfig?  So I can build it by myself.

Best Regards,
Huang, Ying

> KASAN:
>   ==
>   BUG: KASAN: slab-out-of-bounds in __frontswap_store+0xc9/0x2e0
>   Read of size 8 at addr 88901f646f18 by task stress-ng-mrema/31329
>   CPU: 2 PID: 31329 Comm: stress-ng-mrema Tainted: G SI  L
> 5.12.0-smp-DEV #2
>   Call Trace:
>dump_stack+0xff/0x165
>print_address_description+0x81/0x390
>__kasan_report+0x154/0x1b0
>? __frontswap_store+0xc9/0x2e0
>? __frontswap_store+0xc9/0x2e0
>kasan_report+0x47/0x60
>kasan_check_range+0x2f3/0x340
>__kasan_check_read+0x11/0x20
>__frontswap_store+0xc9/0x2e0
>swap_writepage+0x52/0x80
>pageout+0x489/0x7f0
>shrink_page_list+0x1b11/0x2c90
>reclaim_pages+0x6ca/0x930
>madvise_cold_or_pageout_pte_range+0x1260/0x13a0
>
>   Allocated by task 16813:
>kasan_kmalloc+0xb0/0xe0
>__kasan_kmalloc+0x9/0x10
>__kmalloc_node+0x52/0x70
>kvmalloc_node+0x50/0x90
>__se_sys_swapon+0x353a/0x4860
>__x64_sys_swapon+0x5b/0x70
>
>   The buggy address belongs to the object at 889

Re: [Question] Is there a race window between swapoff vs synchronous swap_readpage

2021-03-29 Thread Huang, Ying

Miaohe Lin  writes:

> On 2021/3/30 9:57, Huang, Ying wrote:
>> Hi, Miaohe,
>> 
>> Miaohe Lin  writes:
>> 
>>> Hi all,
>>> I am investigating the swap code, and I found the below possible race 
>>> window:
>>>
>>> CPU 1   CPU 2
>>> -   -
>>> do_swap_page
>>>   skip swapcache case (synchronous swap_readpage)
>>> alloc_page_vma
>>> swapoff
>>>   release swap_file, 
>>> bdev, or ...
>>>   swap_readpage
>>> check sis->flags is ok
>>>   access swap_file, bdev or ...[oops!]
>>> si->flags = 0
>>>
>>> The swapcache case is ok because swapoff will wait on the page_lock of 
>>> swapcache page.
>>> Is this will really happen or Am I miss something ?
>>> Any reply would be really grateful. Thanks! :)
>> 
>> This appears possible.  Even for swapcache case, we can't guarantee the
>
> Many thanks for reply!
>
>> swap entry gotten from the page table is always valid too.  The
>
> The page table may change at any time. And we may thus do some useless work.
> But the pte_same() check could handle these races correctly if these do not
> result in oops.
>
>> underlying swap device can be swapped off at the same time.  So we use
>> get/put_swap_device() for that.  Maybe we need similar stuff here.
>
> Using get/put_swap_device() to guard against swapoff for swap_readpage() 
> sounds
> really bad as swap_readpage() may take really long time. Also such race may 
> not be
> really hurtful because swapoff is usually done when system shutdown only.
> I can not figure some simple and stable stuff out to fix this. Any 
> suggestions or
> could anyone help get rid of such race?

Some reference counting on the swap device can prevent swap device from
swapping-off.  To reduce the performance overhead on the hot-path as
much as possible, it appears we can use the percpu_ref.

Best Regards,
Huang, Ying

Re: [Question] Is there a race window between swapoff vs synchronous swap_readpage

2021-03-29 Thread Huang, Ying

Hi, Miaohe,

Miaohe Lin  writes:

> Hi all,
> I am investigating the swap code, and I found the below possible race window:
>
> CPU 1 CPU 2
> - -
> do_swap_page
>   skip swapcache case (synchronous swap_readpage)
> alloc_page_vma
>   swapoff
> release swap_file, 
> bdev, or ...
>   swap_readpage
>   check sis->flags is ok
> access swap_file, bdev or ...[oops!]
>   si->flags = 0
>
> The swapcache case is ok because swapoff will wait on the page_lock of 
> swapcache page.
> Is this will really happen or Am I miss something ?
> Any reply would be really grateful. Thanks! :)

This appears possible.  Even for swapcache case, we can't guarantee the
swap entry gotten from the page table is always valid too.  The
underlying swap device can be swapped off at the same time.  So we use
get/put_swap_device() for that.  Maybe we need similar stuff here.

Best Regards,
Huang, Ying

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3349 matches

Mail list logo