"Ho-Ren (Jack) Chuang" <horenchu...@bytedance.com> writes: > On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron > <jonathan.came...@huawei.com> wrote: >> >> On Fri, 5 Apr 2024 00:07:06 +0000 >> "Ho-Ren (Jack) Chuang" <horenchu...@bytedance.com> wrote: >> >> > The current implementation treats emulated memory devices, such as >> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory >> > (E820_TYPE_RAM). However, these emulated devices have different >> > characteristics than traditional DRAM, making it important to >> > distinguish them. Thus, we modify the tiered memory initialization process >> > to introduce a delay specifically for CPUless NUMA nodes. This delay >> > ensures that the memory tier initialization for these nodes is deferred >> > until HMAT information is obtained during the boot process. Finally, >> > demotion tables are recalculated at the end. >> > >> > * late_initcall(memory_tier_late_init); >> > Some device drivers may have initialized memory tiers between >> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing >> > online memory nodes and configuring memory tiers. They should be excluded >> > in the late init. >> > >> > * Handle cases where there is no HMAT when creating memory tiers >> > There is a scenario where a CPUless node does not provide HMAT information. >> > If no HMAT is specified, it falls back to using the default DRAM tier. >> > >> > * Introduce another new lock `default_dram_perf_lock` for adist calculation >> > In the current implementation, iterating through CPUlist nodes requires >> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up >> > trying to acquire the same lock, leading to a potential deadlock. >> > Therefore, we propose introducing a standalone `default_dram_perf_lock` to >> > protect `default_dram_perf_*`. This approach not only avoids deadlock >> > but also prevents holding a large lock simultaneously. >> > >> > * Upgrade `set_node_memory_tier` to support additional cases, including >> > default DRAM, late CPUless, and hot-plugged initializations. >> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and >> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to >> > handle cases where memtype is not initialized and where HMAT information is >> > available. >> > >> > * Introduce `default_memory_types` for those memory types that are not >> > initialized by device drivers. >> > Because late initialized memory and default DRAM memory need to be managed, >> > a default memory type is created for storing all memory types that are >> > not initialized by device drivers and as a fallback. >> > >> > Signed-off-by: Ho-Ren (Jack) Chuang <horenchu...@bytedance.com> >> > Signed-off-by: Hao Xiang <hao.xi...@bytedance.com> >> > Reviewed-by: "Huang, Ying" <ying.hu...@intel.com> >> >> Hi - one remaining question. Why can't we delay init for all nodes >> to either drivers or your fallback late_initcall code. >> It would be nice to reduce possible code paths. > > I try not to change too much of the existing code structure in > this patchset. > > To me, postponing/moving all memory tier registrations to > late_initcall() is another possible action item for the next patchset. > > After tier_mem(), hmat_init() is called, which requires registering > `default_dram_type` info. This is when `default_dram_type` is needed. > However, it is indeed possible to postpone the latter part, > set_node_memory_tier(), to `late_init(). So, memory_tier_init() can > indeed be split into two parts, and the latter part can be moved to > late_initcall() to be processed together.
I don't think that it's good to move all memory_tier initialization in drivers to late_initcall(). It's natural to keep them in device_initcall() level. If so, we can allocate default_dram_type in memory_tier_init(), and call set_node_memory_tier() only in memory_tier_lateinit(). We can call memory_tier_lateinit() in device_initcall() level too. -- Best Regards, Huang, Ying > Doing this all memory-type drivers have to call late_initcall() to > register a memory tier. I’m not sure how many they are? > > What do you guys think? > >> >> Jonathan >> >> >> > --- >> > mm/memory-tiers.c | 94 +++++++++++++++++++++++++++++++++++------------ >> > 1 file changed, 70 insertions(+), 24 deletions(-) >> > >> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c >> > index 516b144fd45a..6632102bd5c9 100644 >> > --- a/mm/memory-tiers.c >> > +++ b/mm/memory-tiers.c >> >> >> >> > @@ -855,7 +892,8 @@ static int __init memory_tier_init(void) >> > * For now we can have 4 faster memory tiers with smaller adistance >> > * than default DRAM tier. >> > */ >> > - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM); >> > + default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM, >> > + &default_memory_types); >> > if (IS_ERR(default_dram_type)) >> > panic("%s() failed to allocate default DRAM tier\n", >> > __func__); >> > >> > @@ -865,6 +903,14 @@ static int __init memory_tier_init(void) >> > * types assigned. >> > */ >> > for_each_node_state(node, N_MEMORY) { >> > + if (!node_state(node, N_CPU)) >> > + /* >> > + * Defer memory tier initialization on >> > + * CPUless numa nodes. These will be initialized >> > + * after firmware and devices are initialized. >> >> Could the comment also say why we can't defer them all? >> >> (In an odd coincidence we have a similar issue for some CPU hotplug >> related bring up where review feedback was move all cases later). >> >> > + */ >> > + continue; >> > + >> > memtier = set_node_memory_tier(node); >> > if (IS_ERR(memtier)) >> > /* >>