hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges

dan.j.williams Mon, 26 Jan 2026 15:55:10 -0800

[responding to the questions raised here before reviewing the patch...]

Koralahalli Channabasappa, Smita wrote:
> Hi Alejandro,
> 
> On 1/23/2026 3:59 AM, Alejandro Lucero Palau wrote:
> > 
> > On 1/22/26 04:55, Smita Koralahalli wrote:
> >> The current probe time ownership check for Soft Reserved memory based
> >> solely on CXL window intersection is insufficient. dax_hmem probing is 
> >> not
> >> always guaranteed to run after CXL enumeration and region assembly, which
> >> can lead to incorrect ownership decisions before the CXL stack has
> >> finished publishing windows and assembling committed regions.
> >>
> >> Introduce deferred ownership handling for Soft Reserved ranges that
> >> intersect CXL windows at probe time by scheduling deferred work from
> >> dax_hmem and waiting for the CXL stack to complete enumeration and region
> >> assembly before deciding ownership.
> >>
> >> Evaluate ownership of Soft Reserved ranges based on CXL region
> >> containment.
> >>
> >>     - If all Soft Reserved ranges are fully contained within committed 
> >> CXL
> >>       regions, DROP handling Soft Reserved ranges from dax_hmem and allow
> >>       dax_cxl to bind.
> >>
> >>     - If any Soft Reserved range is not fully claimed by committed CXL
> >>       region, tear down all CXL regions and REGISTER the Soft Reserved
> >>       ranges with dax_hmem instead.
> > 
> > 
> > I was not sure if I was understanding this properly, but after looking 
> > at the code I think I do ... but then I do not understand the reason 
> > behind. If I'm right, there could be two devices and therefore different 
> > soft reserved ranges, with one getting an automatic cxl region for all 
> > the range and the other without that, and the outcome would be the first 
> > one getting its region removed and added to hmem. Maybe I'm missing 
> > something obvious but, why? If there is a good reason, I think it should 
> > be documented in the commit and somewhere else.
> 
> Yeah, if I understood Dan correctly, that's exactly the intended behavior.
> 
> I'm trying to restate the "why" behind this based on Dan's earlier 
> guidance. Please correct me if I'm misrepresenting it Dan.
> 
> The policy is meant to be coarse: If all SR ranges that intersect CXL 
> windows are fully contained by committed CXL regions, then we have high 
> confidence that the platform descriptions line up and CXL owns the memory.
> 
> If any SR range that intersects a CXL window is not fully covered by 
> committed regions then we treat that as unexpected platform shenanigans. 
> In that situation the intent is to give up on CXL entirely for those SR 
> ranges because partial ownership becomes ambiguous.
> 
> This is why the fallback is global and not per range. The goal is to
> leave no room for mixed some SR to CXL, some SR to HMEM configurations. 
> Any mismatch should push the platform issue back to the vendor to fix 
> the description (ideally preserving the simplifying assumption of a 1:1 
> correlation between CXL Regions and SR).
> 
> Thanks for pointing this out. I will update the why in the next revision.


You have it right. This is mostly a policy to save debug sanity and
share the compatibility pain. You either always get everything the BIOS
put into the memory map, or you get the fully enlightened CXL world.

When accelerator memory enters the mix it does require an opt-in/out of
this scheme. Either the device completely opts out of this HMEM fallback
mechanism by marking the memory as Reserved (the dominant preference),
or it arranges for CXL accelerator drivers to be present at boot if they
want to interoperate with this fallback. Some folks want the fallback:
https://lpc.events/event/19/contributions/2064/

> > I have also problems understanding the concurrency when handling the 
> > global dax_cxl_mode variable. It is modified inside process_defer_work() 
> > which I think can have different instances for different devices 
> > executed concurrently in different cores/workers (the system_wq used is 
> > not ordered). If I'm right race conditions are likely.

It only works as a single queue of regions. One sync point to say "all
collected regions are routed into the dax_hmem or dax_cxl bucket".

> Yeah, this is something I spent sometime thinking on. My rationale 
> behind not having it and where I'm still unsure:
> 
> My assumption was that after wait_for_device_probe(), CXL topology 
> discovery and region commit are complete and stable.

...or more specifically, any CXL region discovery after that point is a
typical runtime dynamic discovery event that is not subject to any
deferral.

> And each deferred worker should observe the same CXL state and
> therefore compute the same final policy (either DROP or REGISTER).

The expectation is one queue, one event that takes the rwsem and
dispositions all present regions relative to initial soft-reserve memory
map.

> Also, I was assuming that even if multiple process_defer_work() 
> instances run, the operations they perform are effectively safe to 
> repeat.. though I'm not sure on this.

I think something is wrong if the workqueue runs more than once. It is
just a place to wait for initial device probe to complete and then fixup
all the regions (allow dax_region registration to proceed) that were
waiting for that.

> cxl_region_teardown_all(): this ultimately triggers the 
> devm_release_action(... unregister_region ...) path. My expectation was 
> that these devm actions are single shot per device lifecycle, so 
> repeated teardown attempts should become noops.

Not noops, right? The definition of a devm_action is that they always
fire at device_del(). There is no facility to device_del() a device
twice.

> cxl_region_teardown_all() ultimately leads to cxl_decoder_detach(), 
> which takes "cxl_rwsem.region". That should serialize decoder detach and 
> region teardown.
> 
> bus_rescan_devices(&cxl_bus_type): I assumed repeated rescans during 
> boot are fine as the rescan path will simply rediscover already present 
> devices..

The rescan path likely needs some logic to give up on CXL region
autodiscovery for devices that failed their memmap compatibility check.

> walk_hmem_resources(.., hmem_register_device): in the DROP case,I 
> thought running the walk multiple times is safe because devm managed 
> platform devices and memregion allocations should prevent duplicate 
> lifetime issues.
> 
> So, even if multiple process_defer_work() instances execute 
> concurrently, the CXL operations involved in containment evaluation 
> (cxl_region_contains_soft_reserve()) and teardown are already guarded.
> 
> But I'm still trying to understand if bus_rescan_devices(&cxl_bus_type) 
> is not safe when invoked concurrently?

It already races today between natural bus enumeration and the
cxl_bus_rescan() call from cxl_acpi. So it needs to be ok, it is
naturally synchronized by the region's device_lock and regions' rwsem.

> Or is the primary issue that dax_cxl_mode is a global updated from one 
> context and read from others, and should be synchronized even if the 
> computed final value will always be the same?

There is only one global hmem_platform device, so only one potential
item in this workqueue.

Re: [PATCH v5 6/7] dax/hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges

Reply via email to