[responding to the questions raised here before reviewing the patch...]
Koralahalli Channabasappa, Smita wrote:
Hi Alejandro,
On 1/23/2026 3:59 AM, Alejandro Lucero Palau wrote:
On 1/22/26 04:55, Smita Koralahalli wrote:
The current probe time ownership check for Soft Reserved memory based
solely on CXL window intersection is insufficient. dax_hmem probing is
not
always guaranteed to run after CXL enumeration and region assembly, which
can lead to incorrect ownership decisions before the CXL stack has
finished publishing windows and assembling committed regions.
Introduce deferred ownership handling for Soft Reserved ranges that
intersect CXL windows at probe time by scheduling deferred work from
dax_hmem and waiting for the CXL stack to complete enumeration and region
assembly before deciding ownership.
Evaluate ownership of Soft Reserved ranges based on CXL region
containment.
- If all Soft Reserved ranges are fully contained within committed
CXL
regions, DROP handling Soft Reserved ranges from dax_hmem and allow
dax_cxl to bind.
- If any Soft Reserved range is not fully claimed by committed CXL
region, tear down all CXL regions and REGISTER the Soft Reserved
ranges with dax_hmem instead.
I was not sure if I was understanding this properly, but after looking
at the code I think I do ... but then I do not understand the reason
behind. If I'm right, there could be two devices and therefore different
soft reserved ranges, with one getting an automatic cxl region for all
the range and the other without that, and the outcome would be the first
one getting its region removed and added to hmem. Maybe I'm missing
something obvious but, why? If there is a good reason, I think it should
be documented in the commit and somewhere else.
Yeah, if I understood Dan correctly, that's exactly the intended behavior.
I'm trying to restate the "why" behind this based on Dan's earlier
guidance. Please correct me if I'm misrepresenting it Dan.
The policy is meant to be coarse: If all SR ranges that intersect CXL
windows are fully contained by committed CXL regions, then we have high
confidence that the platform descriptions line up and CXL owns the memory.
If any SR range that intersects a CXL window is not fully covered by
committed regions then we treat that as unexpected platform shenanigans.
In that situation the intent is to give up on CXL entirely for those SR
ranges because partial ownership becomes ambiguous.
This is why the fallback is global and not per range. The goal is to
leave no room for mixed some SR to CXL, some SR to HMEM configurations.
Any mismatch should push the platform issue back to the vendor to fix
the description (ideally preserving the simplifying assumption of a 1:1
correlation between CXL Regions and SR).
Thanks for pointing this out. I will update the why in the next revision.
You have it right. This is mostly a policy to save debug sanity and
share the compatibility pain. You either always get everything the BIOS
put into the memory map, or you get the fully enlightened CXL world.
When accelerator memory enters the mix it does require an opt-in/out of
this scheme. Either the device completely opts out of this HMEM fallback
mechanism by marking the memory as Reserved (the dominant preference),
or it arranges for CXL accelerator drivers to be present at boot if they
want to interoperate with this fallback. Some folks want the fallback:
https://lpc.events/event/19/contributions/2064/
Yeah, this is something I spent sometime thinking on. My rationale
behind not having it and where I'm still unsure:
My assumption was that after wait_for_device_probe(), CXL topology
discovery and region commit are complete and stable.
...or more specifically, any CXL region discovery after that point is a
typical runtime dynamic discovery event that is not subject to any
deferral.
And each deferred worker should observe the same CXL state and
therefore compute the same final policy (either DROP or REGISTER).
The expectation is one queue, one event that takes the rwsem and
dispositions all present regions relative to initial soft-reserve memory
map.
Also, I was assuming that even if multiple process_defer_work()
instances run, the operations they perform are effectively safe to
repeat.. though I'm not sure on this.
I think something is wrong if the workqueue runs more than once. It is
just a place to wait for initial device probe to complete and then fixup
all the regions (allow dax_region registration to proceed) that were
waiting for that.
cxl_region_teardown_all(): this ultimately triggers the
devm_release_action(... unregister_region ...) path. My expectation was
that these devm actions are single shot per device lifecycle, so
repeated teardown attempts should become noops.
Not noops, right? The definition of a devm_action is that they always
fire at device_del(). There is no facility to device_del() a device
twice.
cxl_region_teardown_all() ultimately leads to cxl_decoder_detach(),
which takes "cxl_rwsem.region". That should serialize decoder detach and
region teardown.
bus_rescan_devices(&cxl_bus_type): I assumed repeated rescans during
boot are fine as the rescan path will simply rediscover already present
devices..
The rescan path likely needs some logic to give up on CXL region
autodiscovery for devices that failed their memmap compatibility check.
walk_hmem_resources(.., hmem_register_device): in the DROP case,I
thought running the walk multiple times is safe because devm managed
platform devices and memregion allocations should prevent duplicate
lifetime issues.
So, even if multiple process_defer_work() instances execute
concurrently, the CXL operations involved in containment evaluation
(cxl_region_contains_soft_reserve()) and teardown are already guarded.
But I'm still trying to understand if bus_rescan_devices(&cxl_bus_type)
is not safe when invoked concurrently?
It already races today between natural bus enumeration and the
cxl_bus_rescan() call from cxl_acpi. So it needs to be ok, it is
naturally synchronized by the region's device_lock and regions' rwsem.
Or is the primary issue that dax_cxl_mode is a global updated from one
context and read from others, and should be synchronized even if the
computed final value will always be the same?
There is only one global hmem_platform device, so only one potential
item in this workqueue.