On 10/1/25 18:15, Tomasz Wolski wrote:
[responding to the questions raised here before reviewing the patch...]

Koralahalli Channabasappa, Smita wrote:
Hi Alejandro,

On 1/23/2026 3:59 AM, Alejandro Lucero Palau wrote:
On 1/22/26 04:55, Smita Koralahalli wrote:
The current probe time ownership check for Soft Reserved memory based
solely on CXL window intersection is insufficient. dax_hmem probing is
not
always guaranteed to run after CXL enumeration and region assembly, which
can lead to incorrect ownership decisions before the CXL stack has
finished publishing windows and assembling committed regions.

Introduce deferred ownership handling for Soft Reserved ranges that
intersect CXL windows at probe time by scheduling deferred work from
dax_hmem and waiting for the CXL stack to complete enumeration and region
assembly before deciding ownership.

Evaluate ownership of Soft Reserved ranges based on CXL region
containment.

      - If all Soft Reserved ranges are fully contained within committed
CXL
        regions, DROP handling Soft Reserved ranges from dax_hmem and allow
        dax_cxl to bind.

      - If any Soft Reserved range is not fully claimed by committed CXL
        region, tear down all CXL regions and REGISTER the Soft Reserved
        ranges with dax_hmem instead.
I was not sure if I was understanding this properly, but after looking
at the code I think I do ... but then I do not understand the reason
behind. If I'm right, there could be two devices and therefore different
soft reserved ranges, with one getting an automatic cxl region for all
the range and the other without that, and the outcome would be the first
one getting its region removed and added to hmem. Maybe I'm missing
something obvious but, why? If there is a good reason, I think it should
be documented in the commit and somewhere else.
Yeah, if I understood Dan correctly, that's exactly the intended behavior.

I'm trying to restate the "why" behind this based on Dan's earlier
guidance. Please correct me if I'm misrepresenting it Dan.

The policy is meant to be coarse: If all SR ranges that intersect CXL
windows are fully contained by committed CXL regions, then we have high
confidence that the platform descriptions line up and CXL owns the memory.

If any SR range that intersects a CXL window is not fully covered by
committed regions then we treat that as unexpected platform shenanigans.
In that situation the intent is to give up on CXL entirely for those SR
ranges because partial ownership becomes ambiguous.

This is why the fallback is global and not per range. The goal is to
leave no room for mixed some SR to CXL, some SR to HMEM configurations.
Any mismatch should push the platform issue back to the vendor to fix
the description (ideally preserving the simplifying assumption of a 1:1
correlation between CXL Regions and SR).

Thanks for pointing this out. I will update the why in the next revision.
You have it right. This is mostly a policy to save debug sanity and
share the compatibility pain. You either always get everything the BIOS
put into the memory map, or you get the fully enlightened CXL world.

When accelerator memory enters the mix it does require an opt-in/out of
this scheme. Either the device completely opts out of this HMEM fallback
mechanism by marking the memory as Reserved (the dominant preference),
or it arranges for CXL accelerator drivers to be present at boot if they
want to interoperate with this fallback. Some folks want the fallback:
https://lpc.events/event/19/contributions/2064/

I will take a look at this presentation, but I think there could be
another option where accelerators information is obtained during pci
enumeration by the kernel and using this information by this
functionality skipping those ranges allocated to them. Forcing them to
be compiled with the kernel would go against what distributions
currently and widely do with initramfs. Not sure if some current "early"
stubs could be used for this though but the information needs to be
recollected before this code does the checks.


I have also problems understanding the concurrency when handling the
global dax_cxl_mode variable. It is modified inside process_defer_work()
which I think can have different instances for different devices
executed concurrently in different cores/workers (the system_wq used is
not ordered). If I'm right race conditions are likely.
It only works as a single queue of regions. One sync point to say "all
collected regions are routed into the dax_hmem or dax_cxl bucket".

That is how I think it should work, handling all the soft reserved
ranges vs regions by one code execution. But that is not the case. More
later.


Yeah, this is something I spent sometime thinking on. My rationale
behind not having it and where I'm still unsure:

My assumption was that after wait_for_device_probe(), CXL topology
discovery and region commit are complete and stable.
...or more specifically, any CXL region discovery after that point is a
typical runtime dynamic discovery event that is not subject to any
deferral.

And each deferred worker should observe the same CXL state and
therefore compute the same final policy (either DROP or REGISTER).
The expectation is one queue, one event that takes the rwsem and
dispositions all present regions relative to initial soft-reserve memory
map.

Also, I was assuming that even if multiple process_defer_work()
instances run, the operations they perform are effectively safe to
repeat.. though I'm not sure on this.
I think something is wrong if the workqueue runs more than once. It is
just a place to wait for initial device probe to complete and then fixup
all the regions (allow dax_region registration to proceed) that were
waiting for that.

cxl_region_teardown_all(): this ultimately triggers the
devm_release_action(... unregister_region ...) path. My expectation was
that these devm actions are single shot per device lifecycle, so
repeated teardown attempts should become noops.
Not noops, right? The definition of a devm_action is that they always
fire at device_del(). There is no facility to device_del() a device
twice.

cxl_region_teardown_all() ultimately leads to cxl_decoder_detach(),
which takes "cxl_rwsem.region". That should serialize decoder detach and
region teardown.

bus_rescan_devices(&cxl_bus_type): I assumed repeated rescans during
boot are fine as the rescan path will simply rediscover already present
devices..
The rescan path likely needs some logic to give up on CXL region
autodiscovery for devices that failed their memmap compatibility check.

walk_hmem_resources(.., hmem_register_device): in the DROP case,I
thought running the walk multiple times is safe because devm managed
platform devices and memregion allocations should prevent duplicate
lifetime issues.

So, even if multiple process_defer_work() instances execute
concurrently, the CXL operations involved in containment evaluation
(cxl_region_contains_soft_reserve()) and teardown are already guarded.

But I'm still trying to understand if bus_rescan_devices(&cxl_bus_type)
is not safe when invoked concurrently?
It already races today between natural bus enumeration and the
cxl_bus_rescan() call from cxl_acpi. So it needs to be ok, it is
naturally synchronized by the region's device_lock and regions' rwsem.

Or is the primary issue that dax_cxl_mode is a global updated from one
context and read from others, and should be synchronized even if the
computed final value will always be the same?
There is only one global hmem_platform device, so only one potential
item in this workqueue.

Well, I do not think so.


hmem_init() in drivers/dax/device.c walks IORESOURCE_MEM looking for
IORES_DESC_SOFT_RESERVED descriptors and calling hmem_register_one for
each of them. That leads to create an hmem_platform platform device (no
typo, just emphasizing this is a platform device with name
hmem_platform) so there will be as many hmem_platform devices as
descriptors found.


Then each hmem_platform probe() will create an hmem platform device,
where a work will be schedule passing this specific hmem platform device
as argument. So, each work will check for the specific ranges of its own
pdev and not all of them. The check can result in a different value
assigned to dax_cxl_mode leading to potential race conditions with
concurrent workers and also potentially leaving soft reserved ranges
without both, a dax or an hmem device.
Hi Alejandro,

Isn't below  check in __hmem_register_resource ensuring that only one
hmem_platform device can be created? So first resource would
create platform device and set the platform_initialized bool to true

static void __hmem_register_resource(int target_nid, struct resource *res)
..
        if (platform_initialized)
                return;
..


Upps. Silly me. So focused on platform_device_alloc() ...


So, the concurrency problem when updating dax_cxl_mode does not exist as all potential concurrent workers use the hmem_platform device, but I think, or maybe I'm having a really bad day, the hmem devices are "indeed" created based on those soft reserved ranges when walking hmem_active ... the work is schedule as many times as hmem devices, and maybe there could be a problem with concurrent works invoking cxl_region_teardown_all().


Anyways, thank you for pointing out what I could not see.


Thanks,
Tomasz


Reply via email to