hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges

Tomasz Wolski Tue, 27 Jan 2026 07:26:54 -0800

>> [responding to the questions raised here before reviewing the patch...]
>>
>> Koralahalli Channabasappa, Smita wrote:
>>> Hi Alejandro,
>>>
>>> On 1/23/2026 3:59 AM, Alejandro Lucero Palau wrote:
>>>> On 1/22/26 04:55, Smita Koralahalli wrote:
>>>>> The current probe time ownership check for Soft Reserved memory based
>>>>> solely on CXL window intersection is insufficient. dax_hmem probing is
>>>>> not
>>>>> always guaranteed to run after CXL enumeration and region assembly, which
>>>>> can lead to incorrect ownership decisions before the CXL stack has
>>>>> finished publishing windows and assembling committed regions.
>>>>>
>>>>> Introduce deferred ownership handling for Soft Reserved ranges that
>>>>> intersect CXL windows at probe time by scheduling deferred work from
>>>>> dax_hmem and waiting for the CXL stack to complete enumeration and region
>>>>> assembly before deciding ownership.
>>>>>
>>>>> Evaluate ownership of Soft Reserved ranges based on CXL region
>>>>> containment.
>>>>>
>>>>>      - If all Soft Reserved ranges are fully contained within committed
>>>>> CXL
>>>>>        regions, DROP handling Soft Reserved ranges from dax_hmem and allow
>>>>>        dax_cxl to bind.
>>>>>
>>>>>      - If any Soft Reserved range is not fully claimed by committed CXL
>>>>>        region, tear down all CXL regions and REGISTER the Soft Reserved
>>>>>        ranges with dax_hmem instead.
>>>>
>>>> I was not sure if I was understanding this properly, but after looking
>>>> at the code I think I do ... but then I do not understand the reason
>>>> behind. If I'm right, there could be two devices and therefore different
>>>> soft reserved ranges, with one getting an automatic cxl region for all
>>>> the range and the other without that, and the outcome would be the first
>>>> one getting its region removed and added to hmem. Maybe I'm missing
>>>> something obvious but, why? If there is a good reason, I think it should
>>>> be documented in the commit and somewhere else.
>>> Yeah, if I understood Dan correctly, that's exactly the intended behavior.
>>>
>>> I'm trying to restate the "why" behind this based on Dan's earlier
>>> guidance. Please correct me if I'm misrepresenting it Dan.
>>>
>>> The policy is meant to be coarse: If all SR ranges that intersect CXL
>>> windows are fully contained by committed CXL regions, then we have high
>>> confidence that the platform descriptions line up and CXL owns the memory.
>>>
>>> If any SR range that intersects a CXL window is not fully covered by
>>> committed regions then we treat that as unexpected platform shenanigans.
>>> In that situation the intent is to give up on CXL entirely for those SR
>>> ranges because partial ownership becomes ambiguous.
>>>
>>> This is why the fallback is global and not per range. The goal is to
>>> leave no room for mixed some SR to CXL, some SR to HMEM configurations.
>>> Any mismatch should push the platform issue back to the vendor to fix
>>> the description (ideally preserving the simplifying assumption of a 1:1
>>> correlation between CXL Regions and SR).
>>>
>>> Thanks for pointing this out. I will update the why in the next revision.
>> You have it right. This is mostly a policy to save debug sanity and
>> share the compatibility pain. You either always get everything the BIOS
>> put into the memory map, or you get the fully enlightened CXL world.
>>
>> When accelerator memory enters the mix it does require an opt-in/out of
>> this scheme. Either the device completely opts out of this HMEM fallback
>> mechanism by marking the memory as Reserved (the dominant preference),
>> or it arranges for CXL accelerator drivers to be present at boot if they
>> want to interoperate with this fallback. Some folks want the fallback:
>> https://lpc.events/event/19/contributions/2064/
>
>
>I will take a look at this presentation, but I think there could be 
>another option where accelerators information is obtained during pci 
>enumeration by the kernel and using this information by this 
>functionality skipping those ranges allocated to them. Forcing them to 
>be compiled with the kernel would go against what distributions 
>currently and widely do with initramfs. Not sure if some current "early" 
>stubs could be used for this though but the information needs to be 
>recollected before this code does the checks.
>
>
>>>> I have also problems understanding the concurrency when handling the
>>>> global dax_cxl_mode variable. It is modified inside process_defer_work()
>>>> which I think can have different instances for different devices
>>>> executed concurrently in different cores/workers (the system_wq used is
>>>> not ordered). If I'm right race conditions are likely.
>> It only works as a single queue of regions. One sync point to say "all
>> collected regions are routed into the dax_hmem or dax_cxl bucket".
>
>
>That is how I think it should work, handling all the soft reserved 
>ranges vs regions by one code execution. But that is not the case. More 
>later.
>
>
>>> Yeah, this is something I spent sometime thinking on. My rationale
>>> behind not having it and where I'm still unsure:
>>>
>>> My assumption was that after wait_for_device_probe(), CXL topology
>>> discovery and region commit are complete and stable.
>> ...or more specifically, any CXL region discovery after that point is a
>> typical runtime dynamic discovery event that is not subject to any
>> deferral.
>>
>>> And each deferred worker should observe the same CXL state and
>>> therefore compute the same final policy (either DROP or REGISTER).
>> The expectation is one queue, one event that takes the rwsem and
>> dispositions all present regions relative to initial soft-reserve memory
>> map.
>>
>>> Also, I was assuming that even if multiple process_defer_work()
>>> instances run, the operations they perform are effectively safe to
>>> repeat.. though I'm not sure on this.
>> I think something is wrong if the workqueue runs more than once. It is
>> just a place to wait for initial device probe to complete and then fixup
>> all the regions (allow dax_region registration to proceed) that were
>> waiting for that.
>>
>>> cxl_region_teardown_all(): this ultimately triggers the
>>> devm_release_action(... unregister_region ...) path. My expectation was
>>> that these devm actions are single shot per device lifecycle, so
>>> repeated teardown attempts should become noops.
>> Not noops, right? The definition of a devm_action is that they always
>> fire at device_del(). There is no facility to device_del() a device
>> twice.
>>
>>> cxl_region_teardown_all() ultimately leads to cxl_decoder_detach(),
>>> which takes "cxl_rwsem.region". That should serialize decoder detach and
>>> region teardown.
>>>
>>> bus_rescan_devices(&cxl_bus_type): I assumed repeated rescans during
>>> boot are fine as the rescan path will simply rediscover already present
>>> devices..
>> The rescan path likely needs some logic to give up on CXL region
>> autodiscovery for devices that failed their memmap compatibility check.
>>
>>> walk_hmem_resources(.., hmem_register_device): in the DROP case,I
>>> thought running the walk multiple times is safe because devm managed
>>> platform devices and memregion allocations should prevent duplicate
>>> lifetime issues.
>>>
>>> So, even if multiple process_defer_work() instances execute
>>> concurrently, the CXL operations involved in containment evaluation
>>> (cxl_region_contains_soft_reserve()) and teardown are already guarded.
>>>
>>> But I'm still trying to understand if bus_rescan_devices(&cxl_bus_type)
>>> is not safe when invoked concurrently?
>> It already races today between natural bus enumeration and the
>> cxl_bus_rescan() call from cxl_acpi. So it needs to be ok, it is
>> naturally synchronized by the region's device_lock and regions' rwsem.
>>
>>> Or is the primary issue that dax_cxl_mode is a global updated from one
>>> context and read from others, and should be synchronized even if the
>>> computed final value will always be the same?
>> There is only one global hmem_platform device, so only one potential
>> item in this workqueue.
>
>
>Well, I do not think so.
>
>
>hmem_init() in drivers/dax/device.c walks IORESOURCE_MEM looking for 
>IORES_DESC_SOFT_RESERVED descriptors and calling hmem_register_one for 
>each of them. That leads to create an hmem_platform platform device (no 
>typo, just emphasizing this is a platform device with name 
>hmem_platform) so there will be as many hmem_platform devices as 
>descriptors found.
>
>
>Then each hmem_platform probe() will create an hmem platform device, 
>where a work will be schedule passing this specific hmem platform device 
>as argument. So, each work will check for the specific ranges of its own 
>pdev and not all of them. The check can result in a different value 
>assigned to dax_cxl_mode leading to potential race conditions with 
>concurrent workers and also potentially leaving soft reserved ranges 
>without both, a dax or an hmem device.


Hi Alejandro,

Isn't below  check in __hmem_register_resource ensuring that only one
hmem_platform device can be created? So first resource would
create platform device and set the platform_initialized bool to true

static void __hmem_register_resource(int target_nid, struct resource *res)
..
        if (platform_initialized)
                return;
..

Thanks,
Tomasz

Re: [PATCH v5 6/7] dax/hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges

Reply via email to