hmm: add new helper to hotplug CDM memory region v3

Bob Liu Wed, 06 Sep 2017 19:18:26 -0700

On 2017/9/6 10:12, Jerome Glisse wrote:
> On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
>> On 2017/9/6 2:54, Ross Zwisler wrote:
>>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>>>> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>>>>> On 2017/9/4 23:51, Jerome Glisse wrote:
>>>>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>>>>>>> On 2017/8/17 8:05, Jérôme Glisse wrote:
>>>>>>>> Unlike unaddressable memory, coherent device memory has a real
>>>>>>>> resource associated with it on the system (as CPU can address
>>>>>>>> it). Add a new helper to hotplug such memory within the HMM
>>>>>>>> framework.
>>>>>>>>
>>>>>>>
>>>>>>> Got an new question, coherent device( e.g CCIX) memory are likely 
>>>>>>> reported to OS 
>>>>>>> through ACPI and recognized as NUMA memory node.
>>>>>>> Then how can their memory be captured and managed by HMM framework?
>>>>>>>
>>>>>>
>>>>>> Only platform that has such memory today is powerpc and it is not 
>>>>>> reported
>>>>>> as regular memory by the firmware hence why they need this helper.
>>>>>>
>>>>>> I don't think anyone has defined anything yet for x86 and acpi. As this 
>>>>>> is
>>>>>
>>>>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
>>>>> Table (HMAT) table defined in ACPI 6.2.
>>>>> The HMAT can cover CPU-addressable memory types(though not non-cache
>>>>> coherent on-device memory).
>>>>>
>>>>> Ross from Intel already done some work on this, see:
>>>>> https://lwn.net/Articles/724562/
>>>>>
>>>>> arm64 supports APCI also, there is likely more this kind of device when 
>>>>> CCIX
>>>>> is out (should be very soon if on schedule).
>>>>
>>>> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory 
>>>> ie
>>>> when you have several kind of memory each with different characteristics:
>>>>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
>>>>     small (ie few giga bytes)
>>>>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>>>>   - DDR (good old memory) well characteristics are between HBM and 
>>>> persistent
>>>>
>>>> So AFAICT this has nothing to do with what HMM is for, ie device memory. 
>>>> Note
>>>> that device memory can have a hierarchy of memory themself (HBM, GDDR and 
>>>> in
>>>> maybe even persistent memory).
>>>>
>>>>>> memory on PCIE like interface then i don't expect it to be reported as 
>>>>>> NUMA
>>>>>> memory node but as io range like any regular PCIE resources. Device 
>>>>>> driver
>>>>>> through capabilities flags would then figure out if the link between the
>>>>>> device and CPU is CCIX capable if so it can use this helper to hotplug it
>>>>>> as device memory.
>>>>>>
>>>>>
>>>>> From my point of view,  Cache coherent device memory will popular soon and
>>>>> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
>>>>> reasonable
>>>>> to me.
>>>>
>>>> Cache coherent device will be reported through standard mecanisms defined 
>>>> by
>>>> the bus standard they are using. To my knowledge all the standard are 
>>>> either
>>>> on top of PCIE or are similar to PCIE.
>>>>
>>>> It is true that on many platform PCIE resource is manage/initialize by the
>>>> bios (UEFI) but it is platform specific. In some case we reprogram what the
>>>> bios pick.
>>>>
>>>> So like i was saying i don't expect the BIOS/UEFI to report device memory 
>>>> as
>>>> regular memory. It will be reported as a regular PCIE resources and then 
>>>> the
>>>> device driver will be able to determine through some flags if the link 
>>>> between
>>>> the CPU(s) and the device is cache coherent or not. At that point the 
>>>> device
>>>> driver can use register it with HMM helper.
>>>>
>>>>
>>>> The whole NUMA discussion happen several time in the past i suggest looking
>>>> on mm list archive for them. But it was rule out for several reasons. Top 
>>>> of
>>>> my head:
>>>>   - people hate CPU less node and device memory is inherently CPU less
>>>
>>> With the introduction of the HMAT in ACPI 6.2 one of the things that was 
>>> added
>>> was the ability to have an ACPI proximity domain that isn't associated with 
>>> a
>>> CPU.  This can be seen in the changes in the text of the "Proximity Domain"
>>> field in table 5-73 which describes the "Memory Affinity Structure".  One of
>>> the major features of the HMAT was the separation of "Initiator" proximity
>>> domains (CPUs, devices that initiate memory transfers), and "target" 
>>> proximity
>>> domains (memory regions, be they attached to a CPU or some other device).
>>>
>>> ACPI proximity domains map directly to Linux NUMA nodes, so I think we're
>>> already in a place where we have to support CPU-less NUMA nodes.
>>>
>>>>   - device driver want total control over memory and thus to be isolated 
>>>> from
>>>>     mm mecanism and doing all those special cases was not welcome
>>>
>>> I agree that the kernel doesn't have enough information to be able to
>>> accurately handle all the use cases for the various types of heterogeneous
>>> memory.   The goal of my HMAT enabling is to allow that memory to be 
>>> reserved
>>> from kernel use via the "Reservation Hint" in the HMAT's Memory Subsystem
>>> Address Range Structure, then provide userspace with enough information to 
>>> be
>>> able to distinguish between the various types of memory in the system so it
>>> can allocate & utilize it appropriately.
>>>
>>
>> Does this mean require an user space memory management library to deal with 
>> all
>> alloc/free/defragment.. But how to do with virtual <-> physical address 
>> mapping
>> from userspace?
> 
> For HMM each process give hint (somewhat similar to mbind) for range of 
> virtual
> address to the device kernel driver (through some API like OpenCL or CUDA for 
> GPU
> for instance). All this being device driver specific ioctl.
> 
> The kernel device driver have an overall view of all the process that use the 
> device
> and each of the memory advise they gave. From that informations the kernel 
> device
> driver decide what part of each process address space to migrate to device 
> memory.


Oh, I mean CDM-HMM.  I'm fine with HMM.

> This obviously dynamic and likely to change over the process lifetime.
> 
> 
> My understanding is that HMAT want similar API to allow process to give 
> direction on
> where each range of virtual address should be allocated. It is expected that 
> most

Right, but not clear who should manage the physical memory allocation and setup 
the 
pagetable mapping. An new driver or the kernel?

> software can easily infer what part of its address will need more bandwidth, 
> smaller
> latency versus what part is sparsely accessed ...
> 
> For HMAT i think first target is HBM and persistent memory and device memory 
> might
> be added latter if that make sense.
> 

Okay, so there are two potential ways for CPU-addressable cache-coherent device 
memory
(or cpu-less numa memory or "target domain" memory in ACPI spec )?
1. CDM-HMM
2. HMAT

--
Regards,
Bob Liu

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

Reply via email to