Hi folks:

Seeking community input on an initial design for Intel Resource Director 
Technology (RDT), in particular for Cache Allocation Technology in OpenStack 
Nova to protect workloads from co-resident noisy neighbors, to ensure quality 
of service (QoS).

1. What is Cache Allocation Technology (CAT)?
Intel’s RDT(Resource Director Technology) [1]  is a umbrella of hardware 
support to facilitate the monitoring and reservation of shared resources such 
as cache, memory and network bandwidth towards obtaining Quality of Service. 
RDT will enable fine grain control of resources which in particular is valuable 
in cloud environments to meet Service Level Agreements while increasing 
resource utilization through sharing. CAT is a part of RDT and concerns itself 
with reserving for a process(es) a portion of last level cache with further 
fine grain control as to how much for code versus data. The below figure shows 
a single processor composed of 4 cores and the cache hierarchy. The L1 cache is 
split into Instruction and Data, the L2 cache is next in speed to L1. The L1 
and L2 caches are per core. The Last Level Cache (LLC) is shared among all 
cores. With CAT on the currently available hardware the LLC can be partitioned 
on a per process (virtual machine, container, or normal application) or process 
group basis.


Libvirt and OpenStack [2] already support monitoring cache (CMT) and memory 
bandwidth usage local to a processor socket (MBM_local) and total memory 
bandwidth usage across all processor sockets (MBM_total) for a process or 
process group.


2. How CAT works
To learn more about CAT please refer to the Intel Processor Soft Developer's 
Manual<http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html>
  volume 3b, chapters 17.16 and 17.17 [3]. Linux kernel support for the same is 
expected in release 4.10 and documented at [4]


3. Libvirt Interface

Libvirt support for CAT is underway with the patch at reversion 7

Interface changes of libvirt:

3.1 The capabilities xml has been extended to reveal cache information

<cache>
     <bank id='0' type='l3' size='56320' unit='KiB' cpus='0-21,44-65'>
       <control min='2816' reserved='2816' unit='KiB' scope='L3'/>
     </bank>
     <bank id='1' type='l3' size='56320' unit='KiB' cpus='22-43,66-87'>
       <control min='2816' reserved='2816' unit='KiB' scope='L3'/>
     </bank>
</cache>

The new `cache` xml element shows that the host has two banks of type L3 or 
Last Level Cache (LLC), one per processor socket. The cache type is l3 cache, 
its size 56320 KiB, and the cpus attribute indicates the physical CPUs 
associated with the same, here ‘0-21’, ‘44-65’ respectively.

The control tag shows that bank belongs to scope L3, with a minimum possible 
allocation of 2816 KiB and still has 2816 KiB need to be reserved.

If the host enabled CDP (Code and Data Prioritization) , l3 cache will be 
divided as code  (L3CODE)and data (L3Data).

Control tag will be extended to:
...
 <control min='2816' reserved='2816' unit='KiB' scope='L3CODE'/>
 <control min='2816' reserved='2816' unit='KiB' scope='L3DATA'/>
…

The scope of L3CODE and L3DATA show that we can allocate cache for code/data 
usage respectively, they share same amount of l3 cache.

3.2 Domain xml extended to include new CacheTune element

<cputune>
   <vcpupin vcpu='0' cpuset='0'/>
               <vcpupin vcpu='1' cpuset='1'/>
   <vcpupin vcpu='2' cpuset='0'/>
               <vcpupin vcpu='3' cpuset='1'/>
   <cachetune id='0' host_id='0' type='l3' size='2816' unit='KiB' vcpus='0, 1/>
   <cachetune id='1' host_id='1' type='l3' size='2816' unit='KiB' vcpus=’2, 3’/>
...
</cputune>

This means the guest will be have vcpus 0, 1 running on host’s socket 0, with 
2816 KiB cache exclusively allocated to it and vcpus 2, 3 running on host’s 
socket 0, with 2816 KiB cache exclusively allocated to it.

Here we need to make sure vcpus 0, 1 are pinned to the pcpus of socket 0, refer 
capabilities
 <bank id='0' type='l3' size='56320' unit='KiB' cpus='0-21,44-65'>:

Here we need to make sure vcpus 2, 3 are pinned to the pcpus of socket 1, refer 
capabilities
 <bank id='1' type='l3' size='56320' unit='KiB' cpus='22-43,66-87'>:.

3.3 Libvirt work flow for CAT


  1.  Create qemu process and get it’s PIDs
  2.  Define a new resource control domain also known as Class-of-Service 
(CLOS) under /sys/fs/resctrl and set the desired Cache Bit Mask(CBM) in the 
libvirt domain xml file in addition to updating the default schemata of the host

4. Proposed Nova Changes


  1.  Get host capabilities from libvirt and extend compute node’ filed
  2.  Add new scheduler filter and weight to help schedule host for requested 
guest.
  3.  Extend flavor’s (and image meta) extra spec fields:

We need to specify  numa setting for NUMA hosts if we want to enable CAT, see 
[5] to learn more about NUMA.
In flavor, we can have:

vcpus=8
mem=4
hw:numa_nodes=2 - numa of NUMA nodes to expose to the guest.
hw:numa_cpus.0=0,1,2,3,4,5
hw:numa_cpus.1=6,7
hw:numa_mem.0=3072
hw:numa_mem.1=1024
//  new added in the proposal
hw:cache_banks=2   //cache banks to be allocated to a  guest, (can be less than 
the number of NUMA nodes)
hw:cache_type.0=l3  //cache bank type, could be l3, l3data + l3code
hw:cache_type.1=l3_c+d  //cache bank type, could be l3, l3data + l3code
hw:cache_vcpus.0=0,1  //vcpu list on cache banks, can be none
hw:cache_vcpus.1=6,7
hw:cache_l3.0=2816  //cache size in KiB.
hw:cache_l3_code.1=2816
hw:cache_l3_data.1=2816

Here, user can clear about which vcpus will benefit cache allocation, about 
cache bank, it’s should be co-work with numa cell, it will allocate cache on a 
physical CPU socket, but here cache bank is a logic concept. Cache bank will 
allocate cache for a vcpu list, all vcpu list should group

Modify in addition the <cachetune> element in libvirt domain xml, see 3.2 for 
detail

This will allocate 2 cache banks from the host’s cache banks and associate 
vcpus to the same.
In the example, the guest will be have vcpus 0, 1 running on socket 0 of the 
host with 2816 KiB of cache for exclusive use and have vcpus 6, 7 running on 
socket 1 of the host with l3 code cache 2816KiB and l3 data with 2816KiB cache 
allocation.

If a NUMA Cell were to contain multiple CPU sockets (this is rare), then we 
will adjust NUMA vCPU placement policy, to ensure that vCPUs and the cache 
allocated to them are all co-located on the same socket.


  *   We can define less cache bank on a multiple NUMA cell node.
  *   No cache_vcpus parameter needs to be specified if no reservation is 
desired.

NOTE: the cache allocation for a guest is in isolated/exclusive mode.

References

[1] 
http://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html
[2] https://blueprints.launchpad.net/nova/+spec/support-perf-event
[3] 
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
[4] 
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/tree/Documentation/x86/intel_rdt_ui.txt?h=x86/cache
[5] 
https://specs.openstack.org/openstack/nova-specs/specs/juno/implemented/virt-driver-numa-placement.html


Best Regards

Eli Qiao(乔立勇)OpenStack Core team OTC Intel.
--

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to