+Tony and acknowledging him..
On Sun, 23 Aug 2015, Vikas Shivappa wrote:
This document tries to propose alternative interface for the Intel cache allocation compared to the cgroup interface in the current patchset - http://marc.info/?l=linux-kernel&m=143889814520578 More info about cache allocation can be found in Intel SDM june 2015 volume 3, section 17.16. Design overview: --------------- OS maintains a mapping between task_struct and the class of service it belongs to. This is done by adding a new field 'closid' in the task_struct. Each closid is mapped to unique capacity bit mask(cbm) which indicates the cache capacity associated to the closid. During scheduing, kernel writes this closid into IA32_PQOS_MSR to indicate the hardware as to what Class of service(CLOS) the task belongs to. It makes following changes to the current patch series : - Add kernel mode API to control cache allocations from with in the OS. - we dont use cgroup and instead expose controls through using sysfs in /sys/kernel directory for the administrator to configure the cache allocations. - And optionally it also adds capabilities to add a control where process can change the cache allocation under the defined allocations by administrator. The usecases targeted is mainly server clusters, cloud and container based services and HPC workloads. Users of cloud or containers would get a VM/container to run the workloads and its most appropriate to setup the static cache allocations for these units like VM/Containers. For containers, many of the container based products like Rancher/stackengine etc are docker based and allocate/manage resources through a centralized orchestration/deployment tool. Containers are quickly picking up in usage given the ease of deployment of new containers and the scaling. These cache alloc interfaces try to build a framework so that such use cases like cloud and container based can easily adapt. Apps are restricted to self control the cache allocations as cache is orders of magnitude scarce resource when we compare to other resources like memory and will quickly run out of the resource if the apps naturally try to use more of the resource to increase their own performance. kernel mode API: --------------- enum cache_resource{ l3_shared, }; struct cache_alloc_config { u32 max_cbm; u32 max_closid; unsigned long cache_size; int cdp_mode; }; struct clos_cbm_table { unsigned long l3_cbm; unsigned int clos_refcnt; }; void cache_alloc_get_info(enum cache_resource cr, struct cache_alloc_config &config); This returns the cache allocation configuration information along with the cache size. Additionational capabilities can be added for example the current mode whether code data prioritization(supporting both icache/dcache or legacy cache alloc). int cache_alloc_set_cdpmode(bool setcdp); By default cdp(code data prioritization which supports allocation of code and data seperately instead of common cache allocation) is not enabled and can be set/reset with this API. Enabling cdp would reset all the capacity bit masks and reduce the number of CLOSids to half. With cdp enabled the cbm can be extended to represent data and code capacity mask (by having two u32). void cache_alloc_get_cbm_table( struct clos_cbm_table *cctable, int size); Returns the mapping of the current closids to the capacity bit masks. u32 cache_alloc_set_cbm( u32 preferred_closid, u32 cbm); This reconfigures the capacity bitmask(cbm) for a preferred closid. If the cbm is already present in the table, that closid is returned. That way each unique cbm has one closid. sysfs interface --------------- This exposes files changeble by root in /sys/kernel/cache_alloc directory. clos_cbm_table : Reading - this shows the max_cbm and the current snapshot of the clos_cbm table. writing - user can write the 'preferred closid' 'cbm' to change the existing entry in the set of CLOS configs. If user writes a bitmask that already exists it outputs indicating what closid has the cbm. $ echo <closid> <cbm> > /sys/kernel/cache_alloc/clos_cbm_table Alternatively , instead of clos_cbm_table a directory for each clos would be created with a file cbm in each directory. add_task : write only: Can change the closid of any task by writing the 'pid' 'closid'. eg: $ echo <pid> <closid> > /sys/kernel/cache_alloc/add_task threshold_clos : Can have two values 'lowest', 'all'. default to lowest. When it lowest , a process can self change its closid to a different closid but the new closid has to have the lowest capacity bitmask among all the bitmasks. When its 'all' the process can change to any closid. the interface is indicated below. cdp_enable : takes 1/0 and by default is 0. Used to set cdp mode. $ ls /sys/kernel/cache_alloc add_task threshold_clos cdp_enable clos0/ clos1/ ... closn/ The closid of the task can be viewed in the /proc/<tid>/ stats. The tasks would have closid 0 by default and would inherit parents closid upon fork. prctl/ syscall interface for process to change cache alloc ---------------------------------------------------------- This lets a process change its own cache allocation. However the amount of change that can be done is limited. This is because L3 cache is a very limited/scarce resource and can easily be exhausted by the first few processes requesting more amount of cache. And this also lets one centralized entity or a system-controlled mechanism which can be used only by administrator to have a higher control in deciding the cache allocation which is more useful in the scenarios described above. struct cat_config { u32 max_cbm; u32 max_clos; unsigned long chunk_size; int any_clos_allowed; }; void cat_get_current_config(struct cat_config &config, struct clos_cbm_table &cctable); This returns the max clos and cbm length and the current mappings of the closid and the capacity masks. It also returns the chunk_size which specifies the size of cache capacity that corresponds to one bit of cbm. any_clos_allowed will be true if the threshold_clos is set to 'any'. prctl(PR_SET_CLOSID, <new_closid>, ... ); Cache can be allocated in terms of bytes or percentages using this interface. One can calculate the chunk size from the APIs and then convert the size required to mask easily by using bitmask length = (size required/ chunk size). Also the bitmask gives the flexibility to have exclusive, completely overlapping or partially overlapping cache areas which can be adjusted based on the requirements of the workloads.
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/