On Thu, Mar 26, 2015 at 10:29:27PM -0300, Marcelo Tosatti wrote: > On Thu, Mar 26, 2015 at 11:38:59AM -0700, Vikas Shivappa wrote: > > > > Hello Marcelo, > > Hi Vikas, > > > On Wed, 25 Mar 2015, Marcelo Tosatti wrote: > > > > >On Thu, Mar 12, 2015 at 04:16:07PM -0700, Vikas Shivappa wrote: > > >>This patch adds a description of Cache allocation technology, overview > > >>of kernel implementation and usage of CAT cgroup interface. > > >> > > >>Signed-off-by: Vikas Shivappa <vikas.shiva...@linux.intel.com> > > >>--- > > >> Documentation/cgroups/rdt.txt | 183 > > >> ++++++++++++++++++++++++++++++++++++++++++ > > >> 1 file changed, 183 insertions(+) > > >> create mode 100644 Documentation/cgroups/rdt.txt > > >> > > >>diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt > > >>new file mode 100644 > > >>index 0000000..98eb4b8 > > >>--- /dev/null > > >>+++ b/Documentation/cgroups/rdt.txt > > >>@@ -0,0 +1,183 @@ > > >>+ RDT > > >>+ --- > > >>+ > > >>+Copyright (C) 2014 Intel Corporation > > >>+Written by vikas.shiva...@linux.intel.com > > >>+(based on contents and format from cpusets.txt) > > >>+ > > >>+CONTENTS: > > >>+========= > > >>+ > > >>+1. Cache Allocation Technology > > >>+ 1.1 What is RDT and CAT ? > > >>+ 1.2 Why is CAT needed ? > > >>+ 1.3 CAT implementation overview > > >>+ 1.4 Assignment of CBM and CLOS > > >>+ 1.5 Scheduling and Context Switch > > >>+2. Usage Examples and Syntax > > >>+ > > >>+1. Cache Allocation Technology(CAT) > > >>+=================================== > > >>+ > > >>+1.1 What is RDT and CAT > > >>+----------------------- > > >>+ > > >>+CAT is a part of Resource Director Technology(RDT) or Platform Shared > > >>+resource control which provides support to control Platform shared > > >>+resources like cache. Currently Cache is the only resource that is > > >>+supported in RDT. > > >>+More information can be found in the Intel SDM section 17.15. > > >>+ > > >>+Cache Allocation Technology provides a way for the Software (OS/VMM) > > >>+to restrict cache allocation to a defined 'subset' of cache which may > > >>+be overlapping with other 'subsets'. This feature is used when > > >>+allocating a line in cache ie when pulling new data into the cache. > > >>+The programming of the h/w is done via programming MSRs. > > >>+ > > >>+The different cache subsets are identified by CLOS identifier (class > > >>+of service) and each CLOS has a CBM (cache bit mask). The CBM is a > > >>+contiguous set of bits which defines the amount of cache resource that > > >>+is available for each 'subset'. > > >>+ > > >>+1.2 Why is CAT needed > > >>+--------------------- > > >>+ > > >>+The CAT enables more cache resources to be made available for higher > > >>+priority applications based on guidance from the execution > > >>+environment. > > >>+ > > >>+The architecture also allows dynamically changing these subsets during > > >>+runtime to further optimize the performance of the higher priority > > >>+application with minimal degradation to the low priority app. > > >>+Additionally, resources can be rebalanced for system throughput > > >>+benefit. (Refer to Section 17.15 in the Intel SDM) > > >>+ > > >>+This technique may be useful in managing large computer systems which > > >>+large LLC. Examples may be large servers running instances of > > >>+webservers or database servers. In such complex systems, these subsets > > >>+can be used for more careful placing of the available cache > > >>+resources. > > >>+ > > >>+The CAT kernel patch would provide a basic kernel framework for users > > >>+to be able to implement such cache subsets. > > >>+ > > >>+1.3 CAT implementation Overview > > >>+------------------------------- > > >>+ > > >>+Kernel implements a cgroup subsystem to support cache allocation. > > >>+ > > >>+Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping. > > >>+A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal > > >>+to the kernel and not exposed to user. Each cgroup would have one CBM > > >>+and would just represent one cache 'subset'. > > >>+ > > >>+The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the > > >>+cgroup never fails. When a child cgroup is created it inherits the > > >>+CLOSid and the CBM from its parent. When a user changes the default > > >>+CBM for a cgroup, a new CLOSid may be allocated if the CBM was not > > >>+used before. The changing of 'cbm' may fail with -ERRNOSPC once the > > >>+kernel runs out of maximum CLOSids it can support. > > >>+User can create as many cgroups as he wants but having different CBMs > > >>+at the same time is restricted by the maximum number of CLOSids > > >>+(multiple cgroups can have the same CBM). > > >>+Kernel maintains a CLOSid<->cbm mapping which keeps reference counter > > >>+for each cgroup using a CLOSid. > > >>+ > > >>+The tasks in the cgroup would get to fill the LLC cache represented by > > >>+the cgroup's 'cbm' file. > > >>+ > > >>+Root directory would have all available bits set in 'cbm' file by > > >>+default. > > >>+ > > >>+1.4 Assignment of CBM,CLOS > > >>+-------------------------- > > >>+ > > >>+The 'cbm' needs to be a subset of the parent node's 'cbm'. > > >>+Any contiguous subset of these bits(with a minimum of 2 bits) maybe > > >>+set to indicate the cache mapping desired. The 'cbm' between 2 > > >>+directories can overlap. The 'cbm' would represent the cache 'subset' > > >>+of the CAT cgroup. For ex: on a system with 16 bits of max cbm bits, > > >>+if the directory has the least significant 4 bits set in its 'cbm' > > >>+file(meaning the 'cbm' is just 0xf), it would be allocated the right > > >>+quarter of the Last level cache which means the tasks belonging to > > >>+this CAT cgroup can use the right quarter of the cache to fill. If it > > >>+has the most significant 8 bits set ,it would be allocated the left > > >>+half of the cache(8 bits out of 16 represents 50%). > > >>+ > > >>+The cache portion defined in the CBM file is available to all tasks > > >>+within the cgroup to fill and these task are not allowed to allocate > > >>+space in other parts of the cache. > > > > > >Is there a reason to expose the hardware interface rather > > >than ratios to userspace ? > > > > > >Say, i'd like to allocate 20% of L3 cache to cgroup A, > > >80% to cgroup B. > > > > > >Well, you'd have to expose the shared percentages between > > >any two cgroups (that information is there in the > > >cbm bitmaps, but not in "ratios"). > > > > > >One problem i see with exposing cbm bitmasks is that on hardware > > >updates that change cache size or bitmask length, userspace must > > >recalculate the bitmaps. > > > > > >Another is that its vendor dependant, while ratios (plus shared > > >information for two given cgroups) is not. > > > > > > > Agree that this interface doesnot give options to directly allocate > > in terms of percentage . But note that specifying in bitmasks allows > > the user to allocate overlapping cache areas and also since we use > > cgroup we naturally follow the cgroup hierarchy. User should be able > > to convert the bitmasks into intended percentage or size values > > based on the other available cache size info in hooks like cpuinfo. > > > > We discussed more on this before in the older patches and here is > > one thread where we discussed it for your reference - > > http://marc.info/?l=linux-kernel&m=142482002022543&w=2 > > > > Thanks, > > Vikas > > I can't find any discussion relating to exposing the CBM interface > directly to userspace in that thread ? > > Cpu.shares is written in ratio form, which is much more natural. > Do you see any advantage in maintaining the > > (ratio -> cbm bitmasks) > > translation in userspace rather than in the kernel ? > > What about something like: > > > root cgroup > / \ > / \ > / \ > cgroupA-80 cgroupB-30 > > > So that whatever exceeds 100% is the ratio of cache > shared at that level (cgroup A and B share 10% of cache > at that level). > > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html > > cpu — the cpu.shares parameter determines the share of CPU resources > available to each process in all cgroups. Setting the parameter to 250, > 250, and 500 in the finance, sales, and engineering cgroups respectively > means that processes started in these groups will split the resources > with a 1:1:2 ratio. Note that when a single process is running, it > consumes as much CPU as necessary no matter which cgroup it is placed > in. The CPU limitation only comes into effect when two or more processes > compete for CPU resources.
Vikas, I see the following resource specifications from the POV of a user/admin: 1) Ratios. X%/Y%, as discussed above. 2) Specific kilobyte values. In accord with the rest of cgroups, allow specific kilobyte specification. See limit_in_bytes, for example, from https://www.kernel.org/doc/Documentation/cgroups/memory.txt Of course you would have to convert to way units, but i see two use-cases here: - User wants application to not reclaim more than given number of kilobytes of LLC cache. - User wants application to be guaranteed a given amount of kilobytes of LLC, even across processor changes. Again, some precision is lost with LLC. 3) Per-CPU differentiation The current patchset deals with the following use-case suboptimally: CPU1-4 CPU5-8 die1 die2 * Task groupA is isolated to CPU-8 (die2). * Task groupA has 50% cache reserved. * Task groupB can reclaim into 50% cache. * Task groupB can reclaim into 100% of cache of die1. I suppose this is a common scenario which is not handled by the current patchset (you would have task groupB use only 50% of cache of die1). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/