intel_rdt: Cache Allocation documentation and cgroup usage guide

Marcelo Tosatti Thu, 30 Jul 2015 13:10:05 -0700

On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
> 
> 
> Marcello,
> 
> 
> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> >
> >How about this:
> >
> >desiredclos (closid  p1  p2  p3 p4)
> >          1       1   0   0  0
> >          2       0   0   0  1
> >          3       0   1   1  0
> 
> #1 Currently in the rdt cgroup , the root cgroup always has all the
> bits set and cant be changed (because the cgroup hierarchy would by
> default make this to have all bits as all the children need to have
> a subset of the root's bitmask). So if the user creates a cgroup and
> not put any task in it , the tasks in the root cgroup could be still
> using that part of the cache. Thats the reason i say we can have
> really 'exclusive' masks.
> 
> Or in other words - there is always a desired clos (0) which has all
> parts set which acts like a default pool.
> 
> Also the parts can overlap.  Please apply this for all the below
> comments which will change the way they work.
> 
> >
> >p means part.
> 
> I am assuming p = (a contiguous cache capacity bit mask)


Yes.

> >closid 1 is a exclusive cgroup.
> >closid 2 is a "cache hog" class.
> >closid 3 is "default closid".
> >
> >Desiredclos is what user has specified.
> >
> >Transition 1: desiredclos --> effectiveclos
> >Clean all bits of unused closid's
> >(that must be updated whenever a
> >closid1 cgroup goes from empty->nonempty
> >and vice-versa).
> >
> >effectiveclos (closid  p1  p2  p3 p4)
> >            1       0   0   0  0
> >            2       0   0   0  1
> >            3       0   1   1  0
> 
> >
> >Transition 2: effectiveclos --> expandedclos
> >expandedclos (closid  p1  p2  p3 p4)
> >            1       0   0   0  0
> >            2       0   0   0  1
> >            3       1   1   1  0
> >Then you have different inplacecos for each
> >CPU (see pseudo-code below):
> >
> >On the following events.
> >
> >- task migration to new pCPU:
> >- task creation:
> >
> >     id = smp_processor_id();
> >     for (part = desiredclos.p1; ...; part++)
> >             /* if my cosid is set and any other
> >                cosid is clear, for the part,
> >                synchronize desiredclos --> inplacecos */
> >             if (part[mycosid] == 1 &&
> >                 part[any_othercosid] == 0)
> >                     wrmsr(part, desiredclos);
> >
> 
> Currently the root cgroup would have all the bits set which will act
> like a default cgroup where all the otherwise unused parts (assuming
> they are a set of contiguous cache capacity bits) will be used.
> 
> Otherwise the question is in the expandedclos - who decides to
> expand the closx parts to include some of the unused parts.. - that
> could just be a default root always ?

Right, so the problem is for certain closid's you might never want 
to expand (because doing so would cause data to be cached in a
cache way which might have high eviction rate in the future).
See the example from Will.

But for the default cache (that is "unclassified applications" 
i suppose it is beneficial to expand in most cases, that is, 
use maximum amount of cache irrespective of eviction rate, which 
is the behaviour that exists now without CAT).

So perhaps a new flag "expand=y/n" can be added to the cgroup 
directories... What do you say?

Userspace representation of CAT
-------------------------------

Usage model:
1) measure application performance without L3 cache reservation.
2) measure application perf with L3 cache reservation and
X number of cache ways until desired performance is attained.

Requirements:
1) Persistency of CLOS configuration across hardware. On migration
of operating system or application between different hardware
systems we'd like the following to be maintained:
        - exclusive number of bytes (*) reserved to a certain CLOSid.
        - shared number of bytes (*) reserved between a certain group
          of CLOSid's.

For both code and data, rounded down or up in cache way size.

2) Reasoning:
Different CBM masks in different hardware platforms might be necessary
to specify the same CLOS configuration, in terms of exclusive number of
bytes and shared number of bytes. (cache-way rounded number of bytes).
For example, due to L3 allocation by other hardware entities in certain parts
of the cache it might be necessary to relocate CBM mask to achieve
the same CLOS configuration.

3) Proposed format:

sharedregionK.exclusive - Number of exclusive cache bytes reserved for 
                        shared region.
sharedregionK.excl_data - Number of exclusive cache data bytes reserved for 
                        shared region.
sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for 
                        shared region.
sharedregionK.round_down - Round down to cache way bytes from respective number
                     specification (default is round up).
sharedregionK.expand - y/n - Expand shared region to more cache ways
                        when available (default N).

cgroupN.exclusive - Number of exclusive L3 cache bytes reserved 
                    for cgroup.
cgroupN.excl_data - Number of exclusive L3 data cache bytes reserved
                    for cgroup.
cgroupN.excl_code - Number of exclusive L3 code cache bytes reserved
                    for cgroup.
cgroupN.round_down - Round down to cache way bytes from respective number
                     specification (default is round up).
cgroupN.expand - y/n - Expand shared region to more cache ways when
                       available (default N).
cgroupN.shared = { sharedregion1, sharedregion2, ... } (list of shared
regions)

Example 1:
One application with 2M exclusive cache, two applications
with 1M exclusive each, sharing an expansive shared region of 1M.

cgroup1.exclusive = 2M

sharedregion1.exclusive = 1M
sharedregion1.expand = Y

cgroup2.exclusive = 1M
cgroup2.shared = sharedregion1

cgroup3.exclusive = 1M
cgroup3.shared = sharedregion1

Example 2:
3 high performance applications running, one of which is a cache hog
with no cache locality.

cgroup1.exclusive = 8M
cgroup2.exclusive = 8M

cgroup3.exclusive = 512K
cgroup3.round_down = Y

In all cases the default cgroup (which requires no explicit 
specification) is expansive and uses the remaining cache 
ways, including the ways shared by other hardware entities.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

Reply via email to