Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-08-04 Thread Vikas Shivappa



On Tue, 28 Jul 2015, Peter Zijlstra wrote:


On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:

Please edit this document to have consistent spacing. Its really hard to
read this. Every time I spot a misplaced space my brain stumbles and I
need to restart.


Will fix all the spacing and other indentions issues mentioned. 
Thanks for pointing them all out. Although the other documents I see dont have a 
consistent format completely which is what confused me, this format would be 
better.



+
+The following considerations are done for the PQR MSR write so that it
+has minimal impact on scheduling hot path:
+- This path doesnt exist on any non-intel platforms.


!x86 I think you mean, its entirely possible to have the code present
on AMD systems for instance.


+- On Intel platforms, this would not exist by default unless CGROUP_RDT
+is enabled.


You can enable this just fine on AMD machines.


The cache alloc code is under CPU_SUP_INTEL ..

Thanks,
Vikas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-08-04 Thread Vikas Shivappa



On Tue, 28 Jul 2015, Peter Zijlstra wrote:


On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:

Please edit this document to have consistent spacing. Its really hard to
read this. Every time I spot a misplaced space my brain stumbles and I
need to restart.


Will fix all the spacing and other indentions issues mentioned. 
Thanks for pointing them all out. Although the other documents I see dont have a 
consistent format completely which is what confused me, this format would be 
better.



+
+The following considerations are done for the PQR MSR write so that it
+has minimal impact on scheduling hot path:
+- This path doesnt exist on any non-intel platforms.


!x86 I think you mean, its entirely possible to have the code present
on AMD systems for instance.


+- On Intel platforms, this would not exist by default unless CGROUP_RDT
+is enabled.


You can enable this just fine on AMD machines.


The cache alloc code is under CPU_SUP_INTEL ..

Thanks,
Vikas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-08-03 Thread Vikas Shivappa


Hello Marcelo/Martin,

Like I mentioned let me modify the documentation to better help understand 
the usage. Things like updating each package bitmask is already in the patches.


Lets discuss offline and come up a well defined proposal for change if any and 
then update that in next series. We seem to be just looping over same items.


Thanks,
Vikas

On Mon, 3 Aug 2015, Marcelo Tosatti wrote:


On Sun, Aug 02, 2015 at 05:48:07PM +0200, Martin Kletzander wrote:

On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:

On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:



Marcello,


On Wed, 29 Jul 2015, Marcelo Tosatti wrote:


How about this:

desiredclos (closid  p1  p2  p3 p4)
 1   1   0   0  0
 2   0   0   0  1
 3   0   1   1  0


#1 Currently in the rdt cgroup , the root cgroup always has all the
bits set and cant be changed (because the cgroup hierarchy would by
default make this to have all bits as all the children need to have
a subset of the root's bitmask). So if the user creates a cgroup and
not put any task in it , the tasks in the root cgroup could be still
using that part of the cache. Thats the reason i say we can have
really 'exclusive' masks.

Or in other words - there is always a desired clos (0) which has all
parts set which acts like a default pool.

Also the parts can overlap.  Please apply this for all the below
comments which will change the way they work.



p means part.


I am assuming p = (a contiguous cache capacity bit mask)


Yes.


closid 1 is a exclusive cgroup.
closid 2 is a "cache hog" class.
closid 3 is "default closid".

Desiredclos is what user has specified.

Transition 1: desiredclos --> effectiveclos
Clean all bits of unused closid's
(that must be updated whenever a
closid1 cgroup goes from empty->nonempty
and vice-versa).

effectiveclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   0   1   1  0




Transition 2: effectiveclos --> expandedclos
expandedclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   1   1   1  0
Then you have different inplacecos for each
CPU (see pseudo-code below):

On the following events.

- task migration to new pCPU:
- task creation:

id = smp_processor_id();
for (part = desiredclos.p1; ...; part++)
/* if my cosid is set and any other
   cosid is clear, for the part,
   synchronize desiredclos --> inplacecos */
if (part[mycosid] == 1 &&
part[any_othercosid] == 0)
wrmsr(part, desiredclos);



Currently the root cgroup would have all the bits set which will act
like a default cgroup where all the otherwise unused parts (assuming
they are a set of contiguous cache capacity bits) will be used.

Otherwise the question is in the expandedclos - who decides to
expand the closx parts to include some of the unused parts.. - that
could just be a default root always ?


Right, so the problem is for certain closid's you might never want
to expand (because doing so would cause data to be cached in a
cache way which might have high eviction rate in the future).
See the example from Will.

But for the default cache (that is "unclassified applications"
i suppose it is beneficial to expand in most cases, that is,
use maximum amount of cache irrespective of eviction rate, which
is the behaviour that exists now without CAT).

So perhaps a new flag "expand=y/n" can be added to the cgroup
directories... What do you say?

Userspace representation of CAT
---

Usage model:
1) measure application performance without L3 cache reservation.
2) measure application perf with L3 cache reservation and
X number of cache ways until desired performance is attained.

Requirements:
1) Persistency of CLOS configuration across hardware. On migration
of operating system or application between different hardware
systems we'd like the following to be maintained:
  - exclusive number of bytes (*) reserved to a certain CLOSid.
  - shared number of bytes (*) reserved between a certain group
of CLOSid's.

For both code and data, rounded down or up in cache way size.

2) Reasoning:
Different CBM masks in different hardware platforms might be necessary
to specify the same CLOS configuration, in terms of exclusive number of
bytes and shared number of bytes. (cache-way rounded number of bytes).
For example, due to L3 allocation by other hardware entities in certain parts
of the cache it might be necessary to relocate CBM mask to achieve
the same CLOS configuration.

3) Proposed format:



Few questions from a random listener, I apologise if some of them are
in a wrong place due to me missing some information from past threads.

I'm not sure whether the following proposal to the format is the
internal structure 

Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-08-03 Thread Marcelo Tosatti
On Sun, Aug 02, 2015 at 05:48:07PM +0200, Martin Kletzander wrote:
> On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:
> >On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
> >>
> >>
> >>Marcello,
> >>
> >>
> >>On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> >>>
> >>>How about this:
> >>>
> >>>desiredclos (closid  p1  p2  p3 p4)
> >>>1   1   0   0  0
> >>>2   0   0   0  1
> >>>3   0   1   1  0
> >>
> >>#1 Currently in the rdt cgroup , the root cgroup always has all the
> >>bits set and cant be changed (because the cgroup hierarchy would by
> >>default make this to have all bits as all the children need to have
> >>a subset of the root's bitmask). So if the user creates a cgroup and
> >>not put any task in it , the tasks in the root cgroup could be still
> >>using that part of the cache. Thats the reason i say we can have
> >>really 'exclusive' masks.
> >>
> >>Or in other words - there is always a desired clos (0) which has all
> >>parts set which acts like a default pool.
> >>
> >>Also the parts can overlap.  Please apply this for all the below
> >>comments which will change the way they work.
> >>
> >>>
> >>>p means part.
> >>
> >>I am assuming p = (a contiguous cache capacity bit mask)
> >
> >Yes.
> >
> >>>closid 1 is a exclusive cgroup.
> >>>closid 2 is a "cache hog" class.
> >>>closid 3 is "default closid".
> >>>
> >>>Desiredclos is what user has specified.
> >>>
> >>>Transition 1: desiredclos --> effectiveclos
> >>>Clean all bits of unused closid's
> >>>(that must be updated whenever a
> >>>closid1 cgroup goes from empty->nonempty
> >>>and vice-versa).
> >>>
> >>>effectiveclos (closid  p1  p2  p3 p4)
> >>>  1   0   0   0  0
> >>>  2   0   0   0  1
> >>>  3   0   1   1  0
> >>
> >>>
> >>>Transition 2: effectiveclos --> expandedclos
> >>>expandedclos (closid  p1  p2  p3 p4)
> >>>  1   0   0   0  0
> >>>  2   0   0   0  1
> >>>  3   1   1   1  0
> >>>Then you have different inplacecos for each
> >>>CPU (see pseudo-code below):
> >>>
> >>>On the following events.
> >>>
> >>>- task migration to new pCPU:
> >>>- task creation:
> >>>
> >>>   id = smp_processor_id();
> >>>   for (part = desiredclos.p1; ...; part++)
> >>>   /* if my cosid is set and any other
> >>>  cosid is clear, for the part,
> >>>  synchronize desiredclos --> inplacecos */
> >>>   if (part[mycosid] == 1 &&
> >>>   part[any_othercosid] == 0)
> >>>   wrmsr(part, desiredclos);
> >>>
> >>
> >>Currently the root cgroup would have all the bits set which will act
> >>like a default cgroup where all the otherwise unused parts (assuming
> >>they are a set of contiguous cache capacity bits) will be used.
> >>
> >>Otherwise the question is in the expandedclos - who decides to
> >>expand the closx parts to include some of the unused parts.. - that
> >>could just be a default root always ?
> >
> >Right, so the problem is for certain closid's you might never want
> >to expand (because doing so would cause data to be cached in a
> >cache way which might have high eviction rate in the future).
> >See the example from Will.
> >
> >But for the default cache (that is "unclassified applications"
> >i suppose it is beneficial to expand in most cases, that is,
> >use maximum amount of cache irrespective of eviction rate, which
> >is the behaviour that exists now without CAT).
> >
> >So perhaps a new flag "expand=y/n" can be added to the cgroup
> >directories... What do you say?
> >
> >Userspace representation of CAT
> >---
> >
> >Usage model:
> >1) measure application performance without L3 cache reservation.
> >2) measure application perf with L3 cache reservation and
> >X number of cache ways until desired performance is attained.
> >
> >Requirements:
> >1) Persistency of CLOS configuration across hardware. On migration
> >of operating system or application between different hardware
> >systems we'd like the following to be maintained:
> >   - exclusive number of bytes (*) reserved to a certain CLOSid.
> >   - shared number of bytes (*) reserved between a certain group
> > of CLOSid's.
> >
> >For both code and data, rounded down or up in cache way size.
> >
> >2) Reasoning:
> >Different CBM masks in different hardware platforms might be necessary
> >to specify the same CLOS configuration, in terms of exclusive number of
> >bytes and shared number of bytes. (cache-way rounded number of bytes).
> >For example, due to L3 allocation by other hardware entities in certain parts
> >of the cache it might be necessary to relocate CBM mask to achieve
> >the same CLOS configuration.
> >
> >3) Proposed format:
> >
> 
> Few questions from a random listener, I apologise if some of them are
> in a wrong place due to me missing some information from past threads.
> 
> I'm not sure whether the following proposal to the format is the

Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-08-03 Thread Marcelo Tosatti
On Sun, Aug 02, 2015 at 05:48:07PM +0200, Martin Kletzander wrote:
 On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:
 On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
 
 
 Marcello,
 
 
 On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
 
 How about this:
 
 desiredclos (closid  p1  p2  p3 p4)
 1   1   0   0  0
 2   0   0   0  1
 3   0   1   1  0
 
 #1 Currently in the rdt cgroup , the root cgroup always has all the
 bits set and cant be changed (because the cgroup hierarchy would by
 default make this to have all bits as all the children need to have
 a subset of the root's bitmask). So if the user creates a cgroup and
 not put any task in it , the tasks in the root cgroup could be still
 using that part of the cache. Thats the reason i say we can have
 really 'exclusive' masks.
 
 Or in other words - there is always a desired clos (0) which has all
 parts set which acts like a default pool.
 
 Also the parts can overlap.  Please apply this for all the below
 comments which will change the way they work.
 
 
 p means part.
 
 I am assuming p = (a contiguous cache capacity bit mask)
 
 Yes.
 
 closid 1 is a exclusive cgroup.
 closid 2 is a cache hog class.
 closid 3 is default closid.
 
 Desiredclos is what user has specified.
 
 Transition 1: desiredclos -- effectiveclos
 Clean all bits of unused closid's
 (that must be updated whenever a
 closid1 cgroup goes from empty-nonempty
 and vice-versa).
 
 effectiveclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   0   1   1  0
 
 
 Transition 2: effectiveclos -- expandedclos
 expandedclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   1   1   1  0
 Then you have different inplacecos for each
 CPU (see pseudo-code below):
 
 On the following events.
 
 - task migration to new pCPU:
 - task creation:
 
id = smp_processor_id();
for (part = desiredclos.p1; ...; part++)
/* if my cosid is set and any other
   cosid is clear, for the part,
   synchronize desiredclos -- inplacecos */
if (part[mycosid] == 1 
part[any_othercosid] == 0)
wrmsr(part, desiredclos);
 
 
 Currently the root cgroup would have all the bits set which will act
 like a default cgroup where all the otherwise unused parts (assuming
 they are a set of contiguous cache capacity bits) will be used.
 
 Otherwise the question is in the expandedclos - who decides to
 expand the closx parts to include some of the unused parts.. - that
 could just be a default root always ?
 
 Right, so the problem is for certain closid's you might never want
 to expand (because doing so would cause data to be cached in a
 cache way which might have high eviction rate in the future).
 See the example from Will.
 
 But for the default cache (that is unclassified applications
 i suppose it is beneficial to expand in most cases, that is,
 use maximum amount of cache irrespective of eviction rate, which
 is the behaviour that exists now without CAT).
 
 So perhaps a new flag expand=y/n can be added to the cgroup
 directories... What do you say?
 
 Userspace representation of CAT
 ---
 
 Usage model:
 1) measure application performance without L3 cache reservation.
 2) measure application perf with L3 cache reservation and
 X number of cache ways until desired performance is attained.
 
 Requirements:
 1) Persistency of CLOS configuration across hardware. On migration
 of operating system or application between different hardware
 systems we'd like the following to be maintained:
- exclusive number of bytes (*) reserved to a certain CLOSid.
- shared number of bytes (*) reserved between a certain group
  of CLOSid's.
 
 For both code and data, rounded down or up in cache way size.
 
 2) Reasoning:
 Different CBM masks in different hardware platforms might be necessary
 to specify the same CLOS configuration, in terms of exclusive number of
 bytes and shared number of bytes. (cache-way rounded number of bytes).
 For example, due to L3 allocation by other hardware entities in certain parts
 of the cache it might be necessary to relocate CBM mask to achieve
 the same CLOS configuration.
 
 3) Proposed format:
 
 
 Few questions from a random listener, I apologise if some of them are
 in a wrong place due to me missing some information from past threads.
 
 I'm not sure whether the following proposal to the format is the
 internal structure or what's going to be in cgroups.  If this is
 user-visible interface, I think it could be a little less detailed.

User visible interface. The idea is to have userspace code that performs

[ user visible specification ]   [ cbm bitmasks on present hardware
   platform ]

In systemd, probably (or whatever is between the user and the 

Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-08-03 Thread Vikas Shivappa


Hello Marcelo/Martin,

Like I mentioned let me modify the documentation to better help understand 
the usage. Things like updating each package bitmask is already in the patches.


Lets discuss offline and come up a well defined proposal for change if any and 
then update that in next series. We seem to be just looping over same items.


Thanks,
Vikas

On Mon, 3 Aug 2015, Marcelo Tosatti wrote:


On Sun, Aug 02, 2015 at 05:48:07PM +0200, Martin Kletzander wrote:

On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:

On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:



Marcello,


On Wed, 29 Jul 2015, Marcelo Tosatti wrote:


How about this:

desiredclos (closid  p1  p2  p3 p4)
 1   1   0   0  0
 2   0   0   0  1
 3   0   1   1  0


#1 Currently in the rdt cgroup , the root cgroup always has all the
bits set and cant be changed (because the cgroup hierarchy would by
default make this to have all bits as all the children need to have
a subset of the root's bitmask). So if the user creates a cgroup and
not put any task in it , the tasks in the root cgroup could be still
using that part of the cache. Thats the reason i say we can have
really 'exclusive' masks.

Or in other words - there is always a desired clos (0) which has all
parts set which acts like a default pool.

Also the parts can overlap.  Please apply this for all the below
comments which will change the way they work.



p means part.


I am assuming p = (a contiguous cache capacity bit mask)


Yes.


closid 1 is a exclusive cgroup.
closid 2 is a cache hog class.
closid 3 is default closid.

Desiredclos is what user has specified.

Transition 1: desiredclos -- effectiveclos
Clean all bits of unused closid's
(that must be updated whenever a
closid1 cgroup goes from empty-nonempty
and vice-versa).

effectiveclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   0   1   1  0




Transition 2: effectiveclos -- expandedclos
expandedclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   1   1   1  0
Then you have different inplacecos for each
CPU (see pseudo-code below):

On the following events.

- task migration to new pCPU:
- task creation:

id = smp_processor_id();
for (part = desiredclos.p1; ...; part++)
/* if my cosid is set and any other
   cosid is clear, for the part,
   synchronize desiredclos -- inplacecos */
if (part[mycosid] == 1 
part[any_othercosid] == 0)
wrmsr(part, desiredclos);



Currently the root cgroup would have all the bits set which will act
like a default cgroup where all the otherwise unused parts (assuming
they are a set of contiguous cache capacity bits) will be used.

Otherwise the question is in the expandedclos - who decides to
expand the closx parts to include some of the unused parts.. - that
could just be a default root always ?


Right, so the problem is for certain closid's you might never want
to expand (because doing so would cause data to be cached in a
cache way which might have high eviction rate in the future).
See the example from Will.

But for the default cache (that is unclassified applications
i suppose it is beneficial to expand in most cases, that is,
use maximum amount of cache irrespective of eviction rate, which
is the behaviour that exists now without CAT).

So perhaps a new flag expand=y/n can be added to the cgroup
directories... What do you say?

Userspace representation of CAT
---

Usage model:
1) measure application performance without L3 cache reservation.
2) measure application perf with L3 cache reservation and
X number of cache ways until desired performance is attained.

Requirements:
1) Persistency of CLOS configuration across hardware. On migration
of operating system or application between different hardware
systems we'd like the following to be maintained:
  - exclusive number of bytes (*) reserved to a certain CLOSid.
  - shared number of bytes (*) reserved between a certain group
of CLOSid's.

For both code and data, rounded down or up in cache way size.

2) Reasoning:
Different CBM masks in different hardware platforms might be necessary
to specify the same CLOS configuration, in terms of exclusive number of
bytes and shared number of bytes. (cache-way rounded number of bytes).
For example, due to L3 allocation by other hardware entities in certain parts
of the cache it might be necessary to relocate CBM mask to achieve
the same CLOS configuration.

3) Proposed format:



Few questions from a random listener, I apologise if some of them are
in a wrong place due to me missing some information from past threads.

I'm not sure whether the following proposal to the format is the
internal structure or what's going 

Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-08-02 Thread Martin Kletzander

On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:

On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:



Marcello,


On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
>
>How about this:
>
>desiredclos (closid  p1  p2  p3 p4)
> 1   1   0   0  0
> 2   0   0   0  1
> 3   0   1   1  0

#1 Currently in the rdt cgroup , the root cgroup always has all the
bits set and cant be changed (because the cgroup hierarchy would by
default make this to have all bits as all the children need to have
a subset of the root's bitmask). So if the user creates a cgroup and
not put any task in it , the tasks in the root cgroup could be still
using that part of the cache. Thats the reason i say we can have
really 'exclusive' masks.

Or in other words - there is always a desired clos (0) which has all
parts set which acts like a default pool.

Also the parts can overlap.  Please apply this for all the below
comments which will change the way they work.

>
>p means part.

I am assuming p = (a contiguous cache capacity bit mask)


Yes.


>closid 1 is a exclusive cgroup.
>closid 2 is a "cache hog" class.
>closid 3 is "default closid".
>
>Desiredclos is what user has specified.
>
>Transition 1: desiredclos --> effectiveclos
>Clean all bits of unused closid's
>(that must be updated whenever a
>closid1 cgroup goes from empty->nonempty
>and vice-versa).
>
>effectiveclos (closid  p1  p2  p3 p4)
>   1   0   0   0  0
>   2   0   0   0  1
>   3   0   1   1  0

>
>Transition 2: effectiveclos --> expandedclos
>expandedclos (closid  p1  p2  p3 p4)
>   1   0   0   0  0
>   2   0   0   0  1
>   3   1   1   1  0
>Then you have different inplacecos for each
>CPU (see pseudo-code below):
>
>On the following events.
>
>- task migration to new pCPU:
>- task creation:
>
>id = smp_processor_id();
>for (part = desiredclos.p1; ...; part++)
>/* if my cosid is set and any other
>   cosid is clear, for the part,
>   synchronize desiredclos --> inplacecos */
>if (part[mycosid] == 1 &&
>part[any_othercosid] == 0)
>wrmsr(part, desiredclos);
>

Currently the root cgroup would have all the bits set which will act
like a default cgroup where all the otherwise unused parts (assuming
they are a set of contiguous cache capacity bits) will be used.

Otherwise the question is in the expandedclos - who decides to
expand the closx parts to include some of the unused parts.. - that
could just be a default root always ?


Right, so the problem is for certain closid's you might never want
to expand (because doing so would cause data to be cached in a
cache way which might have high eviction rate in the future).
See the example from Will.

But for the default cache (that is "unclassified applications"
i suppose it is beneficial to expand in most cases, that is,
use maximum amount of cache irrespective of eviction rate, which
is the behaviour that exists now without CAT).

So perhaps a new flag "expand=y/n" can be added to the cgroup
directories... What do you say?

Userspace representation of CAT
---

Usage model:
1) measure application performance without L3 cache reservation.
2) measure application perf with L3 cache reservation and
X number of cache ways until desired performance is attained.

Requirements:
1) Persistency of CLOS configuration across hardware. On migration
of operating system or application between different hardware
systems we'd like the following to be maintained:
   - exclusive number of bytes (*) reserved to a certain CLOSid.
   - shared number of bytes (*) reserved between a certain group
 of CLOSid's.

For both code and data, rounded down or up in cache way size.

2) Reasoning:
Different CBM masks in different hardware platforms might be necessary
to specify the same CLOS configuration, in terms of exclusive number of
bytes and shared number of bytes. (cache-way rounded number of bytes).
For example, due to L3 allocation by other hardware entities in certain parts
of the cache it might be necessary to relocate CBM mask to achieve
the same CLOS configuration.

3) Proposed format:



Few questions from a random listener, I apologise if some of them are
in a wrong place due to me missing some information from past threads.

I'm not sure whether the following proposal to the format is the
internal structure or what's going to be in cgroups.  If this is
user-visible interface, I think it could be a little less detailed.


sharedregionK.exclusive - Number of exclusive cache bytes reserved for
shared region.
sharedregionK.excl_data - Number of exclusive cache data bytes reserved for
shared region.
sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for
shared region.
sharedregionK.round_down - Round down 

Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-08-02 Thread Martin Kletzander

On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:

On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:



Marcello,


On Wed, 29 Jul 2015, Marcelo Tosatti wrote:

How about this:

desiredclos (closid  p1  p2  p3 p4)
 1   1   0   0  0
 2   0   0   0  1
 3   0   1   1  0

#1 Currently in the rdt cgroup , the root cgroup always has all the
bits set and cant be changed (because the cgroup hierarchy would by
default make this to have all bits as all the children need to have
a subset of the root's bitmask). So if the user creates a cgroup and
not put any task in it , the tasks in the root cgroup could be still
using that part of the cache. Thats the reason i say we can have
really 'exclusive' masks.

Or in other words - there is always a desired clos (0) which has all
parts set which acts like a default pool.

Also the parts can overlap.  Please apply this for all the below
comments which will change the way they work.


p means part.

I am assuming p = (a contiguous cache capacity bit mask)


Yes.


closid 1 is a exclusive cgroup.
closid 2 is a cache hog class.
closid 3 is default closid.

Desiredclos is what user has specified.

Transition 1: desiredclos -- effectiveclos
Clean all bits of unused closid's
(that must be updated whenever a
closid1 cgroup goes from empty-nonempty
and vice-versa).

effectiveclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   0   1   1  0


Transition 2: effectiveclos -- expandedclos
expandedclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   1   1   1  0
Then you have different inplacecos for each
CPU (see pseudo-code below):

On the following events.

- task migration to new pCPU:
- task creation:

id = smp_processor_id();
for (part = desiredclos.p1; ...; part++)
/* if my cosid is set and any other
   cosid is clear, for the part,
   synchronize desiredclos -- inplacecos */
if (part[mycosid] == 1 
part[any_othercosid] == 0)
wrmsr(part, desiredclos);


Currently the root cgroup would have all the bits set which will act
like a default cgroup where all the otherwise unused parts (assuming
they are a set of contiguous cache capacity bits) will be used.

Otherwise the question is in the expandedclos - who decides to
expand the closx parts to include some of the unused parts.. - that
could just be a default root always ?


Right, so the problem is for certain closid's you might never want
to expand (because doing so would cause data to be cached in a
cache way which might have high eviction rate in the future).
See the example from Will.

But for the default cache (that is unclassified applications
i suppose it is beneficial to expand in most cases, that is,
use maximum amount of cache irrespective of eviction rate, which
is the behaviour that exists now without CAT).

So perhaps a new flag expand=y/n can be added to the cgroup
directories... What do you say?

Userspace representation of CAT
---

Usage model:
1) measure application performance without L3 cache reservation.
2) measure application perf with L3 cache reservation and
X number of cache ways until desired performance is attained.

Requirements:
1) Persistency of CLOS configuration across hardware. On migration
of operating system or application between different hardware
systems we'd like the following to be maintained:
   - exclusive number of bytes (*) reserved to a certain CLOSid.
   - shared number of bytes (*) reserved between a certain group
 of CLOSid's.

For both code and data, rounded down or up in cache way size.

2) Reasoning:
Different CBM masks in different hardware platforms might be necessary
to specify the same CLOS configuration, in terms of exclusive number of
bytes and shared number of bytes. (cache-way rounded number of bytes).
For example, due to L3 allocation by other hardware entities in certain parts
of the cache it might be necessary to relocate CBM mask to achieve
the same CLOS configuration.

3) Proposed format:



Few questions from a random listener, I apologise if some of them are
in a wrong place due to me missing some information from past threads.

I'm not sure whether the following proposal to the format is the
internal structure or what's going to be in cgroups.  If this is
user-visible interface, I think it could be a little less detailed.


sharedregionK.exclusive - Number of exclusive cache bytes reserved for
shared region.
sharedregionK.excl_data - Number of exclusive cache data bytes reserved for
shared region.
sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for
shared region.
sharedregionK.round_down - Round down to cache way bytes from respective number

Re: [summary] Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-31 Thread Marcelo Tosatti
On Fri, Jul 31, 2015 at 09:41:58AM -0700, Vikas Shivappa wrote:
> 
> To summarize  the ever growing thread :
> 
> 1. the rdt_cgroup can be used to configure exclusive cache bitmaps
> for the child nodes which can be used for the scenarios which
> Marcello mentions.
> 
> simle examples which were mentioned :
> max bitmask length : 16 . hence full mask is 0x
> groupx_realtime - 0xff .
> group2_systemtraffic - 0xf. : put a lot of tasks from root node to
> here or which ever is offending and thrashing.
> groupy_ - 0x0f
> 
> Now the groupx has its own area of cache that can used by the
> realtime/(specific scenario) apps. Similarly configure any groupy.
> 
> 2. Can the maps can let you specify which cache ways ways the cache
> is allocated ? - No , this is implementation specific as mentioned
> in the SDM. So when we configure a mask , you really dont know which
> ways or which exact lines are used on which SKUs .. We may not see
> any use case as well which is needed for apps to allocate cache in
> specific areas and the h/w does not support this as well.

Ok, can you comment whether the userspace interface proposed addresses
all your use cases ?

> 3. Letting the user specify size in bytes instead of bitmap : we
> have already gone through this discussion in older versions. The
> user can simply check the size of the total cache and understand
> what map could be what size. I dont see a special need to specify an
> interface to enter the cache in bytes and then round off - user
> could instead use the roundoff values before hand or iow it
> automatically does when he specifies the bitmask.

When you move from processor A with CBM bitmask format X to hardware B
with CBM bitmask format Y, and the formats Y and X are different, you
have to manually adjust the format.

Please reply to the userspace proposal, the problem is very explicit
there.

> ex: find cache size from /proc/cpuinfo. - say 20MB
> bitmask max - 0xf.
> 
> This means the roundoff(chunk) size supported is only 1MB , so when
> you specify the mask say 0x3(2MB) thats already taken care of.
> Same applies to percentage - the masks automatically round off the percentage.
> 
> Please note that this is quite different from the way we can
> allocate memory in bytes and needs to be treated differently given
> that the hardware provides interface in a particular way.
> 
> 4. Letting the kernel automatically extend the bitmap may affect a
> lot of other things 

Lets talk about them. What other things?

> and will need a lot of heuristics - note that we
> have overlapping masks.

I proposed a way to avoid heuristics by exposing whether the cgroup is 
"expandable" or not and asked your input.

We really do not want to waste cache if we can avoid it.

> This interface lets the super-user control
> the cache allocation and it may be very confusing for the user if he
> has allocated a cache mask and suddenly from under the floor the
> kernel changes it.

Agree.

> 
> Thanks,
> Vikas
> 
> 
> On Fri, 31 Jul 2015, Marcelo Tosatti wrote:
> 
> >On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote:
> >>
> >>
> >>On Thu, 30 Jul 2015, Marcelo Tosatti wrote:
> >>
> >>>On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
> 
> 
> Marcello,
> 
> 
> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> >
> >How about this:
> >
> >desiredclos (closid  p1  p2  p3 p4)
> >  1   1   0   0  0
> >  2   0   0   0  1
> >  3   0   1   1  0
> 
> #1 Currently in the rdt cgroup , the root cgroup always has all the
> bits set and cant be changed (because the cgroup hierarchy would by
> default make this to have all bits as all the children need to have
> a subset of the root's bitmask). So if the user creates a cgroup and
> not put any task in it , the tasks in the root cgroup could be still
> using that part of the cache. Thats the reason i say we can have
> really 'exclusive' masks.
> 
> Or in other words - there is always a desired clos (0) which has all
> parts set which acts like a default pool.
> 
> Also the parts can overlap.  Please apply this for all the below
> comments which will change the way they work.
> >>>
> >>>
> 
> >
> >p means part.
> 
> I am assuming p = (a contiguous cache capacity bit mask)
> 
> >closid 1 is a exclusive cgroup.
> >closid 2 is a "cache hog" class.
> >closid 3 is "default closid".
> >
> >Desiredclos is what user has specified.
> >
> >Transition 1: desiredclos --> effectiveclos
> >Clean all bits of unused closid's
> >(that must be updated whenever a
> >closid1 cgroup goes from empty->nonempty
> >and vice-versa).
> >
> >effectiveclos (closid  p1  p2  p3 p4)
> >1   0   0   0  0
> >2   0   0   0  1
> >3   0   1   1  0
> 
> >
> >Transition 2: effectiveclos --> expandedclos

[summary] Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-31 Thread Vikas Shivappa


To summarize  the ever growing thread :

1. the rdt_cgroup can be used to configure exclusive cache bitmaps for the child 
nodes which can be used for the scenarios which Marcello mentions.


simle examples which were mentioned :
max bitmask length : 16 . hence full mask is 0x
groupx_realtime - 0xff .
group2_systemtraffic - 0xf. : put a lot of tasks from root node to here or which 
ever is offending and thrashing.

groupy_ - 0x0f

Now the groupx has its own area of cache that can used by the realtime/(specific 
scenario) apps. Similarly configure any groupy.


2. Can the maps can let you specify which cache ways ways the cache is allocated 
? - No , this is 
implementation specific as mentioned in the SDM. So when we configure a mask , 
you really dont know which ways or which exact lines are used on which SKUs .. 
We may not see any use case as well 
which is needed for apps to allocate cache in specific areas and the h/w does 
not support this as well.


3. Letting the user specify size in bytes instead of bitmap : we have already 
gone through this discussion in older versions. The user can simply check the 
size of the total cache and understand what map could be what size. I dont see a 
special need to specify an interface to enter the cache in bytes and then round 
off - user could instead use the roundoff values before hand or iow it 
automatically does when he specifies the bitmask.


ex: find cache size from /proc/cpuinfo. - say 20MB
bitmask max - 0xf.

This means the roundoff(chunk) size supported is only 1MB , so when you specify 
the mask say 0x3(2MB) thats already taken care of.

Same applies to percentage - the masks automatically round off the percentage.

Please note that this is quite different from the way we can allocate memory in 
bytes and needs to be treated differently given that the hardware provides interface 
in a particular way.


4. Letting the kernel automatically extend the bitmap may affect a lot of other 
things and will need a lot of heuristics - note that we have overlapping masks . 
This interface lets the super-user control the cache allocation and it may be 
very confusing for the user if he has allocated a cache mask and suddenly from 
under the floor the kernel changes it.


Thanks,
Vikas


On Fri, 31 Jul 2015, Marcelo Tosatti wrote:


On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote:



On Thu, 30 Jul 2015, Marcelo Tosatti wrote:


On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:



Marcello,


On Wed, 29 Jul 2015, Marcelo Tosatti wrote:


How about this:

desiredclos (closid  p1  p2  p3 p4)
 1   1   0   0  0
 2   0   0   0  1
 3   0   1   1  0


#1 Currently in the rdt cgroup , the root cgroup always has all the
bits set and cant be changed (because the cgroup hierarchy would by
default make this to have all bits as all the children need to have
a subset of the root's bitmask). So if the user creates a cgroup and
not put any task in it , the tasks in the root cgroup could be still
using that part of the cache. Thats the reason i say we can have
really 'exclusive' masks.

Or in other words - there is always a desired clos (0) which has all
parts set which acts like a default pool.

Also the parts can overlap.  Please apply this for all the below
comments which will change the way they work.







p means part.


I am assuming p = (a contiguous cache capacity bit mask)


closid 1 is a exclusive cgroup.
closid 2 is a "cache hog" class.
closid 3 is "default closid".

Desiredclos is what user has specified.

Transition 1: desiredclos --> effectiveclos
Clean all bits of unused closid's
(that must be updated whenever a
closid1 cgroup goes from empty->nonempty
and vice-versa).

effectiveclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   0   1   1  0




Transition 2: effectiveclos --> expandedclos
expandedclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   1   1   1  0
Then you have different inplacecos for each
CPU (see pseudo-code below):

On the following events.

- task migration to new pCPU:
- task creation:

id = smp_processor_id();
for (part = desiredclos.p1; ...; part++)
/* if my cosid is set and any other
   cosid is clear, for the part,
   synchronize desiredclos --> inplacecos */
if (part[mycosid] == 1 &&
part[any_othercosid] == 0)
wrmsr(part, desiredclos);



Currently the root cgroup would have all the bits set which will act
like a default cgroup where all the otherwise unused parts (assuming
they are a set of contiguous cache capacity bits) will be used.


Right, but we don't want to place tasks in there in case one cgroup
wants exclusive cache access.

So whenever you want an exclusive cgroup you'd do:


Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-31 Thread Marcelo Tosatti
On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:
> On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
> > 
> > 
> > Marcello,
> > 
> > 
> > On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> > >
> > >How about this:
> > >
> > >desiredclos (closid  p1  p2  p3 p4)
> > >1   1   0   0  0
> > >2   0   0   0  1
> > >3   0   1   1  0
> > 
> > #1 Currently in the rdt cgroup , the root cgroup always has all the
> > bits set and cant be changed (because the cgroup hierarchy would by
> > default make this to have all bits as all the children need to have
> > a subset of the root's bitmask). So if the user creates a cgroup and
> > not put any task in it , the tasks in the root cgroup could be still
> > using that part of the cache. Thats the reason i say we can have
> > really 'exclusive' masks.
> > 
> > Or in other words - there is always a desired clos (0) which has all
> > parts set which acts like a default pool.
> > 
> > Also the parts can overlap.  Please apply this for all the below
> > comments which will change the way they work.
> > 
> > >
> > >p means part.
> > 
> > I am assuming p = (a contiguous cache capacity bit mask)
> 
> Yes.
> 
> > >closid 1 is a exclusive cgroup.
> > >closid 2 is a "cache hog" class.
> > >closid 3 is "default closid".
> > >
> > >Desiredclos is what user has specified.
> > >
> > >Transition 1: desiredclos --> effectiveclos
> > >Clean all bits of unused closid's
> > >(that must be updated whenever a
> > >closid1 cgroup goes from empty->nonempty
> > >and vice-versa).
> > >
> > >effectiveclos (closid  p1  p2  p3 p4)
> > >  1   0   0   0  0
> > >  2   0   0   0  1
> > >  3   0   1   1  0
> > 
> > >
> > >Transition 2: effectiveclos --> expandedclos
> > >expandedclos (closid  p1  p2  p3 p4)
> > >  1   0   0   0  0
> > >  2   0   0   0  1
> > >  3   1   1   1  0
> > >Then you have different inplacecos for each
> > >CPU (see pseudo-code below):
> > >
> > >On the following events.
> > >
> > >- task migration to new pCPU:
> > >- task creation:
> > >
> > >   id = smp_processor_id();
> > >   for (part = desiredclos.p1; ...; part++)
> > >   /* if my cosid is set and any other
> > >  cosid is clear, for the part,
> > >  synchronize desiredclos --> inplacecos */
> > >   if (part[mycosid] == 1 &&
> > >   part[any_othercosid] == 0)
> > >   wrmsr(part, desiredclos);
> > >
> > 
> > Currently the root cgroup would have all the bits set which will act
> > like a default cgroup where all the otherwise unused parts (assuming
> > they are a set of contiguous cache capacity bits) will be used.
> > 
> > Otherwise the question is in the expandedclos - who decides to
> > expand the closx parts to include some of the unused parts.. - that
> > could just be a default root always ?
> 
> Right, so the problem is for certain closid's you might never want 
> to expand (because doing so would cause data to be cached in a
> cache way which might have high eviction rate in the future).
> See the example from Will.
> 
> But for the default cache (that is "unclassified applications" 
> i suppose it is beneficial to expand in most cases, that is, 
> use maximum amount of cache irrespective of eviction rate, which 
> is the behaviour that exists now without CAT).
> 
> So perhaps a new flag "expand=y/n" can be added to the cgroup 
> directories... What do you say?
> 
> Userspace representation of CAT
> ---
> 
> Usage model:
> 1) measure application performance without L3 cache reservation.
> 2) measure application perf with L3 cache reservation and
> X number of cache ways until desired performance is attained.
> 
> Requirements:
> 1) Persistency of CLOS configuration across hardware. On migration
> of operating system or application between different hardware
> systems we'd like the following to be maintained:
> - exclusive number of bytes (*) reserved to a certain CLOSid.
> - shared number of bytes (*) reserved between a certain group
>   of CLOSid's.
> 
> For both code and data, rounded down or up in cache way size.
> 
> 2) Reasoning:
> Different CBM masks in different hardware platforms might be necessary
> to specify the same CLOS configuration, in terms of exclusive number of
> bytes and shared number of bytes. (cache-way rounded number of bytes).
> For example, due to L3 allocation by other hardware entities in certain parts
> of the cache it might be necessary to relocate CBM mask to achieve
> the same CLOS configuration.
> 
> 3) Proposed format:
> 
> sharedregionK.exclusive - Number of exclusive cache bytes reserved for 
>   shared region.
> sharedregionK.excl_data - Number of exclusive cache data bytes reserved for 
>   shared region.
> sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for 
>  

Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-31 Thread Marcelo Tosatti
On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote:
> 
> 
> On Thu, 30 Jul 2015, Marcelo Tosatti wrote:
> 
> >On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
> >>
> >>
> >>Marcello,
> >>
> >>
> >>On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> >>>
> >>>How about this:
> >>>
> >>>desiredclos (closid  p1  p2  p3 p4)
> >>>1   1   0   0  0
> >>>2   0   0   0  1
> >>>3   0   1   1  0
> >>
> >>#1 Currently in the rdt cgroup , the root cgroup always has all the
> >>bits set and cant be changed (because the cgroup hierarchy would by
> >>default make this to have all bits as all the children need to have
> >>a subset of the root's bitmask). So if the user creates a cgroup and
> >>not put any task in it , the tasks in the root cgroup could be still
> >>using that part of the cache. Thats the reason i say we can have
> >>really 'exclusive' masks.
> >>
> >>Or in other words - there is always a desired clos (0) which has all
> >>parts set which acts like a default pool.
> >>
> >>Also the parts can overlap.  Please apply this for all the below
> >>comments which will change the way they work.
> >
> >
> >>
> >>>
> >>>p means part.
> >>
> >>I am assuming p = (a contiguous cache capacity bit mask)
> >>
> >>>closid 1 is a exclusive cgroup.
> >>>closid 2 is a "cache hog" class.
> >>>closid 3 is "default closid".
> >>>
> >>>Desiredclos is what user has specified.
> >>>
> >>>Transition 1: desiredclos --> effectiveclos
> >>>Clean all bits of unused closid's
> >>>(that must be updated whenever a
> >>>closid1 cgroup goes from empty->nonempty
> >>>and vice-versa).
> >>>
> >>>effectiveclos (closid  p1  p2  p3 p4)
> >>>  1   0   0   0  0
> >>>  2   0   0   0  1
> >>>  3   0   1   1  0
> >>
> >>>
> >>>Transition 2: effectiveclos --> expandedclos
> >>>expandedclos (closid  p1  p2  p3 p4)
> >>>  1   0   0   0  0
> >>>  2   0   0   0  1
> >>>  3   1   1   1  0
> >>>Then you have different inplacecos for each
> >>>CPU (see pseudo-code below):
> >>>
> >>>On the following events.
> >>>
> >>>- task migration to new pCPU:
> >>>- task creation:
> >>>
> >>>   id = smp_processor_id();
> >>>   for (part = desiredclos.p1; ...; part++)
> >>>   /* if my cosid is set and any other
> >>>  cosid is clear, for the part,
> >>>  synchronize desiredclos --> inplacecos */
> >>>   if (part[mycosid] == 1 &&
> >>>   part[any_othercosid] == 0)
> >>>   wrmsr(part, desiredclos);
> >>>
> >>
> >>Currently the root cgroup would have all the bits set which will act
> >>like a default cgroup where all the otherwise unused parts (assuming
> >>they are a set of contiguous cache capacity bits) will be used.
> >
> >Right, but we don't want to place tasks in there in case one cgroup
> >wants exclusive cache access.
> >
> >So whenever you want an exclusive cgroup you'd do:
> >
> >create cgroup-exclusive; reserve desired part of the cache
> >for it.
> >create cgroup-default; reserved all cache minus that of cgroup-exclusive
> >for it.
> >
> >place tasks that belong to cgroup-exclusive into it.
> >place all other tasks (including init) into cgroup-default.
> >
> >Is that right?
> 
> Yes you could do that.
> 
> You can create cgroups to have masks which are exclusive in todays
> implementation, just that you could also created more cgroups to
> overlap the masks again.. iow we dont have an exclusive flag for the
> cgroup mask.
> Is that a common use case in the server environment that you need to
> prevent other cgroups from using a certain mask ? (since the root
> user should control these allocations .. he should know?)

Yes, there are two known use-cases that have this characteristic:

1) High performance numeric application which has been optimized
to a certain fraction of the cache.

2) Low latency application in multi-application OS.

For both cases exclusive cache access is wanted.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-31 Thread Marcelo Tosatti
On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote:
 
 
 On Thu, 30 Jul 2015, Marcelo Tosatti wrote:
 
 On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
 
 
 Marcello,
 
 
 On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
 
 How about this:
 
 desiredclos (closid  p1  p2  p3 p4)
 1   1   0   0  0
 2   0   0   0  1
 3   0   1   1  0
 
 #1 Currently in the rdt cgroup , the root cgroup always has all the
 bits set and cant be changed (because the cgroup hierarchy would by
 default make this to have all bits as all the children need to have
 a subset of the root's bitmask). So if the user creates a cgroup and
 not put any task in it , the tasks in the root cgroup could be still
 using that part of the cache. Thats the reason i say we can have
 really 'exclusive' masks.
 
 Or in other words - there is always a desired clos (0) which has all
 parts set which acts like a default pool.
 
 Also the parts can overlap.  Please apply this for all the below
 comments which will change the way they work.
 
 
 
 
 p means part.
 
 I am assuming p = (a contiguous cache capacity bit mask)
 
 closid 1 is a exclusive cgroup.
 closid 2 is a cache hog class.
 closid 3 is default closid.
 
 Desiredclos is what user has specified.
 
 Transition 1: desiredclos -- effectiveclos
 Clean all bits of unused closid's
 (that must be updated whenever a
 closid1 cgroup goes from empty-nonempty
 and vice-versa).
 
 effectiveclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   0   1   1  0
 
 
 Transition 2: effectiveclos -- expandedclos
 expandedclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   1   1   1  0
 Then you have different inplacecos for each
 CPU (see pseudo-code below):
 
 On the following events.
 
 - task migration to new pCPU:
 - task creation:
 
id = smp_processor_id();
for (part = desiredclos.p1; ...; part++)
/* if my cosid is set and any other
   cosid is clear, for the part,
   synchronize desiredclos -- inplacecos */
if (part[mycosid] == 1 
part[any_othercosid] == 0)
wrmsr(part, desiredclos);
 
 
 Currently the root cgroup would have all the bits set which will act
 like a default cgroup where all the otherwise unused parts (assuming
 they are a set of contiguous cache capacity bits) will be used.
 
 Right, but we don't want to place tasks in there in case one cgroup
 wants exclusive cache access.
 
 So whenever you want an exclusive cgroup you'd do:
 
 create cgroup-exclusive; reserve desired part of the cache
 for it.
 create cgroup-default; reserved all cache minus that of cgroup-exclusive
 for it.
 
 place tasks that belong to cgroup-exclusive into it.
 place all other tasks (including init) into cgroup-default.
 
 Is that right?
 
 Yes you could do that.
 
 You can create cgroups to have masks which are exclusive in todays
 implementation, just that you could also created more cgroups to
 overlap the masks again.. iow we dont have an exclusive flag for the
 cgroup mask.
 Is that a common use case in the server environment that you need to
 prevent other cgroups from using a certain mask ? (since the root
 user should control these allocations .. he should know?)

Yes, there are two known use-cases that have this characteristic:

1) High performance numeric application which has been optimized
to a certain fraction of the cache.

2) Low latency application in multi-application OS.

For both cases exclusive cache access is wanted.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-31 Thread Marcelo Tosatti
On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:
 On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
  
  
  Marcello,
  
  
  On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
  
  How about this:
  
  desiredclos (closid  p1  p2  p3 p4)
  1   1   0   0  0
  2   0   0   0  1
  3   0   1   1  0
  
  #1 Currently in the rdt cgroup , the root cgroup always has all the
  bits set and cant be changed (because the cgroup hierarchy would by
  default make this to have all bits as all the children need to have
  a subset of the root's bitmask). So if the user creates a cgroup and
  not put any task in it , the tasks in the root cgroup could be still
  using that part of the cache. Thats the reason i say we can have
  really 'exclusive' masks.
  
  Or in other words - there is always a desired clos (0) which has all
  parts set which acts like a default pool.
  
  Also the parts can overlap.  Please apply this for all the below
  comments which will change the way they work.
  
  
  p means part.
  
  I am assuming p = (a contiguous cache capacity bit mask)
 
 Yes.
 
  closid 1 is a exclusive cgroup.
  closid 2 is a cache hog class.
  closid 3 is default closid.
  
  Desiredclos is what user has specified.
  
  Transition 1: desiredclos -- effectiveclos
  Clean all bits of unused closid's
  (that must be updated whenever a
  closid1 cgroup goes from empty-nonempty
  and vice-versa).
  
  effectiveclos (closid  p1  p2  p3 p4)
1   0   0   0  0
2   0   0   0  1
3   0   1   1  0
  
  
  Transition 2: effectiveclos -- expandedclos
  expandedclos (closid  p1  p2  p3 p4)
1   0   0   0  0
2   0   0   0  1
3   1   1   1  0
  Then you have different inplacecos for each
  CPU (see pseudo-code below):
  
  On the following events.
  
  - task migration to new pCPU:
  - task creation:
  
 id = smp_processor_id();
 for (part = desiredclos.p1; ...; part++)
 /* if my cosid is set and any other
cosid is clear, for the part,
synchronize desiredclos -- inplacecos */
 if (part[mycosid] == 1 
 part[any_othercosid] == 0)
 wrmsr(part, desiredclos);
  
  
  Currently the root cgroup would have all the bits set which will act
  like a default cgroup where all the otherwise unused parts (assuming
  they are a set of contiguous cache capacity bits) will be used.
  
  Otherwise the question is in the expandedclos - who decides to
  expand the closx parts to include some of the unused parts.. - that
  could just be a default root always ?
 
 Right, so the problem is for certain closid's you might never want 
 to expand (because doing so would cause data to be cached in a
 cache way which might have high eviction rate in the future).
 See the example from Will.
 
 But for the default cache (that is unclassified applications 
 i suppose it is beneficial to expand in most cases, that is, 
 use maximum amount of cache irrespective of eviction rate, which 
 is the behaviour that exists now without CAT).
 
 So perhaps a new flag expand=y/n can be added to the cgroup 
 directories... What do you say?
 
 Userspace representation of CAT
 ---
 
 Usage model:
 1) measure application performance without L3 cache reservation.
 2) measure application perf with L3 cache reservation and
 X number of cache ways until desired performance is attained.
 
 Requirements:
 1) Persistency of CLOS configuration across hardware. On migration
 of operating system or application between different hardware
 systems we'd like the following to be maintained:
 - exclusive number of bytes (*) reserved to a certain CLOSid.
 - shared number of bytes (*) reserved between a certain group
   of CLOSid's.
 
 For both code and data, rounded down or up in cache way size.
 
 2) Reasoning:
 Different CBM masks in different hardware platforms might be necessary
 to specify the same CLOS configuration, in terms of exclusive number of
 bytes and shared number of bytes. (cache-way rounded number of bytes).
 For example, due to L3 allocation by other hardware entities in certain parts
 of the cache it might be necessary to relocate CBM mask to achieve
 the same CLOS configuration.
 
 3) Proposed format:
 
 sharedregionK.exclusive - Number of exclusive cache bytes reserved for 
   shared region.
 sharedregionK.excl_data - Number of exclusive cache data bytes reserved for 
   shared region.
 sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for 
   shared region.
 sharedregionK.round_down - Round down to cache way bytes from respective 
 number
specification (default is round up).
 sharedregionK.expand - y/n - Expand shared region to more cache ways
   when available 

[summary] Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-31 Thread Vikas Shivappa


To summarize  the ever growing thread :

1. the rdt_cgroup can be used to configure exclusive cache bitmaps for the child 
nodes which can be used for the scenarios which Marcello mentions.


simle examples which were mentioned :
max bitmask length : 16 . hence full mask is 0x
groupx_realtime - 0xff .
group2_systemtraffic - 0xf. : put a lot of tasks from root node to here or which 
ever is offending and thrashing.

groupy_mytraffic - 0x0f

Now the groupx has its own area of cache that can used by the realtime/(specific 
scenario) apps. Similarly configure any groupy.


2. Can the maps can let you specify which cache ways ways the cache is allocated 
? - No , this is 
implementation specific as mentioned in the SDM. So when we configure a mask , 
you really dont know which ways or which exact lines are used on which SKUs .. 
We may not see any use case as well 
which is needed for apps to allocate cache in specific areas and the h/w does 
not support this as well.


3. Letting the user specify size in bytes instead of bitmap : we have already 
gone through this discussion in older versions. The user can simply check the 
size of the total cache and understand what map could be what size. I dont see a 
special need to specify an interface to enter the cache in bytes and then round 
off - user could instead use the roundoff values before hand or iow it 
automatically does when he specifies the bitmask.


ex: find cache size from /proc/cpuinfo. - say 20MB
bitmask max - 0xf.

This means the roundoff(chunk) size supported is only 1MB , so when you specify 
the mask say 0x3(2MB) thats already taken care of.

Same applies to percentage - the masks automatically round off the percentage.

Please note that this is quite different from the way we can allocate memory in 
bytes and needs to be treated differently given that the hardware provides interface 
in a particular way.


4. Letting the kernel automatically extend the bitmap may affect a lot of other 
things and will need a lot of heuristics - note that we have overlapping masks . 
This interface lets the super-user control the cache allocation and it may be 
very confusing for the user if he has allocated a cache mask and suddenly from 
under the floor the kernel changes it.


Thanks,
Vikas


On Fri, 31 Jul 2015, Marcelo Tosatti wrote:


On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote:



On Thu, 30 Jul 2015, Marcelo Tosatti wrote:


On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:



Marcello,


On Wed, 29 Jul 2015, Marcelo Tosatti wrote:


How about this:

desiredclos (closid  p1  p2  p3 p4)
 1   1   0   0  0
 2   0   0   0  1
 3   0   1   1  0


#1 Currently in the rdt cgroup , the root cgroup always has all the
bits set and cant be changed (because the cgroup hierarchy would by
default make this to have all bits as all the children need to have
a subset of the root's bitmask). So if the user creates a cgroup and
not put any task in it , the tasks in the root cgroup could be still
using that part of the cache. Thats the reason i say we can have
really 'exclusive' masks.

Or in other words - there is always a desired clos (0) which has all
parts set which acts like a default pool.

Also the parts can overlap.  Please apply this for all the below
comments which will change the way they work.







p means part.


I am assuming p = (a contiguous cache capacity bit mask)


closid 1 is a exclusive cgroup.
closid 2 is a cache hog class.
closid 3 is default closid.

Desiredclos is what user has specified.

Transition 1: desiredclos -- effectiveclos
Clean all bits of unused closid's
(that must be updated whenever a
closid1 cgroup goes from empty-nonempty
and vice-versa).

effectiveclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   0   1   1  0




Transition 2: effectiveclos -- expandedclos
expandedclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   1   1   1  0
Then you have different inplacecos for each
CPU (see pseudo-code below):

On the following events.

- task migration to new pCPU:
- task creation:

id = smp_processor_id();
for (part = desiredclos.p1; ...; part++)
/* if my cosid is set and any other
   cosid is clear, for the part,
   synchronize desiredclos -- inplacecos */
if (part[mycosid] == 1 
part[any_othercosid] == 0)
wrmsr(part, desiredclos);



Currently the root cgroup would have all the bits set which will act
like a default cgroup where all the otherwise unused parts (assuming
they are a set of contiguous cache capacity bits) will be used.


Right, but we don't want to place tasks in there in case one cgroup
wants exclusive cache access.

So whenever you want an exclusive cgroup you'd do:


Re: [summary] Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-31 Thread Marcelo Tosatti
On Fri, Jul 31, 2015 at 09:41:58AM -0700, Vikas Shivappa wrote:
 
 To summarize  the ever growing thread :
 
 1. the rdt_cgroup can be used to configure exclusive cache bitmaps
 for the child nodes which can be used for the scenarios which
 Marcello mentions.
 
 simle examples which were mentioned :
 max bitmask length : 16 . hence full mask is 0x
 groupx_realtime - 0xff .
 group2_systemtraffic - 0xf. : put a lot of tasks from root node to
 here or which ever is offending and thrashing.
 groupy_mytraffic - 0x0f
 
 Now the groupx has its own area of cache that can used by the
 realtime/(specific scenario) apps. Similarly configure any groupy.
 
 2. Can the maps can let you specify which cache ways ways the cache
 is allocated ? - No , this is implementation specific as mentioned
 in the SDM. So when we configure a mask , you really dont know which
 ways or which exact lines are used on which SKUs .. We may not see
 any use case as well which is needed for apps to allocate cache in
 specific areas and the h/w does not support this as well.

Ok, can you comment whether the userspace interface proposed addresses
all your use cases ?

 3. Letting the user specify size in bytes instead of bitmap : we
 have already gone through this discussion in older versions. The
 user can simply check the size of the total cache and understand
 what map could be what size. I dont see a special need to specify an
 interface to enter the cache in bytes and then round off - user
 could instead use the roundoff values before hand or iow it
 automatically does when he specifies the bitmask.

When you move from processor A with CBM bitmask format X to hardware B
with CBM bitmask format Y, and the formats Y and X are different, you
have to manually adjust the format.

Please reply to the userspace proposal, the problem is very explicit
there.

 ex: find cache size from /proc/cpuinfo. - say 20MB
 bitmask max - 0xf.
 
 This means the roundoff(chunk) size supported is only 1MB , so when
 you specify the mask say 0x3(2MB) thats already taken care of.
 Same applies to percentage - the masks automatically round off the percentage.
 
 Please note that this is quite different from the way we can
 allocate memory in bytes and needs to be treated differently given
 that the hardware provides interface in a particular way.
 
 4. Letting the kernel automatically extend the bitmap may affect a
 lot of other things 

Lets talk about them. What other things?

 and will need a lot of heuristics - note that we
 have overlapping masks.

I proposed a way to avoid heuristics by exposing whether the cgroup is 
expandable or not and asked your input.

We really do not want to waste cache if we can avoid it.

 This interface lets the super-user control
 the cache allocation and it may be very confusing for the user if he
 has allocated a cache mask and suddenly from under the floor the
 kernel changes it.

Agree.

 
 Thanks,
 Vikas
 
 
 On Fri, 31 Jul 2015, Marcelo Tosatti wrote:
 
 On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote:
 
 
 On Thu, 30 Jul 2015, Marcelo Tosatti wrote:
 
 On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
 
 
 Marcello,
 
 
 On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
 
 How about this:
 
 desiredclos (closid  p1  p2  p3 p4)
   1   1   0   0  0
   2   0   0   0  1
   3   0   1   1  0
 
 #1 Currently in the rdt cgroup , the root cgroup always has all the
 bits set and cant be changed (because the cgroup hierarchy would by
 default make this to have all bits as all the children need to have
 a subset of the root's bitmask). So if the user creates a cgroup and
 not put any task in it , the tasks in the root cgroup could be still
 using that part of the cache. Thats the reason i say we can have
 really 'exclusive' masks.
 
 Or in other words - there is always a desired clos (0) which has all
 parts set which acts like a default pool.
 
 Also the parts can overlap.  Please apply this for all the below
 comments which will change the way they work.
 
 
 
 
 p means part.
 
 I am assuming p = (a contiguous cache capacity bit mask)
 
 closid 1 is a exclusive cgroup.
 closid 2 is a cache hog class.
 closid 3 is default closid.
 
 Desiredclos is what user has specified.
 
 Transition 1: desiredclos -- effectiveclos
 Clean all bits of unused closid's
 (that must be updated whenever a
 closid1 cgroup goes from empty-nonempty
 and vice-versa).
 
 effectiveclos (closid  p1  p2  p3 p4)
 1   0   0   0  0
 2   0   0   0  1
 3   0   1   1  0
 
 
 Transition 2: effectiveclos -- expandedclos
 expandedclos (closid  p1  p2  p3 p4)
 1   0   0   0  0
 2   0   0   0  1
 3   1   1   1  0
 Then you have different inplacecos for each
 CPU (see pseudo-code below):
 
 On the following events.
 
 - task migration to new pCPU:
 - task creation:
 
  id = smp_processor_id();
  for (part = desiredclos.p1; ...; part++)
  /* 

Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-30 Thread Vikas Shivappa



On Thu, 30 Jul 2015, Marcelo Tosatti wrote:


On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:



Marcello,


On Wed, 29 Jul 2015, Marcelo Tosatti wrote:


How about this:

desiredclos (closid  p1  p2  p3 p4)
 1   1   0   0  0
 2   0   0   0  1
 3   0   1   1  0


#1 Currently in the rdt cgroup , the root cgroup always has all the
bits set and cant be changed (because the cgroup hierarchy would by
default make this to have all bits as all the children need to have
a subset of the root's bitmask). So if the user creates a cgroup and
not put any task in it , the tasks in the root cgroup could be still
using that part of the cache. Thats the reason i say we can have
really 'exclusive' masks.

Or in other words - there is always a desired clos (0) which has all
parts set which acts like a default pool.

Also the parts can overlap.  Please apply this for all the below
comments which will change the way they work.







p means part.


I am assuming p = (a contiguous cache capacity bit mask)


closid 1 is a exclusive cgroup.
closid 2 is a "cache hog" class.
closid 3 is "default closid".

Desiredclos is what user has specified.

Transition 1: desiredclos --> effectiveclos
Clean all bits of unused closid's
(that must be updated whenever a
closid1 cgroup goes from empty->nonempty
and vice-versa).

effectiveclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   0   1   1  0




Transition 2: effectiveclos --> expandedclos
expandedclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   1   1   1  0
Then you have different inplacecos for each
CPU (see pseudo-code below):

On the following events.

- task migration to new pCPU:
- task creation:

id = smp_processor_id();
for (part = desiredclos.p1; ...; part++)
/* if my cosid is set and any other
   cosid is clear, for the part,
   synchronize desiredclos --> inplacecos */
if (part[mycosid] == 1 &&
part[any_othercosid] == 0)
wrmsr(part, desiredclos);



Currently the root cgroup would have all the bits set which will act
like a default cgroup where all the otherwise unused parts (assuming
they are a set of contiguous cache capacity bits) will be used.


Right, but we don't want to place tasks in there in case one cgroup
wants exclusive cache access.

So whenever you want an exclusive cgroup you'd do:

create cgroup-exclusive; reserve desired part of the cache
for it.
create cgroup-default; reserved all cache minus that of cgroup-exclusive
for it.

place tasks that belong to cgroup-exclusive into it.
place all other tasks (including init) into cgroup-default.

Is that right?


Yes you could do that.

You can create cgroups to have masks which are exclusive in todays 
implementation, just that you could also created more cgroups to overlap the 
masks again.. iow we dont have an exclusive flag for the cgroup mask.
Is that a common use case in 
the server environment that you need to prevent other cgroups from using a 
certain mask ? (since the root user should control these allocations .. he 
should know?)






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-30 Thread Marcelo Tosatti
On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
> 
> 
> Marcello,
> 
> 
> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> >
> >How about this:
> >
> >desiredclos (closid  p1  p2  p3 p4)
> >  1   1   0   0  0
> >  2   0   0   0  1
> >  3   0   1   1  0
> 
> #1 Currently in the rdt cgroup , the root cgroup always has all the
> bits set and cant be changed (because the cgroup hierarchy would by
> default make this to have all bits as all the children need to have
> a subset of the root's bitmask). So if the user creates a cgroup and
> not put any task in it , the tasks in the root cgroup could be still
> using that part of the cache. Thats the reason i say we can have
> really 'exclusive' masks.
> 
> Or in other words - there is always a desired clos (0) which has all
> parts set which acts like a default pool.
> 
> Also the parts can overlap.  Please apply this for all the below
> comments which will change the way they work.


> 
> >
> >p means part.
> 
> I am assuming p = (a contiguous cache capacity bit mask)
> 
> >closid 1 is a exclusive cgroup.
> >closid 2 is a "cache hog" class.
> >closid 3 is "default closid".
> >
> >Desiredclos is what user has specified.
> >
> >Transition 1: desiredclos --> effectiveclos
> >Clean all bits of unused closid's
> >(that must be updated whenever a
> >closid1 cgroup goes from empty->nonempty
> >and vice-versa).
> >
> >effectiveclos (closid  p1  p2  p3 p4)
> >1   0   0   0  0
> >2   0   0   0  1
> >3   0   1   1  0
> 
> >
> >Transition 2: effectiveclos --> expandedclos
> >expandedclos (closid  p1  p2  p3 p4)
> >1   0   0   0  0
> >2   0   0   0  1
> >3   1   1   1  0
> >Then you have different inplacecos for each
> >CPU (see pseudo-code below):
> >
> >On the following events.
> >
> >- task migration to new pCPU:
> >- task creation:
> >
> > id = smp_processor_id();
> > for (part = desiredclos.p1; ...; part++)
> > /* if my cosid is set and any other
> >cosid is clear, for the part,
> >synchronize desiredclos --> inplacecos */
> > if (part[mycosid] == 1 &&
> > part[any_othercosid] == 0)
> > wrmsr(part, desiredclos);
> >
> 
> Currently the root cgroup would have all the bits set which will act
> like a default cgroup where all the otherwise unused parts (assuming
> they are a set of contiguous cache capacity bits) will be used.

Right, but we don't want to place tasks in there in case one cgroup
wants exclusive cache access.

So whenever you want an exclusive cgroup you'd do:

create cgroup-exclusive; reserve desired part of the cache 
for it.
create cgroup-default; reserved all cache minus that of cgroup-exclusive
for it.

place tasks that belong to cgroup-exclusive into it.
place all other tasks (including init) into cgroup-default.

Is that right?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-30 Thread Marcelo Tosatti
On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
> 
> 
> Marcello,
> 
> 
> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> >
> >How about this:
> >
> >desiredclos (closid  p1  p2  p3 p4)
> >  1   1   0   0  0
> >  2   0   0   0  1
> >  3   0   1   1  0
> 
> #1 Currently in the rdt cgroup , the root cgroup always has all the
> bits set and cant be changed (because the cgroup hierarchy would by
> default make this to have all bits as all the children need to have
> a subset of the root's bitmask). So if the user creates a cgroup and
> not put any task in it , the tasks in the root cgroup could be still
> using that part of the cache. Thats the reason i say we can have
> really 'exclusive' masks.
> 
> Or in other words - there is always a desired clos (0) which has all
> parts set which acts like a default pool.
> 
> Also the parts can overlap.  Please apply this for all the below
> comments which will change the way they work.
> 
> >
> >p means part.
> 
> I am assuming p = (a contiguous cache capacity bit mask)

Yes.

> >closid 1 is a exclusive cgroup.
> >closid 2 is a "cache hog" class.
> >closid 3 is "default closid".
> >
> >Desiredclos is what user has specified.
> >
> >Transition 1: desiredclos --> effectiveclos
> >Clean all bits of unused closid's
> >(that must be updated whenever a
> >closid1 cgroup goes from empty->nonempty
> >and vice-versa).
> >
> >effectiveclos (closid  p1  p2  p3 p4)
> >1   0   0   0  0
> >2   0   0   0  1
> >3   0   1   1  0
> 
> >
> >Transition 2: effectiveclos --> expandedclos
> >expandedclos (closid  p1  p2  p3 p4)
> >1   0   0   0  0
> >2   0   0   0  1
> >3   1   1   1  0
> >Then you have different inplacecos for each
> >CPU (see pseudo-code below):
> >
> >On the following events.
> >
> >- task migration to new pCPU:
> >- task creation:
> >
> > id = smp_processor_id();
> > for (part = desiredclos.p1; ...; part++)
> > /* if my cosid is set and any other
> >cosid is clear, for the part,
> >synchronize desiredclos --> inplacecos */
> > if (part[mycosid] == 1 &&
> > part[any_othercosid] == 0)
> > wrmsr(part, desiredclos);
> >
> 
> Currently the root cgroup would have all the bits set which will act
> like a default cgroup where all the otherwise unused parts (assuming
> they are a set of contiguous cache capacity bits) will be used.
> 
> Otherwise the question is in the expandedclos - who decides to
> expand the closx parts to include some of the unused parts.. - that
> could just be a default root always ?

Right, so the problem is for certain closid's you might never want 
to expand (because doing so would cause data to be cached in a
cache way which might have high eviction rate in the future).
See the example from Will.

But for the default cache (that is "unclassified applications" 
i suppose it is beneficial to expand in most cases, that is, 
use maximum amount of cache irrespective of eviction rate, which 
is the behaviour that exists now without CAT).

So perhaps a new flag "expand=y/n" can be added to the cgroup 
directories... What do you say?

Userspace representation of CAT
---

Usage model:
1) measure application performance without L3 cache reservation.
2) measure application perf with L3 cache reservation and
X number of cache ways until desired performance is attained.

Requirements:
1) Persistency of CLOS configuration across hardware. On migration
of operating system or application between different hardware
systems we'd like the following to be maintained:
- exclusive number of bytes (*) reserved to a certain CLOSid.
- shared number of bytes (*) reserved between a certain group
  of CLOSid's.

For both code and data, rounded down or up in cache way size.

2) Reasoning:
Different CBM masks in different hardware platforms might be necessary
to specify the same CLOS configuration, in terms of exclusive number of
bytes and shared number of bytes. (cache-way rounded number of bytes).
For example, due to L3 allocation by other hardware entities in certain parts
of the cache it might be necessary to relocate CBM mask to achieve
the same CLOS configuration.

3) Proposed format:

sharedregionK.exclusive - Number of exclusive cache bytes reserved for 
shared region.
sharedregionK.excl_data - Number of exclusive cache data bytes reserved for 
shared region.
sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for 
shared region.
sharedregionK.round_down - Round down to cache way bytes from respective number
 specification (default is round up).
sharedregionK.expand - y/n - Expand shared region to more cache ways
when available (default N).


Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-30 Thread Vikas Shivappa



Marcello,


On Wed, 29 Jul 2015, Marcelo Tosatti wrote:


How about this:

desiredclos (closid  p1  p2  p3 p4)
 1   1   0   0  0
 2   0   0   0  1
 3   0   1   1  0


#1 Currently in the rdt cgroup , the root cgroup always has all the bits set and 
cant be changed (because the cgroup hierarchy would by default make this to have 
all bits as all the children need to have a subset of the root's bitmask). So if 
the user creates a cgroup and not put any task in it , the tasks in the root 
cgroup could be still using that part of the cache. Thats the reason i say we 
can have really 'exclusive' masks.


Or in other words - there is always a desired clos (0) which has all parts set 
which acts like a default pool.


Also the parts can overlap.  Please apply this for all the below comments which 
will change the way they work.




p means part.


I am assuming p = (a contiguous cache capacity bit mask)


closid 1 is a exclusive cgroup.
closid 2 is a "cache hog" class.
closid 3 is "default closid".

Desiredclos is what user has specified.

Transition 1: desiredclos --> effectiveclos
Clean all bits of unused closid's
(that must be updated whenever a
closid1 cgroup goes from empty->nonempty
and vice-versa).

effectiveclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   0   1   1  0




Transition 2: effectiveclos --> expandedclos
expandedclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   1   1   1  0
Then you have different inplacecos for each
CPU (see pseudo-code below):

On the following events.

- task migration to new pCPU:
- task creation:

id = smp_processor_id();
for (part = desiredclos.p1; ...; part++)
/* if my cosid is set and any other
   cosid is clear, for the part,
   synchronize desiredclos --> inplacecos */
if (part[mycosid] == 1 &&
part[any_othercosid] == 0)
wrmsr(part, desiredclos);



Currently the root cgroup would have all the bits set which will act like a 
default cgroup where all the otherwise unused parts (assuming they are a 
set of contiguous cache capacity bits) will be used.


Otherwise the question is in the expandedclos - who decides to expand the closx 
parts to include some of the unused parts.. - that could just be a default root 
always ?


Thanks,
Vikas





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-30 Thread Marcelo Tosatti
On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
 
 
 Marcello,
 
 
 On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
 
 How about this:
 
 desiredclos (closid  p1  p2  p3 p4)
   1   1   0   0  0
   2   0   0   0  1
   3   0   1   1  0
 
 #1 Currently in the rdt cgroup , the root cgroup always has all the
 bits set and cant be changed (because the cgroup hierarchy would by
 default make this to have all bits as all the children need to have
 a subset of the root's bitmask). So if the user creates a cgroup and
 not put any task in it , the tasks in the root cgroup could be still
 using that part of the cache. Thats the reason i say we can have
 really 'exclusive' masks.
 
 Or in other words - there is always a desired clos (0) which has all
 parts set which acts like a default pool.
 
 Also the parts can overlap.  Please apply this for all the below
 comments which will change the way they work.
 
 
 p means part.
 
 I am assuming p = (a contiguous cache capacity bit mask)

Yes.

 closid 1 is a exclusive cgroup.
 closid 2 is a cache hog class.
 closid 3 is default closid.
 
 Desiredclos is what user has specified.
 
 Transition 1: desiredclos -- effectiveclos
 Clean all bits of unused closid's
 (that must be updated whenever a
 closid1 cgroup goes from empty-nonempty
 and vice-versa).
 
 effectiveclos (closid  p1  p2  p3 p4)
 1   0   0   0  0
 2   0   0   0  1
 3   0   1   1  0
 
 
 Transition 2: effectiveclos -- expandedclos
 expandedclos (closid  p1  p2  p3 p4)
 1   0   0   0  0
 2   0   0   0  1
 3   1   1   1  0
 Then you have different inplacecos for each
 CPU (see pseudo-code below):
 
 On the following events.
 
 - task migration to new pCPU:
 - task creation:
 
  id = smp_processor_id();
  for (part = desiredclos.p1; ...; part++)
  /* if my cosid is set and any other
 cosid is clear, for the part,
 synchronize desiredclos -- inplacecos */
  if (part[mycosid] == 1 
  part[any_othercosid] == 0)
  wrmsr(part, desiredclos);
 
 
 Currently the root cgroup would have all the bits set which will act
 like a default cgroup where all the otherwise unused parts (assuming
 they are a set of contiguous cache capacity bits) will be used.
 
 Otherwise the question is in the expandedclos - who decides to
 expand the closx parts to include some of the unused parts.. - that
 could just be a default root always ?

Right, so the problem is for certain closid's you might never want 
to expand (because doing so would cause data to be cached in a
cache way which might have high eviction rate in the future).
See the example from Will.

But for the default cache (that is unclassified applications 
i suppose it is beneficial to expand in most cases, that is, 
use maximum amount of cache irrespective of eviction rate, which 
is the behaviour that exists now without CAT).

So perhaps a new flag expand=y/n can be added to the cgroup 
directories... What do you say?

Userspace representation of CAT
---

Usage model:
1) measure application performance without L3 cache reservation.
2) measure application perf with L3 cache reservation and
X number of cache ways until desired performance is attained.

Requirements:
1) Persistency of CLOS configuration across hardware. On migration
of operating system or application between different hardware
systems we'd like the following to be maintained:
- exclusive number of bytes (*) reserved to a certain CLOSid.
- shared number of bytes (*) reserved between a certain group
  of CLOSid's.

For both code and data, rounded down or up in cache way size.

2) Reasoning:
Different CBM masks in different hardware platforms might be necessary
to specify the same CLOS configuration, in terms of exclusive number of
bytes and shared number of bytes. (cache-way rounded number of bytes).
For example, due to L3 allocation by other hardware entities in certain parts
of the cache it might be necessary to relocate CBM mask to achieve
the same CLOS configuration.

3) Proposed format:

sharedregionK.exclusive - Number of exclusive cache bytes reserved for 
shared region.
sharedregionK.excl_data - Number of exclusive cache data bytes reserved for 
shared region.
sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for 
shared region.
sharedregionK.round_down - Round down to cache way bytes from respective number
 specification (default is round up).
sharedregionK.expand - y/n - Expand shared region to more cache ways
when available (default N).

cgroupN.exclusive - Number of exclusive L3 cache bytes reserved 
for cgroup.
cgroupN.excl_data - Number of exclusive L3 data 

Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-30 Thread Marcelo Tosatti
On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
 
 
 Marcello,
 
 
 On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
 
 How about this:
 
 desiredclos (closid  p1  p2  p3 p4)
   1   1   0   0  0
   2   0   0   0  1
   3   0   1   1  0
 
 #1 Currently in the rdt cgroup , the root cgroup always has all the
 bits set and cant be changed (because the cgroup hierarchy would by
 default make this to have all bits as all the children need to have
 a subset of the root's bitmask). So if the user creates a cgroup and
 not put any task in it , the tasks in the root cgroup could be still
 using that part of the cache. Thats the reason i say we can have
 really 'exclusive' masks.
 
 Or in other words - there is always a desired clos (0) which has all
 parts set which acts like a default pool.
 
 Also the parts can overlap.  Please apply this for all the below
 comments which will change the way they work.


 
 
 p means part.
 
 I am assuming p = (a contiguous cache capacity bit mask)
 
 closid 1 is a exclusive cgroup.
 closid 2 is a cache hog class.
 closid 3 is default closid.
 
 Desiredclos is what user has specified.
 
 Transition 1: desiredclos -- effectiveclos
 Clean all bits of unused closid's
 (that must be updated whenever a
 closid1 cgroup goes from empty-nonempty
 and vice-versa).
 
 effectiveclos (closid  p1  p2  p3 p4)
 1   0   0   0  0
 2   0   0   0  1
 3   0   1   1  0
 
 
 Transition 2: effectiveclos -- expandedclos
 expandedclos (closid  p1  p2  p3 p4)
 1   0   0   0  0
 2   0   0   0  1
 3   1   1   1  0
 Then you have different inplacecos for each
 CPU (see pseudo-code below):
 
 On the following events.
 
 - task migration to new pCPU:
 - task creation:
 
  id = smp_processor_id();
  for (part = desiredclos.p1; ...; part++)
  /* if my cosid is set and any other
 cosid is clear, for the part,
 synchronize desiredclos -- inplacecos */
  if (part[mycosid] == 1 
  part[any_othercosid] == 0)
  wrmsr(part, desiredclos);
 
 
 Currently the root cgroup would have all the bits set which will act
 like a default cgroup where all the otherwise unused parts (assuming
 they are a set of contiguous cache capacity bits) will be used.

Right, but we don't want to place tasks in there in case one cgroup
wants exclusive cache access.

So whenever you want an exclusive cgroup you'd do:

create cgroup-exclusive; reserve desired part of the cache 
for it.
create cgroup-default; reserved all cache minus that of cgroup-exclusive
for it.

place tasks that belong to cgroup-exclusive into it.
place all other tasks (including init) into cgroup-default.

Is that right?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-30 Thread Vikas Shivappa



On Thu, 30 Jul 2015, Marcelo Tosatti wrote:


On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:



Marcello,


On Wed, 29 Jul 2015, Marcelo Tosatti wrote:


How about this:

desiredclos (closid  p1  p2  p3 p4)
 1   1   0   0  0
 2   0   0   0  1
 3   0   1   1  0


#1 Currently in the rdt cgroup , the root cgroup always has all the
bits set and cant be changed (because the cgroup hierarchy would by
default make this to have all bits as all the children need to have
a subset of the root's bitmask). So if the user creates a cgroup and
not put any task in it , the tasks in the root cgroup could be still
using that part of the cache. Thats the reason i say we can have
really 'exclusive' masks.

Or in other words - there is always a desired clos (0) which has all
parts set which acts like a default pool.

Also the parts can overlap.  Please apply this for all the below
comments which will change the way they work.







p means part.


I am assuming p = (a contiguous cache capacity bit mask)


closid 1 is a exclusive cgroup.
closid 2 is a cache hog class.
closid 3 is default closid.

Desiredclos is what user has specified.

Transition 1: desiredclos -- effectiveclos
Clean all bits of unused closid's
(that must be updated whenever a
closid1 cgroup goes from empty-nonempty
and vice-versa).

effectiveclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   0   1   1  0




Transition 2: effectiveclos -- expandedclos
expandedclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   1   1   1  0
Then you have different inplacecos for each
CPU (see pseudo-code below):

On the following events.

- task migration to new pCPU:
- task creation:

id = smp_processor_id();
for (part = desiredclos.p1; ...; part++)
/* if my cosid is set and any other
   cosid is clear, for the part,
   synchronize desiredclos -- inplacecos */
if (part[mycosid] == 1 
part[any_othercosid] == 0)
wrmsr(part, desiredclos);



Currently the root cgroup would have all the bits set which will act
like a default cgroup where all the otherwise unused parts (assuming
they are a set of contiguous cache capacity bits) will be used.


Right, but we don't want to place tasks in there in case one cgroup
wants exclusive cache access.

So whenever you want an exclusive cgroup you'd do:

create cgroup-exclusive; reserve desired part of the cache
for it.
create cgroup-default; reserved all cache minus that of cgroup-exclusive
for it.

place tasks that belong to cgroup-exclusive into it.
place all other tasks (including init) into cgroup-default.

Is that right?


Yes you could do that.

You can create cgroups to have masks which are exclusive in todays 
implementation, just that you could also created more cgroups to overlap the 
masks again.. iow we dont have an exclusive flag for the cgroup mask.
Is that a common use case in 
the server environment that you need to prevent other cgroups from using a 
certain mask ? (since the root user should control these allocations .. he 
should know?)






--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-30 Thread Vikas Shivappa



Marcello,


On Wed, 29 Jul 2015, Marcelo Tosatti wrote:


How about this:

desiredclos (closid  p1  p2  p3 p4)
 1   1   0   0  0
 2   0   0   0  1
 3   0   1   1  0


#1 Currently in the rdt cgroup , the root cgroup always has all the bits set and 
cant be changed (because the cgroup hierarchy would by default make this to have 
all bits as all the children need to have a subset of the root's bitmask). So if 
the user creates a cgroup and not put any task in it , the tasks in the root 
cgroup could be still using that part of the cache. Thats the reason i say we 
can have really 'exclusive' masks.


Or in other words - there is always a desired clos (0) which has all parts set 
which acts like a default pool.


Also the parts can overlap.  Please apply this for all the below comments which 
will change the way they work.




p means part.


I am assuming p = (a contiguous cache capacity bit mask)


closid 1 is a exclusive cgroup.
closid 2 is a cache hog class.
closid 3 is default closid.

Desiredclos is what user has specified.

Transition 1: desiredclos -- effectiveclos
Clean all bits of unused closid's
(that must be updated whenever a
closid1 cgroup goes from empty-nonempty
and vice-versa).

effectiveclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   0   1   1  0




Transition 2: effectiveclos -- expandedclos
expandedclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   1   1   1  0
Then you have different inplacecos for each
CPU (see pseudo-code below):

On the following events.

- task migration to new pCPU:
- task creation:

id = smp_processor_id();
for (part = desiredclos.p1; ...; part++)
/* if my cosid is set and any other
   cosid is clear, for the part,
   synchronize desiredclos -- inplacecos */
if (part[mycosid] == 1 
part[any_othercosid] == 0)
wrmsr(part, desiredclos);



Currently the root cgroup would have all the bits set which will act like a 
default cgroup where all the otherwise unused parts (assuming they are a 
set of contiguous cache capacity bits) will be used.


Otherwise the question is in the expandedclos - who decides to expand the closx 
parts to include some of the unused parts.. - that could just be a default root 
always ?


Thanks,
Vikas





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-29 Thread Vikas Shivappa



On Tue, 28 Jul 2015, Auld, Will wrote:





-Original Message-

Same comment as above - Cgroup masks can always overlap and other cgroups
can allocate the same cache , and hence wont have exclusive cache allocation.


[Auld, Will] You can define all the cbm to provide one clos with an exclusive 
area


Do you mean a CLOS that has all the bits set. We donot support exclusive area 
today. The bits in the mask can overlap .. hence can always share the same cache 
allocation .






So natuarally the cgroup with tasks would get to use the cache if it has the 
same
mask (say representing 50% of cache in your example) as others .


[Auld, Will] automatic adjustment of the cbm make me nervous. There are times
when we want to limit the cache for a process independent of whether there is
lots of unused cache.



Please see example below - In general , I just mean the cache mask can have bits 
that can overlap - does not matter whether there is tasks in it or not.





(assume there are 8 bits max cbm)
cgroupa - mask - 0xf
cgroupb - mask - 0xf . Now if cgroupa has no tasks , cgroupb naturally gets all
the cache.

Thanks,
Vikas

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-29 Thread Marcelo Tosatti
On Wed, Jul 29, 2015 at 01:28:38AM +, Auld, Will wrote:
> > > Whenever cgroupE has zero tasks, remove exclusivity (by allowing other
> > > cgroups to use the exclusive ways of it).
> > 
> > Same comment as above - Cgroup masks can always overlap and other cgroups
> > can allocate the same cache , and hence wont have exclusive cache 
> > allocation.
> 
> [Auld, Will] You can define all the cbm to provide one clos with an exclusive 
> area
> 
> > 
> > So natuarally the cgroup with tasks would get to use the cache if it has 
> > the same
> > mask (say representing 50% of cache in your example) as others .
>  
> [Auld, Will] automatic adjustment of the cbm make me nervous. There are times 
> when we want to limit the cache for a process independent of whether there is 
> lots of unused cache. 

How about this:

desiredclos (closid  p1  p2  p3 p4)
 1   1   0   0  0
 2   0   0   0  1
 3   0   1   1  0

p means part. 
closid 1 is a exclusive cgroup.
closid 2 is a "cache hog" class.
closid 3 is "default closid".

Desiredclos is what user has specified.

Transition 1: desiredclos --> effectiveclos
Clean all bits of unused closid's
(that must be updated whenever a 
closid1 cgroup goes from empty->nonempty 
and vice-versa).

effectiveclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   0   1   1  0

Transition 2: effectiveclos --> expandedclos
expandedclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   1   1   1  0

Then you have different inplacecos for each
CPU (see pseudo-code below):

On the following events.

- task migration to new pCPU:
- task creation:

id = smp_processor_id();
for (part = desiredclos.p1; ...; part++)
/* if my cosid is set and any other
   cosid is clear, for the part,
   synchronize desiredclos --> inplacecos */
if (part[mycosid] == 1 && 
part[any_othercosid] == 0)
wrmsr(part, desiredclos);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-29 Thread Marcelo Tosatti
On Wed, Jul 29, 2015 at 01:28:38AM +, Auld, Will wrote:
   Whenever cgroupE has zero tasks, remove exclusivity (by allowing other
   cgroups to use the exclusive ways of it).
  
  Same comment as above - Cgroup masks can always overlap and other cgroups
  can allocate the same cache , and hence wont have exclusive cache 
  allocation.
 
 [Auld, Will] You can define all the cbm to provide one clos with an exclusive 
 area
 
  
  So natuarally the cgroup with tasks would get to use the cache if it has 
  the same
  mask (say representing 50% of cache in your example) as others .
  
 [Auld, Will] automatic adjustment of the cbm make me nervous. There are times 
 when we want to limit the cache for a process independent of whether there is 
 lots of unused cache. 

How about this:

desiredclos (closid  p1  p2  p3 p4)
 1   1   0   0  0
 2   0   0   0  1
 3   0   1   1  0

p means part. 
closid 1 is a exclusive cgroup.
closid 2 is a cache hog class.
closid 3 is default closid.

Desiredclos is what user has specified.

Transition 1: desiredclos -- effectiveclos
Clean all bits of unused closid's
(that must be updated whenever a 
closid1 cgroup goes from empty-nonempty 
and vice-versa).

effectiveclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   0   1   1  0

Transition 2: effectiveclos -- expandedclos
expandedclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   1   1   1  0

Then you have different inplacecos for each
CPU (see pseudo-code below):

On the following events.

- task migration to new pCPU:
- task creation:

id = smp_processor_id();
for (part = desiredclos.p1; ...; part++)
/* if my cosid is set and any other
   cosid is clear, for the part,
   synchronize desiredclos -- inplacecos */
if (part[mycosid] == 1  
part[any_othercosid] == 0)
wrmsr(part, desiredclos);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-29 Thread Vikas Shivappa



On Tue, 28 Jul 2015, Auld, Will wrote:





-Original Message-

Same comment as above - Cgroup masks can always overlap and other cgroups
can allocate the same cache , and hence wont have exclusive cache allocation.


[Auld, Will] You can define all the cbm to provide one clos with an exclusive 
area


Do you mean a CLOS that has all the bits set. We donot support exclusive area 
today. The bits in the mask can overlap .. hence can always share the same cache 
allocation .






So natuarally the cgroup with tasks would get to use the cache if it has the 
same
mask (say representing 50% of cache in your example) as others .


[Auld, Will] automatic adjustment of the cbm make me nervous. There are times
when we want to limit the cache for a process independent of whether there is
lots of unused cache.



Please see example below - In general , I just mean the cache mask can have bits 
that can overlap - does not matter whether there is tasks in it or not.





(assume there are 8 bits max cbm)
cgroupa - mask - 0xf
cgroupb - mask - 0xf . Now if cgroupa has no tasks , cgroupb naturally gets all
the cache.

Thanks,
Vikas

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-28 Thread Auld, Will


> -Original Message-
> From: Shivappa, Vikas
> Sent: Tuesday, July 28, 2015 5:07 PM
> To: Marcelo Tosatti
> Cc: Vikas Shivappa; linux-kernel@vger.kernel.org; Shivappa, Vikas;
> x...@kernel.org; h...@zytor.com; t...@linutronix.de; mi...@kernel.org;
> t...@kernel.org; pet...@infradead.org; Fleming, Matt; Auld, Will; Williamson,
> Glenn P; Juvva, Kanaka D
> Subject: Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and
> cgroup usage guide
> 
> 
> 
> On Tue, 28 Jul 2015, Marcelo Tosatti wrote:
> 
> > On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:
> >> Adds a description of Cache allocation technology, overview of kernel
> >> implementation and usage of Cache Allocation cgroup interface.
> >>
> >> Cache allocation is a sub-feature of Resource Director
> >> Technology(RDT) Allocation or Platform Shared resource control which
> >> provides support to control Platform shared resources like L3 cache.
> >> Currently L3 Cache is the only resource that is supported in RDT.
> >> More information can be found in the Intel SDM, Volume 3, section 17.15.
> >>
> >> Cache Allocation Technology provides a way for the Software (OS/VMM)
> >> to restrict cache allocation to a defined 'subset' of cache which may
> >> be overlapping with other 'subsets'.  This feature is used when
> >> allocating a line in cache ie when pulling new data into the cache.
> >>
> >> Signed-off-by: Vikas Shivappa 
> >> ---
> >>  Documentation/cgroups/rdt.txt | 215
> >> ++
> >>  1 file changed, 215 insertions(+)
> >>  create mode 100644 Documentation/cgroups/rdt.txt
> >>
> >> diff --git a/Documentation/cgroups/rdt.txt
> >> b/Documentation/cgroups/rdt.txt new file mode 100644 index
> >> 000..dfff477
> >> --- /dev/null
> >> +++ b/Documentation/cgroups/rdt.txt
> >> @@ -0,0 +1,215 @@
> >> +RDT
> >> +---
> >> +
> >> +Copyright (C) 2014 Intel Corporation Written by
> >> +vikas.shiva...@linux.intel.com (based on contents and format from
> >> +cpusets.txt)
> >> +
> >> +CONTENTS:
> >> +=
> >> +
> >> +1. Cache Allocation Technology
> >> +  1.1 What is RDT and Cache allocation ?
> >> +  1.2 Why is Cache allocation needed ?
> >> +  1.3 Cache allocation implementation overview
> >> +  1.4 Assignment of CBM and CLOS
> >> +  1.5 Scheduling and Context Switch
> >> +2. Usage Examples and Syntax
> >> +
> >> +1. Cache Allocation Technology(Cache allocation)
> >> +===
> >> +
> >> +1.1 What is RDT and Cache allocation
> >> +
> >> +
> >> +Cache allocation is a sub-feature of Resource Director
> >> +Technology(RDT) Allocation or Platform Shared resource control which
> >> +provides support to control Platform shared resources like L3 cache.
> >> +Currently L3 Cache is the only resource that is supported in RDT.
> >> +More information can be found in the Intel SDM, Volume 3, section 17.15.
> >> +
> >> +Cache Allocation Technology provides a way for the Software (OS/VMM)
> >> +to restrict cache allocation to a defined 'subset' of cache which
> >> +may be overlapping with other 'subsets'.  This feature is used when
> >> +allocating a line in cache ie when pulling new data into the cache.
> >> +The programming of the h/w is done via programming  MSRs.
> >> +
> >> +The different cache subsets are identified by CLOS identifier (class
> >> +of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
> >> +contiguous set of bits which defines the amount of cache resource
> >> +that is available for each 'subset'.
> >> +
> >> +1.2 Why is Cache allocation needed
> >> +--
> >> +
> >> +In todays new processors the number of cores is continuously
> >> +increasing, especially in large scale usage models where VMs are
> >> +used like webservers and datacenters. The number of cores increase
> >> +the number of threads or workloads that can simultaneously be run.
> >> +When multi-threaded-applications, VMs, workloads run concurrently
> >> +they compete for shared resources including L3 cache.
> >> +
> >> +The Cache allocation  enables more cache resources to be made
> >> +avail

Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-28 Thread Vikas Shivappa



On Tue, 28 Jul 2015, Marcelo Tosatti wrote:


On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:

Adds a description of Cache allocation technology, overview
of kernel implementation and usage of Cache Allocation cgroup interface.

Cache allocation is a sub-feature of Resource Director Technology(RDT)
Allocation or Platform Shared resource control which provides support to
control Platform shared resources like L3 cache.  Currently L3 Cache is
the only resource that is supported in RDT.  More information can be
found in the Intel SDM, Volume 3, section 17.15.

Cache Allocation Technology provides a way for the Software (OS/VMM)
to restrict cache allocation to a defined 'subset' of cache which may
be overlapping with other 'subsets'.  This feature is used when
allocating a line in cache ie when pulling new data into the cache.

Signed-off-by: Vikas Shivappa 
---
 Documentation/cgroups/rdt.txt | 215 ++
 1 file changed, 215 insertions(+)
 create mode 100644 Documentation/cgroups/rdt.txt

diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
new file mode 100644
index 000..dfff477
--- /dev/null
+++ b/Documentation/cgroups/rdt.txt
@@ -0,0 +1,215 @@
+RDT
+---
+
+Copyright (C) 2014 Intel Corporation
+Written by vikas.shiva...@linux.intel.com
+(based on contents and format from cpusets.txt)
+
+CONTENTS:
+=
+
+1. Cache Allocation Technology
+  1.1 What is RDT and Cache allocation ?
+  1.2 Why is Cache allocation needed ?
+  1.3 Cache allocation implementation overview
+  1.4 Assignment of CBM and CLOS
+  1.5 Scheduling and Context Switch
+2. Usage Examples and Syntax
+
+1. Cache Allocation Technology(Cache allocation)
+===
+
+1.1 What is RDT and Cache allocation
+
+
+Cache allocation is a sub-feature of Resource Director Technology(RDT)
+Allocation or Platform Shared resource control which provides support to
+control Platform shared resources like L3 cache.  Currently L3 Cache is
+the only resource that is supported in RDT.  More information can be
+found in the Intel SDM, Volume 3, section 17.15.
+
+Cache Allocation Technology provides a way for the Software (OS/VMM)
+to restrict cache allocation to a defined 'subset' of cache which may
+be overlapping with other 'subsets'.  This feature is used when
+allocating a line in cache ie when pulling new data into the cache.
+The programming of the h/w is done via programming  MSRs.
+
+The different cache subsets are identified by CLOS identifier (class
+of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
+contiguous set of bits which defines the amount of cache resource that
+is available for each 'subset'.
+
+1.2 Why is Cache allocation needed
+--
+
+In todays new processors the number of cores is continuously increasing,
+especially in large scale usage models where VMs are used like
+webservers and datacenters. The number of cores increase the number
+of threads or workloads that can simultaneously be run. When
+multi-threaded-applications, VMs, workloads run concurrently they
+compete for shared resources including L3 cache.
+
+The Cache allocation  enables more cache resources to be made available
+for higher priority applications based on guidance from the execution
+environment.
+
+The architecture also allows dynamically changing these subsets during
+runtime to further optimize the performance of the higher priority
+application with minimal degradation to the low priority app.
+Additionally, resources can be rebalanced for system throughput benefit.
+
+This technique may be useful in managing large computer systems which
+large L3 cache. Examples may be large servers running  instances of
+webservers or database servers. In such complex systems, these subsets
+can be used for more careful placing of the available cache
+resources.
+
+1.3 Cache allocation implementation Overview
+
+
+Kernel implements a cgroup subsystem to support cache allocation.
+
+Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
+A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
+to the kernel and not exposed to user.  Each cgroup would have one CBM
+and would just represent one cache 'subset'.
+
+The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
+cgroup never fails.  When a child cgroup is created it inherits the
+CLOSid and the CBM from its parent.  When a user changes the default
+CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
+used before.  The changing of 'l3_cache_mask' may fail with -ENOSPC once
+the kernel runs out of maximum CLOSids it can support.
+User can create as many cgroups as he wants but having different CBMs
+at the same time is restricted by the maximum number of CLOSids
+(multiple cgroups can have the same CBM).
+Kernel maintains a 

Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-28 Thread Marcelo Tosatti
On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:
> Adds a description of Cache allocation technology, overview
> of kernel implementation and usage of Cache Allocation cgroup interface.
> 
> Cache allocation is a sub-feature of Resource Director Technology(RDT)
> Allocation or Platform Shared resource control which provides support to
> control Platform shared resources like L3 cache.  Currently L3 Cache is
> the only resource that is supported in RDT.  More information can be
> found in the Intel SDM, Volume 3, section 17.15.
> 
> Cache Allocation Technology provides a way for the Software (OS/VMM)
> to restrict cache allocation to a defined 'subset' of cache which may
> be overlapping with other 'subsets'.  This feature is used when
> allocating a line in cache ie when pulling new data into the cache.
> 
> Signed-off-by: Vikas Shivappa 
> ---
>  Documentation/cgroups/rdt.txt | 215 
> ++
>  1 file changed, 215 insertions(+)
>  create mode 100644 Documentation/cgroups/rdt.txt
> 
> diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
> new file mode 100644
> index 000..dfff477
> --- /dev/null
> +++ b/Documentation/cgroups/rdt.txt
> @@ -0,0 +1,215 @@
> +RDT
> +---
> +
> +Copyright (C) 2014 Intel Corporation
> +Written by vikas.shiva...@linux.intel.com
> +(based on contents and format from cpusets.txt)
> +
> +CONTENTS:
> +=
> +
> +1. Cache Allocation Technology
> +  1.1 What is RDT and Cache allocation ?
> +  1.2 Why is Cache allocation needed ?
> +  1.3 Cache allocation implementation overview
> +  1.4 Assignment of CBM and CLOS
> +  1.5 Scheduling and Context Switch
> +2. Usage Examples and Syntax
> +
> +1. Cache Allocation Technology(Cache allocation)
> +===
> +
> +1.1 What is RDT and Cache allocation
> +
> +
> +Cache allocation is a sub-feature of Resource Director Technology(RDT)
> +Allocation or Platform Shared resource control which provides support to
> +control Platform shared resources like L3 cache.  Currently L3 Cache is
> +the only resource that is supported in RDT.  More information can be
> +found in the Intel SDM, Volume 3, section 17.15.
> +
> +Cache Allocation Technology provides a way for the Software (OS/VMM)
> +to restrict cache allocation to a defined 'subset' of cache which may
> +be overlapping with other 'subsets'.  This feature is used when
> +allocating a line in cache ie when pulling new data into the cache.
> +The programming of the h/w is done via programming  MSRs.
> +
> +The different cache subsets are identified by CLOS identifier (class
> +of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
> +contiguous set of bits which defines the amount of cache resource that
> +is available for each 'subset'.
> +
> +1.2 Why is Cache allocation needed
> +--
> +
> +In todays new processors the number of cores is continuously increasing,
> +especially in large scale usage models where VMs are used like
> +webservers and datacenters. The number of cores increase the number
> +of threads or workloads that can simultaneously be run. When
> +multi-threaded-applications, VMs, workloads run concurrently they
> +compete for shared resources including L3 cache.
> +
> +The Cache allocation  enables more cache resources to be made available
> +for higher priority applications based on guidance from the execution
> +environment.
> +
> +The architecture also allows dynamically changing these subsets during
> +runtime to further optimize the performance of the higher priority
> +application with minimal degradation to the low priority app.
> +Additionally, resources can be rebalanced for system throughput benefit.
> +
> +This technique may be useful in managing large computer systems which
> +large L3 cache. Examples may be large servers running  instances of
> +webservers or database servers. In such complex systems, these subsets
> +can be used for more careful placing of the available cache
> +resources.
> +
> +1.3 Cache allocation implementation Overview
> +
> +
> +Kernel implements a cgroup subsystem to support cache allocation.
> +
> +Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
> +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
> +to the kernel and not exposed to user.  Each cgroup would have one CBM
> +and would just represent one cache 'subset'.
> +
> +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
> +cgroup never fails.  When a child cgroup is created it inherits the
> +CLOSid and the CBM from its parent.  When a user changes the default
> +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> +used before.  The changing of 'l3_cache_mask' may fail with -ENOSPC once
> +the kernel runs out of maximum CLOSids it can support.
> +User can create as many cgroups as he 

Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-28 Thread Peter Zijlstra
On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:

Please edit this document to have consistent spacing. Its really hard to
read this. Every time I spot a misplaced space my brain stumbles and I
need to restart.

> diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
> new file mode 100644
> index 000..dfff477
> --- /dev/null
> +++ b/Documentation/cgroups/rdt.txt
> @@ -0,0 +1,215 @@
> +RDT
> +---
> +
> +Copyright (C) 2014 Intel Corporation
> +Written by vikas.shiva...@linux.intel.com
> +(based on contents and format from cpusets.txt)
> +
> +CONTENTS:
> +=
> +
> +1. Cache Allocation Technology
> +  1.1 What is RDT and Cache allocation ?
> +  1.2 Why is Cache allocation needed ?
> +  1.3 Cache allocation implementation overview
> +  1.4 Assignment of CBM and CLOS
> +  1.5 Scheduling and Context Switch
> +2. Usage Examples and Syntax
> +
> +1. Cache Allocation Technology(Cache allocation)
> +===
> +
> +1.1 What is RDT and Cache allocation
> +
> +
> +Cache allocation is a sub-feature of Resource Director Technology(RDT)

missing ' ' before the '('.

> +Allocation or Platform Shared resource control which provides support to
> +control Platform shared resources like L3 cache.  Currently L3 Cache is

Double ' ' after '.' -- which _can_ be correct, but is inconsistent
throughout the document.

> +the only resource that is supported in RDT.  More information can be
> +found in the Intel SDM, Volume 3, section 17.15.

Please also include the SDM revision, like June 2015.

In fact, in the June 2015 V3 17.15 is CQM, not CAT.

> +Cache Allocation Technology provides a way for the Software (OS/VMM)
> +to restrict cache allocation to a defined 'subset' of cache which may
> +be overlapping with other 'subsets'.  This feature is used when
> +allocating a line in cache ie when pulling new data into the cache.
> +The programming of the h/w is done via programming  MSRs.

Double ' ' before 'MSRs'.

> +The different cache subsets are identified by CLOS identifier (class
> +of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
> +contiguous set of bits which defines the amount of cache resource that
> +is available for each 'subset'.
> +
> +1.2 Why is Cache allocation needed
> +--
> +
> +In todays new processors the number of cores is continuously increasing,
> +especially in large scale usage models where VMs are used like
> +webservers and datacenters. The number of cores increase the number

Single ' ' after .

> +of threads or workloads that can simultaneously be run. When
> +multi-threaded-applications, VMs, workloads run concurrently they
> +compete for shared resources including L3 cache.
> +
> +The Cache allocation  enables more cache resources to be made available

Double ' ' for no apparent reason.

> +for higher priority applications based on guidance from the execution
> +environment.
> +
> +The architecture also allows dynamically changing these subsets during
> +runtime to further optimize the performance of the higher priority
> +application with minimal degradation to the low priority app.
> +Additionally, resources can be rebalanced for system throughput benefit.
> +
> +This technique may be useful in managing large computer systems which
> +large L3 cache. Examples may be large servers running  instances of

Double ' '

> +webservers or database servers. In such complex systems, these subsets
> +can be used for more careful placing of the available cache
> +resources.
> +
> +1.3 Cache allocation implementation Overview
> +
> +
> +Kernel implements a cgroup subsystem to support cache allocation.
> +
> +Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.

No ' ' before '('

> +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal

Idem, also, _no_ space after '.'

> +to the kernel and not exposed to user.  Each cgroup would have one CBM

Double space after '.'

> +and would just represent one cache 'subset'.
> +
> +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the

I'm thinking the convention is ' ' _after_ ',', not before.

> +cgroup never fails.  When a child cgroup is created it inherits the
> +CLOSid and the CBM from its parent.  When a user changes the default
> +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> +used before.  The changing of 'l3_cache_mask' may fail with -ENOSPC once
> +the kernel runs out of maximum CLOSids it can support.
> +User can create as many cgroups as he wants but having different CBMs
> +at the same time is restricted by the maximum number of CLOSids
> +(multiple cgroups can have the same CBM).
> +Kernel maintains a CLOSid<->cbm mapping which keeps reference counter

Above you had ' ' around the arrows.

> +for each cgroup using a CLOSid.
> +
> +The tasks in the cgroup would get to fill the L3 cache 

Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-28 Thread Marcelo Tosatti
On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:
 Adds a description of Cache allocation technology, overview
 of kernel implementation and usage of Cache Allocation cgroup interface.
 
 Cache allocation is a sub-feature of Resource Director Technology(RDT)
 Allocation or Platform Shared resource control which provides support to
 control Platform shared resources like L3 cache.  Currently L3 Cache is
 the only resource that is supported in RDT.  More information can be
 found in the Intel SDM, Volume 3, section 17.15.
 
 Cache Allocation Technology provides a way for the Software (OS/VMM)
 to restrict cache allocation to a defined 'subset' of cache which may
 be overlapping with other 'subsets'.  This feature is used when
 allocating a line in cache ie when pulling new data into the cache.
 
 Signed-off-by: Vikas Shivappa vikas.shiva...@linux.intel.com
 ---
  Documentation/cgroups/rdt.txt | 215 
 ++
  1 file changed, 215 insertions(+)
  create mode 100644 Documentation/cgroups/rdt.txt
 
 diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
 new file mode 100644
 index 000..dfff477
 --- /dev/null
 +++ b/Documentation/cgroups/rdt.txt
 @@ -0,0 +1,215 @@
 +RDT
 +---
 +
 +Copyright (C) 2014 Intel Corporation
 +Written by vikas.shiva...@linux.intel.com
 +(based on contents and format from cpusets.txt)
 +
 +CONTENTS:
 +=
 +
 +1. Cache Allocation Technology
 +  1.1 What is RDT and Cache allocation ?
 +  1.2 Why is Cache allocation needed ?
 +  1.3 Cache allocation implementation overview
 +  1.4 Assignment of CBM and CLOS
 +  1.5 Scheduling and Context Switch
 +2. Usage Examples and Syntax
 +
 +1. Cache Allocation Technology(Cache allocation)
 +===
 +
 +1.1 What is RDT and Cache allocation
 +
 +
 +Cache allocation is a sub-feature of Resource Director Technology(RDT)
 +Allocation or Platform Shared resource control which provides support to
 +control Platform shared resources like L3 cache.  Currently L3 Cache is
 +the only resource that is supported in RDT.  More information can be
 +found in the Intel SDM, Volume 3, section 17.15.
 +
 +Cache Allocation Technology provides a way for the Software (OS/VMM)
 +to restrict cache allocation to a defined 'subset' of cache which may
 +be overlapping with other 'subsets'.  This feature is used when
 +allocating a line in cache ie when pulling new data into the cache.
 +The programming of the h/w is done via programming  MSRs.
 +
 +The different cache subsets are identified by CLOS identifier (class
 +of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
 +contiguous set of bits which defines the amount of cache resource that
 +is available for each 'subset'.
 +
 +1.2 Why is Cache allocation needed
 +--
 +
 +In todays new processors the number of cores is continuously increasing,
 +especially in large scale usage models where VMs are used like
 +webservers and datacenters. The number of cores increase the number
 +of threads or workloads that can simultaneously be run. When
 +multi-threaded-applications, VMs, workloads run concurrently they
 +compete for shared resources including L3 cache.
 +
 +The Cache allocation  enables more cache resources to be made available
 +for higher priority applications based on guidance from the execution
 +environment.
 +
 +The architecture also allows dynamically changing these subsets during
 +runtime to further optimize the performance of the higher priority
 +application with minimal degradation to the low priority app.
 +Additionally, resources can be rebalanced for system throughput benefit.
 +
 +This technique may be useful in managing large computer systems which
 +large L3 cache. Examples may be large servers running  instances of
 +webservers or database servers. In such complex systems, these subsets
 +can be used for more careful placing of the available cache
 +resources.
 +
 +1.3 Cache allocation implementation Overview
 +
 +
 +Kernel implements a cgroup subsystem to support cache allocation.
 +
 +Each cgroup has a CLOSid - CBM(cache bit mask) mapping.
 +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
 +to the kernel and not exposed to user.  Each cgroup would have one CBM
 +and would just represent one cache 'subset'.
 +
 +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
 +cgroup never fails.  When a child cgroup is created it inherits the
 +CLOSid and the CBM from its parent.  When a user changes the default
 +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
 +used before.  The changing of 'l3_cache_mask' may fail with -ENOSPC once
 +the kernel runs out of maximum CLOSids it can support.
 +User can create as many cgroups as he wants but having different CBMs
 +at the same time is restricted by the maximum 

RE: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-28 Thread Auld, Will


 -Original Message-
 From: Shivappa, Vikas
 Sent: Tuesday, July 28, 2015 5:07 PM
 To: Marcelo Tosatti
 Cc: Vikas Shivappa; linux-kernel@vger.kernel.org; Shivappa, Vikas;
 x...@kernel.org; h...@zytor.com; t...@linutronix.de; mi...@kernel.org;
 t...@kernel.org; pet...@infradead.org; Fleming, Matt; Auld, Will; Williamson,
 Glenn P; Juvva, Kanaka D
 Subject: Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and
 cgroup usage guide
 
 
 
 On Tue, 28 Jul 2015, Marcelo Tosatti wrote:
 
  On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:
  Adds a description of Cache allocation technology, overview of kernel
  implementation and usage of Cache Allocation cgroup interface.
 
  Cache allocation is a sub-feature of Resource Director
  Technology(RDT) Allocation or Platform Shared resource control which
  provides support to control Platform shared resources like L3 cache.
  Currently L3 Cache is the only resource that is supported in RDT.
  More information can be found in the Intel SDM, Volume 3, section 17.15.
 
  Cache Allocation Technology provides a way for the Software (OS/VMM)
  to restrict cache allocation to a defined 'subset' of cache which may
  be overlapping with other 'subsets'.  This feature is used when
  allocating a line in cache ie when pulling new data into the cache.
 
  Signed-off-by: Vikas Shivappa vikas.shiva...@linux.intel.com
  ---
   Documentation/cgroups/rdt.txt | 215
  ++
   1 file changed, 215 insertions(+)
   create mode 100644 Documentation/cgroups/rdt.txt
 
  diff --git a/Documentation/cgroups/rdt.txt
  b/Documentation/cgroups/rdt.txt new file mode 100644 index
  000..dfff477
  --- /dev/null
  +++ b/Documentation/cgroups/rdt.txt
  @@ -0,0 +1,215 @@
  +RDT
  +---
  +
  +Copyright (C) 2014 Intel Corporation Written by
  +vikas.shiva...@linux.intel.com (based on contents and format from
  +cpusets.txt)
  +
  +CONTENTS:
  +=
  +
  +1. Cache Allocation Technology
  +  1.1 What is RDT and Cache allocation ?
  +  1.2 Why is Cache allocation needed ?
  +  1.3 Cache allocation implementation overview
  +  1.4 Assignment of CBM and CLOS
  +  1.5 Scheduling and Context Switch
  +2. Usage Examples and Syntax
  +
  +1. Cache Allocation Technology(Cache allocation)
  +===
  +
  +1.1 What is RDT and Cache allocation
  +
  +
  +Cache allocation is a sub-feature of Resource Director
  +Technology(RDT) Allocation or Platform Shared resource control which
  +provides support to control Platform shared resources like L3 cache.
  +Currently L3 Cache is the only resource that is supported in RDT.
  +More information can be found in the Intel SDM, Volume 3, section 17.15.
  +
  +Cache Allocation Technology provides a way for the Software (OS/VMM)
  +to restrict cache allocation to a defined 'subset' of cache which
  +may be overlapping with other 'subsets'.  This feature is used when
  +allocating a line in cache ie when pulling new data into the cache.
  +The programming of the h/w is done via programming  MSRs.
  +
  +The different cache subsets are identified by CLOS identifier (class
  +of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
  +contiguous set of bits which defines the amount of cache resource
  +that is available for each 'subset'.
  +
  +1.2 Why is Cache allocation needed
  +--
  +
  +In todays new processors the number of cores is continuously
  +increasing, especially in large scale usage models where VMs are
  +used like webservers and datacenters. The number of cores increase
  +the number of threads or workloads that can simultaneously be run.
  +When multi-threaded-applications, VMs, workloads run concurrently
  +they compete for shared resources including L3 cache.
  +
  +The Cache allocation  enables more cache resources to be made
  +available for higher priority applications based on guidance from
  +the execution environment.
  +
  +The architecture also allows dynamically changing these subsets
  +during runtime to further optimize the performance of the higher
  +priority application with minimal degradation to the low priority app.
  +Additionally, resources can be rebalanced for system throughput benefit.
  +
  +This technique may be useful in managing large computer systems
  +which large L3 cache. Examples may be large servers running
  +instances of webservers or database servers. In such complex
  +systems, these subsets can be used for more careful placing of the
  +available cache resources.
  +
  +1.3 Cache allocation implementation Overview
  +
  +
  +Kernel implements a cgroup subsystem to support cache allocation.
  +
  +Each cgroup has a CLOSid - CBM(cache bit mask) mapping.
  +A CLOS(Class of service) is represented by a CLOSid.CLOSid is
  +internal to the kernel and not exposed to user.  Each cgroup

Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-28 Thread Vikas Shivappa



On Tue, 28 Jul 2015, Marcelo Tosatti wrote:


On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:

Adds a description of Cache allocation technology, overview
of kernel implementation and usage of Cache Allocation cgroup interface.

Cache allocation is a sub-feature of Resource Director Technology(RDT)
Allocation or Platform Shared resource control which provides support to
control Platform shared resources like L3 cache.  Currently L3 Cache is
the only resource that is supported in RDT.  More information can be
found in the Intel SDM, Volume 3, section 17.15.

Cache Allocation Technology provides a way for the Software (OS/VMM)
to restrict cache allocation to a defined 'subset' of cache which may
be overlapping with other 'subsets'.  This feature is used when
allocating a line in cache ie when pulling new data into the cache.

Signed-off-by: Vikas Shivappa vikas.shiva...@linux.intel.com
---
 Documentation/cgroups/rdt.txt | 215 ++
 1 file changed, 215 insertions(+)
 create mode 100644 Documentation/cgroups/rdt.txt

diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
new file mode 100644
index 000..dfff477
--- /dev/null
+++ b/Documentation/cgroups/rdt.txt
@@ -0,0 +1,215 @@
+RDT
+---
+
+Copyright (C) 2014 Intel Corporation
+Written by vikas.shiva...@linux.intel.com
+(based on contents and format from cpusets.txt)
+
+CONTENTS:
+=
+
+1. Cache Allocation Technology
+  1.1 What is RDT and Cache allocation ?
+  1.2 Why is Cache allocation needed ?
+  1.3 Cache allocation implementation overview
+  1.4 Assignment of CBM and CLOS
+  1.5 Scheduling and Context Switch
+2. Usage Examples and Syntax
+
+1. Cache Allocation Technology(Cache allocation)
+===
+
+1.1 What is RDT and Cache allocation
+
+
+Cache allocation is a sub-feature of Resource Director Technology(RDT)
+Allocation or Platform Shared resource control which provides support to
+control Platform shared resources like L3 cache.  Currently L3 Cache is
+the only resource that is supported in RDT.  More information can be
+found in the Intel SDM, Volume 3, section 17.15.
+
+Cache Allocation Technology provides a way for the Software (OS/VMM)
+to restrict cache allocation to a defined 'subset' of cache which may
+be overlapping with other 'subsets'.  This feature is used when
+allocating a line in cache ie when pulling new data into the cache.
+The programming of the h/w is done via programming  MSRs.
+
+The different cache subsets are identified by CLOS identifier (class
+of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
+contiguous set of bits which defines the amount of cache resource that
+is available for each 'subset'.
+
+1.2 Why is Cache allocation needed
+--
+
+In todays new processors the number of cores is continuously increasing,
+especially in large scale usage models where VMs are used like
+webservers and datacenters. The number of cores increase the number
+of threads or workloads that can simultaneously be run. When
+multi-threaded-applications, VMs, workloads run concurrently they
+compete for shared resources including L3 cache.
+
+The Cache allocation  enables more cache resources to be made available
+for higher priority applications based on guidance from the execution
+environment.
+
+The architecture also allows dynamically changing these subsets during
+runtime to further optimize the performance of the higher priority
+application with minimal degradation to the low priority app.
+Additionally, resources can be rebalanced for system throughput benefit.
+
+This technique may be useful in managing large computer systems which
+large L3 cache. Examples may be large servers running  instances of
+webservers or database servers. In such complex systems, these subsets
+can be used for more careful placing of the available cache
+resources.
+
+1.3 Cache allocation implementation Overview
+
+
+Kernel implements a cgroup subsystem to support cache allocation.
+
+Each cgroup has a CLOSid - CBM(cache bit mask) mapping.
+A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
+to the kernel and not exposed to user.  Each cgroup would have one CBM
+and would just represent one cache 'subset'.
+
+The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
+cgroup never fails.  When a child cgroup is created it inherits the
+CLOSid and the CBM from its parent.  When a user changes the default
+CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
+used before.  The changing of 'l3_cache_mask' may fail with -ENOSPC once
+the kernel runs out of maximum CLOSids it can support.
+User can create as many cgroups as he wants but having different CBMs
+at the same time is restricted by the maximum number of CLOSids
+(multiple cgroups can have the same CBM).

Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-28 Thread Peter Zijlstra
On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote:

Please edit this document to have consistent spacing. Its really hard to
read this. Every time I spot a misplaced space my brain stumbles and I
need to restart.

 diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
 new file mode 100644
 index 000..dfff477
 --- /dev/null
 +++ b/Documentation/cgroups/rdt.txt
 @@ -0,0 +1,215 @@
 +RDT
 +---
 +
 +Copyright (C) 2014 Intel Corporation
 +Written by vikas.shiva...@linux.intel.com
 +(based on contents and format from cpusets.txt)
 +
 +CONTENTS:
 +=
 +
 +1. Cache Allocation Technology
 +  1.1 What is RDT and Cache allocation ?
 +  1.2 Why is Cache allocation needed ?
 +  1.3 Cache allocation implementation overview
 +  1.4 Assignment of CBM and CLOS
 +  1.5 Scheduling and Context Switch
 +2. Usage Examples and Syntax
 +
 +1. Cache Allocation Technology(Cache allocation)
 +===
 +
 +1.1 What is RDT and Cache allocation
 +
 +
 +Cache allocation is a sub-feature of Resource Director Technology(RDT)

missing ' ' before the '('.

 +Allocation or Platform Shared resource control which provides support to
 +control Platform shared resources like L3 cache.  Currently L3 Cache is

Double ' ' after '.' -- which _can_ be correct, but is inconsistent
throughout the document.

 +the only resource that is supported in RDT.  More information can be
 +found in the Intel SDM, Volume 3, section 17.15.

Please also include the SDM revision, like June 2015.

In fact, in the June 2015 V3 17.15 is CQM, not CAT.

 +Cache Allocation Technology provides a way for the Software (OS/VMM)
 +to restrict cache allocation to a defined 'subset' of cache which may
 +be overlapping with other 'subsets'.  This feature is used when
 +allocating a line in cache ie when pulling new data into the cache.
 +The programming of the h/w is done via programming  MSRs.

Double ' ' before 'MSRs'.

 +The different cache subsets are identified by CLOS identifier (class
 +of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
 +contiguous set of bits which defines the amount of cache resource that
 +is available for each 'subset'.
 +
 +1.2 Why is Cache allocation needed
 +--
 +
 +In todays new processors the number of cores is continuously increasing,
 +especially in large scale usage models where VMs are used like
 +webservers and datacenters. The number of cores increase the number

Single ' ' after .

 +of threads or workloads that can simultaneously be run. When
 +multi-threaded-applications, VMs, workloads run concurrently they
 +compete for shared resources including L3 cache.
 +
 +The Cache allocation  enables more cache resources to be made available

Double ' ' for no apparent reason.

 +for higher priority applications based on guidance from the execution
 +environment.
 +
 +The architecture also allows dynamically changing these subsets during
 +runtime to further optimize the performance of the higher priority
 +application with minimal degradation to the low priority app.
 +Additionally, resources can be rebalanced for system throughput benefit.
 +
 +This technique may be useful in managing large computer systems which
 +large L3 cache. Examples may be large servers running  instances of

Double ' '

 +webservers or database servers. In such complex systems, these subsets
 +can be used for more careful placing of the available cache
 +resources.
 +
 +1.3 Cache allocation implementation Overview
 +
 +
 +Kernel implements a cgroup subsystem to support cache allocation.
 +
 +Each cgroup has a CLOSid - CBM(cache bit mask) mapping.

No ' ' before '('

 +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal

Idem, also, _no_ space after '.'

 +to the kernel and not exposed to user.  Each cgroup would have one CBM

Double space after '.'

 +and would just represent one cache 'subset'.
 +
 +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the

I'm thinking the convention is ' ' _after_ ',', not before.

 +cgroup never fails.  When a child cgroup is created it inherits the
 +CLOSid and the CBM from its parent.  When a user changes the default
 +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
 +used before.  The changing of 'l3_cache_mask' may fail with -ENOSPC once
 +the kernel runs out of maximum CLOSids it can support.
 +User can create as many cgroups as he wants but having different CBMs
 +at the same time is restricted by the maximum number of CLOSids
 +(multiple cgroups can have the same CBM).
 +Kernel maintains a CLOSid-cbm mapping which keeps reference counter

Above you had ' ' around the arrows.

 +for each cgroup using a CLOSid.
 +
 +The tasks in the cgroup would get to fill the L3 cache represented by
 +the cgroup's 'l3_cache_mask' file.
 +
 +Root directory would have all available