Re: [summary] Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-31 Thread Marcelo Tosatti
On Fri, Jul 31, 2015 at 09:41:58AM -0700, Vikas Shivappa wrote:
> 
> To summarize  the ever growing thread :
> 
> 1. the rdt_cgroup can be used to configure exclusive cache bitmaps
> for the child nodes which can be used for the scenarios which
> Marcello mentions.
> 
> simle examples which were mentioned :
> max bitmask length : 16 . hence full mask is 0x
> groupx_realtime - 0xff .
> group2_systemtraffic - 0xf. : put a lot of tasks from root node to
> here or which ever is offending and thrashing.
> groupy_ - 0x0f
> 
> Now the groupx has its own area of cache that can used by the
> realtime/(specific scenario) apps. Similarly configure any groupy.
> 
> 2. Can the maps can let you specify which cache ways ways the cache
> is allocated ? - No , this is implementation specific as mentioned
> in the SDM. So when we configure a mask , you really dont know which
> ways or which exact lines are used on which SKUs .. We may not see
> any use case as well which is needed for apps to allocate cache in
> specific areas and the h/w does not support this as well.

Ok, can you comment whether the userspace interface proposed addresses
all your use cases ?

> 3. Letting the user specify size in bytes instead of bitmap : we
> have already gone through this discussion in older versions. The
> user can simply check the size of the total cache and understand
> what map could be what size. I dont see a special need to specify an
> interface to enter the cache in bytes and then round off - user
> could instead use the roundoff values before hand or iow it
> automatically does when he specifies the bitmask.

When you move from processor A with CBM bitmask format X to hardware B
with CBM bitmask format Y, and the formats Y and X are different, you
have to manually adjust the format.

Please reply to the userspace proposal, the problem is very explicit
there.

> ex: find cache size from /proc/cpuinfo. - say 20MB
> bitmask max - 0xf.
> 
> This means the roundoff(chunk) size supported is only 1MB , so when
> you specify the mask say 0x3(2MB) thats already taken care of.
> Same applies to percentage - the masks automatically round off the percentage.
> 
> Please note that this is quite different from the way we can
> allocate memory in bytes and needs to be treated differently given
> that the hardware provides interface in a particular way.
> 
> 4. Letting the kernel automatically extend the bitmap may affect a
> lot of other things 

Lets talk about them. What other things?

> and will need a lot of heuristics - note that we
> have overlapping masks.

I proposed a way to avoid heuristics by exposing whether the cgroup is 
"expandable" or not and asked your input.

We really do not want to waste cache if we can avoid it.

> This interface lets the super-user control
> the cache allocation and it may be very confusing for the user if he
> has allocated a cache mask and suddenly from under the floor the
> kernel changes it.

Agree.

> 
> Thanks,
> Vikas
> 
> 
> On Fri, 31 Jul 2015, Marcelo Tosatti wrote:
> 
> >On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote:
> >>
> >>
> >>On Thu, 30 Jul 2015, Marcelo Tosatti wrote:
> >>
> >>>On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
> 
> 
> Marcello,
> 
> 
> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
> >
> >How about this:
> >
> >desiredclos (closid  p1  p2  p3 p4)
> >  1   1   0   0  0
> >  2   0   0   0  1
> >  3   0   1   1  0
> 
> #1 Currently in the rdt cgroup , the root cgroup always has all the
> bits set and cant be changed (because the cgroup hierarchy would by
> default make this to have all bits as all the children need to have
> a subset of the root's bitmask). So if the user creates a cgroup and
> not put any task in it , the tasks in the root cgroup could be still
> using that part of the cache. Thats the reason i say we can have
> really 'exclusive' masks.
> 
> Or in other words - there is always a desired clos (0) which has all
> parts set which acts like a default pool.
> 
> Also the parts can overlap.  Please apply this for all the below
> comments which will change the way they work.
> >>>
> >>>
> 
> >
> >p means part.
> 
> I am assuming p = (a contiguous cache capacity bit mask)
> 
> >closid 1 is a exclusive cgroup.
> >closid 2 is a "cache hog" class.
> >closid 3 is "default closid".
> >
> >Desiredclos is what user has specified.
> >
> >Transition 1: desiredclos --> effectiveclos
> >Clean all bits of unused closid's
> >(that must be updated whenever a
> >closid1 cgroup goes from empty->nonempty
> >and vice-versa).
> >
> >effectiveclos (closid  p1  p2  p3 p4)
> >1   0   0   0  0
> >2   0   0   0  1
> >3   0   1   1  0
> 
> >
> >Transition 2: effectiveclos --> expandedclos

[summary] Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-31 Thread Vikas Shivappa


To summarize  the ever growing thread :

1. the rdt_cgroup can be used to configure exclusive cache bitmaps for the child 
nodes which can be used for the scenarios which Marcello mentions.


simle examples which were mentioned :
max bitmask length : 16 . hence full mask is 0x
groupx_realtime - 0xff .
group2_systemtraffic - 0xf. : put a lot of tasks from root node to here or which 
ever is offending and thrashing.

groupy_ - 0x0f

Now the groupx has its own area of cache that can used by the realtime/(specific 
scenario) apps. Similarly configure any groupy.


2. Can the maps can let you specify which cache ways ways the cache is allocated 
? - No , this is 
implementation specific as mentioned in the SDM. So when we configure a mask , 
you really dont know which ways or which exact lines are used on which SKUs .. 
We may not see any use case as well 
which is needed for apps to allocate cache in specific areas and the h/w does 
not support this as well.


3. Letting the user specify size in bytes instead of bitmap : we have already 
gone through this discussion in older versions. The user can simply check the 
size of the total cache and understand what map could be what size. I dont see a 
special need to specify an interface to enter the cache in bytes and then round 
off - user could instead use the roundoff values before hand or iow it 
automatically does when he specifies the bitmask.


ex: find cache size from /proc/cpuinfo. - say 20MB
bitmask max - 0xf.

This means the roundoff(chunk) size supported is only 1MB , so when you specify 
the mask say 0x3(2MB) thats already taken care of.

Same applies to percentage - the masks automatically round off the percentage.

Please note that this is quite different from the way we can allocate memory in 
bytes and needs to be treated differently given that the hardware provides interface 
in a particular way.


4. Letting the kernel automatically extend the bitmap may affect a lot of other 
things and will need a lot of heuristics - note that we have overlapping masks . 
This interface lets the super-user control the cache allocation and it may be 
very confusing for the user if he has allocated a cache mask and suddenly from 
under the floor the kernel changes it.


Thanks,
Vikas


On Fri, 31 Jul 2015, Marcelo Tosatti wrote:


On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote:



On Thu, 30 Jul 2015, Marcelo Tosatti wrote:


On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:



Marcello,


On Wed, 29 Jul 2015, Marcelo Tosatti wrote:


How about this:

desiredclos (closid  p1  p2  p3 p4)
 1   1   0   0  0
 2   0   0   0  1
 3   0   1   1  0


#1 Currently in the rdt cgroup , the root cgroup always has all the
bits set and cant be changed (because the cgroup hierarchy would by
default make this to have all bits as all the children need to have
a subset of the root's bitmask). So if the user creates a cgroup and
not put any task in it , the tasks in the root cgroup could be still
using that part of the cache. Thats the reason i say we can have
really 'exclusive' masks.

Or in other words - there is always a desired clos (0) which has all
parts set which acts like a default pool.

Also the parts can overlap.  Please apply this for all the below
comments which will change the way they work.







p means part.


I am assuming p = (a contiguous cache capacity bit mask)


closid 1 is a exclusive cgroup.
closid 2 is a "cache hog" class.
closid 3 is "default closid".

Desiredclos is what user has specified.

Transition 1: desiredclos --> effectiveclos
Clean all bits of unused closid's
(that must be updated whenever a
closid1 cgroup goes from empty->nonempty
and vice-versa).

effectiveclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   0   1   1  0




Transition 2: effectiveclos --> expandedclos
expandedclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   1   1   1  0
Then you have different inplacecos for each
CPU (see pseudo-code below):

On the following events.

- task migration to new pCPU:
- task creation:

id = smp_processor_id();
for (part = desiredclos.p1; ...; part++)
/* if my cosid is set and any other
   cosid is clear, for the part,
   synchronize desiredclos --> inplacecos */
if (part[mycosid] == 1 &&
part[any_othercosid] == 0)
wrmsr(part, desiredclos);



Currently the root cgroup would have all the bits set which will act
like a default cgroup where all the otherwise unused parts (assuming
they are a set of contiguous cache capacity bits) will be used.


Right, but we don't want to place tasks in there in case one cgroup
wants exclusive cache access.

So whenever you want an exclusive cgroup you'd do:


[summary] Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-31 Thread Vikas Shivappa


To summarize  the ever growing thread :

1. the rdt_cgroup can be used to configure exclusive cache bitmaps for the child 
nodes which can be used for the scenarios which Marcello mentions.


simle examples which were mentioned :
max bitmask length : 16 . hence full mask is 0x
groupx_realtime - 0xff .
group2_systemtraffic - 0xf. : put a lot of tasks from root node to here or which 
ever is offending and thrashing.

groupy_mytraffic - 0x0f

Now the groupx has its own area of cache that can used by the realtime/(specific 
scenario) apps. Similarly configure any groupy.


2. Can the maps can let you specify which cache ways ways the cache is allocated 
? - No , this is 
implementation specific as mentioned in the SDM. So when we configure a mask , 
you really dont know which ways or which exact lines are used on which SKUs .. 
We may not see any use case as well 
which is needed for apps to allocate cache in specific areas and the h/w does 
not support this as well.


3. Letting the user specify size in bytes instead of bitmap : we have already 
gone through this discussion in older versions. The user can simply check the 
size of the total cache and understand what map could be what size. I dont see a 
special need to specify an interface to enter the cache in bytes and then round 
off - user could instead use the roundoff values before hand or iow it 
automatically does when he specifies the bitmask.


ex: find cache size from /proc/cpuinfo. - say 20MB
bitmask max - 0xf.

This means the roundoff(chunk) size supported is only 1MB , so when you specify 
the mask say 0x3(2MB) thats already taken care of.

Same applies to percentage - the masks automatically round off the percentage.

Please note that this is quite different from the way we can allocate memory in 
bytes and needs to be treated differently given that the hardware provides interface 
in a particular way.


4. Letting the kernel automatically extend the bitmap may affect a lot of other 
things and will need a lot of heuristics - note that we have overlapping masks . 
This interface lets the super-user control the cache allocation and it may be 
very confusing for the user if he has allocated a cache mask and suddenly from 
under the floor the kernel changes it.


Thanks,
Vikas


On Fri, 31 Jul 2015, Marcelo Tosatti wrote:


On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote:



On Thu, 30 Jul 2015, Marcelo Tosatti wrote:


On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:



Marcello,


On Wed, 29 Jul 2015, Marcelo Tosatti wrote:


How about this:

desiredclos (closid  p1  p2  p3 p4)
 1   1   0   0  0
 2   0   0   0  1
 3   0   1   1  0


#1 Currently in the rdt cgroup , the root cgroup always has all the
bits set and cant be changed (because the cgroup hierarchy would by
default make this to have all bits as all the children need to have
a subset of the root's bitmask). So if the user creates a cgroup and
not put any task in it , the tasks in the root cgroup could be still
using that part of the cache. Thats the reason i say we can have
really 'exclusive' masks.

Or in other words - there is always a desired clos (0) which has all
parts set which acts like a default pool.

Also the parts can overlap.  Please apply this for all the below
comments which will change the way they work.







p means part.


I am assuming p = (a contiguous cache capacity bit mask)


closid 1 is a exclusive cgroup.
closid 2 is a cache hog class.
closid 3 is default closid.

Desiredclos is what user has specified.

Transition 1: desiredclos -- effectiveclos
Clean all bits of unused closid's
(that must be updated whenever a
closid1 cgroup goes from empty-nonempty
and vice-versa).

effectiveclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   0   1   1  0




Transition 2: effectiveclos -- expandedclos
expandedclos (closid  p1  p2  p3 p4)
   1   0   0   0  0
   2   0   0   0  1
   3   1   1   1  0
Then you have different inplacecos for each
CPU (see pseudo-code below):

On the following events.

- task migration to new pCPU:
- task creation:

id = smp_processor_id();
for (part = desiredclos.p1; ...; part++)
/* if my cosid is set and any other
   cosid is clear, for the part,
   synchronize desiredclos -- inplacecos */
if (part[mycosid] == 1 
part[any_othercosid] == 0)
wrmsr(part, desiredclos);



Currently the root cgroup would have all the bits set which will act
like a default cgroup where all the otherwise unused parts (assuming
they are a set of contiguous cache capacity bits) will be used.


Right, but we don't want to place tasks in there in case one cgroup
wants exclusive cache access.

So whenever you want an exclusive cgroup you'd do:


Re: [summary] Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

2015-07-31 Thread Marcelo Tosatti
On Fri, Jul 31, 2015 at 09:41:58AM -0700, Vikas Shivappa wrote:
 
 To summarize  the ever growing thread :
 
 1. the rdt_cgroup can be used to configure exclusive cache bitmaps
 for the child nodes which can be used for the scenarios which
 Marcello mentions.
 
 simle examples which were mentioned :
 max bitmask length : 16 . hence full mask is 0x
 groupx_realtime - 0xff .
 group2_systemtraffic - 0xf. : put a lot of tasks from root node to
 here or which ever is offending and thrashing.
 groupy_mytraffic - 0x0f
 
 Now the groupx has its own area of cache that can used by the
 realtime/(specific scenario) apps. Similarly configure any groupy.
 
 2. Can the maps can let you specify which cache ways ways the cache
 is allocated ? - No , this is implementation specific as mentioned
 in the SDM. So when we configure a mask , you really dont know which
 ways or which exact lines are used on which SKUs .. We may not see
 any use case as well which is needed for apps to allocate cache in
 specific areas and the h/w does not support this as well.

Ok, can you comment whether the userspace interface proposed addresses
all your use cases ?

 3. Letting the user specify size in bytes instead of bitmap : we
 have already gone through this discussion in older versions. The
 user can simply check the size of the total cache and understand
 what map could be what size. I dont see a special need to specify an
 interface to enter the cache in bytes and then round off - user
 could instead use the roundoff values before hand or iow it
 automatically does when he specifies the bitmask.

When you move from processor A with CBM bitmask format X to hardware B
with CBM bitmask format Y, and the formats Y and X are different, you
have to manually adjust the format.

Please reply to the userspace proposal, the problem is very explicit
there.

 ex: find cache size from /proc/cpuinfo. - say 20MB
 bitmask max - 0xf.
 
 This means the roundoff(chunk) size supported is only 1MB , so when
 you specify the mask say 0x3(2MB) thats already taken care of.
 Same applies to percentage - the masks automatically round off the percentage.
 
 Please note that this is quite different from the way we can
 allocate memory in bytes and needs to be treated differently given
 that the hardware provides interface in a particular way.
 
 4. Letting the kernel automatically extend the bitmap may affect a
 lot of other things 

Lets talk about them. What other things?

 and will need a lot of heuristics - note that we
 have overlapping masks.

I proposed a way to avoid heuristics by exposing whether the cgroup is 
expandable or not and asked your input.

We really do not want to waste cache if we can avoid it.

 This interface lets the super-user control
 the cache allocation and it may be very confusing for the user if he
 has allocated a cache mask and suddenly from under the floor the
 kernel changes it.

Agree.

 
 Thanks,
 Vikas
 
 
 On Fri, 31 Jul 2015, Marcelo Tosatti wrote:
 
 On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote:
 
 
 On Thu, 30 Jul 2015, Marcelo Tosatti wrote:
 
 On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
 
 
 Marcello,
 
 
 On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
 
 How about this:
 
 desiredclos (closid  p1  p2  p3 p4)
   1   1   0   0  0
   2   0   0   0  1
   3   0   1   1  0
 
 #1 Currently in the rdt cgroup , the root cgroup always has all the
 bits set and cant be changed (because the cgroup hierarchy would by
 default make this to have all bits as all the children need to have
 a subset of the root's bitmask). So if the user creates a cgroup and
 not put any task in it , the tasks in the root cgroup could be still
 using that part of the cache. Thats the reason i say we can have
 really 'exclusive' masks.
 
 Or in other words - there is always a desired clos (0) which has all
 parts set which acts like a default pool.
 
 Also the parts can overlap.  Please apply this for all the below
 comments which will change the way they work.
 
 
 
 
 p means part.
 
 I am assuming p = (a contiguous cache capacity bit mask)
 
 closid 1 is a exclusive cgroup.
 closid 2 is a cache hog class.
 closid 3 is default closid.
 
 Desiredclos is what user has specified.
 
 Transition 1: desiredclos -- effectiveclos
 Clean all bits of unused closid's
 (that must be updated whenever a
 closid1 cgroup goes from empty-nonempty
 and vice-versa).
 
 effectiveclos (closid  p1  p2  p3 p4)
 1   0   0   0  0
 2   0   0   0  1
 3   0   1   1  0
 
 
 Transition 2: effectiveclos -- expandedclos
 expandedclos (closid  p1  p2  p3 p4)
 1   0   0   0  0
 2   0   0   0  1
 3   1   1   1  0
 Then you have different inplacecos for each
 CPU (see pseudo-code below):
 
 On the following events.
 
 - task migration to new pCPU:
 - task creation:
 
  id = smp_processor_id();
  for (part = desiredclos.p1; ...; part++)
  /*