Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Tue, 28 Jul 2015, Peter Zijlstra wrote: On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote: Please edit this document to have consistent spacing. Its really hard to read this. Every time I spot a misplaced space my brain stumbles and I need to restart. Will fix all the spacing and other indentions issues mentioned. Thanks for pointing them all out. Although the other documents I see dont have a consistent format completely which is what confused me, this format would be better. + +The following considerations are done for the PQR MSR write so that it +has minimal impact on scheduling hot path: +- This path doesnt exist on any non-intel platforms. !x86 I think you mean, its entirely possible to have the code present on AMD systems for instance. +- On Intel platforms, this would not exist by default unless CGROUP_RDT +is enabled. You can enable this just fine on AMD machines. The cache alloc code is under CPU_SUP_INTEL .. Thanks, Vikas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Tue, 28 Jul 2015, Peter Zijlstra wrote: On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote: Please edit this document to have consistent spacing. Its really hard to read this. Every time I spot a misplaced space my brain stumbles and I need to restart. Will fix all the spacing and other indentions issues mentioned. Thanks for pointing them all out. Although the other documents I see dont have a consistent format completely which is what confused me, this format would be better. + +The following considerations are done for the PQR MSR write so that it +has minimal impact on scheduling hot path: +- This path doesnt exist on any non-intel platforms. !x86 I think you mean, its entirely possible to have the code present on AMD systems for instance. +- On Intel platforms, this would not exist by default unless CGROUP_RDT +is enabled. You can enable this just fine on AMD machines. The cache alloc code is under CPU_SUP_INTEL .. Thanks, Vikas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
Hello Marcelo/Martin, Like I mentioned let me modify the documentation to better help understand the usage. Things like updating each package bitmask is already in the patches. Lets discuss offline and come up a well defined proposal for change if any and then update that in next series. We seem to be just looping over same items. Thanks, Vikas On Mon, 3 Aug 2015, Marcelo Tosatti wrote: On Sun, Aug 02, 2015 at 05:48:07PM +0200, Martin Kletzander wrote: On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote: On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: Marcello, On Wed, 29 Jul 2015, Marcelo Tosatti wrote: How about this: desiredclos (closid p1 p2 p3 p4) 1 1 0 0 0 2 0 0 0 1 3 0 1 1 0 #1 Currently in the rdt cgroup , the root cgroup always has all the bits set and cant be changed (because the cgroup hierarchy would by default make this to have all bits as all the children need to have a subset of the root's bitmask). So if the user creates a cgroup and not put any task in it , the tasks in the root cgroup could be still using that part of the cache. Thats the reason i say we can have really 'exclusive' masks. Or in other words - there is always a desired clos (0) which has all parts set which acts like a default pool. Also the parts can overlap. Please apply this for all the below comments which will change the way they work. p means part. I am assuming p = (a contiguous cache capacity bit mask) Yes. closid 1 is a exclusive cgroup. closid 2 is a "cache hog" class. closid 3 is "default closid". Desiredclos is what user has specified. Transition 1: desiredclos --> effectiveclos Clean all bits of unused closid's (that must be updated whenever a closid1 cgroup goes from empty->nonempty and vice-versa). effectiveclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 0 1 1 0 Transition 2: effectiveclos --> expandedclos expandedclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 1 1 1 0 Then you have different inplacecos for each CPU (see pseudo-code below): On the following events. - task migration to new pCPU: - task creation: id = smp_processor_id(); for (part = desiredclos.p1; ...; part++) /* if my cosid is set and any other cosid is clear, for the part, synchronize desiredclos --> inplacecos */ if (part[mycosid] == 1 && part[any_othercosid] == 0) wrmsr(part, desiredclos); Currently the root cgroup would have all the bits set which will act like a default cgroup where all the otherwise unused parts (assuming they are a set of contiguous cache capacity bits) will be used. Otherwise the question is in the expandedclos - who decides to expand the closx parts to include some of the unused parts.. - that could just be a default root always ? Right, so the problem is for certain closid's you might never want to expand (because doing so would cause data to be cached in a cache way which might have high eviction rate in the future). See the example from Will. But for the default cache (that is "unclassified applications" i suppose it is beneficial to expand in most cases, that is, use maximum amount of cache irrespective of eviction rate, which is the behaviour that exists now without CAT). So perhaps a new flag "expand=y/n" can be added to the cgroup directories... What do you say? Userspace representation of CAT --- Usage model: 1) measure application performance without L3 cache reservation. 2) measure application perf with L3 cache reservation and X number of cache ways until desired performance is attained. Requirements: 1) Persistency of CLOS configuration across hardware. On migration of operating system or application between different hardware systems we'd like the following to be maintained: - exclusive number of bytes (*) reserved to a certain CLOSid. - shared number of bytes (*) reserved between a certain group of CLOSid's. For both code and data, rounded down or up in cache way size. 2) Reasoning: Different CBM masks in different hardware platforms might be necessary to specify the same CLOS configuration, in terms of exclusive number of bytes and shared number of bytes. (cache-way rounded number of bytes). For example, due to L3 allocation by other hardware entities in certain parts of the cache it might be necessary to relocate CBM mask to achieve the same CLOS configuration. 3) Proposed format: Few questions from a random listener, I apologise if some of them are in a wrong place due to me missing some information from past threads. I'm not sure whether the following proposal to the format is the internal structure
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Sun, Aug 02, 2015 at 05:48:07PM +0200, Martin Kletzander wrote: > On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote: > >On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: > >> > >> > >>Marcello, > >> > >> > >>On Wed, 29 Jul 2015, Marcelo Tosatti wrote: > >>> > >>>How about this: > >>> > >>>desiredclos (closid p1 p2 p3 p4) > >>>1 1 0 0 0 > >>>2 0 0 0 1 > >>>3 0 1 1 0 > >> > >>#1 Currently in the rdt cgroup , the root cgroup always has all the > >>bits set and cant be changed (because the cgroup hierarchy would by > >>default make this to have all bits as all the children need to have > >>a subset of the root's bitmask). So if the user creates a cgroup and > >>not put any task in it , the tasks in the root cgroup could be still > >>using that part of the cache. Thats the reason i say we can have > >>really 'exclusive' masks. > >> > >>Or in other words - there is always a desired clos (0) which has all > >>parts set which acts like a default pool. > >> > >>Also the parts can overlap. Please apply this for all the below > >>comments which will change the way they work. > >> > >>> > >>>p means part. > >> > >>I am assuming p = (a contiguous cache capacity bit mask) > > > >Yes. > > > >>>closid 1 is a exclusive cgroup. > >>>closid 2 is a "cache hog" class. > >>>closid 3 is "default closid". > >>> > >>>Desiredclos is what user has specified. > >>> > >>>Transition 1: desiredclos --> effectiveclos > >>>Clean all bits of unused closid's > >>>(that must be updated whenever a > >>>closid1 cgroup goes from empty->nonempty > >>>and vice-versa). > >>> > >>>effectiveclos (closid p1 p2 p3 p4) > >>> 1 0 0 0 0 > >>> 2 0 0 0 1 > >>> 3 0 1 1 0 > >> > >>> > >>>Transition 2: effectiveclos --> expandedclos > >>>expandedclos (closid p1 p2 p3 p4) > >>> 1 0 0 0 0 > >>> 2 0 0 0 1 > >>> 3 1 1 1 0 > >>>Then you have different inplacecos for each > >>>CPU (see pseudo-code below): > >>> > >>>On the following events. > >>> > >>>- task migration to new pCPU: > >>>- task creation: > >>> > >>> id = smp_processor_id(); > >>> for (part = desiredclos.p1; ...; part++) > >>> /* if my cosid is set and any other > >>> cosid is clear, for the part, > >>> synchronize desiredclos --> inplacecos */ > >>> if (part[mycosid] == 1 && > >>> part[any_othercosid] == 0) > >>> wrmsr(part, desiredclos); > >>> > >> > >>Currently the root cgroup would have all the bits set which will act > >>like a default cgroup where all the otherwise unused parts (assuming > >>they are a set of contiguous cache capacity bits) will be used. > >> > >>Otherwise the question is in the expandedclos - who decides to > >>expand the closx parts to include some of the unused parts.. - that > >>could just be a default root always ? > > > >Right, so the problem is for certain closid's you might never want > >to expand (because doing so would cause data to be cached in a > >cache way which might have high eviction rate in the future). > >See the example from Will. > > > >But for the default cache (that is "unclassified applications" > >i suppose it is beneficial to expand in most cases, that is, > >use maximum amount of cache irrespective of eviction rate, which > >is the behaviour that exists now without CAT). > > > >So perhaps a new flag "expand=y/n" can be added to the cgroup > >directories... What do you say? > > > >Userspace representation of CAT > >--- > > > >Usage model: > >1) measure application performance without L3 cache reservation. > >2) measure application perf with L3 cache reservation and > >X number of cache ways until desired performance is attained. > > > >Requirements: > >1) Persistency of CLOS configuration across hardware. On migration > >of operating system or application between different hardware > >systems we'd like the following to be maintained: > > - exclusive number of bytes (*) reserved to a certain CLOSid. > > - shared number of bytes (*) reserved between a certain group > > of CLOSid's. > > > >For both code and data, rounded down or up in cache way size. > > > >2) Reasoning: > >Different CBM masks in different hardware platforms might be necessary > >to specify the same CLOS configuration, in terms of exclusive number of > >bytes and shared number of bytes. (cache-way rounded number of bytes). > >For example, due to L3 allocation by other hardware entities in certain parts > >of the cache it might be necessary to relocate CBM mask to achieve > >the same CLOS configuration. > > > >3) Proposed format: > > > > Few questions from a random listener, I apologise if some of them are > in a wrong place due to me missing some information from past threads. > > I'm not sure whether the following proposal to the format is the
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Sun, Aug 02, 2015 at 05:48:07PM +0200, Martin Kletzander wrote: On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote: On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: Marcello, On Wed, 29 Jul 2015, Marcelo Tosatti wrote: How about this: desiredclos (closid p1 p2 p3 p4) 1 1 0 0 0 2 0 0 0 1 3 0 1 1 0 #1 Currently in the rdt cgroup , the root cgroup always has all the bits set and cant be changed (because the cgroup hierarchy would by default make this to have all bits as all the children need to have a subset of the root's bitmask). So if the user creates a cgroup and not put any task in it , the tasks in the root cgroup could be still using that part of the cache. Thats the reason i say we can have really 'exclusive' masks. Or in other words - there is always a desired clos (0) which has all parts set which acts like a default pool. Also the parts can overlap. Please apply this for all the below comments which will change the way they work. p means part. I am assuming p = (a contiguous cache capacity bit mask) Yes. closid 1 is a exclusive cgroup. closid 2 is a cache hog class. closid 3 is default closid. Desiredclos is what user has specified. Transition 1: desiredclos -- effectiveclos Clean all bits of unused closid's (that must be updated whenever a closid1 cgroup goes from empty-nonempty and vice-versa). effectiveclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 0 1 1 0 Transition 2: effectiveclos -- expandedclos expandedclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 1 1 1 0 Then you have different inplacecos for each CPU (see pseudo-code below): On the following events. - task migration to new pCPU: - task creation: id = smp_processor_id(); for (part = desiredclos.p1; ...; part++) /* if my cosid is set and any other cosid is clear, for the part, synchronize desiredclos -- inplacecos */ if (part[mycosid] == 1 part[any_othercosid] == 0) wrmsr(part, desiredclos); Currently the root cgroup would have all the bits set which will act like a default cgroup where all the otherwise unused parts (assuming they are a set of contiguous cache capacity bits) will be used. Otherwise the question is in the expandedclos - who decides to expand the closx parts to include some of the unused parts.. - that could just be a default root always ? Right, so the problem is for certain closid's you might never want to expand (because doing so would cause data to be cached in a cache way which might have high eviction rate in the future). See the example from Will. But for the default cache (that is unclassified applications i suppose it is beneficial to expand in most cases, that is, use maximum amount of cache irrespective of eviction rate, which is the behaviour that exists now without CAT). So perhaps a new flag expand=y/n can be added to the cgroup directories... What do you say? Userspace representation of CAT --- Usage model: 1) measure application performance without L3 cache reservation. 2) measure application perf with L3 cache reservation and X number of cache ways until desired performance is attained. Requirements: 1) Persistency of CLOS configuration across hardware. On migration of operating system or application between different hardware systems we'd like the following to be maintained: - exclusive number of bytes (*) reserved to a certain CLOSid. - shared number of bytes (*) reserved between a certain group of CLOSid's. For both code and data, rounded down or up in cache way size. 2) Reasoning: Different CBM masks in different hardware platforms might be necessary to specify the same CLOS configuration, in terms of exclusive number of bytes and shared number of bytes. (cache-way rounded number of bytes). For example, due to L3 allocation by other hardware entities in certain parts of the cache it might be necessary to relocate CBM mask to achieve the same CLOS configuration. 3) Proposed format: Few questions from a random listener, I apologise if some of them are in a wrong place due to me missing some information from past threads. I'm not sure whether the following proposal to the format is the internal structure or what's going to be in cgroups. If this is user-visible interface, I think it could be a little less detailed. User visible interface. The idea is to have userspace code that performs [ user visible specification ] [ cbm bitmasks on present hardware platform ] In systemd, probably (or whatever is between the user and the
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
Hello Marcelo/Martin, Like I mentioned let me modify the documentation to better help understand the usage. Things like updating each package bitmask is already in the patches. Lets discuss offline and come up a well defined proposal for change if any and then update that in next series. We seem to be just looping over same items. Thanks, Vikas On Mon, 3 Aug 2015, Marcelo Tosatti wrote: On Sun, Aug 02, 2015 at 05:48:07PM +0200, Martin Kletzander wrote: On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote: On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: Marcello, On Wed, 29 Jul 2015, Marcelo Tosatti wrote: How about this: desiredclos (closid p1 p2 p3 p4) 1 1 0 0 0 2 0 0 0 1 3 0 1 1 0 #1 Currently in the rdt cgroup , the root cgroup always has all the bits set and cant be changed (because the cgroup hierarchy would by default make this to have all bits as all the children need to have a subset of the root's bitmask). So if the user creates a cgroup and not put any task in it , the tasks in the root cgroup could be still using that part of the cache. Thats the reason i say we can have really 'exclusive' masks. Or in other words - there is always a desired clos (0) which has all parts set which acts like a default pool. Also the parts can overlap. Please apply this for all the below comments which will change the way they work. p means part. I am assuming p = (a contiguous cache capacity bit mask) Yes. closid 1 is a exclusive cgroup. closid 2 is a cache hog class. closid 3 is default closid. Desiredclos is what user has specified. Transition 1: desiredclos -- effectiveclos Clean all bits of unused closid's (that must be updated whenever a closid1 cgroup goes from empty-nonempty and vice-versa). effectiveclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 0 1 1 0 Transition 2: effectiveclos -- expandedclos expandedclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 1 1 1 0 Then you have different inplacecos for each CPU (see pseudo-code below): On the following events. - task migration to new pCPU: - task creation: id = smp_processor_id(); for (part = desiredclos.p1; ...; part++) /* if my cosid is set and any other cosid is clear, for the part, synchronize desiredclos -- inplacecos */ if (part[mycosid] == 1 part[any_othercosid] == 0) wrmsr(part, desiredclos); Currently the root cgroup would have all the bits set which will act like a default cgroup where all the otherwise unused parts (assuming they are a set of contiguous cache capacity bits) will be used. Otherwise the question is in the expandedclos - who decides to expand the closx parts to include some of the unused parts.. - that could just be a default root always ? Right, so the problem is for certain closid's you might never want to expand (because doing so would cause data to be cached in a cache way which might have high eviction rate in the future). See the example from Will. But for the default cache (that is unclassified applications i suppose it is beneficial to expand in most cases, that is, use maximum amount of cache irrespective of eviction rate, which is the behaviour that exists now without CAT). So perhaps a new flag expand=y/n can be added to the cgroup directories... What do you say? Userspace representation of CAT --- Usage model: 1) measure application performance without L3 cache reservation. 2) measure application perf with L3 cache reservation and X number of cache ways until desired performance is attained. Requirements: 1) Persistency of CLOS configuration across hardware. On migration of operating system or application between different hardware systems we'd like the following to be maintained: - exclusive number of bytes (*) reserved to a certain CLOSid. - shared number of bytes (*) reserved between a certain group of CLOSid's. For both code and data, rounded down or up in cache way size. 2) Reasoning: Different CBM masks in different hardware platforms might be necessary to specify the same CLOS configuration, in terms of exclusive number of bytes and shared number of bytes. (cache-way rounded number of bytes). For example, due to L3 allocation by other hardware entities in certain parts of the cache it might be necessary to relocate CBM mask to achieve the same CLOS configuration. 3) Proposed format: Few questions from a random listener, I apologise if some of them are in a wrong place due to me missing some information from past threads. I'm not sure whether the following proposal to the format is the internal structure or what's going
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote: On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: Marcello, On Wed, 29 Jul 2015, Marcelo Tosatti wrote: > >How about this: > >desiredclos (closid p1 p2 p3 p4) > 1 1 0 0 0 > 2 0 0 0 1 > 3 0 1 1 0 #1 Currently in the rdt cgroup , the root cgroup always has all the bits set and cant be changed (because the cgroup hierarchy would by default make this to have all bits as all the children need to have a subset of the root's bitmask). So if the user creates a cgroup and not put any task in it , the tasks in the root cgroup could be still using that part of the cache. Thats the reason i say we can have really 'exclusive' masks. Or in other words - there is always a desired clos (0) which has all parts set which acts like a default pool. Also the parts can overlap. Please apply this for all the below comments which will change the way they work. > >p means part. I am assuming p = (a contiguous cache capacity bit mask) Yes. >closid 1 is a exclusive cgroup. >closid 2 is a "cache hog" class. >closid 3 is "default closid". > >Desiredclos is what user has specified. > >Transition 1: desiredclos --> effectiveclos >Clean all bits of unused closid's >(that must be updated whenever a >closid1 cgroup goes from empty->nonempty >and vice-versa). > >effectiveclos (closid p1 p2 p3 p4) > 1 0 0 0 0 > 2 0 0 0 1 > 3 0 1 1 0 > >Transition 2: effectiveclos --> expandedclos >expandedclos (closid p1 p2 p3 p4) > 1 0 0 0 0 > 2 0 0 0 1 > 3 1 1 1 0 >Then you have different inplacecos for each >CPU (see pseudo-code below): > >On the following events. > >- task migration to new pCPU: >- task creation: > >id = smp_processor_id(); >for (part = desiredclos.p1; ...; part++) >/* if my cosid is set and any other > cosid is clear, for the part, > synchronize desiredclos --> inplacecos */ >if (part[mycosid] == 1 && >part[any_othercosid] == 0) >wrmsr(part, desiredclos); > Currently the root cgroup would have all the bits set which will act like a default cgroup where all the otherwise unused parts (assuming they are a set of contiguous cache capacity bits) will be used. Otherwise the question is in the expandedclos - who decides to expand the closx parts to include some of the unused parts.. - that could just be a default root always ? Right, so the problem is for certain closid's you might never want to expand (because doing so would cause data to be cached in a cache way which might have high eviction rate in the future). See the example from Will. But for the default cache (that is "unclassified applications" i suppose it is beneficial to expand in most cases, that is, use maximum amount of cache irrespective of eviction rate, which is the behaviour that exists now without CAT). So perhaps a new flag "expand=y/n" can be added to the cgroup directories... What do you say? Userspace representation of CAT --- Usage model: 1) measure application performance without L3 cache reservation. 2) measure application perf with L3 cache reservation and X number of cache ways until desired performance is attained. Requirements: 1) Persistency of CLOS configuration across hardware. On migration of operating system or application between different hardware systems we'd like the following to be maintained: - exclusive number of bytes (*) reserved to a certain CLOSid. - shared number of bytes (*) reserved between a certain group of CLOSid's. For both code and data, rounded down or up in cache way size. 2) Reasoning: Different CBM masks in different hardware platforms might be necessary to specify the same CLOS configuration, in terms of exclusive number of bytes and shared number of bytes. (cache-way rounded number of bytes). For example, due to L3 allocation by other hardware entities in certain parts of the cache it might be necessary to relocate CBM mask to achieve the same CLOS configuration. 3) Proposed format: Few questions from a random listener, I apologise if some of them are in a wrong place due to me missing some information from past threads. I'm not sure whether the following proposal to the format is the internal structure or what's going to be in cgroups. If this is user-visible interface, I think it could be a little less detailed. sharedregionK.exclusive - Number of exclusive cache bytes reserved for shared region. sharedregionK.excl_data - Number of exclusive cache data bytes reserved for shared region. sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for shared region. sharedregionK.round_down - Round down
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote: On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: Marcello, On Wed, 29 Jul 2015, Marcelo Tosatti wrote: How about this: desiredclos (closid p1 p2 p3 p4) 1 1 0 0 0 2 0 0 0 1 3 0 1 1 0 #1 Currently in the rdt cgroup , the root cgroup always has all the bits set and cant be changed (because the cgroup hierarchy would by default make this to have all bits as all the children need to have a subset of the root's bitmask). So if the user creates a cgroup and not put any task in it , the tasks in the root cgroup could be still using that part of the cache. Thats the reason i say we can have really 'exclusive' masks. Or in other words - there is always a desired clos (0) which has all parts set which acts like a default pool. Also the parts can overlap. Please apply this for all the below comments which will change the way they work. p means part. I am assuming p = (a contiguous cache capacity bit mask) Yes. closid 1 is a exclusive cgroup. closid 2 is a cache hog class. closid 3 is default closid. Desiredclos is what user has specified. Transition 1: desiredclos -- effectiveclos Clean all bits of unused closid's (that must be updated whenever a closid1 cgroup goes from empty-nonempty and vice-versa). effectiveclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 0 1 1 0 Transition 2: effectiveclos -- expandedclos expandedclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 1 1 1 0 Then you have different inplacecos for each CPU (see pseudo-code below): On the following events. - task migration to new pCPU: - task creation: id = smp_processor_id(); for (part = desiredclos.p1; ...; part++) /* if my cosid is set and any other cosid is clear, for the part, synchronize desiredclos -- inplacecos */ if (part[mycosid] == 1 part[any_othercosid] == 0) wrmsr(part, desiredclos); Currently the root cgroup would have all the bits set which will act like a default cgroup where all the otherwise unused parts (assuming they are a set of contiguous cache capacity bits) will be used. Otherwise the question is in the expandedclos - who decides to expand the closx parts to include some of the unused parts.. - that could just be a default root always ? Right, so the problem is for certain closid's you might never want to expand (because doing so would cause data to be cached in a cache way which might have high eviction rate in the future). See the example from Will. But for the default cache (that is unclassified applications i suppose it is beneficial to expand in most cases, that is, use maximum amount of cache irrespective of eviction rate, which is the behaviour that exists now without CAT). So perhaps a new flag expand=y/n can be added to the cgroup directories... What do you say? Userspace representation of CAT --- Usage model: 1) measure application performance without L3 cache reservation. 2) measure application perf with L3 cache reservation and X number of cache ways until desired performance is attained. Requirements: 1) Persistency of CLOS configuration across hardware. On migration of operating system or application between different hardware systems we'd like the following to be maintained: - exclusive number of bytes (*) reserved to a certain CLOSid. - shared number of bytes (*) reserved between a certain group of CLOSid's. For both code and data, rounded down or up in cache way size. 2) Reasoning: Different CBM masks in different hardware platforms might be necessary to specify the same CLOS configuration, in terms of exclusive number of bytes and shared number of bytes. (cache-way rounded number of bytes). For example, due to L3 allocation by other hardware entities in certain parts of the cache it might be necessary to relocate CBM mask to achieve the same CLOS configuration. 3) Proposed format: Few questions from a random listener, I apologise if some of them are in a wrong place due to me missing some information from past threads. I'm not sure whether the following proposal to the format is the internal structure or what's going to be in cgroups. If this is user-visible interface, I think it could be a little less detailed. sharedregionK.exclusive - Number of exclusive cache bytes reserved for shared region. sharedregionK.excl_data - Number of exclusive cache data bytes reserved for shared region. sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for shared region. sharedregionK.round_down - Round down to cache way bytes from respective number
Re: [summary] Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Fri, Jul 31, 2015 at 09:41:58AM -0700, Vikas Shivappa wrote: > > To summarize the ever growing thread : > > 1. the rdt_cgroup can be used to configure exclusive cache bitmaps > for the child nodes which can be used for the scenarios which > Marcello mentions. > > simle examples which were mentioned : > max bitmask length : 16 . hence full mask is 0x > groupx_realtime - 0xff . > group2_systemtraffic - 0xf. : put a lot of tasks from root node to > here or which ever is offending and thrashing. > groupy_ - 0x0f > > Now the groupx has its own area of cache that can used by the > realtime/(specific scenario) apps. Similarly configure any groupy. > > 2. Can the maps can let you specify which cache ways ways the cache > is allocated ? - No , this is implementation specific as mentioned > in the SDM. So when we configure a mask , you really dont know which > ways or which exact lines are used on which SKUs .. We may not see > any use case as well which is needed for apps to allocate cache in > specific areas and the h/w does not support this as well. Ok, can you comment whether the userspace interface proposed addresses all your use cases ? > 3. Letting the user specify size in bytes instead of bitmap : we > have already gone through this discussion in older versions. The > user can simply check the size of the total cache and understand > what map could be what size. I dont see a special need to specify an > interface to enter the cache in bytes and then round off - user > could instead use the roundoff values before hand or iow it > automatically does when he specifies the bitmask. When you move from processor A with CBM bitmask format X to hardware B with CBM bitmask format Y, and the formats Y and X are different, you have to manually adjust the format. Please reply to the userspace proposal, the problem is very explicit there. > ex: find cache size from /proc/cpuinfo. - say 20MB > bitmask max - 0xf. > > This means the roundoff(chunk) size supported is only 1MB , so when > you specify the mask say 0x3(2MB) thats already taken care of. > Same applies to percentage - the masks automatically round off the percentage. > > Please note that this is quite different from the way we can > allocate memory in bytes and needs to be treated differently given > that the hardware provides interface in a particular way. > > 4. Letting the kernel automatically extend the bitmap may affect a > lot of other things Lets talk about them. What other things? > and will need a lot of heuristics - note that we > have overlapping masks. I proposed a way to avoid heuristics by exposing whether the cgroup is "expandable" or not and asked your input. We really do not want to waste cache if we can avoid it. > This interface lets the super-user control > the cache allocation and it may be very confusing for the user if he > has allocated a cache mask and suddenly from under the floor the > kernel changes it. Agree. > > Thanks, > Vikas > > > On Fri, 31 Jul 2015, Marcelo Tosatti wrote: > > >On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote: > >> > >> > >>On Thu, 30 Jul 2015, Marcelo Tosatti wrote: > >> > >>>On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: > > > Marcello, > > > On Wed, 29 Jul 2015, Marcelo Tosatti wrote: > > > >How about this: > > > >desiredclos (closid p1 p2 p3 p4) > > 1 1 0 0 0 > > 2 0 0 0 1 > > 3 0 1 1 0 > > #1 Currently in the rdt cgroup , the root cgroup always has all the > bits set and cant be changed (because the cgroup hierarchy would by > default make this to have all bits as all the children need to have > a subset of the root's bitmask). So if the user creates a cgroup and > not put any task in it , the tasks in the root cgroup could be still > using that part of the cache. Thats the reason i say we can have > really 'exclusive' masks. > > Or in other words - there is always a desired clos (0) which has all > parts set which acts like a default pool. > > Also the parts can overlap. Please apply this for all the below > comments which will change the way they work. > >>> > >>> > > > > >p means part. > > I am assuming p = (a contiguous cache capacity bit mask) > > >closid 1 is a exclusive cgroup. > >closid 2 is a "cache hog" class. > >closid 3 is "default closid". > > > >Desiredclos is what user has specified. > > > >Transition 1: desiredclos --> effectiveclos > >Clean all bits of unused closid's > >(that must be updated whenever a > >closid1 cgroup goes from empty->nonempty > >and vice-versa). > > > >effectiveclos (closid p1 p2 p3 p4) > >1 0 0 0 0 > >2 0 0 0 1 > >3 0 1 1 0 > > > > >Transition 2: effectiveclos --> expandedclos
[summary] Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
To summarize the ever growing thread : 1. the rdt_cgroup can be used to configure exclusive cache bitmaps for the child nodes which can be used for the scenarios which Marcello mentions. simle examples which were mentioned : max bitmask length : 16 . hence full mask is 0x groupx_realtime - 0xff . group2_systemtraffic - 0xf. : put a lot of tasks from root node to here or which ever is offending and thrashing. groupy_ - 0x0f Now the groupx has its own area of cache that can used by the realtime/(specific scenario) apps. Similarly configure any groupy. 2. Can the maps can let you specify which cache ways ways the cache is allocated ? - No , this is implementation specific as mentioned in the SDM. So when we configure a mask , you really dont know which ways or which exact lines are used on which SKUs .. We may not see any use case as well which is needed for apps to allocate cache in specific areas and the h/w does not support this as well. 3. Letting the user specify size in bytes instead of bitmap : we have already gone through this discussion in older versions. The user can simply check the size of the total cache and understand what map could be what size. I dont see a special need to specify an interface to enter the cache in bytes and then round off - user could instead use the roundoff values before hand or iow it automatically does when he specifies the bitmask. ex: find cache size from /proc/cpuinfo. - say 20MB bitmask max - 0xf. This means the roundoff(chunk) size supported is only 1MB , so when you specify the mask say 0x3(2MB) thats already taken care of. Same applies to percentage - the masks automatically round off the percentage. Please note that this is quite different from the way we can allocate memory in bytes and needs to be treated differently given that the hardware provides interface in a particular way. 4. Letting the kernel automatically extend the bitmap may affect a lot of other things and will need a lot of heuristics - note that we have overlapping masks . This interface lets the super-user control the cache allocation and it may be very confusing for the user if he has allocated a cache mask and suddenly from under the floor the kernel changes it. Thanks, Vikas On Fri, 31 Jul 2015, Marcelo Tosatti wrote: On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote: On Thu, 30 Jul 2015, Marcelo Tosatti wrote: On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: Marcello, On Wed, 29 Jul 2015, Marcelo Tosatti wrote: How about this: desiredclos (closid p1 p2 p3 p4) 1 1 0 0 0 2 0 0 0 1 3 0 1 1 0 #1 Currently in the rdt cgroup , the root cgroup always has all the bits set and cant be changed (because the cgroup hierarchy would by default make this to have all bits as all the children need to have a subset of the root's bitmask). So if the user creates a cgroup and not put any task in it , the tasks in the root cgroup could be still using that part of the cache. Thats the reason i say we can have really 'exclusive' masks. Or in other words - there is always a desired clos (0) which has all parts set which acts like a default pool. Also the parts can overlap. Please apply this for all the below comments which will change the way they work. p means part. I am assuming p = (a contiguous cache capacity bit mask) closid 1 is a exclusive cgroup. closid 2 is a "cache hog" class. closid 3 is "default closid". Desiredclos is what user has specified. Transition 1: desiredclos --> effectiveclos Clean all bits of unused closid's (that must be updated whenever a closid1 cgroup goes from empty->nonempty and vice-versa). effectiveclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 0 1 1 0 Transition 2: effectiveclos --> expandedclos expandedclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 1 1 1 0 Then you have different inplacecos for each CPU (see pseudo-code below): On the following events. - task migration to new pCPU: - task creation: id = smp_processor_id(); for (part = desiredclos.p1; ...; part++) /* if my cosid is set and any other cosid is clear, for the part, synchronize desiredclos --> inplacecos */ if (part[mycosid] == 1 && part[any_othercosid] == 0) wrmsr(part, desiredclos); Currently the root cgroup would have all the bits set which will act like a default cgroup where all the otherwise unused parts (assuming they are a set of contiguous cache capacity bits) will be used. Right, but we don't want to place tasks in there in case one cgroup wants exclusive cache access. So whenever you want an exclusive cgroup you'd do:
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote: > On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: > > > > > > Marcello, > > > > > > On Wed, 29 Jul 2015, Marcelo Tosatti wrote: > > > > > >How about this: > > > > > >desiredclos (closid p1 p2 p3 p4) > > >1 1 0 0 0 > > >2 0 0 0 1 > > >3 0 1 1 0 > > > > #1 Currently in the rdt cgroup , the root cgroup always has all the > > bits set and cant be changed (because the cgroup hierarchy would by > > default make this to have all bits as all the children need to have > > a subset of the root's bitmask). So if the user creates a cgroup and > > not put any task in it , the tasks in the root cgroup could be still > > using that part of the cache. Thats the reason i say we can have > > really 'exclusive' masks. > > > > Or in other words - there is always a desired clos (0) which has all > > parts set which acts like a default pool. > > > > Also the parts can overlap. Please apply this for all the below > > comments which will change the way they work. > > > > > > > >p means part. > > > > I am assuming p = (a contiguous cache capacity bit mask) > > Yes. > > > >closid 1 is a exclusive cgroup. > > >closid 2 is a "cache hog" class. > > >closid 3 is "default closid". > > > > > >Desiredclos is what user has specified. > > > > > >Transition 1: desiredclos --> effectiveclos > > >Clean all bits of unused closid's > > >(that must be updated whenever a > > >closid1 cgroup goes from empty->nonempty > > >and vice-versa). > > > > > >effectiveclos (closid p1 p2 p3 p4) > > > 1 0 0 0 0 > > > 2 0 0 0 1 > > > 3 0 1 1 0 > > > > > > > >Transition 2: effectiveclos --> expandedclos > > >expandedclos (closid p1 p2 p3 p4) > > > 1 0 0 0 0 > > > 2 0 0 0 1 > > > 3 1 1 1 0 > > >Then you have different inplacecos for each > > >CPU (see pseudo-code below): > > > > > >On the following events. > > > > > >- task migration to new pCPU: > > >- task creation: > > > > > > id = smp_processor_id(); > > > for (part = desiredclos.p1; ...; part++) > > > /* if my cosid is set and any other > > > cosid is clear, for the part, > > > synchronize desiredclos --> inplacecos */ > > > if (part[mycosid] == 1 && > > > part[any_othercosid] == 0) > > > wrmsr(part, desiredclos); > > > > > > > Currently the root cgroup would have all the bits set which will act > > like a default cgroup where all the otherwise unused parts (assuming > > they are a set of contiguous cache capacity bits) will be used. > > > > Otherwise the question is in the expandedclos - who decides to > > expand the closx parts to include some of the unused parts.. - that > > could just be a default root always ? > > Right, so the problem is for certain closid's you might never want > to expand (because doing so would cause data to be cached in a > cache way which might have high eviction rate in the future). > See the example from Will. > > But for the default cache (that is "unclassified applications" > i suppose it is beneficial to expand in most cases, that is, > use maximum amount of cache irrespective of eviction rate, which > is the behaviour that exists now without CAT). > > So perhaps a new flag "expand=y/n" can be added to the cgroup > directories... What do you say? > > Userspace representation of CAT > --- > > Usage model: > 1) measure application performance without L3 cache reservation. > 2) measure application perf with L3 cache reservation and > X number of cache ways until desired performance is attained. > > Requirements: > 1) Persistency of CLOS configuration across hardware. On migration > of operating system or application between different hardware > systems we'd like the following to be maintained: > - exclusive number of bytes (*) reserved to a certain CLOSid. > - shared number of bytes (*) reserved between a certain group > of CLOSid's. > > For both code and data, rounded down or up in cache way size. > > 2) Reasoning: > Different CBM masks in different hardware platforms might be necessary > to specify the same CLOS configuration, in terms of exclusive number of > bytes and shared number of bytes. (cache-way rounded number of bytes). > For example, due to L3 allocation by other hardware entities in certain parts > of the cache it might be necessary to relocate CBM mask to achieve > the same CLOS configuration. > > 3) Proposed format: > > sharedregionK.exclusive - Number of exclusive cache bytes reserved for > shared region. > sharedregionK.excl_data - Number of exclusive cache data bytes reserved for > shared region. > sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for >
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote: > > > On Thu, 30 Jul 2015, Marcelo Tosatti wrote: > > >On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: > >> > >> > >>Marcello, > >> > >> > >>On Wed, 29 Jul 2015, Marcelo Tosatti wrote: > >>> > >>>How about this: > >>> > >>>desiredclos (closid p1 p2 p3 p4) > >>>1 1 0 0 0 > >>>2 0 0 0 1 > >>>3 0 1 1 0 > >> > >>#1 Currently in the rdt cgroup , the root cgroup always has all the > >>bits set and cant be changed (because the cgroup hierarchy would by > >>default make this to have all bits as all the children need to have > >>a subset of the root's bitmask). So if the user creates a cgroup and > >>not put any task in it , the tasks in the root cgroup could be still > >>using that part of the cache. Thats the reason i say we can have > >>really 'exclusive' masks. > >> > >>Or in other words - there is always a desired clos (0) which has all > >>parts set which acts like a default pool. > >> > >>Also the parts can overlap. Please apply this for all the below > >>comments which will change the way they work. > > > > > >> > >>> > >>>p means part. > >> > >>I am assuming p = (a contiguous cache capacity bit mask) > >> > >>>closid 1 is a exclusive cgroup. > >>>closid 2 is a "cache hog" class. > >>>closid 3 is "default closid". > >>> > >>>Desiredclos is what user has specified. > >>> > >>>Transition 1: desiredclos --> effectiveclos > >>>Clean all bits of unused closid's > >>>(that must be updated whenever a > >>>closid1 cgroup goes from empty->nonempty > >>>and vice-versa). > >>> > >>>effectiveclos (closid p1 p2 p3 p4) > >>> 1 0 0 0 0 > >>> 2 0 0 0 1 > >>> 3 0 1 1 0 > >> > >>> > >>>Transition 2: effectiveclos --> expandedclos > >>>expandedclos (closid p1 p2 p3 p4) > >>> 1 0 0 0 0 > >>> 2 0 0 0 1 > >>> 3 1 1 1 0 > >>>Then you have different inplacecos for each > >>>CPU (see pseudo-code below): > >>> > >>>On the following events. > >>> > >>>- task migration to new pCPU: > >>>- task creation: > >>> > >>> id = smp_processor_id(); > >>> for (part = desiredclos.p1; ...; part++) > >>> /* if my cosid is set and any other > >>> cosid is clear, for the part, > >>> synchronize desiredclos --> inplacecos */ > >>> if (part[mycosid] == 1 && > >>> part[any_othercosid] == 0) > >>> wrmsr(part, desiredclos); > >>> > >> > >>Currently the root cgroup would have all the bits set which will act > >>like a default cgroup where all the otherwise unused parts (assuming > >>they are a set of contiguous cache capacity bits) will be used. > > > >Right, but we don't want to place tasks in there in case one cgroup > >wants exclusive cache access. > > > >So whenever you want an exclusive cgroup you'd do: > > > >create cgroup-exclusive; reserve desired part of the cache > >for it. > >create cgroup-default; reserved all cache minus that of cgroup-exclusive > >for it. > > > >place tasks that belong to cgroup-exclusive into it. > >place all other tasks (including init) into cgroup-default. > > > >Is that right? > > Yes you could do that. > > You can create cgroups to have masks which are exclusive in todays > implementation, just that you could also created more cgroups to > overlap the masks again.. iow we dont have an exclusive flag for the > cgroup mask. > Is that a common use case in the server environment that you need to > prevent other cgroups from using a certain mask ? (since the root > user should control these allocations .. he should know?) Yes, there are two known use-cases that have this characteristic: 1) High performance numeric application which has been optimized to a certain fraction of the cache. 2) Low latency application in multi-application OS. For both cases exclusive cache access is wanted. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote: On Thu, 30 Jul 2015, Marcelo Tosatti wrote: On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: Marcello, On Wed, 29 Jul 2015, Marcelo Tosatti wrote: How about this: desiredclos (closid p1 p2 p3 p4) 1 1 0 0 0 2 0 0 0 1 3 0 1 1 0 #1 Currently in the rdt cgroup , the root cgroup always has all the bits set and cant be changed (because the cgroup hierarchy would by default make this to have all bits as all the children need to have a subset of the root's bitmask). So if the user creates a cgroup and not put any task in it , the tasks in the root cgroup could be still using that part of the cache. Thats the reason i say we can have really 'exclusive' masks. Or in other words - there is always a desired clos (0) which has all parts set which acts like a default pool. Also the parts can overlap. Please apply this for all the below comments which will change the way they work. p means part. I am assuming p = (a contiguous cache capacity bit mask) closid 1 is a exclusive cgroup. closid 2 is a cache hog class. closid 3 is default closid. Desiredclos is what user has specified. Transition 1: desiredclos -- effectiveclos Clean all bits of unused closid's (that must be updated whenever a closid1 cgroup goes from empty-nonempty and vice-versa). effectiveclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 0 1 1 0 Transition 2: effectiveclos -- expandedclos expandedclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 1 1 1 0 Then you have different inplacecos for each CPU (see pseudo-code below): On the following events. - task migration to new pCPU: - task creation: id = smp_processor_id(); for (part = desiredclos.p1; ...; part++) /* if my cosid is set and any other cosid is clear, for the part, synchronize desiredclos -- inplacecos */ if (part[mycosid] == 1 part[any_othercosid] == 0) wrmsr(part, desiredclos); Currently the root cgroup would have all the bits set which will act like a default cgroup where all the otherwise unused parts (assuming they are a set of contiguous cache capacity bits) will be used. Right, but we don't want to place tasks in there in case one cgroup wants exclusive cache access. So whenever you want an exclusive cgroup you'd do: create cgroup-exclusive; reserve desired part of the cache for it. create cgroup-default; reserved all cache minus that of cgroup-exclusive for it. place tasks that belong to cgroup-exclusive into it. place all other tasks (including init) into cgroup-default. Is that right? Yes you could do that. You can create cgroups to have masks which are exclusive in todays implementation, just that you could also created more cgroups to overlap the masks again.. iow we dont have an exclusive flag for the cgroup mask. Is that a common use case in the server environment that you need to prevent other cgroups from using a certain mask ? (since the root user should control these allocations .. he should know?) Yes, there are two known use-cases that have this characteristic: 1) High performance numeric application which has been optimized to a certain fraction of the cache. 2) Low latency application in multi-application OS. For both cases exclusive cache access is wanted. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote: On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: Marcello, On Wed, 29 Jul 2015, Marcelo Tosatti wrote: How about this: desiredclos (closid p1 p2 p3 p4) 1 1 0 0 0 2 0 0 0 1 3 0 1 1 0 #1 Currently in the rdt cgroup , the root cgroup always has all the bits set and cant be changed (because the cgroup hierarchy would by default make this to have all bits as all the children need to have a subset of the root's bitmask). So if the user creates a cgroup and not put any task in it , the tasks in the root cgroup could be still using that part of the cache. Thats the reason i say we can have really 'exclusive' masks. Or in other words - there is always a desired clos (0) which has all parts set which acts like a default pool. Also the parts can overlap. Please apply this for all the below comments which will change the way they work. p means part. I am assuming p = (a contiguous cache capacity bit mask) Yes. closid 1 is a exclusive cgroup. closid 2 is a cache hog class. closid 3 is default closid. Desiredclos is what user has specified. Transition 1: desiredclos -- effectiveclos Clean all bits of unused closid's (that must be updated whenever a closid1 cgroup goes from empty-nonempty and vice-versa). effectiveclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 0 1 1 0 Transition 2: effectiveclos -- expandedclos expandedclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 1 1 1 0 Then you have different inplacecos for each CPU (see pseudo-code below): On the following events. - task migration to new pCPU: - task creation: id = smp_processor_id(); for (part = desiredclos.p1; ...; part++) /* if my cosid is set and any other cosid is clear, for the part, synchronize desiredclos -- inplacecos */ if (part[mycosid] == 1 part[any_othercosid] == 0) wrmsr(part, desiredclos); Currently the root cgroup would have all the bits set which will act like a default cgroup where all the otherwise unused parts (assuming they are a set of contiguous cache capacity bits) will be used. Otherwise the question is in the expandedclos - who decides to expand the closx parts to include some of the unused parts.. - that could just be a default root always ? Right, so the problem is for certain closid's you might never want to expand (because doing so would cause data to be cached in a cache way which might have high eviction rate in the future). See the example from Will. But for the default cache (that is unclassified applications i suppose it is beneficial to expand in most cases, that is, use maximum amount of cache irrespective of eviction rate, which is the behaviour that exists now without CAT). So perhaps a new flag expand=y/n can be added to the cgroup directories... What do you say? Userspace representation of CAT --- Usage model: 1) measure application performance without L3 cache reservation. 2) measure application perf with L3 cache reservation and X number of cache ways until desired performance is attained. Requirements: 1) Persistency of CLOS configuration across hardware. On migration of operating system or application between different hardware systems we'd like the following to be maintained: - exclusive number of bytes (*) reserved to a certain CLOSid. - shared number of bytes (*) reserved between a certain group of CLOSid's. For both code and data, rounded down or up in cache way size. 2) Reasoning: Different CBM masks in different hardware platforms might be necessary to specify the same CLOS configuration, in terms of exclusive number of bytes and shared number of bytes. (cache-way rounded number of bytes). For example, due to L3 allocation by other hardware entities in certain parts of the cache it might be necessary to relocate CBM mask to achieve the same CLOS configuration. 3) Proposed format: sharedregionK.exclusive - Number of exclusive cache bytes reserved for shared region. sharedregionK.excl_data - Number of exclusive cache data bytes reserved for shared region. sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for shared region. sharedregionK.round_down - Round down to cache way bytes from respective number specification (default is round up). sharedregionK.expand - y/n - Expand shared region to more cache ways when available
[summary] Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
To summarize the ever growing thread : 1. the rdt_cgroup can be used to configure exclusive cache bitmaps for the child nodes which can be used for the scenarios which Marcello mentions. simle examples which were mentioned : max bitmask length : 16 . hence full mask is 0x groupx_realtime - 0xff . group2_systemtraffic - 0xf. : put a lot of tasks from root node to here or which ever is offending and thrashing. groupy_mytraffic - 0x0f Now the groupx has its own area of cache that can used by the realtime/(specific scenario) apps. Similarly configure any groupy. 2. Can the maps can let you specify which cache ways ways the cache is allocated ? - No , this is implementation specific as mentioned in the SDM. So when we configure a mask , you really dont know which ways or which exact lines are used on which SKUs .. We may not see any use case as well which is needed for apps to allocate cache in specific areas and the h/w does not support this as well. 3. Letting the user specify size in bytes instead of bitmap : we have already gone through this discussion in older versions. The user can simply check the size of the total cache and understand what map could be what size. I dont see a special need to specify an interface to enter the cache in bytes and then round off - user could instead use the roundoff values before hand or iow it automatically does when he specifies the bitmask. ex: find cache size from /proc/cpuinfo. - say 20MB bitmask max - 0xf. This means the roundoff(chunk) size supported is only 1MB , so when you specify the mask say 0x3(2MB) thats already taken care of. Same applies to percentage - the masks automatically round off the percentage. Please note that this is quite different from the way we can allocate memory in bytes and needs to be treated differently given that the hardware provides interface in a particular way. 4. Letting the kernel automatically extend the bitmap may affect a lot of other things and will need a lot of heuristics - note that we have overlapping masks . This interface lets the super-user control the cache allocation and it may be very confusing for the user if he has allocated a cache mask and suddenly from under the floor the kernel changes it. Thanks, Vikas On Fri, 31 Jul 2015, Marcelo Tosatti wrote: On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote: On Thu, 30 Jul 2015, Marcelo Tosatti wrote: On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: Marcello, On Wed, 29 Jul 2015, Marcelo Tosatti wrote: How about this: desiredclos (closid p1 p2 p3 p4) 1 1 0 0 0 2 0 0 0 1 3 0 1 1 0 #1 Currently in the rdt cgroup , the root cgroup always has all the bits set and cant be changed (because the cgroup hierarchy would by default make this to have all bits as all the children need to have a subset of the root's bitmask). So if the user creates a cgroup and not put any task in it , the tasks in the root cgroup could be still using that part of the cache. Thats the reason i say we can have really 'exclusive' masks. Or in other words - there is always a desired clos (0) which has all parts set which acts like a default pool. Also the parts can overlap. Please apply this for all the below comments which will change the way they work. p means part. I am assuming p = (a contiguous cache capacity bit mask) closid 1 is a exclusive cgroup. closid 2 is a cache hog class. closid 3 is default closid. Desiredclos is what user has specified. Transition 1: desiredclos -- effectiveclos Clean all bits of unused closid's (that must be updated whenever a closid1 cgroup goes from empty-nonempty and vice-versa). effectiveclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 0 1 1 0 Transition 2: effectiveclos -- expandedclos expandedclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 1 1 1 0 Then you have different inplacecos for each CPU (see pseudo-code below): On the following events. - task migration to new pCPU: - task creation: id = smp_processor_id(); for (part = desiredclos.p1; ...; part++) /* if my cosid is set and any other cosid is clear, for the part, synchronize desiredclos -- inplacecos */ if (part[mycosid] == 1 part[any_othercosid] == 0) wrmsr(part, desiredclos); Currently the root cgroup would have all the bits set which will act like a default cgroup where all the otherwise unused parts (assuming they are a set of contiguous cache capacity bits) will be used. Right, but we don't want to place tasks in there in case one cgroup wants exclusive cache access. So whenever you want an exclusive cgroup you'd do:
Re: [summary] Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Fri, Jul 31, 2015 at 09:41:58AM -0700, Vikas Shivappa wrote: To summarize the ever growing thread : 1. the rdt_cgroup can be used to configure exclusive cache bitmaps for the child nodes which can be used for the scenarios which Marcello mentions. simle examples which were mentioned : max bitmask length : 16 . hence full mask is 0x groupx_realtime - 0xff . group2_systemtraffic - 0xf. : put a lot of tasks from root node to here or which ever is offending and thrashing. groupy_mytraffic - 0x0f Now the groupx has its own area of cache that can used by the realtime/(specific scenario) apps. Similarly configure any groupy. 2. Can the maps can let you specify which cache ways ways the cache is allocated ? - No , this is implementation specific as mentioned in the SDM. So when we configure a mask , you really dont know which ways or which exact lines are used on which SKUs .. We may not see any use case as well which is needed for apps to allocate cache in specific areas and the h/w does not support this as well. Ok, can you comment whether the userspace interface proposed addresses all your use cases ? 3. Letting the user specify size in bytes instead of bitmap : we have already gone through this discussion in older versions. The user can simply check the size of the total cache and understand what map could be what size. I dont see a special need to specify an interface to enter the cache in bytes and then round off - user could instead use the roundoff values before hand or iow it automatically does when he specifies the bitmask. When you move from processor A with CBM bitmask format X to hardware B with CBM bitmask format Y, and the formats Y and X are different, you have to manually adjust the format. Please reply to the userspace proposal, the problem is very explicit there. ex: find cache size from /proc/cpuinfo. - say 20MB bitmask max - 0xf. This means the roundoff(chunk) size supported is only 1MB , so when you specify the mask say 0x3(2MB) thats already taken care of. Same applies to percentage - the masks automatically round off the percentage. Please note that this is quite different from the way we can allocate memory in bytes and needs to be treated differently given that the hardware provides interface in a particular way. 4. Letting the kernel automatically extend the bitmap may affect a lot of other things Lets talk about them. What other things? and will need a lot of heuristics - note that we have overlapping masks. I proposed a way to avoid heuristics by exposing whether the cgroup is expandable or not and asked your input. We really do not want to waste cache if we can avoid it. This interface lets the super-user control the cache allocation and it may be very confusing for the user if he has allocated a cache mask and suddenly from under the floor the kernel changes it. Agree. Thanks, Vikas On Fri, 31 Jul 2015, Marcelo Tosatti wrote: On Thu, Jul 30, 2015 at 04:03:07PM -0700, Vikas Shivappa wrote: On Thu, 30 Jul 2015, Marcelo Tosatti wrote: On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: Marcello, On Wed, 29 Jul 2015, Marcelo Tosatti wrote: How about this: desiredclos (closid p1 p2 p3 p4) 1 1 0 0 0 2 0 0 0 1 3 0 1 1 0 #1 Currently in the rdt cgroup , the root cgroup always has all the bits set and cant be changed (because the cgroup hierarchy would by default make this to have all bits as all the children need to have a subset of the root's bitmask). So if the user creates a cgroup and not put any task in it , the tasks in the root cgroup could be still using that part of the cache. Thats the reason i say we can have really 'exclusive' masks. Or in other words - there is always a desired clos (0) which has all parts set which acts like a default pool. Also the parts can overlap. Please apply this for all the below comments which will change the way they work. p means part. I am assuming p = (a contiguous cache capacity bit mask) closid 1 is a exclusive cgroup. closid 2 is a cache hog class. closid 3 is default closid. Desiredclos is what user has specified. Transition 1: desiredclos -- effectiveclos Clean all bits of unused closid's (that must be updated whenever a closid1 cgroup goes from empty-nonempty and vice-versa). effectiveclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 0 1 1 0 Transition 2: effectiveclos -- expandedclos expandedclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 1 1 1 0 Then you have different inplacecos for each CPU (see pseudo-code below): On the following events. - task migration to new pCPU: - task creation: id = smp_processor_id(); for (part = desiredclos.p1; ...; part++) /*
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Thu, 30 Jul 2015, Marcelo Tosatti wrote: On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: Marcello, On Wed, 29 Jul 2015, Marcelo Tosatti wrote: How about this: desiredclos (closid p1 p2 p3 p4) 1 1 0 0 0 2 0 0 0 1 3 0 1 1 0 #1 Currently in the rdt cgroup , the root cgroup always has all the bits set and cant be changed (because the cgroup hierarchy would by default make this to have all bits as all the children need to have a subset of the root's bitmask). So if the user creates a cgroup and not put any task in it , the tasks in the root cgroup could be still using that part of the cache. Thats the reason i say we can have really 'exclusive' masks. Or in other words - there is always a desired clos (0) which has all parts set which acts like a default pool. Also the parts can overlap. Please apply this for all the below comments which will change the way they work. p means part. I am assuming p = (a contiguous cache capacity bit mask) closid 1 is a exclusive cgroup. closid 2 is a "cache hog" class. closid 3 is "default closid". Desiredclos is what user has specified. Transition 1: desiredclos --> effectiveclos Clean all bits of unused closid's (that must be updated whenever a closid1 cgroup goes from empty->nonempty and vice-versa). effectiveclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 0 1 1 0 Transition 2: effectiveclos --> expandedclos expandedclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 1 1 1 0 Then you have different inplacecos for each CPU (see pseudo-code below): On the following events. - task migration to new pCPU: - task creation: id = smp_processor_id(); for (part = desiredclos.p1; ...; part++) /* if my cosid is set and any other cosid is clear, for the part, synchronize desiredclos --> inplacecos */ if (part[mycosid] == 1 && part[any_othercosid] == 0) wrmsr(part, desiredclos); Currently the root cgroup would have all the bits set which will act like a default cgroup where all the otherwise unused parts (assuming they are a set of contiguous cache capacity bits) will be used. Right, but we don't want to place tasks in there in case one cgroup wants exclusive cache access. So whenever you want an exclusive cgroup you'd do: create cgroup-exclusive; reserve desired part of the cache for it. create cgroup-default; reserved all cache minus that of cgroup-exclusive for it. place tasks that belong to cgroup-exclusive into it. place all other tasks (including init) into cgroup-default. Is that right? Yes you could do that. You can create cgroups to have masks which are exclusive in todays implementation, just that you could also created more cgroups to overlap the masks again.. iow we dont have an exclusive flag for the cgroup mask. Is that a common use case in the server environment that you need to prevent other cgroups from using a certain mask ? (since the root user should control these allocations .. he should know?) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: > > > Marcello, > > > On Wed, 29 Jul 2015, Marcelo Tosatti wrote: > > > >How about this: > > > >desiredclos (closid p1 p2 p3 p4) > > 1 1 0 0 0 > > 2 0 0 0 1 > > 3 0 1 1 0 > > #1 Currently in the rdt cgroup , the root cgroup always has all the > bits set and cant be changed (because the cgroup hierarchy would by > default make this to have all bits as all the children need to have > a subset of the root's bitmask). So if the user creates a cgroup and > not put any task in it , the tasks in the root cgroup could be still > using that part of the cache. Thats the reason i say we can have > really 'exclusive' masks. > > Or in other words - there is always a desired clos (0) which has all > parts set which acts like a default pool. > > Also the parts can overlap. Please apply this for all the below > comments which will change the way they work. > > > > >p means part. > > I am assuming p = (a contiguous cache capacity bit mask) > > >closid 1 is a exclusive cgroup. > >closid 2 is a "cache hog" class. > >closid 3 is "default closid". > > > >Desiredclos is what user has specified. > > > >Transition 1: desiredclos --> effectiveclos > >Clean all bits of unused closid's > >(that must be updated whenever a > >closid1 cgroup goes from empty->nonempty > >and vice-versa). > > > >effectiveclos (closid p1 p2 p3 p4) > >1 0 0 0 0 > >2 0 0 0 1 > >3 0 1 1 0 > > > > >Transition 2: effectiveclos --> expandedclos > >expandedclos (closid p1 p2 p3 p4) > >1 0 0 0 0 > >2 0 0 0 1 > >3 1 1 1 0 > >Then you have different inplacecos for each > >CPU (see pseudo-code below): > > > >On the following events. > > > >- task migration to new pCPU: > >- task creation: > > > > id = smp_processor_id(); > > for (part = desiredclos.p1; ...; part++) > > /* if my cosid is set and any other > >cosid is clear, for the part, > >synchronize desiredclos --> inplacecos */ > > if (part[mycosid] == 1 && > > part[any_othercosid] == 0) > > wrmsr(part, desiredclos); > > > > Currently the root cgroup would have all the bits set which will act > like a default cgroup where all the otherwise unused parts (assuming > they are a set of contiguous cache capacity bits) will be used. Right, but we don't want to place tasks in there in case one cgroup wants exclusive cache access. So whenever you want an exclusive cgroup you'd do: create cgroup-exclusive; reserve desired part of the cache for it. create cgroup-default; reserved all cache minus that of cgroup-exclusive for it. place tasks that belong to cgroup-exclusive into it. place all other tasks (including init) into cgroup-default. Is that right? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: > > > Marcello, > > > On Wed, 29 Jul 2015, Marcelo Tosatti wrote: > > > >How about this: > > > >desiredclos (closid p1 p2 p3 p4) > > 1 1 0 0 0 > > 2 0 0 0 1 > > 3 0 1 1 0 > > #1 Currently in the rdt cgroup , the root cgroup always has all the > bits set and cant be changed (because the cgroup hierarchy would by > default make this to have all bits as all the children need to have > a subset of the root's bitmask). So if the user creates a cgroup and > not put any task in it , the tasks in the root cgroup could be still > using that part of the cache. Thats the reason i say we can have > really 'exclusive' masks. > > Or in other words - there is always a desired clos (0) which has all > parts set which acts like a default pool. > > Also the parts can overlap. Please apply this for all the below > comments which will change the way they work. > > > > >p means part. > > I am assuming p = (a contiguous cache capacity bit mask) Yes. > >closid 1 is a exclusive cgroup. > >closid 2 is a "cache hog" class. > >closid 3 is "default closid". > > > >Desiredclos is what user has specified. > > > >Transition 1: desiredclos --> effectiveclos > >Clean all bits of unused closid's > >(that must be updated whenever a > >closid1 cgroup goes from empty->nonempty > >and vice-versa). > > > >effectiveclos (closid p1 p2 p3 p4) > >1 0 0 0 0 > >2 0 0 0 1 > >3 0 1 1 0 > > > > >Transition 2: effectiveclos --> expandedclos > >expandedclos (closid p1 p2 p3 p4) > >1 0 0 0 0 > >2 0 0 0 1 > >3 1 1 1 0 > >Then you have different inplacecos for each > >CPU (see pseudo-code below): > > > >On the following events. > > > >- task migration to new pCPU: > >- task creation: > > > > id = smp_processor_id(); > > for (part = desiredclos.p1; ...; part++) > > /* if my cosid is set and any other > >cosid is clear, for the part, > >synchronize desiredclos --> inplacecos */ > > if (part[mycosid] == 1 && > > part[any_othercosid] == 0) > > wrmsr(part, desiredclos); > > > > Currently the root cgroup would have all the bits set which will act > like a default cgroup where all the otherwise unused parts (assuming > they are a set of contiguous cache capacity bits) will be used. > > Otherwise the question is in the expandedclos - who decides to > expand the closx parts to include some of the unused parts.. - that > could just be a default root always ? Right, so the problem is for certain closid's you might never want to expand (because doing so would cause data to be cached in a cache way which might have high eviction rate in the future). See the example from Will. But for the default cache (that is "unclassified applications" i suppose it is beneficial to expand in most cases, that is, use maximum amount of cache irrespective of eviction rate, which is the behaviour that exists now without CAT). So perhaps a new flag "expand=y/n" can be added to the cgroup directories... What do you say? Userspace representation of CAT --- Usage model: 1) measure application performance without L3 cache reservation. 2) measure application perf with L3 cache reservation and X number of cache ways until desired performance is attained. Requirements: 1) Persistency of CLOS configuration across hardware. On migration of operating system or application between different hardware systems we'd like the following to be maintained: - exclusive number of bytes (*) reserved to a certain CLOSid. - shared number of bytes (*) reserved between a certain group of CLOSid's. For both code and data, rounded down or up in cache way size. 2) Reasoning: Different CBM masks in different hardware platforms might be necessary to specify the same CLOS configuration, in terms of exclusive number of bytes and shared number of bytes. (cache-way rounded number of bytes). For example, due to L3 allocation by other hardware entities in certain parts of the cache it might be necessary to relocate CBM mask to achieve the same CLOS configuration. 3) Proposed format: sharedregionK.exclusive - Number of exclusive cache bytes reserved for shared region. sharedregionK.excl_data - Number of exclusive cache data bytes reserved for shared region. sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for shared region. sharedregionK.round_down - Round down to cache way bytes from respective number specification (default is round up). sharedregionK.expand - y/n - Expand shared region to more cache ways when available (default N).
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
Marcello, On Wed, 29 Jul 2015, Marcelo Tosatti wrote: How about this: desiredclos (closid p1 p2 p3 p4) 1 1 0 0 0 2 0 0 0 1 3 0 1 1 0 #1 Currently in the rdt cgroup , the root cgroup always has all the bits set and cant be changed (because the cgroup hierarchy would by default make this to have all bits as all the children need to have a subset of the root's bitmask). So if the user creates a cgroup and not put any task in it , the tasks in the root cgroup could be still using that part of the cache. Thats the reason i say we can have really 'exclusive' masks. Or in other words - there is always a desired clos (0) which has all parts set which acts like a default pool. Also the parts can overlap. Please apply this for all the below comments which will change the way they work. p means part. I am assuming p = (a contiguous cache capacity bit mask) closid 1 is a exclusive cgroup. closid 2 is a "cache hog" class. closid 3 is "default closid". Desiredclos is what user has specified. Transition 1: desiredclos --> effectiveclos Clean all bits of unused closid's (that must be updated whenever a closid1 cgroup goes from empty->nonempty and vice-versa). effectiveclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 0 1 1 0 Transition 2: effectiveclos --> expandedclos expandedclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 1 1 1 0 Then you have different inplacecos for each CPU (see pseudo-code below): On the following events. - task migration to new pCPU: - task creation: id = smp_processor_id(); for (part = desiredclos.p1; ...; part++) /* if my cosid is set and any other cosid is clear, for the part, synchronize desiredclos --> inplacecos */ if (part[mycosid] == 1 && part[any_othercosid] == 0) wrmsr(part, desiredclos); Currently the root cgroup would have all the bits set which will act like a default cgroup where all the otherwise unused parts (assuming they are a set of contiguous cache capacity bits) will be used. Otherwise the question is in the expandedclos - who decides to expand the closx parts to include some of the unused parts.. - that could just be a default root always ? Thanks, Vikas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: Marcello, On Wed, 29 Jul 2015, Marcelo Tosatti wrote: How about this: desiredclos (closid p1 p2 p3 p4) 1 1 0 0 0 2 0 0 0 1 3 0 1 1 0 #1 Currently in the rdt cgroup , the root cgroup always has all the bits set and cant be changed (because the cgroup hierarchy would by default make this to have all bits as all the children need to have a subset of the root's bitmask). So if the user creates a cgroup and not put any task in it , the tasks in the root cgroup could be still using that part of the cache. Thats the reason i say we can have really 'exclusive' masks. Or in other words - there is always a desired clos (0) which has all parts set which acts like a default pool. Also the parts can overlap. Please apply this for all the below comments which will change the way they work. p means part. I am assuming p = (a contiguous cache capacity bit mask) Yes. closid 1 is a exclusive cgroup. closid 2 is a cache hog class. closid 3 is default closid. Desiredclos is what user has specified. Transition 1: desiredclos -- effectiveclos Clean all bits of unused closid's (that must be updated whenever a closid1 cgroup goes from empty-nonempty and vice-versa). effectiveclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 0 1 1 0 Transition 2: effectiveclos -- expandedclos expandedclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 1 1 1 0 Then you have different inplacecos for each CPU (see pseudo-code below): On the following events. - task migration to new pCPU: - task creation: id = smp_processor_id(); for (part = desiredclos.p1; ...; part++) /* if my cosid is set and any other cosid is clear, for the part, synchronize desiredclos -- inplacecos */ if (part[mycosid] == 1 part[any_othercosid] == 0) wrmsr(part, desiredclos); Currently the root cgroup would have all the bits set which will act like a default cgroup where all the otherwise unused parts (assuming they are a set of contiguous cache capacity bits) will be used. Otherwise the question is in the expandedclos - who decides to expand the closx parts to include some of the unused parts.. - that could just be a default root always ? Right, so the problem is for certain closid's you might never want to expand (because doing so would cause data to be cached in a cache way which might have high eviction rate in the future). See the example from Will. But for the default cache (that is unclassified applications i suppose it is beneficial to expand in most cases, that is, use maximum amount of cache irrespective of eviction rate, which is the behaviour that exists now without CAT). So perhaps a new flag expand=y/n can be added to the cgroup directories... What do you say? Userspace representation of CAT --- Usage model: 1) measure application performance without L3 cache reservation. 2) measure application perf with L3 cache reservation and X number of cache ways until desired performance is attained. Requirements: 1) Persistency of CLOS configuration across hardware. On migration of operating system or application between different hardware systems we'd like the following to be maintained: - exclusive number of bytes (*) reserved to a certain CLOSid. - shared number of bytes (*) reserved between a certain group of CLOSid's. For both code and data, rounded down or up in cache way size. 2) Reasoning: Different CBM masks in different hardware platforms might be necessary to specify the same CLOS configuration, in terms of exclusive number of bytes and shared number of bytes. (cache-way rounded number of bytes). For example, due to L3 allocation by other hardware entities in certain parts of the cache it might be necessary to relocate CBM mask to achieve the same CLOS configuration. 3) Proposed format: sharedregionK.exclusive - Number of exclusive cache bytes reserved for shared region. sharedregionK.excl_data - Number of exclusive cache data bytes reserved for shared region. sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for shared region. sharedregionK.round_down - Round down to cache way bytes from respective number specification (default is round up). sharedregionK.expand - y/n - Expand shared region to more cache ways when available (default N). cgroupN.exclusive - Number of exclusive L3 cache bytes reserved for cgroup. cgroupN.excl_data - Number of exclusive L3 data
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: Marcello, On Wed, 29 Jul 2015, Marcelo Tosatti wrote: How about this: desiredclos (closid p1 p2 p3 p4) 1 1 0 0 0 2 0 0 0 1 3 0 1 1 0 #1 Currently in the rdt cgroup , the root cgroup always has all the bits set and cant be changed (because the cgroup hierarchy would by default make this to have all bits as all the children need to have a subset of the root's bitmask). So if the user creates a cgroup and not put any task in it , the tasks in the root cgroup could be still using that part of the cache. Thats the reason i say we can have really 'exclusive' masks. Or in other words - there is always a desired clos (0) which has all parts set which acts like a default pool. Also the parts can overlap. Please apply this for all the below comments which will change the way they work. p means part. I am assuming p = (a contiguous cache capacity bit mask) closid 1 is a exclusive cgroup. closid 2 is a cache hog class. closid 3 is default closid. Desiredclos is what user has specified. Transition 1: desiredclos -- effectiveclos Clean all bits of unused closid's (that must be updated whenever a closid1 cgroup goes from empty-nonempty and vice-versa). effectiveclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 0 1 1 0 Transition 2: effectiveclos -- expandedclos expandedclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 1 1 1 0 Then you have different inplacecos for each CPU (see pseudo-code below): On the following events. - task migration to new pCPU: - task creation: id = smp_processor_id(); for (part = desiredclos.p1; ...; part++) /* if my cosid is set and any other cosid is clear, for the part, synchronize desiredclos -- inplacecos */ if (part[mycosid] == 1 part[any_othercosid] == 0) wrmsr(part, desiredclos); Currently the root cgroup would have all the bits set which will act like a default cgroup where all the otherwise unused parts (assuming they are a set of contiguous cache capacity bits) will be used. Right, but we don't want to place tasks in there in case one cgroup wants exclusive cache access. So whenever you want an exclusive cgroup you'd do: create cgroup-exclusive; reserve desired part of the cache for it. create cgroup-default; reserved all cache minus that of cgroup-exclusive for it. place tasks that belong to cgroup-exclusive into it. place all other tasks (including init) into cgroup-default. Is that right? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Thu, 30 Jul 2015, Marcelo Tosatti wrote: On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: Marcello, On Wed, 29 Jul 2015, Marcelo Tosatti wrote: How about this: desiredclos (closid p1 p2 p3 p4) 1 1 0 0 0 2 0 0 0 1 3 0 1 1 0 #1 Currently in the rdt cgroup , the root cgroup always has all the bits set and cant be changed (because the cgroup hierarchy would by default make this to have all bits as all the children need to have a subset of the root's bitmask). So if the user creates a cgroup and not put any task in it , the tasks in the root cgroup could be still using that part of the cache. Thats the reason i say we can have really 'exclusive' masks. Or in other words - there is always a desired clos (0) which has all parts set which acts like a default pool. Also the parts can overlap. Please apply this for all the below comments which will change the way they work. p means part. I am assuming p = (a contiguous cache capacity bit mask) closid 1 is a exclusive cgroup. closid 2 is a cache hog class. closid 3 is default closid. Desiredclos is what user has specified. Transition 1: desiredclos -- effectiveclos Clean all bits of unused closid's (that must be updated whenever a closid1 cgroup goes from empty-nonempty and vice-versa). effectiveclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 0 1 1 0 Transition 2: effectiveclos -- expandedclos expandedclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 1 1 1 0 Then you have different inplacecos for each CPU (see pseudo-code below): On the following events. - task migration to new pCPU: - task creation: id = smp_processor_id(); for (part = desiredclos.p1; ...; part++) /* if my cosid is set and any other cosid is clear, for the part, synchronize desiredclos -- inplacecos */ if (part[mycosid] == 1 part[any_othercosid] == 0) wrmsr(part, desiredclos); Currently the root cgroup would have all the bits set which will act like a default cgroup where all the otherwise unused parts (assuming they are a set of contiguous cache capacity bits) will be used. Right, but we don't want to place tasks in there in case one cgroup wants exclusive cache access. So whenever you want an exclusive cgroup you'd do: create cgroup-exclusive; reserve desired part of the cache for it. create cgroup-default; reserved all cache minus that of cgroup-exclusive for it. place tasks that belong to cgroup-exclusive into it. place all other tasks (including init) into cgroup-default. Is that right? Yes you could do that. You can create cgroups to have masks which are exclusive in todays implementation, just that you could also created more cgroups to overlap the masks again.. iow we dont have an exclusive flag for the cgroup mask. Is that a common use case in the server environment that you need to prevent other cgroups from using a certain mask ? (since the root user should control these allocations .. he should know?) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
Marcello, On Wed, 29 Jul 2015, Marcelo Tosatti wrote: How about this: desiredclos (closid p1 p2 p3 p4) 1 1 0 0 0 2 0 0 0 1 3 0 1 1 0 #1 Currently in the rdt cgroup , the root cgroup always has all the bits set and cant be changed (because the cgroup hierarchy would by default make this to have all bits as all the children need to have a subset of the root's bitmask). So if the user creates a cgroup and not put any task in it , the tasks in the root cgroup could be still using that part of the cache. Thats the reason i say we can have really 'exclusive' masks. Or in other words - there is always a desired clos (0) which has all parts set which acts like a default pool. Also the parts can overlap. Please apply this for all the below comments which will change the way they work. p means part. I am assuming p = (a contiguous cache capacity bit mask) closid 1 is a exclusive cgroup. closid 2 is a cache hog class. closid 3 is default closid. Desiredclos is what user has specified. Transition 1: desiredclos -- effectiveclos Clean all bits of unused closid's (that must be updated whenever a closid1 cgroup goes from empty-nonempty and vice-versa). effectiveclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 0 1 1 0 Transition 2: effectiveclos -- expandedclos expandedclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 1 1 1 0 Then you have different inplacecos for each CPU (see pseudo-code below): On the following events. - task migration to new pCPU: - task creation: id = smp_processor_id(); for (part = desiredclos.p1; ...; part++) /* if my cosid is set and any other cosid is clear, for the part, synchronize desiredclos -- inplacecos */ if (part[mycosid] == 1 part[any_othercosid] == 0) wrmsr(part, desiredclos); Currently the root cgroup would have all the bits set which will act like a default cgroup where all the otherwise unused parts (assuming they are a set of contiguous cache capacity bits) will be used. Otherwise the question is in the expandedclos - who decides to expand the closx parts to include some of the unused parts.. - that could just be a default root always ? Thanks, Vikas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Tue, 28 Jul 2015, Auld, Will wrote: -Original Message- Same comment as above - Cgroup masks can always overlap and other cgroups can allocate the same cache , and hence wont have exclusive cache allocation. [Auld, Will] You can define all the cbm to provide one clos with an exclusive area Do you mean a CLOS that has all the bits set. We donot support exclusive area today. The bits in the mask can overlap .. hence can always share the same cache allocation . So natuarally the cgroup with tasks would get to use the cache if it has the same mask (say representing 50% of cache in your example) as others . [Auld, Will] automatic adjustment of the cbm make me nervous. There are times when we want to limit the cache for a process independent of whether there is lots of unused cache. Please see example below - In general , I just mean the cache mask can have bits that can overlap - does not matter whether there is tasks in it or not. (assume there are 8 bits max cbm) cgroupa - mask - 0xf cgroupb - mask - 0xf . Now if cgroupa has no tasks , cgroupb naturally gets all the cache. Thanks, Vikas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Wed, Jul 29, 2015 at 01:28:38AM +, Auld, Will wrote: > > > Whenever cgroupE has zero tasks, remove exclusivity (by allowing other > > > cgroups to use the exclusive ways of it). > > > > Same comment as above - Cgroup masks can always overlap and other cgroups > > can allocate the same cache , and hence wont have exclusive cache > > allocation. > > [Auld, Will] You can define all the cbm to provide one clos with an exclusive > area > > > > > So natuarally the cgroup with tasks would get to use the cache if it has > > the same > > mask (say representing 50% of cache in your example) as others . > > [Auld, Will] automatic adjustment of the cbm make me nervous. There are times > when we want to limit the cache for a process independent of whether there is > lots of unused cache. How about this: desiredclos (closid p1 p2 p3 p4) 1 1 0 0 0 2 0 0 0 1 3 0 1 1 0 p means part. closid 1 is a exclusive cgroup. closid 2 is a "cache hog" class. closid 3 is "default closid". Desiredclos is what user has specified. Transition 1: desiredclos --> effectiveclos Clean all bits of unused closid's (that must be updated whenever a closid1 cgroup goes from empty->nonempty and vice-versa). effectiveclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 0 1 1 0 Transition 2: effectiveclos --> expandedclos expandedclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 1 1 1 0 Then you have different inplacecos for each CPU (see pseudo-code below): On the following events. - task migration to new pCPU: - task creation: id = smp_processor_id(); for (part = desiredclos.p1; ...; part++) /* if my cosid is set and any other cosid is clear, for the part, synchronize desiredclos --> inplacecos */ if (part[mycosid] == 1 && part[any_othercosid] == 0) wrmsr(part, desiredclos); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Wed, Jul 29, 2015 at 01:28:38AM +, Auld, Will wrote: Whenever cgroupE has zero tasks, remove exclusivity (by allowing other cgroups to use the exclusive ways of it). Same comment as above - Cgroup masks can always overlap and other cgroups can allocate the same cache , and hence wont have exclusive cache allocation. [Auld, Will] You can define all the cbm to provide one clos with an exclusive area So natuarally the cgroup with tasks would get to use the cache if it has the same mask (say representing 50% of cache in your example) as others . [Auld, Will] automatic adjustment of the cbm make me nervous. There are times when we want to limit the cache for a process independent of whether there is lots of unused cache. How about this: desiredclos (closid p1 p2 p3 p4) 1 1 0 0 0 2 0 0 0 1 3 0 1 1 0 p means part. closid 1 is a exclusive cgroup. closid 2 is a cache hog class. closid 3 is default closid. Desiredclos is what user has specified. Transition 1: desiredclos -- effectiveclos Clean all bits of unused closid's (that must be updated whenever a closid1 cgroup goes from empty-nonempty and vice-versa). effectiveclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 0 1 1 0 Transition 2: effectiveclos -- expandedclos expandedclos (closid p1 p2 p3 p4) 1 0 0 0 0 2 0 0 0 1 3 1 1 1 0 Then you have different inplacecos for each CPU (see pseudo-code below): On the following events. - task migration to new pCPU: - task creation: id = smp_processor_id(); for (part = desiredclos.p1; ...; part++) /* if my cosid is set and any other cosid is clear, for the part, synchronize desiredclos -- inplacecos */ if (part[mycosid] == 1 part[any_othercosid] == 0) wrmsr(part, desiredclos); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Tue, 28 Jul 2015, Auld, Will wrote: -Original Message- Same comment as above - Cgroup masks can always overlap and other cgroups can allocate the same cache , and hence wont have exclusive cache allocation. [Auld, Will] You can define all the cbm to provide one clos with an exclusive area Do you mean a CLOS that has all the bits set. We donot support exclusive area today. The bits in the mask can overlap .. hence can always share the same cache allocation . So natuarally the cgroup with tasks would get to use the cache if it has the same mask (say representing 50% of cache in your example) as others . [Auld, Will] automatic adjustment of the cbm make me nervous. There are times when we want to limit the cache for a process independent of whether there is lots of unused cache. Please see example below - In general , I just mean the cache mask can have bits that can overlap - does not matter whether there is tasks in it or not. (assume there are 8 bits max cbm) cgroupa - mask - 0xf cgroupb - mask - 0xf . Now if cgroupa has no tasks , cgroupb naturally gets all the cache. Thanks, Vikas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
> -Original Message- > From: Shivappa, Vikas > Sent: Tuesday, July 28, 2015 5:07 PM > To: Marcelo Tosatti > Cc: Vikas Shivappa; linux-kernel@vger.kernel.org; Shivappa, Vikas; > x...@kernel.org; h...@zytor.com; t...@linutronix.de; mi...@kernel.org; > t...@kernel.org; pet...@infradead.org; Fleming, Matt; Auld, Will; Williamson, > Glenn P; Juvva, Kanaka D > Subject: Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and > cgroup usage guide > > > > On Tue, 28 Jul 2015, Marcelo Tosatti wrote: > > > On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote: > >> Adds a description of Cache allocation technology, overview of kernel > >> implementation and usage of Cache Allocation cgroup interface. > >> > >> Cache allocation is a sub-feature of Resource Director > >> Technology(RDT) Allocation or Platform Shared resource control which > >> provides support to control Platform shared resources like L3 cache. > >> Currently L3 Cache is the only resource that is supported in RDT. > >> More information can be found in the Intel SDM, Volume 3, section 17.15. > >> > >> Cache Allocation Technology provides a way for the Software (OS/VMM) > >> to restrict cache allocation to a defined 'subset' of cache which may > >> be overlapping with other 'subsets'. This feature is used when > >> allocating a line in cache ie when pulling new data into the cache. > >> > >> Signed-off-by: Vikas Shivappa > >> --- > >> Documentation/cgroups/rdt.txt | 215 > >> ++ > >> 1 file changed, 215 insertions(+) > >> create mode 100644 Documentation/cgroups/rdt.txt > >> > >> diff --git a/Documentation/cgroups/rdt.txt > >> b/Documentation/cgroups/rdt.txt new file mode 100644 index > >> 000..dfff477 > >> --- /dev/null > >> +++ b/Documentation/cgroups/rdt.txt > >> @@ -0,0 +1,215 @@ > >> +RDT > >> +--- > >> + > >> +Copyright (C) 2014 Intel Corporation Written by > >> +vikas.shiva...@linux.intel.com (based on contents and format from > >> +cpusets.txt) > >> + > >> +CONTENTS: > >> += > >> + > >> +1. Cache Allocation Technology > >> + 1.1 What is RDT and Cache allocation ? > >> + 1.2 Why is Cache allocation needed ? > >> + 1.3 Cache allocation implementation overview > >> + 1.4 Assignment of CBM and CLOS > >> + 1.5 Scheduling and Context Switch > >> +2. Usage Examples and Syntax > >> + > >> +1. Cache Allocation Technology(Cache allocation) > >> +=== > >> + > >> +1.1 What is RDT and Cache allocation > >> + > >> + > >> +Cache allocation is a sub-feature of Resource Director > >> +Technology(RDT) Allocation or Platform Shared resource control which > >> +provides support to control Platform shared resources like L3 cache. > >> +Currently L3 Cache is the only resource that is supported in RDT. > >> +More information can be found in the Intel SDM, Volume 3, section 17.15. > >> + > >> +Cache Allocation Technology provides a way for the Software (OS/VMM) > >> +to restrict cache allocation to a defined 'subset' of cache which > >> +may be overlapping with other 'subsets'. This feature is used when > >> +allocating a line in cache ie when pulling new data into the cache. > >> +The programming of the h/w is done via programming MSRs. > >> + > >> +The different cache subsets are identified by CLOS identifier (class > >> +of service) and each CLOS has a CBM (cache bit mask). The CBM is a > >> +contiguous set of bits which defines the amount of cache resource > >> +that is available for each 'subset'. > >> + > >> +1.2 Why is Cache allocation needed > >> +-- > >> + > >> +In todays new processors the number of cores is continuously > >> +increasing, especially in large scale usage models where VMs are > >> +used like webservers and datacenters. The number of cores increase > >> +the number of threads or workloads that can simultaneously be run. > >> +When multi-threaded-applications, VMs, workloads run concurrently > >> +they compete for shared resources including L3 cache. > >> + > >> +The Cache allocation enables more cache resources to be made > >> +avail
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Tue, 28 Jul 2015, Marcelo Tosatti wrote: On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote: Adds a description of Cache allocation technology, overview of kernel implementation and usage of Cache Allocation cgroup interface. Cache allocation is a sub-feature of Resource Director Technology(RDT) Allocation or Platform Shared resource control which provides support to control Platform shared resources like L3 cache. Currently L3 Cache is the only resource that is supported in RDT. More information can be found in the Intel SDM, Volume 3, section 17.15. Cache Allocation Technology provides a way for the Software (OS/VMM) to restrict cache allocation to a defined 'subset' of cache which may be overlapping with other 'subsets'. This feature is used when allocating a line in cache ie when pulling new data into the cache. Signed-off-by: Vikas Shivappa --- Documentation/cgroups/rdt.txt | 215 ++ 1 file changed, 215 insertions(+) create mode 100644 Documentation/cgroups/rdt.txt diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt new file mode 100644 index 000..dfff477 --- /dev/null +++ b/Documentation/cgroups/rdt.txt @@ -0,0 +1,215 @@ +RDT +--- + +Copyright (C) 2014 Intel Corporation +Written by vikas.shiva...@linux.intel.com +(based on contents and format from cpusets.txt) + +CONTENTS: += + +1. Cache Allocation Technology + 1.1 What is RDT and Cache allocation ? + 1.2 Why is Cache allocation needed ? + 1.3 Cache allocation implementation overview + 1.4 Assignment of CBM and CLOS + 1.5 Scheduling and Context Switch +2. Usage Examples and Syntax + +1. Cache Allocation Technology(Cache allocation) +=== + +1.1 What is RDT and Cache allocation + + +Cache allocation is a sub-feature of Resource Director Technology(RDT) +Allocation or Platform Shared resource control which provides support to +control Platform shared resources like L3 cache. Currently L3 Cache is +the only resource that is supported in RDT. More information can be +found in the Intel SDM, Volume 3, section 17.15. + +Cache Allocation Technology provides a way for the Software (OS/VMM) +to restrict cache allocation to a defined 'subset' of cache which may +be overlapping with other 'subsets'. This feature is used when +allocating a line in cache ie when pulling new data into the cache. +The programming of the h/w is done via programming MSRs. + +The different cache subsets are identified by CLOS identifier (class +of service) and each CLOS has a CBM (cache bit mask). The CBM is a +contiguous set of bits which defines the amount of cache resource that +is available for each 'subset'. + +1.2 Why is Cache allocation needed +-- + +In todays new processors the number of cores is continuously increasing, +especially in large scale usage models where VMs are used like +webservers and datacenters. The number of cores increase the number +of threads or workloads that can simultaneously be run. When +multi-threaded-applications, VMs, workloads run concurrently they +compete for shared resources including L3 cache. + +The Cache allocation enables more cache resources to be made available +for higher priority applications based on guidance from the execution +environment. + +The architecture also allows dynamically changing these subsets during +runtime to further optimize the performance of the higher priority +application with minimal degradation to the low priority app. +Additionally, resources can be rebalanced for system throughput benefit. + +This technique may be useful in managing large computer systems which +large L3 cache. Examples may be large servers running instances of +webservers or database servers. In such complex systems, these subsets +can be used for more careful placing of the available cache +resources. + +1.3 Cache allocation implementation Overview + + +Kernel implements a cgroup subsystem to support cache allocation. + +Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping. +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal +to the kernel and not exposed to user. Each cgroup would have one CBM +and would just represent one cache 'subset'. + +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the +cgroup never fails. When a child cgroup is created it inherits the +CLOSid and the CBM from its parent. When a user changes the default +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not +used before. The changing of 'l3_cache_mask' may fail with -ENOSPC once +the kernel runs out of maximum CLOSids it can support. +User can create as many cgroups as he wants but having different CBMs +at the same time is restricted by the maximum number of CLOSids +(multiple cgroups can have the same CBM). +Kernel maintains a
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote: > Adds a description of Cache allocation technology, overview > of kernel implementation and usage of Cache Allocation cgroup interface. > > Cache allocation is a sub-feature of Resource Director Technology(RDT) > Allocation or Platform Shared resource control which provides support to > control Platform shared resources like L3 cache. Currently L3 Cache is > the only resource that is supported in RDT. More information can be > found in the Intel SDM, Volume 3, section 17.15. > > Cache Allocation Technology provides a way for the Software (OS/VMM) > to restrict cache allocation to a defined 'subset' of cache which may > be overlapping with other 'subsets'. This feature is used when > allocating a line in cache ie when pulling new data into the cache. > > Signed-off-by: Vikas Shivappa > --- > Documentation/cgroups/rdt.txt | 215 > ++ > 1 file changed, 215 insertions(+) > create mode 100644 Documentation/cgroups/rdt.txt > > diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt > new file mode 100644 > index 000..dfff477 > --- /dev/null > +++ b/Documentation/cgroups/rdt.txt > @@ -0,0 +1,215 @@ > +RDT > +--- > + > +Copyright (C) 2014 Intel Corporation > +Written by vikas.shiva...@linux.intel.com > +(based on contents and format from cpusets.txt) > + > +CONTENTS: > += > + > +1. Cache Allocation Technology > + 1.1 What is RDT and Cache allocation ? > + 1.2 Why is Cache allocation needed ? > + 1.3 Cache allocation implementation overview > + 1.4 Assignment of CBM and CLOS > + 1.5 Scheduling and Context Switch > +2. Usage Examples and Syntax > + > +1. Cache Allocation Technology(Cache allocation) > +=== > + > +1.1 What is RDT and Cache allocation > + > + > +Cache allocation is a sub-feature of Resource Director Technology(RDT) > +Allocation or Platform Shared resource control which provides support to > +control Platform shared resources like L3 cache. Currently L3 Cache is > +the only resource that is supported in RDT. More information can be > +found in the Intel SDM, Volume 3, section 17.15. > + > +Cache Allocation Technology provides a way for the Software (OS/VMM) > +to restrict cache allocation to a defined 'subset' of cache which may > +be overlapping with other 'subsets'. This feature is used when > +allocating a line in cache ie when pulling new data into the cache. > +The programming of the h/w is done via programming MSRs. > + > +The different cache subsets are identified by CLOS identifier (class > +of service) and each CLOS has a CBM (cache bit mask). The CBM is a > +contiguous set of bits which defines the amount of cache resource that > +is available for each 'subset'. > + > +1.2 Why is Cache allocation needed > +-- > + > +In todays new processors the number of cores is continuously increasing, > +especially in large scale usage models where VMs are used like > +webservers and datacenters. The number of cores increase the number > +of threads or workloads that can simultaneously be run. When > +multi-threaded-applications, VMs, workloads run concurrently they > +compete for shared resources including L3 cache. > + > +The Cache allocation enables more cache resources to be made available > +for higher priority applications based on guidance from the execution > +environment. > + > +The architecture also allows dynamically changing these subsets during > +runtime to further optimize the performance of the higher priority > +application with minimal degradation to the low priority app. > +Additionally, resources can be rebalanced for system throughput benefit. > + > +This technique may be useful in managing large computer systems which > +large L3 cache. Examples may be large servers running instances of > +webservers or database servers. In such complex systems, these subsets > +can be used for more careful placing of the available cache > +resources. > + > +1.3 Cache allocation implementation Overview > + > + > +Kernel implements a cgroup subsystem to support cache allocation. > + > +Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping. > +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal > +to the kernel and not exposed to user. Each cgroup would have one CBM > +and would just represent one cache 'subset'. > + > +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the > +cgroup never fails. When a child cgroup is created it inherits the > +CLOSid and the CBM from its parent. When a user changes the default > +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not > +used before. The changing of 'l3_cache_mask' may fail with -ENOSPC once > +the kernel runs out of maximum CLOSids it can support. > +User can create as many cgroups as he
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote: Please edit this document to have consistent spacing. Its really hard to read this. Every time I spot a misplaced space my brain stumbles and I need to restart. > diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt > new file mode 100644 > index 000..dfff477 > --- /dev/null > +++ b/Documentation/cgroups/rdt.txt > @@ -0,0 +1,215 @@ > +RDT > +--- > + > +Copyright (C) 2014 Intel Corporation > +Written by vikas.shiva...@linux.intel.com > +(based on contents and format from cpusets.txt) > + > +CONTENTS: > += > + > +1. Cache Allocation Technology > + 1.1 What is RDT and Cache allocation ? > + 1.2 Why is Cache allocation needed ? > + 1.3 Cache allocation implementation overview > + 1.4 Assignment of CBM and CLOS > + 1.5 Scheduling and Context Switch > +2. Usage Examples and Syntax > + > +1. Cache Allocation Technology(Cache allocation) > +=== > + > +1.1 What is RDT and Cache allocation > + > + > +Cache allocation is a sub-feature of Resource Director Technology(RDT) missing ' ' before the '('. > +Allocation or Platform Shared resource control which provides support to > +control Platform shared resources like L3 cache. Currently L3 Cache is Double ' ' after '.' -- which _can_ be correct, but is inconsistent throughout the document. > +the only resource that is supported in RDT. More information can be > +found in the Intel SDM, Volume 3, section 17.15. Please also include the SDM revision, like June 2015. In fact, in the June 2015 V3 17.15 is CQM, not CAT. > +Cache Allocation Technology provides a way for the Software (OS/VMM) > +to restrict cache allocation to a defined 'subset' of cache which may > +be overlapping with other 'subsets'. This feature is used when > +allocating a line in cache ie when pulling new data into the cache. > +The programming of the h/w is done via programming MSRs. Double ' ' before 'MSRs'. > +The different cache subsets are identified by CLOS identifier (class > +of service) and each CLOS has a CBM (cache bit mask). The CBM is a > +contiguous set of bits which defines the amount of cache resource that > +is available for each 'subset'. > + > +1.2 Why is Cache allocation needed > +-- > + > +In todays new processors the number of cores is continuously increasing, > +especially in large scale usage models where VMs are used like > +webservers and datacenters. The number of cores increase the number Single ' ' after . > +of threads or workloads that can simultaneously be run. When > +multi-threaded-applications, VMs, workloads run concurrently they > +compete for shared resources including L3 cache. > + > +The Cache allocation enables more cache resources to be made available Double ' ' for no apparent reason. > +for higher priority applications based on guidance from the execution > +environment. > + > +The architecture also allows dynamically changing these subsets during > +runtime to further optimize the performance of the higher priority > +application with minimal degradation to the low priority app. > +Additionally, resources can be rebalanced for system throughput benefit. > + > +This technique may be useful in managing large computer systems which > +large L3 cache. Examples may be large servers running instances of Double ' ' > +webservers or database servers. In such complex systems, these subsets > +can be used for more careful placing of the available cache > +resources. > + > +1.3 Cache allocation implementation Overview > + > + > +Kernel implements a cgroup subsystem to support cache allocation. > + > +Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping. No ' ' before '(' > +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal Idem, also, _no_ space after '.' > +to the kernel and not exposed to user. Each cgroup would have one CBM Double space after '.' > +and would just represent one cache 'subset'. > + > +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the I'm thinking the convention is ' ' _after_ ',', not before. > +cgroup never fails. When a child cgroup is created it inherits the > +CLOSid and the CBM from its parent. When a user changes the default > +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not > +used before. The changing of 'l3_cache_mask' may fail with -ENOSPC once > +the kernel runs out of maximum CLOSids it can support. > +User can create as many cgroups as he wants but having different CBMs > +at the same time is restricted by the maximum number of CLOSids > +(multiple cgroups can have the same CBM). > +Kernel maintains a CLOSid<->cbm mapping which keeps reference counter Above you had ' ' around the arrows. > +for each cgroup using a CLOSid. > + > +The tasks in the cgroup would get to fill the L3 cache
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote: Adds a description of Cache allocation technology, overview of kernel implementation and usage of Cache Allocation cgroup interface. Cache allocation is a sub-feature of Resource Director Technology(RDT) Allocation or Platform Shared resource control which provides support to control Platform shared resources like L3 cache. Currently L3 Cache is the only resource that is supported in RDT. More information can be found in the Intel SDM, Volume 3, section 17.15. Cache Allocation Technology provides a way for the Software (OS/VMM) to restrict cache allocation to a defined 'subset' of cache which may be overlapping with other 'subsets'. This feature is used when allocating a line in cache ie when pulling new data into the cache. Signed-off-by: Vikas Shivappa vikas.shiva...@linux.intel.com --- Documentation/cgroups/rdt.txt | 215 ++ 1 file changed, 215 insertions(+) create mode 100644 Documentation/cgroups/rdt.txt diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt new file mode 100644 index 000..dfff477 --- /dev/null +++ b/Documentation/cgroups/rdt.txt @@ -0,0 +1,215 @@ +RDT +--- + +Copyright (C) 2014 Intel Corporation +Written by vikas.shiva...@linux.intel.com +(based on contents and format from cpusets.txt) + +CONTENTS: += + +1. Cache Allocation Technology + 1.1 What is RDT and Cache allocation ? + 1.2 Why is Cache allocation needed ? + 1.3 Cache allocation implementation overview + 1.4 Assignment of CBM and CLOS + 1.5 Scheduling and Context Switch +2. Usage Examples and Syntax + +1. Cache Allocation Technology(Cache allocation) +=== + +1.1 What is RDT and Cache allocation + + +Cache allocation is a sub-feature of Resource Director Technology(RDT) +Allocation or Platform Shared resource control which provides support to +control Platform shared resources like L3 cache. Currently L3 Cache is +the only resource that is supported in RDT. More information can be +found in the Intel SDM, Volume 3, section 17.15. + +Cache Allocation Technology provides a way for the Software (OS/VMM) +to restrict cache allocation to a defined 'subset' of cache which may +be overlapping with other 'subsets'. This feature is used when +allocating a line in cache ie when pulling new data into the cache. +The programming of the h/w is done via programming MSRs. + +The different cache subsets are identified by CLOS identifier (class +of service) and each CLOS has a CBM (cache bit mask). The CBM is a +contiguous set of bits which defines the amount of cache resource that +is available for each 'subset'. + +1.2 Why is Cache allocation needed +-- + +In todays new processors the number of cores is continuously increasing, +especially in large scale usage models where VMs are used like +webservers and datacenters. The number of cores increase the number +of threads or workloads that can simultaneously be run. When +multi-threaded-applications, VMs, workloads run concurrently they +compete for shared resources including L3 cache. + +The Cache allocation enables more cache resources to be made available +for higher priority applications based on guidance from the execution +environment. + +The architecture also allows dynamically changing these subsets during +runtime to further optimize the performance of the higher priority +application with minimal degradation to the low priority app. +Additionally, resources can be rebalanced for system throughput benefit. + +This technique may be useful in managing large computer systems which +large L3 cache. Examples may be large servers running instances of +webservers or database servers. In such complex systems, these subsets +can be used for more careful placing of the available cache +resources. + +1.3 Cache allocation implementation Overview + + +Kernel implements a cgroup subsystem to support cache allocation. + +Each cgroup has a CLOSid - CBM(cache bit mask) mapping. +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal +to the kernel and not exposed to user. Each cgroup would have one CBM +and would just represent one cache 'subset'. + +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the +cgroup never fails. When a child cgroup is created it inherits the +CLOSid and the CBM from its parent. When a user changes the default +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not +used before. The changing of 'l3_cache_mask' may fail with -ENOSPC once +the kernel runs out of maximum CLOSids it can support. +User can create as many cgroups as he wants but having different CBMs +at the same time is restricted by the maximum
RE: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
-Original Message- From: Shivappa, Vikas Sent: Tuesday, July 28, 2015 5:07 PM To: Marcelo Tosatti Cc: Vikas Shivappa; linux-kernel@vger.kernel.org; Shivappa, Vikas; x...@kernel.org; h...@zytor.com; t...@linutronix.de; mi...@kernel.org; t...@kernel.org; pet...@infradead.org; Fleming, Matt; Auld, Will; Williamson, Glenn P; Juvva, Kanaka D Subject: Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide On Tue, 28 Jul 2015, Marcelo Tosatti wrote: On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote: Adds a description of Cache allocation technology, overview of kernel implementation and usage of Cache Allocation cgroup interface. Cache allocation is a sub-feature of Resource Director Technology(RDT) Allocation or Platform Shared resource control which provides support to control Platform shared resources like L3 cache. Currently L3 Cache is the only resource that is supported in RDT. More information can be found in the Intel SDM, Volume 3, section 17.15. Cache Allocation Technology provides a way for the Software (OS/VMM) to restrict cache allocation to a defined 'subset' of cache which may be overlapping with other 'subsets'. This feature is used when allocating a line in cache ie when pulling new data into the cache. Signed-off-by: Vikas Shivappa vikas.shiva...@linux.intel.com --- Documentation/cgroups/rdt.txt | 215 ++ 1 file changed, 215 insertions(+) create mode 100644 Documentation/cgroups/rdt.txt diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt new file mode 100644 index 000..dfff477 --- /dev/null +++ b/Documentation/cgroups/rdt.txt @@ -0,0 +1,215 @@ +RDT +--- + +Copyright (C) 2014 Intel Corporation Written by +vikas.shiva...@linux.intel.com (based on contents and format from +cpusets.txt) + +CONTENTS: += + +1. Cache Allocation Technology + 1.1 What is RDT and Cache allocation ? + 1.2 Why is Cache allocation needed ? + 1.3 Cache allocation implementation overview + 1.4 Assignment of CBM and CLOS + 1.5 Scheduling and Context Switch +2. Usage Examples and Syntax + +1. Cache Allocation Technology(Cache allocation) +=== + +1.1 What is RDT and Cache allocation + + +Cache allocation is a sub-feature of Resource Director +Technology(RDT) Allocation or Platform Shared resource control which +provides support to control Platform shared resources like L3 cache. +Currently L3 Cache is the only resource that is supported in RDT. +More information can be found in the Intel SDM, Volume 3, section 17.15. + +Cache Allocation Technology provides a way for the Software (OS/VMM) +to restrict cache allocation to a defined 'subset' of cache which +may be overlapping with other 'subsets'. This feature is used when +allocating a line in cache ie when pulling new data into the cache. +The programming of the h/w is done via programming MSRs. + +The different cache subsets are identified by CLOS identifier (class +of service) and each CLOS has a CBM (cache bit mask). The CBM is a +contiguous set of bits which defines the amount of cache resource +that is available for each 'subset'. + +1.2 Why is Cache allocation needed +-- + +In todays new processors the number of cores is continuously +increasing, especially in large scale usage models where VMs are +used like webservers and datacenters. The number of cores increase +the number of threads or workloads that can simultaneously be run. +When multi-threaded-applications, VMs, workloads run concurrently +they compete for shared resources including L3 cache. + +The Cache allocation enables more cache resources to be made +available for higher priority applications based on guidance from +the execution environment. + +The architecture also allows dynamically changing these subsets +during runtime to further optimize the performance of the higher +priority application with minimal degradation to the low priority app. +Additionally, resources can be rebalanced for system throughput benefit. + +This technique may be useful in managing large computer systems +which large L3 cache. Examples may be large servers running +instances of webservers or database servers. In such complex +systems, these subsets can be used for more careful placing of the +available cache resources. + +1.3 Cache allocation implementation Overview + + +Kernel implements a cgroup subsystem to support cache allocation. + +Each cgroup has a CLOSid - CBM(cache bit mask) mapping. +A CLOS(Class of service) is represented by a CLOSid.CLOSid is +internal to the kernel and not exposed to user. Each cgroup
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Tue, 28 Jul 2015, Marcelo Tosatti wrote: On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote: Adds a description of Cache allocation technology, overview of kernel implementation and usage of Cache Allocation cgroup interface. Cache allocation is a sub-feature of Resource Director Technology(RDT) Allocation or Platform Shared resource control which provides support to control Platform shared resources like L3 cache. Currently L3 Cache is the only resource that is supported in RDT. More information can be found in the Intel SDM, Volume 3, section 17.15. Cache Allocation Technology provides a way for the Software (OS/VMM) to restrict cache allocation to a defined 'subset' of cache which may be overlapping with other 'subsets'. This feature is used when allocating a line in cache ie when pulling new data into the cache. Signed-off-by: Vikas Shivappa vikas.shiva...@linux.intel.com --- Documentation/cgroups/rdt.txt | 215 ++ 1 file changed, 215 insertions(+) create mode 100644 Documentation/cgroups/rdt.txt diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt new file mode 100644 index 000..dfff477 --- /dev/null +++ b/Documentation/cgroups/rdt.txt @@ -0,0 +1,215 @@ +RDT +--- + +Copyright (C) 2014 Intel Corporation +Written by vikas.shiva...@linux.intel.com +(based on contents and format from cpusets.txt) + +CONTENTS: += + +1. Cache Allocation Technology + 1.1 What is RDT and Cache allocation ? + 1.2 Why is Cache allocation needed ? + 1.3 Cache allocation implementation overview + 1.4 Assignment of CBM and CLOS + 1.5 Scheduling and Context Switch +2. Usage Examples and Syntax + +1. Cache Allocation Technology(Cache allocation) +=== + +1.1 What is RDT and Cache allocation + + +Cache allocation is a sub-feature of Resource Director Technology(RDT) +Allocation or Platform Shared resource control which provides support to +control Platform shared resources like L3 cache. Currently L3 Cache is +the only resource that is supported in RDT. More information can be +found in the Intel SDM, Volume 3, section 17.15. + +Cache Allocation Technology provides a way for the Software (OS/VMM) +to restrict cache allocation to a defined 'subset' of cache which may +be overlapping with other 'subsets'. This feature is used when +allocating a line in cache ie when pulling new data into the cache. +The programming of the h/w is done via programming MSRs. + +The different cache subsets are identified by CLOS identifier (class +of service) and each CLOS has a CBM (cache bit mask). The CBM is a +contiguous set of bits which defines the amount of cache resource that +is available for each 'subset'. + +1.2 Why is Cache allocation needed +-- + +In todays new processors the number of cores is continuously increasing, +especially in large scale usage models where VMs are used like +webservers and datacenters. The number of cores increase the number +of threads or workloads that can simultaneously be run. When +multi-threaded-applications, VMs, workloads run concurrently they +compete for shared resources including L3 cache. + +The Cache allocation enables more cache resources to be made available +for higher priority applications based on guidance from the execution +environment. + +The architecture also allows dynamically changing these subsets during +runtime to further optimize the performance of the higher priority +application with minimal degradation to the low priority app. +Additionally, resources can be rebalanced for system throughput benefit. + +This technique may be useful in managing large computer systems which +large L3 cache. Examples may be large servers running instances of +webservers or database servers. In such complex systems, these subsets +can be used for more careful placing of the available cache +resources. + +1.3 Cache allocation implementation Overview + + +Kernel implements a cgroup subsystem to support cache allocation. + +Each cgroup has a CLOSid - CBM(cache bit mask) mapping. +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal +to the kernel and not exposed to user. Each cgroup would have one CBM +and would just represent one cache 'subset'. + +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the +cgroup never fails. When a child cgroup is created it inherits the +CLOSid and the CBM from its parent. When a user changes the default +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not +used before. The changing of 'l3_cache_mask' may fail with -ENOSPC once +the kernel runs out of maximum CLOSids it can support. +User can create as many cgroups as he wants but having different CBMs +at the same time is restricted by the maximum number of CLOSids +(multiple cgroups can have the same CBM).
Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide
On Wed, Jul 01, 2015 at 03:21:04PM -0700, Vikas Shivappa wrote: Please edit this document to have consistent spacing. Its really hard to read this. Every time I spot a misplaced space my brain stumbles and I need to restart. diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt new file mode 100644 index 000..dfff477 --- /dev/null +++ b/Documentation/cgroups/rdt.txt @@ -0,0 +1,215 @@ +RDT +--- + +Copyright (C) 2014 Intel Corporation +Written by vikas.shiva...@linux.intel.com +(based on contents and format from cpusets.txt) + +CONTENTS: += + +1. Cache Allocation Technology + 1.1 What is RDT and Cache allocation ? + 1.2 Why is Cache allocation needed ? + 1.3 Cache allocation implementation overview + 1.4 Assignment of CBM and CLOS + 1.5 Scheduling and Context Switch +2. Usage Examples and Syntax + +1. Cache Allocation Technology(Cache allocation) +=== + +1.1 What is RDT and Cache allocation + + +Cache allocation is a sub-feature of Resource Director Technology(RDT) missing ' ' before the '('. +Allocation or Platform Shared resource control which provides support to +control Platform shared resources like L3 cache. Currently L3 Cache is Double ' ' after '.' -- which _can_ be correct, but is inconsistent throughout the document. +the only resource that is supported in RDT. More information can be +found in the Intel SDM, Volume 3, section 17.15. Please also include the SDM revision, like June 2015. In fact, in the June 2015 V3 17.15 is CQM, not CAT. +Cache Allocation Technology provides a way for the Software (OS/VMM) +to restrict cache allocation to a defined 'subset' of cache which may +be overlapping with other 'subsets'. This feature is used when +allocating a line in cache ie when pulling new data into the cache. +The programming of the h/w is done via programming MSRs. Double ' ' before 'MSRs'. +The different cache subsets are identified by CLOS identifier (class +of service) and each CLOS has a CBM (cache bit mask). The CBM is a +contiguous set of bits which defines the amount of cache resource that +is available for each 'subset'. + +1.2 Why is Cache allocation needed +-- + +In todays new processors the number of cores is continuously increasing, +especially in large scale usage models where VMs are used like +webservers and datacenters. The number of cores increase the number Single ' ' after . +of threads or workloads that can simultaneously be run. When +multi-threaded-applications, VMs, workloads run concurrently they +compete for shared resources including L3 cache. + +The Cache allocation enables more cache resources to be made available Double ' ' for no apparent reason. +for higher priority applications based on guidance from the execution +environment. + +The architecture also allows dynamically changing these subsets during +runtime to further optimize the performance of the higher priority +application with minimal degradation to the low priority app. +Additionally, resources can be rebalanced for system throughput benefit. + +This technique may be useful in managing large computer systems which +large L3 cache. Examples may be large servers running instances of Double ' ' +webservers or database servers. In such complex systems, these subsets +can be used for more careful placing of the available cache +resources. + +1.3 Cache allocation implementation Overview + + +Kernel implements a cgroup subsystem to support cache allocation. + +Each cgroup has a CLOSid - CBM(cache bit mask) mapping. No ' ' before '(' +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal Idem, also, _no_ space after '.' +to the kernel and not exposed to user. Each cgroup would have one CBM Double space after '.' +and would just represent one cache 'subset'. + +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the I'm thinking the convention is ' ' _after_ ',', not before. +cgroup never fails. When a child cgroup is created it inherits the +CLOSid and the CBM from its parent. When a user changes the default +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not +used before. The changing of 'l3_cache_mask' may fail with -ENOSPC once +the kernel runs out of maximum CLOSids it can support. +User can create as many cgroups as he wants but having different CBMs +at the same time is restricted by the maximum number of CLOSids +(multiple cgroups can have the same CBM). +Kernel maintains a CLOSid-cbm mapping which keeps reference counter Above you had ' ' around the arrows. +for each cgroup using a CLOSid. + +The tasks in the cgroup would get to fill the L3 cache represented by +the cgroup's 'l3_cache_mask' file. + +Root directory would have all available