Re: [PATCH 2/6] x86/intel_rdt/mba_sc: Enable/disable MBA software controller

2018-05-15 Thread Shivappa Vikas

On Sun, 13 May 2018, Thomas Gleixner wrote:

> On Fri, 20 Apr 2018, Vikas Shivappa wrote:
> > +/*
> > + * Enable or disable the MBA software controller
> > + * which helps user specify bandwidth in MBps.
> > + * MBA software controller is supported only if
> > + * MBM is supported and MBA is in linear scale.
> > + */
> > +static int set_mba_sc(bool mba_sc)
> > +{
> > +   struct rdt_resource *r = _resources_all[RDT_RESOURCE_MBA];
> > +
> > +   if (!is_mbm_enabled() || !is_mba_linear() ||
> > +   mba_sc == is_mba_sc(r))
> > +   return -1;
> 
> Please use a proper return value as this gets propagated.

Will fix. -EINVAL should be better.

Thanks,
Vikas

> 
> Thanks,
> 
>   tglx
> 


Re: [PATCH 2/6] x86/intel_rdt/mba_sc: Enable/disable MBA software controller

2018-05-15 Thread Shivappa Vikas

On Sun, 13 May 2018, Thomas Gleixner wrote:

> On Fri, 20 Apr 2018, Vikas Shivappa wrote:
> > +/*
> > + * Enable or disable the MBA software controller
> > + * which helps user specify bandwidth in MBps.
> > + * MBA software controller is supported only if
> > + * MBM is supported and MBA is in linear scale.
> > + */
> > +static int set_mba_sc(bool mba_sc)
> > +{
> > +   struct rdt_resource *r = _resources_all[RDT_RESOURCE_MBA];
> > +
> > +   if (!is_mbm_enabled() || !is_mba_linear() ||
> > +   mba_sc == is_mba_sc(r))
> > +   return -1;
> 
> Please use a proper return value as this gets propagated.

Will fix. -EINVAL should be better.

Thanks,
Vikas

> 
> Thanks,
> 
>   tglx
> 


Re: [PATCH V2 0/6] Memory bandwidth allocation software controller(mba_sc)

2018-04-30 Thread Shivappa Vikas

Hello Thomas,

I have sent a new version trying to address your feedback. Made this 
more cleaner also. Would be great if you could let me know any feedback.

Regards,
Vikas

On Fri, 20 Apr 2018, Vikas Shivappa wrote:

> Sending the second version of MBA software controller which addresses
> the feedback on V1. Thanks to the feedback from Thomas on the V1. Thomas
> was unhappy about the bad structure and english in the documentation and
> comments explaining the changes and also about duct taping of data
> structure which saves the throttle MSRs. Also issues were pointed out in
> the mounting and other init code.
> This series also changed the counting
> and feedback loop patches with some improvements to not do any division
> and take care of hierarchy and some l2 -> l3 traffic scenarios.
> 
> The patches are based on 4.16.
> 
> Background:
> 
> Intel RDT memory bandwidth allocation (MBA) currently uses the resctrl
> interface and uses the schemata file in each rdtgroup to specify the max
> "bandwidth percentage" that is allowed to be used by the "threads" and
> "cpus" in the rdtgroup. These values are specified "per package" in each
> rdtgroup in the schemata file as below:
> 
> $ cat /sys/fs/resctrl/p1/schemata 
> L3:0=7ff;1=7ff
> MB:0=100;1=50
> 
> In the above example the MB is the memory bandwidth percentage and "0"
> and "1" specify the package/socket ids. The threads in rdtgroup "p1"
> would get 100% memory bandwidth on socket0 and 50% bandwidth on socket1.
> 
> Problem:
> 
> However there are confusions in specifying the MBA in "percentage":
> 
> 1. In some scenarios, when user increases bandwidth percentage values he
>does not not see any raw bandwidth increase in "MBps" 
> 2. Same bandwidth "percentage values" may mean different raw bandwidth
>in "MBps".
> 3. This interface may also end up unnecessarily controlling the L2 <->
>L3 traffic which has no or very minimal L3 external traffic.
> 
> Proposed solution:
> 
> In short, we let user specify the bandwidth in "MBps" and we introduce
> a software feedback loop which measures the bandwidth using MBM and
> restricts the bandwidth "percentage" internally.
> 
> The fact that Memory bandwidth allocation(MBA) is a core specific
> mechanism where as memory bandwidth monitoring(MBM) is done at the
> package level is what leads to confusion when users try to apply control
> via the MBA and then monitor the bandwidth to see if the controls are
> effective. Below are details on such scenarios:
> 
> 1. User may *not* see increase in actual bandwidth when bandwidth
>percentage values are increased:
> 
> This can occur when aggregate L2 external bandwidth is more than L3
> external bandwidth. Consider an SKL SKU with 24 cores on a package and
> where L2 external  is 10GBps (hence aggregate L2 external bandwidth is
> 240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
> threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
> bandwidth of 100GBps although the percentage value specified is only 50%
> << 100%. Hence increasing the bandwidth percentage will not yield any
> more bandwidth. This is because although the L2 external bandwidth still
> has capacity, the L3 external bandwidth is fully used. Also note that
> this would be dependent on number of cores the benchmark is run on.
> 
> 2. Same bandwidth percentage may mean different actual bandwidth
>depending on # of threads:
> 
> For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
> thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
> they have same percentage bandwidth of 10%. This is simply because as
> threads start using more cores in an rdtgroup, the actual bandwidth may
> increase or vary although user specified bandwidth percentage is same.
> 
> In order to mitigate this and make the interface more user friendly,
> resctrl added support for specifying the bandwidth in "MBps" as well.
> The kernel underneath would use a software feedback mechanism or
> a "Software Controller" which reads the actual bandwidth using MBM
> counters and adjust the memory bandwidth percentages to ensure
> 
>   "actual bandwidth < user specified bandwidth".
> 
> By default, the schemata would take the bandwidth percentage values
> where as user can switch to the "MBA software controller" mode using
> a mount option 'mba_MBps'. The schemata format is specified in the
> below.
> 
> To use the feature mount the file system using mba_MBps option:
>  
> $ mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
> 
> If the MBA is specified in MBps then user can enter the max bandwidth in
> MBps rather than the percentage values. The default value when mounted
> is max_u32.
> 
> $ echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
> $ echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
> 
> In the above example the tasks in "p1" and "p0" rdtgroup
> would use a max bandwidth of 1024MBps on socket0 and 

Re: [PATCH V2 0/6] Memory bandwidth allocation software controller(mba_sc)

2018-04-30 Thread Shivappa Vikas

Hello Thomas,

I have sent a new version trying to address your feedback. Made this 
more cleaner also. Would be great if you could let me know any feedback.

Regards,
Vikas

On Fri, 20 Apr 2018, Vikas Shivappa wrote:

> Sending the second version of MBA software controller which addresses
> the feedback on V1. Thanks to the feedback from Thomas on the V1. Thomas
> was unhappy about the bad structure and english in the documentation and
> comments explaining the changes and also about duct taping of data
> structure which saves the throttle MSRs. Also issues were pointed out in
> the mounting and other init code.
> This series also changed the counting
> and feedback loop patches with some improvements to not do any division
> and take care of hierarchy and some l2 -> l3 traffic scenarios.
> 
> The patches are based on 4.16.
> 
> Background:
> 
> Intel RDT memory bandwidth allocation (MBA) currently uses the resctrl
> interface and uses the schemata file in each rdtgroup to specify the max
> "bandwidth percentage" that is allowed to be used by the "threads" and
> "cpus" in the rdtgroup. These values are specified "per package" in each
> rdtgroup in the schemata file as below:
> 
> $ cat /sys/fs/resctrl/p1/schemata 
> L3:0=7ff;1=7ff
> MB:0=100;1=50
> 
> In the above example the MB is the memory bandwidth percentage and "0"
> and "1" specify the package/socket ids. The threads in rdtgroup "p1"
> would get 100% memory bandwidth on socket0 and 50% bandwidth on socket1.
> 
> Problem:
> 
> However there are confusions in specifying the MBA in "percentage":
> 
> 1. In some scenarios, when user increases bandwidth percentage values he
>does not not see any raw bandwidth increase in "MBps" 
> 2. Same bandwidth "percentage values" may mean different raw bandwidth
>in "MBps".
> 3. This interface may also end up unnecessarily controlling the L2 <->
>L3 traffic which has no or very minimal L3 external traffic.
> 
> Proposed solution:
> 
> In short, we let user specify the bandwidth in "MBps" and we introduce
> a software feedback loop which measures the bandwidth using MBM and
> restricts the bandwidth "percentage" internally.
> 
> The fact that Memory bandwidth allocation(MBA) is a core specific
> mechanism where as memory bandwidth monitoring(MBM) is done at the
> package level is what leads to confusion when users try to apply control
> via the MBA and then monitor the bandwidth to see if the controls are
> effective. Below are details on such scenarios:
> 
> 1. User may *not* see increase in actual bandwidth when bandwidth
>percentage values are increased:
> 
> This can occur when aggregate L2 external bandwidth is more than L3
> external bandwidth. Consider an SKL SKU with 24 cores on a package and
> where L2 external  is 10GBps (hence aggregate L2 external bandwidth is
> 240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
> threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
> bandwidth of 100GBps although the percentage value specified is only 50%
> << 100%. Hence increasing the bandwidth percentage will not yield any
> more bandwidth. This is because although the L2 external bandwidth still
> has capacity, the L3 external bandwidth is fully used. Also note that
> this would be dependent on number of cores the benchmark is run on.
> 
> 2. Same bandwidth percentage may mean different actual bandwidth
>depending on # of threads:
> 
> For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
> thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
> they have same percentage bandwidth of 10%. This is simply because as
> threads start using more cores in an rdtgroup, the actual bandwidth may
> increase or vary although user specified bandwidth percentage is same.
> 
> In order to mitigate this and make the interface more user friendly,
> resctrl added support for specifying the bandwidth in "MBps" as well.
> The kernel underneath would use a software feedback mechanism or
> a "Software Controller" which reads the actual bandwidth using MBM
> counters and adjust the memory bandwidth percentages to ensure
> 
>   "actual bandwidth < user specified bandwidth".
> 
> By default, the schemata would take the bandwidth percentage values
> where as user can switch to the "MBA software controller" mode using
> a mount option 'mba_MBps'. The schemata format is specified in the
> below.
> 
> To use the feature mount the file system using mba_MBps option:
>  
> $ mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
> 
> If the MBA is specified in MBps then user can enter the max bandwidth in
> MBps rather than the percentage values. The default value when mounted
> is max_u32.
> 
> $ echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
> $ echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
> 
> In the above example the tasks in "p1" and "p0" rdtgroup
> would use a max bandwidth of 1024MBps on socket0 and 

Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller

2018-04-04 Thread Shivappa Vikas


On Wed, 4 Apr 2018, Thomas Gleixner wrote:

> On Tue, 3 Apr 2018, Shivappa Vikas wrote:
> > On Tue, 3 Apr 2018, Thomas Gleixner wrote:
> > > On Thu, 29 Mar 2018, Vikas Shivappa wrote:
> > > The L2 external bandwidth is higher than the L3 external bandwidth.
> > > 
> > >  Is there any information available from CPUID or whatever source which
> > >  allows us to retrieve the bandwidth ratio or the absolute maximum
> > >  bandwidth per level?
> > 
> > There is no information in cpuid on the bandwidth available. Also we have 
> > seen
> > from our experiments that the increase is not perfectly linear (delta
> > bandwidth increase from 30% to 40% may not be same as 70% to 80%). So we
> > currently dynamically caliberate this delta for the software controller.
> 
> I assume you mean: calibrate
> 
> Though I don't see anything which looks remotely like calibration.
> Calibration means that you determine the exact parameters by observation and
> then can use the calibrated values afterwards. But that's not what you are
> doing. So please don't claim its calibration.
> 
> You observe behaviour which depends on the workload and other
> factors. That's not calibration. If you change the MSR by a granularity
> value then you calculate the bandwidth delta vs. the previous MSR
> value. That only makes sense and works when the application is having the
> same memory access patterns accross both observation periods.
> 
> And of course, this won't be necessarily linear because if you throttle the
> application then it gets less work done per CPU time slice and the
> resulting stalls will also have side effects on the requested amount of
> memory and therefore distort the measurement. Ditto the other way
> around.
> 
> There are too many factors influencing this, so claiming that it's
> calibration is window dressing at best. Even worse it suggests that it's
> something accurate, which subverts your goal of reducing confusion.
> 
> Adaptive control might be an acceptable description, though given the
> amount of factors which play into that it's still an euphemism for
> 'heuristic'.

Agree we donot really caliberate and the only thing we guarentee is that 
the actual bandwidth in bytes < user specified bandwidth bytes. This is 
what the hardware guarenteed when we specified the values in percentage 
as well but just that it was confusing.

> 
> > > What's also missing from your explanation is how that feedback loop 
> > > behaves
> > > under different workloads.
> > > 
> > >  Is this assuming that the involved threads/cpus actually try to utilize
> > >  the bandwidth completely?
> > 
> > No, the feedback loop only guarentees that the usage will not exceed what 
> > the
> > user specifies as max bandwidth. If it is using below the max value it does
> > not matter how much less it is using.
> > > 
> > >  What happens if the threads/cpus are only using a small set because they
> > >  are idle or their computations are mostly cache local and do not need
> > >  external bandwidth? Looking at the implementation I don't see how that is
> > >  taken into account.
> > 
> > The feedback only kicks into action if a rdtgroup uses more bandwidth than 
> > the
> > max specified by the user. I specified that it is always "ensure the "actual
> > b/w
> > 354 < user b/w" " and can add more explanation on these scenarios.
> 
> Please finally stop to use this horrible 'b/w' thingy. It makes my eyes bleed
> everytime.

Will fix - this was a text from already existing documentation.

> 
> > Also note that we are using the MBM counters for this feedback loop. Now 
> > that
> > the interface is much more useful because we have the same rdtgroup that is
> > being monitored and controlled. (vs. if we had the perf mbm the group of
> > threads in resctrl mba and in mbm could be different and would be hard to
> > measure what the threads/cpus in the resctrl are using).
> 
> Why does that make me smile?

I know why :) Full credits to you as you had suggested to rewrite the 
cqm/mbm in resctrl which is definitely very good in long term !

Thanks,
Vikas

> 
> Thanks,
> 
>   tglx
> 


Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller

2018-04-04 Thread Shivappa Vikas


On Wed, 4 Apr 2018, Thomas Gleixner wrote:

> On Tue, 3 Apr 2018, Shivappa Vikas wrote:
> > On Tue, 3 Apr 2018, Thomas Gleixner wrote:
> > > On Thu, 29 Mar 2018, Vikas Shivappa wrote:
> > > The L2 external bandwidth is higher than the L3 external bandwidth.
> > > 
> > >  Is there any information available from CPUID or whatever source which
> > >  allows us to retrieve the bandwidth ratio or the absolute maximum
> > >  bandwidth per level?
> > 
> > There is no information in cpuid on the bandwidth available. Also we have 
> > seen
> > from our experiments that the increase is not perfectly linear (delta
> > bandwidth increase from 30% to 40% may not be same as 70% to 80%). So we
> > currently dynamically caliberate this delta for the software controller.
> 
> I assume you mean: calibrate
> 
> Though I don't see anything which looks remotely like calibration.
> Calibration means that you determine the exact parameters by observation and
> then can use the calibrated values afterwards. But that's not what you are
> doing. So please don't claim its calibration.
> 
> You observe behaviour which depends on the workload and other
> factors. That's not calibration. If you change the MSR by a granularity
> value then you calculate the bandwidth delta vs. the previous MSR
> value. That only makes sense and works when the application is having the
> same memory access patterns accross both observation periods.
> 
> And of course, this won't be necessarily linear because if you throttle the
> application then it gets less work done per CPU time slice and the
> resulting stalls will also have side effects on the requested amount of
> memory and therefore distort the measurement. Ditto the other way
> around.
> 
> There are too many factors influencing this, so claiming that it's
> calibration is window dressing at best. Even worse it suggests that it's
> something accurate, which subverts your goal of reducing confusion.
> 
> Adaptive control might be an acceptable description, though given the
> amount of factors which play into that it's still an euphemism for
> 'heuristic'.

Agree we donot really caliberate and the only thing we guarentee is that 
the actual bandwidth in bytes < user specified bandwidth bytes. This is 
what the hardware guarenteed when we specified the values in percentage 
as well but just that it was confusing.

> 
> > > What's also missing from your explanation is how that feedback loop 
> > > behaves
> > > under different workloads.
> > > 
> > >  Is this assuming that the involved threads/cpus actually try to utilize
> > >  the bandwidth completely?
> > 
> > No, the feedback loop only guarentees that the usage will not exceed what 
> > the
> > user specifies as max bandwidth. If it is using below the max value it does
> > not matter how much less it is using.
> > > 
> > >  What happens if the threads/cpus are only using a small set because they
> > >  are idle or their computations are mostly cache local and do not need
> > >  external bandwidth? Looking at the implementation I don't see how that is
> > >  taken into account.
> > 
> > The feedback only kicks into action if a rdtgroup uses more bandwidth than 
> > the
> > max specified by the user. I specified that it is always "ensure the "actual
> > b/w
> > 354 < user b/w" " and can add more explanation on these scenarios.
> 
> Please finally stop to use this horrible 'b/w' thingy. It makes my eyes bleed
> everytime.

Will fix - this was a text from already existing documentation.

> 
> > Also note that we are using the MBM counters for this feedback loop. Now 
> > that
> > the interface is much more useful because we have the same rdtgroup that is
> > being monitored and controlled. (vs. if we had the perf mbm the group of
> > threads in resctrl mba and in mbm could be different and would be hard to
> > measure what the threads/cpus in the resctrl are using).
> 
> Why does that make me smile?

I know why :) Full credits to you as you had suggested to rewrite the 
cqm/mbm in resctrl which is definitely very good in long term !

Thanks,
Vikas

> 
> Thanks,
> 
>   tglx
> 


Re: [PATCH 3/6] x86/intel_rdt/mba_sc: Add initialization support

2018-04-03 Thread Shivappa Vikas



On Tue, 3 Apr 2018, Thomas Gleixner wrote:


On Thu, 29 Mar 2018, Vikas Shivappa wrote:

+void setup_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm)
+{
+   int i;
+
+   /*
+* Initialize the Control MSRs to having no control.
+* For Cache Allocation: Set all bits in cbm
+* For Memory Allocation: Set b/w requested to 100%
+* and the b/w in MB to U32_MAX
+*/
+   for (i = 0; i < r->num_closid; i++, dc++, dm++) {
+   *dc = r->membw.bw_byte ? MBA_BW_MAX_MB : r->default_ctrl;
+   *dm = r->default_ctrl;


No! Please stop duct taping your stuff into the existing code. So far the
ctrl value was the same as the value which was actually written into the
MSR. With your new mode you have to split that up into the user supplied
value and the value which gets written into the MSR.

So the right thing to do is to separate the user value and the MSR value
first and independent of the mode. Then the new mode falls into place
naturally because r->default_ctrl and r->default_msrval are set up at mount
time with the values which correspond to the mount mode.


will fix. I tried both and this implementation assumes what user modifies is 
the control values (because then schemata read and write is easy as user does it 
directly) but agree we can change that.


Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 3/6] x86/intel_rdt/mba_sc: Add initialization support

2018-04-03 Thread Shivappa Vikas



On Tue, 3 Apr 2018, Thomas Gleixner wrote:


On Thu, 29 Mar 2018, Vikas Shivappa wrote:

+void setup_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm)
+{
+   int i;
+
+   /*
+* Initialize the Control MSRs to having no control.
+* For Cache Allocation: Set all bits in cbm
+* For Memory Allocation: Set b/w requested to 100%
+* and the b/w in MB to U32_MAX
+*/
+   for (i = 0; i < r->num_closid; i++, dc++, dm++) {
+   *dc = r->membw.bw_byte ? MBA_BW_MAX_MB : r->default_ctrl;
+   *dm = r->default_ctrl;


No! Please stop duct taping your stuff into the existing code. So far the
ctrl value was the same as the value which was actually written into the
MSR. With your new mode you have to split that up into the user supplied
value and the value which gets written into the MSR.

So the right thing to do is to separate the user value and the MSR value
first and independent of the mode. Then the new mode falls into place
naturally because r->default_ctrl and r->default_msrval are set up at mount
time with the values which correspond to the mount mode.


will fix. I tried both and this implementation assumes what user modifies is 
the control values (because then schemata read and write is easy as user does it 
directly) but agree we can change that.


Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller

2018-04-03 Thread Shivappa Vikas



On Tue, 3 Apr 2018, Thomas Gleixner wrote:


On Tue, 3 Apr 2018, Thomas Gleixner wrote:

On Thu, 29 Mar 2018, Vikas Shivappa wrote:
You said above:


This may lead to confusion in scenarios below:


Reading the blurb after that creates even more confusion than being
helpful.

First of all this information should not be under the section 'Memory
bandwidth in MB/s'.

Also please write bandwidth. The weird acronym b/w (band per width???) is
really not increasing legibility.

What you really want is a general section about memory bandwidth allocation
where you explain the technical background in purely technical terms w/o
fairy tale mode. Technical descriptions have to be factual and not
'could/may/would'.

If I decode the above correctly then the current percentage based
implementation was buggy from the very beginning in several ways.

Now the obvious question which is in no way answered by the cover letter is
why the current percentage based implementation cannot be fixed and we need
some feedback driven magic to achieve that. I assume you spent some brain
cycles on that question, so it would be really helpful if you shared that.

If I understand it correctly then the problem is that the throttling
mechanism is per core and affects the L2 external bandwidth.

  Is this really per core? What about hyper threads. Both threads have that
  MSR. How is that working?

The L2 external bandwidth is higher than the L3 external bandwidth.

  Is there any information available from CPUID or whatever source which
  allows us to retrieve the bandwidth ratio or the absolute maximum
  bandwidth per level?

What's also missing from your explanation is how that feedback loop behaves
under different workloads.

  Is this assuming that the involved threads/cpus actually try to utilize
  the bandwidth completely?

  What happens if the threads/cpus are only using a small set because they
  are idle or their computations are mostly cache local and do not need
  external bandwidth? Looking at the implementation I don't see how that is
  taken into account.


Forgot to mention the following:

 The proposed new interface has no upper limit. The existing percentage
 based implementation has at least some notion of limit and scale; not
 really helpful either because of the hardware implementation. but I

 How is the poor admin supposed to configure that new thing without
 knowing what the actual hardware limits are in the first place?


That is true. The default values only put it to a very high bandwidth which 
means user gets to use everything. There seems no other way other than 
caliberating to know the actual max bandwidth in bytes. That could be a better 
value to have as default so admin knows the limit. I will explore if there is 
a way to calculate the same without caliberating.


Thanks,
Vikas



Thanks,

tglx





Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller

2018-04-03 Thread Shivappa Vikas



On Tue, 3 Apr 2018, Thomas Gleixner wrote:


On Tue, 3 Apr 2018, Thomas Gleixner wrote:

On Thu, 29 Mar 2018, Vikas Shivappa wrote:
You said above:


This may lead to confusion in scenarios below:


Reading the blurb after that creates even more confusion than being
helpful.

First of all this information should not be under the section 'Memory
bandwidth in MB/s'.

Also please write bandwidth. The weird acronym b/w (band per width???) is
really not increasing legibility.

What you really want is a general section about memory bandwidth allocation
where you explain the technical background in purely technical terms w/o
fairy tale mode. Technical descriptions have to be factual and not
'could/may/would'.

If I decode the above correctly then the current percentage based
implementation was buggy from the very beginning in several ways.

Now the obvious question which is in no way answered by the cover letter is
why the current percentage based implementation cannot be fixed and we need
some feedback driven magic to achieve that. I assume you spent some brain
cycles on that question, so it would be really helpful if you shared that.

If I understand it correctly then the problem is that the throttling
mechanism is per core and affects the L2 external bandwidth.

  Is this really per core? What about hyper threads. Both threads have that
  MSR. How is that working?

The L2 external bandwidth is higher than the L3 external bandwidth.

  Is there any information available from CPUID or whatever source which
  allows us to retrieve the bandwidth ratio or the absolute maximum
  bandwidth per level?

What's also missing from your explanation is how that feedback loop behaves
under different workloads.

  Is this assuming that the involved threads/cpus actually try to utilize
  the bandwidth completely?

  What happens if the threads/cpus are only using a small set because they
  are idle or their computations are mostly cache local and do not need
  external bandwidth? Looking at the implementation I don't see how that is
  taken into account.


Forgot to mention the following:

 The proposed new interface has no upper limit. The existing percentage
 based implementation has at least some notion of limit and scale; not
 really helpful either because of the hardware implementation. but I

 How is the poor admin supposed to configure that new thing without
 knowing what the actual hardware limits are in the first place?


That is true. The default values only put it to a very high bandwidth which 
means user gets to use everything. There seems no other way other than 
caliberating to know the actual max bandwidth in bytes. That could be a better 
value to have as default so admin knows the limit. I will explore if there is 
a way to calculate the same without caliberating.


Thanks,
Vikas



Thanks,

tglx





Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller

2018-04-03 Thread Shivappa Vikas



On Tue, 3 Apr 2018, Thomas Gleixner wrote:


On Thu, 29 Mar 2018, Vikas Shivappa wrote:

+Memory bandwidth(b/w) in MegaBytes
+--
+
+Memory bandwidth is a core specific mechanism which means that when the
+Memory b/w percentage is specified in the schemata per package it
+actually is applied on a per core basis via IA32_MBA_THRTL_MSR
+interface. This may lead to confusion in scenarios below:
+
+1. User may not see increase in actual b/w when percentage values are
+   increased:
+
+This can occur when aggregate L2 external b/w is more than L3 external
+b/w. Consider an SKL SKU with 24 cores on a package and where L2
+external b/w is 10GBps (hence aggregate L2 external b/w is 240GBps) and
+L3 external b/w is 100GBps. Now a workload with '20 threads, having 50%
+b/w, each consuming 5GBps' consumes the max L3 b/w of 100GBps although
+the percentage value specified is only 50% << 100%. Hence increasing
+the b/w percentage will not yeild any more b/w. This is because
+although the L2 external b/w still has capacity, the L3 external b/w
+is fully used. Also note that this would be dependent on number of
+cores the benchmark is run on.
+
+2. Same b/w percentage may mean different actual b/w depending on # of
+   threads:
+
+For the same SKU in #1, a 'single thread, with 10% b/w' and '4 thread,
+with 10% b/w' can consume upto 10GBps and 40GBps although they have same
+percentage b/w of 10%. This is simply because as threads start using
+more cores in an rdtgroup, the actual b/w may increase or vary although
+user specified b/w percentage is same.
+
+In order to mitigate this and make the interface more user friendly, we
+can let the user specify the max bandwidth per rdtgroup in bytes(or mega
+bytes). The kernel underneath would use a software feedback mechanism or
+a "Software Controller" which reads the actual b/w using MBM counters
+and adjust the memowy bandwidth percentages to ensure the "actual b/w
+< user b/w".
+
+The legacy behaviour is default and user can switch to the "MBA software
+controller" mode using a mount option 'mba_MB'.


You said above:


This may lead to confusion in scenarios below:


Reading the blurb after that creates even more confusion than being
helpful.

First of all this information should not be under the section 'Memory
bandwidth in MB/s'.

Also please write bandwidth. The weird acronym b/w (band per width???) is
really not increasing legibility.


Ok will fix and add a seperate section.



What you really want is a general section about memory bandwidth allocation
where you explain the technical background in purely technical terms w/o
fairy tale mode. Technical descriptions have to be factual and not
'could/may/would'.

If I decode the above correctly then the current percentage based
implementation was buggy from the very beginning in several ways.

Now the obvious question which is in no way answered by the cover letter is
why the current percentage based implementation cannot be fixed and we need
some feedback driven magic to achieve that. I assume you spent some brain
cycles on that question, so it would be really helpful if you shared that.

If I understand it correctly then the problem is that the throttling
mechanism is per core and affects the L2 external bandwidth.

 Is this really per core? What about hyper threads. Both threads have that
 MSR. How is that working?


It is per core mechanism. On hyperthreads, it just takes the lowest bandwidth 
among the thread siblings. We have the below to explain the same - i can add 
more description if needed


"The bandwidth throttling is a core specific mechanism on some of Intel
SKUs. Using a high bandwidth and a low bandwidth setting on two threads
sharing a core will result in both threads being throttled to use the
low bandwidth."



The L2 external bandwidth is higher than the L3 external bandwidth.

 Is there any information available from CPUID or whatever source which
 allows us to retrieve the bandwidth ratio or the absolute maximum
 bandwidth per level?


There is no information in cpuid on the bandwidth available. Also we have seen 
from our experiments that the increase is not perfectly linear (delta bandwidth 
increase from 30% to 40% may not be same as 70% to 80%). So we currently 
dynamically caliberate this delta for the software controller.




What's also missing from your explanation is how that feedback loop behaves
under different workloads.

 Is this assuming that the involved threads/cpus actually try to utilize
 the bandwidth completely?


No, the feedback loop only guarentees that the usage will not exceed what the 
user specifies as max bandwidth. If it is using below the max value it does not 
matter how much less it is using.




 What happens if the threads/cpus are only using a small set because they
 are idle or their computations are mostly cache local and do not need
 external bandwidth? Looking at the implementation I don't see how that is
 taken into account.



Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller

2018-04-03 Thread Shivappa Vikas



On Tue, 3 Apr 2018, Thomas Gleixner wrote:


On Thu, 29 Mar 2018, Vikas Shivappa wrote:

+Memory bandwidth(b/w) in MegaBytes
+--
+
+Memory bandwidth is a core specific mechanism which means that when the
+Memory b/w percentage is specified in the schemata per package it
+actually is applied on a per core basis via IA32_MBA_THRTL_MSR
+interface. This may lead to confusion in scenarios below:
+
+1. User may not see increase in actual b/w when percentage values are
+   increased:
+
+This can occur when aggregate L2 external b/w is more than L3 external
+b/w. Consider an SKL SKU with 24 cores on a package and where L2
+external b/w is 10GBps (hence aggregate L2 external b/w is 240GBps) and
+L3 external b/w is 100GBps. Now a workload with '20 threads, having 50%
+b/w, each consuming 5GBps' consumes the max L3 b/w of 100GBps although
+the percentage value specified is only 50% << 100%. Hence increasing
+the b/w percentage will not yeild any more b/w. This is because
+although the L2 external b/w still has capacity, the L3 external b/w
+is fully used. Also note that this would be dependent on number of
+cores the benchmark is run on.
+
+2. Same b/w percentage may mean different actual b/w depending on # of
+   threads:
+
+For the same SKU in #1, a 'single thread, with 10% b/w' and '4 thread,
+with 10% b/w' can consume upto 10GBps and 40GBps although they have same
+percentage b/w of 10%. This is simply because as threads start using
+more cores in an rdtgroup, the actual b/w may increase or vary although
+user specified b/w percentage is same.
+
+In order to mitigate this and make the interface more user friendly, we
+can let the user specify the max bandwidth per rdtgroup in bytes(or mega
+bytes). The kernel underneath would use a software feedback mechanism or
+a "Software Controller" which reads the actual b/w using MBM counters
+and adjust the memowy bandwidth percentages to ensure the "actual b/w
+< user b/w".
+
+The legacy behaviour is default and user can switch to the "MBA software
+controller" mode using a mount option 'mba_MB'.


You said above:


This may lead to confusion in scenarios below:


Reading the blurb after that creates even more confusion than being
helpful.

First of all this information should not be under the section 'Memory
bandwidth in MB/s'.

Also please write bandwidth. The weird acronym b/w (band per width???) is
really not increasing legibility.


Ok will fix and add a seperate section.



What you really want is a general section about memory bandwidth allocation
where you explain the technical background in purely technical terms w/o
fairy tale mode. Technical descriptions have to be factual and not
'could/may/would'.

If I decode the above correctly then the current percentage based
implementation was buggy from the very beginning in several ways.

Now the obvious question which is in no way answered by the cover letter is
why the current percentage based implementation cannot be fixed and we need
some feedback driven magic to achieve that. I assume you spent some brain
cycles on that question, so it would be really helpful if you shared that.

If I understand it correctly then the problem is that the throttling
mechanism is per core and affects the L2 external bandwidth.

 Is this really per core? What about hyper threads. Both threads have that
 MSR. How is that working?


It is per core mechanism. On hyperthreads, it just takes the lowest bandwidth 
among the thread siblings. We have the below to explain the same - i can add 
more description if needed


"The bandwidth throttling is a core specific mechanism on some of Intel
SKUs. Using a high bandwidth and a low bandwidth setting on two threads
sharing a core will result in both threads being throttled to use the
low bandwidth."



The L2 external bandwidth is higher than the L3 external bandwidth.

 Is there any information available from CPUID or whatever source which
 allows us to retrieve the bandwidth ratio or the absolute maximum
 bandwidth per level?


There is no information in cpuid on the bandwidth available. Also we have seen 
from our experiments that the increase is not perfectly linear (delta bandwidth 
increase from 30% to 40% may not be same as 70% to 80%). So we currently 
dynamically caliberate this delta for the software controller.




What's also missing from your explanation is how that feedback loop behaves
under different workloads.

 Is this assuming that the involved threads/cpus actually try to utilize
 the bandwidth completely?


No, the feedback loop only guarentees that the usage will not exceed what the 
user specifies as max bandwidth. If it is using below the max value it does not 
matter how much less it is using.




 What happens if the threads/cpus are only using a small set because they
 are idle or their computations are mostly cache local and do not need
 external bandwidth? Looking at the implementation I don't see how that is
 taken into account.



Re: [PATCH 2/6] x86/intel_rdt/mba_sc: Add support to enable/disable via mount option

2018-03-30 Thread Shivappa Vikas


Hello Thomas,

On Fri, 30 Mar 2018, Thomas Gleixner wrote:


On Thu, 29 Mar 2018, Vikas Shivappa wrote:

Subject: x86/intel_rdt/mba_sc: Add support to enable/disable via mount option

Huch? From Documentation:

 The ``summary phrase`` in the email's Subject should concisely
 describe the patch which that email contains.

You're introducing somthing new: mba_sc

It's completely unclear what that is and what it means.

x86/intel_rdt: Add mount option for bandwidth allocation in MB/s

or something like that.


would 'Mount option to enable MBA softwarecontroller' be better? Given that I 
have a documentation patch which says what is mba software controller.





Specify a new mount option "mba_MB" to enable the user to specify MBA
bandwidth in Megabytes(Software Controller/SC) instead of b/w


You cannot specify bandwidth in Megabytes. Bandwidth is a bit-rate and the
units are multiples of bits per second and not Megabytes.


--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -259,6 +259,7 @@ struct rdt_cache {
  * @min_bw:Minimum memory bandwidth percentage user can request
  * @bw_gran:   Granularity at which the memory bandwidth is allocated
  * @delay_linear:  True if memory B/W delay is in linear scale
+ * @bw_byte:   True if memory B/W is specified in bytes


So the mount parameter says Megabytes, but here you say bytes? What?

And bw_byte is a misnomer. bw_bytes if you really mean bytes. bw_mb if it's 
megabytes.


Will fix the namings. Thanks for pointing it should be MBps everywhere.




+#define is_mba_linear() rdt_resources_all[RDT_RESOURCE_MBA].membw.delay_linear
+#define is_mba_MBctrl() rdt_resources_all[RDT_RESOURCE_MBA].membw.bw_byte


Please use inlines and no camel case. That's horrible.


Will fix..




+
 /**
  * struct rdt_resource - attributes of an RDT resource
  * @rid:   The index of the resource
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c 
b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index fca759d..0707191 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -1041,6 +1041,24 @@ static int set_cache_qos_cfg(int level, bool enable)
return 0;
 }

+static void __set_mba_byte_ctrl(bool byte_ctrl)
+{
+   struct rdt_resource *r = _resources_all[RDT_RESOURCE_MBA];
+
+   r->membw.bw_byte = byte_ctrl;


I don't see the point of this extra function. It has exactly one user.


+}
+
+/*
+ * MBA allocation in bytes is only supported if
+ * MBM is supported and MBA is in linear scale.
+*/


Hint: checkpatch.pl is not optional


+static void set_mba_byte_ctrl(bool byte_ctrl)
+{
+   if ((is_mbm_enabled() && is_mba_linear()) &&
+   byte_ctrl != is_mba_MBctrl())
+   __set_mba_byte_ctrl(byte_ctrl);


And that user is a small enough function. To avoid indentation you can
simply return when the condition is false.

Also if the user wants to mount with the MB option and it's not supported,
why are you not returning an error code and refuse the mount? That's just
wrong.


Will fix. can merge into one function and return error when not available.




+
 static int cdp_enable(int level, int data_type, int code_type)
 {
struct rdt_resource *r_ldata = _resources_all[data_type];
@@ -1104,7 +1122,7 @@ static void cdp_disable_all(void)
cdpl2_disable();
 }

-static int parse_rdtgroupfs_options(char *data)
+static int parse_rdtgroupfs_options(char *data, bool *mba_MBctrl)


What?


 {
char *token, *o = data;
int ret = 0;
@@ -1123,6 +1141,8 @@ static int parse_rdtgroupfs_options(char *data)
ret = cdpl2_enable();
if (ret)
goto out;
+   } else if (!strcmp(token, "mba_MB")) {
+   *mba_MBctrl = true;


That's mindless hackery. Really. What's wrong with setting the flag in the
resource and then add the actual register fiddling right in the

if (is_mbm_enabled()) {

section in rdt_mount()? That would be too obvious and fit into the existing
code, right?


Will fix.




+   /*Set the control values before the rest of reset*/


Space after '/*' and before '*/

Aside of that the comment is pretty useless. 'the control values' ??? Which
control values?



Will fix the comment or remove. Wanted to point here that we reset the control 
values (the delay values that go into the IA32_MBA_THRTL_MSRs) but thats done 
any ways in the reset_all_ctrls call after this, so comment can be removed.


Will fix the checkpatch issues as pointed.

In general wanted to know if this is 
a sane idea to have a software feedback and let the user specify b/w in MBps 
rather than the confusing percentage values. The typical confusing scenarios are 
documented in documentation patch with examples. The use 
can can occur in any rdtgroups 
which are trying to group jobs where different number of threads are 

Re: [PATCH 2/6] x86/intel_rdt/mba_sc: Add support to enable/disable via mount option

2018-03-30 Thread Shivappa Vikas


Hello Thomas,

On Fri, 30 Mar 2018, Thomas Gleixner wrote:


On Thu, 29 Mar 2018, Vikas Shivappa wrote:

Subject: x86/intel_rdt/mba_sc: Add support to enable/disable via mount option

Huch? From Documentation:

 The ``summary phrase`` in the email's Subject should concisely
 describe the patch which that email contains.

You're introducing somthing new: mba_sc

It's completely unclear what that is and what it means.

x86/intel_rdt: Add mount option for bandwidth allocation in MB/s

or something like that.


would 'Mount option to enable MBA softwarecontroller' be better? Given that I 
have a documentation patch which says what is mba software controller.





Specify a new mount option "mba_MB" to enable the user to specify MBA
bandwidth in Megabytes(Software Controller/SC) instead of b/w


You cannot specify bandwidth in Megabytes. Bandwidth is a bit-rate and the
units are multiples of bits per second and not Megabytes.


--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -259,6 +259,7 @@ struct rdt_cache {
  * @min_bw:Minimum memory bandwidth percentage user can request
  * @bw_gran:   Granularity at which the memory bandwidth is allocated
  * @delay_linear:  True if memory B/W delay is in linear scale
+ * @bw_byte:   True if memory B/W is specified in bytes


So the mount parameter says Megabytes, but here you say bytes? What?

And bw_byte is a misnomer. bw_bytes if you really mean bytes. bw_mb if it's 
megabytes.


Will fix the namings. Thanks for pointing it should be MBps everywhere.




+#define is_mba_linear() rdt_resources_all[RDT_RESOURCE_MBA].membw.delay_linear
+#define is_mba_MBctrl() rdt_resources_all[RDT_RESOURCE_MBA].membw.bw_byte


Please use inlines and no camel case. That's horrible.


Will fix..




+
 /**
  * struct rdt_resource - attributes of an RDT resource
  * @rid:   The index of the resource
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c 
b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index fca759d..0707191 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -1041,6 +1041,24 @@ static int set_cache_qos_cfg(int level, bool enable)
return 0;
 }

+static void __set_mba_byte_ctrl(bool byte_ctrl)
+{
+   struct rdt_resource *r = _resources_all[RDT_RESOURCE_MBA];
+
+   r->membw.bw_byte = byte_ctrl;


I don't see the point of this extra function. It has exactly one user.


+}
+
+/*
+ * MBA allocation in bytes is only supported if
+ * MBM is supported and MBA is in linear scale.
+*/


Hint: checkpatch.pl is not optional


+static void set_mba_byte_ctrl(bool byte_ctrl)
+{
+   if ((is_mbm_enabled() && is_mba_linear()) &&
+   byte_ctrl != is_mba_MBctrl())
+   __set_mba_byte_ctrl(byte_ctrl);


And that user is a small enough function. To avoid indentation you can
simply return when the condition is false.

Also if the user wants to mount with the MB option and it's not supported,
why are you not returning an error code and refuse the mount? That's just
wrong.


Will fix. can merge into one function and return error when not available.




+
 static int cdp_enable(int level, int data_type, int code_type)
 {
struct rdt_resource *r_ldata = _resources_all[data_type];
@@ -1104,7 +1122,7 @@ static void cdp_disable_all(void)
cdpl2_disable();
 }

-static int parse_rdtgroupfs_options(char *data)
+static int parse_rdtgroupfs_options(char *data, bool *mba_MBctrl)


What?


 {
char *token, *o = data;
int ret = 0;
@@ -1123,6 +1141,8 @@ static int parse_rdtgroupfs_options(char *data)
ret = cdpl2_enable();
if (ret)
goto out;
+   } else if (!strcmp(token, "mba_MB")) {
+   *mba_MBctrl = true;


That's mindless hackery. Really. What's wrong with setting the flag in the
resource and then add the actual register fiddling right in the

if (is_mbm_enabled()) {

section in rdt_mount()? That would be too obvious and fit into the existing
code, right?


Will fix.




+   /*Set the control values before the rest of reset*/


Space after '/*' and before '*/

Aside of that the comment is pretty useless. 'the control values' ??? Which
control values?



Will fix the comment or remove. Wanted to point here that we reset the control 
values (the delay values that go into the IA32_MBA_THRTL_MSRs) but thats done 
any ways in the reset_all_ctrls call after this, so comment can be removed.


Will fix the checkpatch issues as pointed.

In general wanted to know if this is 
a sane idea to have a software feedback and let the user specify b/w in MBps 
rather than the confusing percentage values. The typical confusing scenarios are 
documented in documentation patch with examples. The use 
can can occur in any rdtgroups 
which are trying to group jobs where different number of threads are 

Re: [PATCH 3/3] x86/intel_rdt/cqm: Improve limbo list processing

2017-08-14 Thread Shivappa Vikas



On Mon, 14 Aug 2017, Shivappa Vikas wrote:




On Mon, 14 Aug 2017, Thomas Gleixner wrote:


On Wed, 9 Aug 2017, Vikas Shivappa wrote:

@@ -426,6 +426,9 @@ static int domain_setup_mon_state(struct rdt_resource 
*r, struct rdt_domain *d)

   GFP_KERNEL);
if (!d->rmid_busy_llc)
return -ENOMEM;
+   INIT_DELAYED_WORK(>cqm_limbo, cqm_handle_limbo);
+   if (has_busy_rmid(r, d))
+   cqm_setup_limbo_handler(d);


This is beyond silly. d->rmid_busy_llc is allocated a few lines above. How
would a bit be set here?


If we logically offline all cpus in a package and bring it back, the worker 
needs to be scheduled on the package if there were busy RMIDs on this 
package. Otherwise that RMID never gets freed as its rmid->busy stays 1..


I needed to scan the limbo list and set the bits for all limbo RMIDs after 
the alloc and before doing the 'has_busy_rmid' check. Will fix


Tony pointed out that there is no guarentee that a domain will come back up once 
its down, so the above issue of rmid->busy staying at > 0 can still happen. 
So I will delete this -

if (has_busy_rmid(r, d))
cqm_setup_limbo_handler(d);

and add this when a domain is powered down -
for each rmid in d->rmid_busy_llc
if (--entry->busy)
free_rmid(rmid);

We have no way to know if the L3 was indeed flushed (or package was powered 
off). This may lead to incorrect counts in rare scenarios but can document the 
same.


Thanks,
vikas


Re: [PATCH 3/3] x86/intel_rdt/cqm: Improve limbo list processing

2017-08-14 Thread Shivappa Vikas



On Mon, 14 Aug 2017, Shivappa Vikas wrote:




On Mon, 14 Aug 2017, Thomas Gleixner wrote:


On Wed, 9 Aug 2017, Vikas Shivappa wrote:

@@ -426,6 +426,9 @@ static int domain_setup_mon_state(struct rdt_resource 
*r, struct rdt_domain *d)

   GFP_KERNEL);
if (!d->rmid_busy_llc)
return -ENOMEM;
+   INIT_DELAYED_WORK(>cqm_limbo, cqm_handle_limbo);
+   if (has_busy_rmid(r, d))
+   cqm_setup_limbo_handler(d);


This is beyond silly. d->rmid_busy_llc is allocated a few lines above. How
would a bit be set here?


If we logically offline all cpus in a package and bring it back, the worker 
needs to be scheduled on the package if there were busy RMIDs on this 
package. Otherwise that RMID never gets freed as its rmid->busy stays 1..


I needed to scan the limbo list and set the bits for all limbo RMIDs after 
the alloc and before doing the 'has_busy_rmid' check. Will fix


Tony pointed out that there is no guarentee that a domain will come back up once 
its down, so the above issue of rmid->busy staying at > 0 can still happen. 
So I will delete this -

if (has_busy_rmid(r, d))
cqm_setup_limbo_handler(d);

and add this when a domain is powered down -
for each rmid in d->rmid_busy_llc
if (--entry->busy)
free_rmid(rmid);

We have no way to know if the L3 was indeed flushed (or package was powered 
off). This may lead to incorrect counts in rare scenarios but can document the 
same.


Thanks,
vikas


Re: [PATCH 3/3] x86/intel_rdt/cqm: Improve limbo list processing

2017-08-14 Thread Shivappa Vikas



On Mon, 14 Aug 2017, Thomas Gleixner wrote:


On Wed, 9 Aug 2017, Vikas Shivappa wrote:


@@ -426,6 +426,9 @@ static int domain_setup_mon_state(struct rdt_resource *r, 
struct rdt_domain *d)
   GFP_KERNEL);
if (!d->rmid_busy_llc)
return -ENOMEM;
+   INIT_DELAYED_WORK(>cqm_limbo, cqm_handle_limbo);
+   if (has_busy_rmid(r, d))
+   cqm_setup_limbo_handler(d);


This is beyond silly. d->rmid_busy_llc is allocated a few lines above. How
would a bit be set here?


If we logically offline all cpus in a package and bring it back, the worker 
needs to be scheduled on the package if there were busy RMIDs on this package. 
Otherwise that RMID never gets freed as its rmid->busy stays 1..


I needed to scan the limbo list and set the bits for all limbo RMIDs after the 
alloc and before doing the 'has_busy_rmid' check. Will fix





}
if (is_mbm_total_enabled()) {
tsize = sizeof(*d->mbm_total);
@@ -536,11 +539,25 @@ static void domain_remove_cpu(int cpu, struct 
rdt_resource *r)
list_del(>list);
if (is_mbm_enabled())
cancel_delayed_work(>mbm_over);
+
+   if (is_llc_occupancy_enabled() &&
+   has_busy_rmid(r, d))


What is that line break helping here and why can't you just unconditionally
cancel the work?


Will fix the line break. The has_busy_rmid makes sure the worker was indeed 
scheduled - that way we cancel the worker which was actually scheduled..





+   cancel_delayed_work(>cqm_limbo);
+
kfree(d);
-   } else if (r == _resources_all[RDT_RESOURCE_L3] &&
-  cpu == d->mbm_work_cpu && is_mbm_enabled()) {
-   cancel_delayed_work(>mbm_over);
-   mbm_setup_overflow_handler(d);
+   return;
+   }
+
+   if (r == _resources_all[RDT_RESOURCE_L3]) {
+   if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
+   cancel_delayed_work(>mbm_over);
+   mbm_setup_overflow_handler(d);


I think this is the wrong approach. If the timer is about to fire you
essentially double the interval. So you better flush the work, which will
reschedule it if needed.


Ok will fix. We can flush(setup and run it immediately) the work here 
on the new cpu.





+   }
+   if (is_llc_occupancy_enabled() && cpu == d->mbm_work_cpu &&


That want's to be d->cbm_work_cpu, right?


Correct - thanks for pointing , will fix.




+   has_busy_rmid(r, d)) {
+   cancel_delayed_work(>cqm_limbo);
+   cqm_setup_limbo_handler(d);


See above.


For cqm 1s is not a hard requirement, but can flush the work like mbm to keep it 
uniform..


Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 3/3] x86/intel_rdt/cqm: Improve limbo list processing

2017-08-14 Thread Shivappa Vikas



On Mon, 14 Aug 2017, Thomas Gleixner wrote:


On Wed, 9 Aug 2017, Vikas Shivappa wrote:


@@ -426,6 +426,9 @@ static int domain_setup_mon_state(struct rdt_resource *r, 
struct rdt_domain *d)
   GFP_KERNEL);
if (!d->rmid_busy_llc)
return -ENOMEM;
+   INIT_DELAYED_WORK(>cqm_limbo, cqm_handle_limbo);
+   if (has_busy_rmid(r, d))
+   cqm_setup_limbo_handler(d);


This is beyond silly. d->rmid_busy_llc is allocated a few lines above. How
would a bit be set here?


If we logically offline all cpus in a package and bring it back, the worker 
needs to be scheduled on the package if there were busy RMIDs on this package. 
Otherwise that RMID never gets freed as its rmid->busy stays 1..


I needed to scan the limbo list and set the bits for all limbo RMIDs after the 
alloc and before doing the 'has_busy_rmid' check. Will fix





}
if (is_mbm_total_enabled()) {
tsize = sizeof(*d->mbm_total);
@@ -536,11 +539,25 @@ static void domain_remove_cpu(int cpu, struct 
rdt_resource *r)
list_del(>list);
if (is_mbm_enabled())
cancel_delayed_work(>mbm_over);
+
+   if (is_llc_occupancy_enabled() &&
+   has_busy_rmid(r, d))


What is that line break helping here and why can't you just unconditionally
cancel the work?


Will fix the line break. The has_busy_rmid makes sure the worker was indeed 
scheduled - that way we cancel the worker which was actually scheduled..





+   cancel_delayed_work(>cqm_limbo);
+
kfree(d);
-   } else if (r == _resources_all[RDT_RESOURCE_L3] &&
-  cpu == d->mbm_work_cpu && is_mbm_enabled()) {
-   cancel_delayed_work(>mbm_over);
-   mbm_setup_overflow_handler(d);
+   return;
+   }
+
+   if (r == _resources_all[RDT_RESOURCE_L3]) {
+   if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
+   cancel_delayed_work(>mbm_over);
+   mbm_setup_overflow_handler(d);


I think this is the wrong approach. If the timer is about to fire you
essentially double the interval. So you better flush the work, which will
reschedule it if needed.


Ok will fix. We can flush(setup and run it immediately) the work here 
on the new cpu.





+   }
+   if (is_llc_occupancy_enabled() && cpu == d->mbm_work_cpu &&


That want's to be d->cbm_work_cpu, right?


Correct - thanks for pointing , will fix.




+   has_busy_rmid(r, d)) {
+   cancel_delayed_work(>cqm_limbo);
+   cqm_setup_limbo_handler(d);


See above.


For cqm 1s is not a hard requirement, but can flush the work like mbm to keep it 
uniform..


Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 09/28] x86/intel_rdt/cqm: Add RMID(Resource monitoring ID) management

2017-08-02 Thread Shivappa Vikas



On Tue, 1 Aug 2017, Thomas Gleixner wrote:


On Tue, 25 Jul 2017, Vikas Shivappa wrote:

Hardware uses RMID(Resource monitoring ID) to keep track of each of the
RDT events associated with tasks. The number of RMIDs is dependent on
the SKU and is enumerated via CPUID. We add support to manage the RMIDs
which include managing the RMID allocation and reading LLC occupancy
for an RMID.

RMID allocation is managed by keeping a free list which is initialized
to all available RMIDs except for RMID 0 which is always reserved for
root group. RMIDs goto a limbo list once they are
freed since the RMIDs are still tagged to cache lines of the tasks which
were using them - thereby still having some occupancy. They continue to
be in limbo list until the occupancy < threshold_occupancy. The
threshold_occupancy is a user configurable value.
OS uses IA32_QM_CTR MSR to read the occupancy associated with an RMID
after programming the IA32_EVENTSEL MSR with the RMID.

[Tony: Improved limbo search]


The search is smarter, but the smp function call problem per se
persists. It's still more than 100us worstcase on a BDW box and it's going
to be worse with the next generation


As an alternative we could have delayed worker threads(hence not in interrupt 
context) every 1s to check the limbo lists using the same optimized search with 
1 bit per RMID.


When a mkdir tries to get RMID if no RMID is in free list and some are present 
in limbo we return -EBUSY and if both free and limbo lists are empty we return 
-ENOSPC.


If that is reasonable I can send an updated patch for the same.

Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 09/28] x86/intel_rdt/cqm: Add RMID(Resource monitoring ID) management

2017-08-02 Thread Shivappa Vikas



On Tue, 1 Aug 2017, Thomas Gleixner wrote:


On Tue, 25 Jul 2017, Vikas Shivappa wrote:

Hardware uses RMID(Resource monitoring ID) to keep track of each of the
RDT events associated with tasks. The number of RMIDs is dependent on
the SKU and is enumerated via CPUID. We add support to manage the RMIDs
which include managing the RMID allocation and reading LLC occupancy
for an RMID.

RMID allocation is managed by keeping a free list which is initialized
to all available RMIDs except for RMID 0 which is always reserved for
root group. RMIDs goto a limbo list once they are
freed since the RMIDs are still tagged to cache lines of the tasks which
were using them - thereby still having some occupancy. They continue to
be in limbo list until the occupancy < threshold_occupancy. The
threshold_occupancy is a user configurable value.
OS uses IA32_QM_CTR MSR to read the occupancy associated with an RMID
after programming the IA32_EVENTSEL MSR with the RMID.

[Tony: Improved limbo search]


The search is smarter, but the smp function call problem per se
persists. It's still more than 100us worstcase on a BDW box and it's going
to be worse with the next generation


As an alternative we could have delayed worker threads(hence not in interrupt 
context) every 1s to check the limbo lists using the same optimized search with 
1 bit per RMID.


When a mkdir tries to get RMID if no RMID is in free list and some are present 
in limbo we return -EBUSY and if both free and limbo lists are empty we return 
-ENOSPC.


If that is reasonable I can send an updated patch for the same.

Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 00/28 V2] Cqm3 patch series(along with MBM support)

2017-08-01 Thread Shivappa Vikas


Hello Thomas,

On Tue, 1 Aug 2017, Thomas Gleixner wrote:


On Tue, 25 Jul 2017, Vikas Shivappa wrote:


Sending the V2 series to address all the feedback that was received in
V1. Apart from that this series adds a patch from Reinette to fix a
global declaration in existing RDT and also fixes some issues where the
prq MSR was not updated when masks or tasks were moved in some cases and
some cleanups.


I merged that pile (after fixing the compiler warnings - SIGH) with the
proviso that

- the cache line issue vs. the default and current pqr state gets resolved

- the worst case latency problem with the smp function call gets addressed

I decided to merge it now so it gets exposure in next, as I'm going to
vanish for my usual summer camp activity for the next 10 days and I don't
want to do that huge thing as a last minute merge post rc5.

Upstream merging into 4.14 depends on the resolution of the open issues.


Thanks for all the feedback. Will send a patch on top of tip for open issues.

Vikas



Thanks,

tglx





Re: [PATCH 00/28 V2] Cqm3 patch series(along with MBM support)

2017-08-01 Thread Shivappa Vikas


Hello Thomas,

On Tue, 1 Aug 2017, Thomas Gleixner wrote:


On Tue, 25 Jul 2017, Vikas Shivappa wrote:


Sending the V2 series to address all the feedback that was received in
V1. Apart from that this series adds a patch from Reinette to fix a
global declaration in existing RDT and also fixes some issues where the
prq MSR was not updated when masks or tasks were moved in some cases and
some cleanups.


I merged that pile (after fixing the compiler warnings - SIGH) with the
proviso that

- the cache line issue vs. the default and current pqr state gets resolved

- the worst case latency problem with the smp function call gets addressed

I decided to merge it now so it gets exposure in next, as I'm going to
vanish for my usual summer camp activity for the next 10 days and I don't
want to do that huge thing as a last minute merge post rc5.

Upstream merging into 4.14 depends on the resolution of the open issues.


Thanks for all the feedback. Will send a patch on top of tip for open issues.

Vikas



Thanks,

tglx





Re: [PATCH 16/28] x86/intel_rdt: Prepare to add RDT monitor cpus file support

2017-08-01 Thread Shivappa Vikas



On Tue, 1 Aug 2017, Thomas Gleixner wrote:


On Tue, 25 Jul 2017, Vikas Shivappa wrote:

 /*
  * The cached intel_pqr_state is strictly per CPU and can never be
  * updated from a remote CPU. Functions which modify the state
@@ -49,6 +47,8 @@
  */
 DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);

+DEFINE_PER_CPU_READ_MOSTLY(struct intel_pqr_state, rdt_cpu_default);


Cacheline wise this is suboptimal. You have to touch two cachelines on each
context switch (at least for read).

If you make that:

struct intel_pqr_state {
u32 default_cosid;
u32 default_rmid;
u32 cur_cosid;
u32 cur_rmid;
};

DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);

then it's all together and you spare one cache line.


Will fix..

Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 16/28] x86/intel_rdt: Prepare to add RDT monitor cpus file support

2017-08-01 Thread Shivappa Vikas



On Tue, 1 Aug 2017, Thomas Gleixner wrote:


On Tue, 25 Jul 2017, Vikas Shivappa wrote:

 /*
  * The cached intel_pqr_state is strictly per CPU and can never be
  * updated from a remote CPU. Functions which modify the state
@@ -49,6 +47,8 @@
  */
 DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);

+DEFINE_PER_CPU_READ_MOSTLY(struct intel_pqr_state, rdt_cpu_default);


Cacheline wise this is suboptimal. You have to touch two cachelines on each
context switch (at least for read).

If you make that:

struct intel_pqr_state {
u32 default_cosid;
u32 default_rmid;
u32 cur_cosid;
u32 cur_rmid;
};

DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);

then it's all together and you spare one cache line.


Will fix..

Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 13/21] x86/intel_rdt/cqm: Add cpus file support

2017-07-13 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


 {
+   struct rdtgroup *pr = rdtgrp->parent, *cr;


*pr and *cr really suck.


We used r before rdtgroup. pr would be parent rdtgrp. Wanted to keep them short 
as there are more in this function.


prgrp can be used if thats not ok?


Re: [PATCH 13/21] x86/intel_rdt/cqm: Add cpus file support

2017-07-13 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


 {
+   struct rdtgroup *pr = rdtgrp->parent, *cr;


*pr and *cr really suck.


We used r before rdtgroup. pr would be parent rdtgrp. Wanted to keep them short 
as there are more in this function.


prgrp can be used if thats not ok?


Re: [PATCH 13/21] x86/intel_rdt/cqm: Add cpus file support

2017-07-13 Thread Shivappa Vikas



On Thu, 6 Jul 2017, Thomas Gleixner wrote:


On Thu, 6 Jul 2017, Shivappa Vikas wrote:

On Sun, 2 Jul 2017, Thomas Gleixner wrote:

+   /* Check whether cpus belong to parent ctrl group */
+   cpumask_andnot(tmpmask, newmask, >cpu_mask);
+   if (cpumask_weight(tmpmask)) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   /* Check whether cpus are dropped from this group */
+   cpumask_andnot(tmpmask, >cpu_mask, newmask);
+   if (cpumask_weight(tmpmask)) {
+   /* Give any dropped cpus to parent rdtgroup */
+   cpumask_or(>cpu_mask, >cpu_mask, tmpmask);


This does not make any sense. The check above verifies that all cpus in
newmask belong to the parent->cpu_mask. If they don't then you return
-EINVAL, but here you give them back to parent->cpu_mask. How is that
supposed to work? You never get into this code path!


The parent->cpu_mask always is the parent->cpus_valid_mask if i understand
right. With monitor group, the cpu is present is always present in "one"
ctrl_mon group and one mon_group. And the mon group can have only cpus in its
parent. May be it needs a comment? (its explaind in the documentation patch).


Sigh, the code needs to be written in a way that it is halfways obvious
what's going on.


# mkdir /sys/fs/resctrl/p1
# mkdir /sys/fs/resctrl/p1/mon_groups/m1
# echo 5-10 > /sys/fs/resctr/p1/cpus_list
Say p1 has RMID 2
cpus 5-10 have RMID 2


So what you say, is that parent is always the resource control group
itself.

Can we please have a proper distinction in the code? I tripped over that
ambigiousities several times.

The normal meaning of parent->child relations is that both have the same
type. While this is the case at the implementation detail level (both are
type struct rdtgroup), from a conceptual level they are different:

 parent is a resource group and child is a monitoring group

That should be expressed in the code, at the very least by variable naming,
so it becomes immediately clear that this operates on two different
entities.

The proper solution is to have different data types or at least embedd the
monitoring bits in a seperate entity inside of struct rdtgroup.


Yes they are conceptually different. There were data which were 
specific to monitoring only but they share a lot of data. So I was still 
thinking whats best but kept a type which seperates them both. But the 
monitoring only data seems like only the 'parent' so we can embed the monitoring 
bits in a seperate struct (The parent is initialized for ctrl_mon group but 
never really used).


Thanks,
Vikas


Re: [PATCH 13/21] x86/intel_rdt/cqm: Add cpus file support

2017-07-13 Thread Shivappa Vikas



On Thu, 6 Jul 2017, Thomas Gleixner wrote:


On Thu, 6 Jul 2017, Shivappa Vikas wrote:

On Sun, 2 Jul 2017, Thomas Gleixner wrote:

+   /* Check whether cpus belong to parent ctrl group */
+   cpumask_andnot(tmpmask, newmask, >cpu_mask);
+   if (cpumask_weight(tmpmask)) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   /* Check whether cpus are dropped from this group */
+   cpumask_andnot(tmpmask, >cpu_mask, newmask);
+   if (cpumask_weight(tmpmask)) {
+   /* Give any dropped cpus to parent rdtgroup */
+   cpumask_or(>cpu_mask, >cpu_mask, tmpmask);


This does not make any sense. The check above verifies that all cpus in
newmask belong to the parent->cpu_mask. If they don't then you return
-EINVAL, but here you give them back to parent->cpu_mask. How is that
supposed to work? You never get into this code path!


The parent->cpu_mask always is the parent->cpus_valid_mask if i understand
right. With monitor group, the cpu is present is always present in "one"
ctrl_mon group and one mon_group. And the mon group can have only cpus in its
parent. May be it needs a comment? (its explaind in the documentation patch).


Sigh, the code needs to be written in a way that it is halfways obvious
what's going on.


# mkdir /sys/fs/resctrl/p1
# mkdir /sys/fs/resctrl/p1/mon_groups/m1
# echo 5-10 > /sys/fs/resctr/p1/cpus_list
Say p1 has RMID 2
cpus 5-10 have RMID 2


So what you say, is that parent is always the resource control group
itself.

Can we please have a proper distinction in the code? I tripped over that
ambigiousities several times.

The normal meaning of parent->child relations is that both have the same
type. While this is the case at the implementation detail level (both are
type struct rdtgroup), from a conceptual level they are different:

 parent is a resource group and child is a monitoring group

That should be expressed in the code, at the very least by variable naming,
so it becomes immediately clear that this operates on two different
entities.

The proper solution is to have different data types or at least embedd the
monitoring bits in a seperate entity inside of struct rdtgroup.


Yes they are conceptually different. There were data which were 
specific to monitoring only but they share a lot of data. So I was still 
thinking whats best but kept a type which seperates them both. But the 
monitoring only data seems like only the 'parent' so we can embed the monitoring 
bits in a seperate struct (The parent is initialized for ctrl_mon group but 
never really used).


Thanks,
Vikas


Re: [PATCH 08/21] x86/intel_rdt/cqm: Add RMID(Resource monitoring ID) management

2017-07-11 Thread Shivappa Vikas



On Mon, 3 Jul 2017, Thomas Gleixner wrote:


On Sun, 2 Jul 2017, Thomas Gleixner wrote:

Thinking a bit more about that limbo mechanics.

In case that a RMID was never used on a particular package, the state check
forces an IPI on all packages unconditionally. That's suboptimal at least.

We know on which package a given RMID was used, so we could restrict the
checks to exactly these packages, but I'm not sure it's worth the
trouble. We might at least document that and explain why this is
implemented in that way.


Second thoughts on that. The allocation logic is:


+   if (list_empty(_free_lru)) {
+   ret = try_freeing_limbo_rmid();
+   if (list_empty(_free_lru))
+   return ret ? -ENOSPC : -EBUSY;
+   }
+
+   entry = list_first_entry(_free_lru,
+struct rmid_entry, list);
+   list_del(>list);
+
+   return entry->rmid;


That means, the free list is used as the primary source. One of my boxes
has 143 RMIDs. So it only takes 142 mkdir/rmdir invocations to move all
RMIDs to the limbo list. On the next mkdir invocation the allocation goes
into the limbo path and the SMP function call has to walk the list with 142
entries on ALL online domains whether they used the RMID or not!


Would it be better if we do this in the MBM 1s overflow timer delayed_work? That 
is not in the interupt context. So we do a periodic flush of the limbo list and 
then mkdir fails with -EBUSY if list_empty(_list) && 
!list_empty(_list).

To improve that -
We may also include the optimization Tony suggested to 
skip the checks for RMIDs which are already checked to be < threshold (however 
that needs a domain mask like I mention below but may be we can just check the 
list here).




That's bad enough already and the number of RMIDs will not become smaller;
it doubled from HSW to BDW ...

The HPC and RT folks will love you for that - NOT!

So this needs to be solved differently.

Let's have a look at the context switch path first. That's the most
sensitive part of it.

if (static_branch_likely(_mon_enable_key)) {
if (current->rmid)
newstate.rmid = current->rmid;
}

That's optimized for the !monitoring case. So we can really penalize the
per task monitoring case.

if (static_branch_likely(_mon_enable_key)) {
if (unlikely(current->rmid)) {
newstate.rmid = current->rmid;
__set_bit(newstate.rmid, this_cpu_ptr(rmid_bitmap));
}
}

Now in rmid_free() we can collect that information:

cpumask_clear();
cpumask_clear(rmid_entry->mask);

cpus_read_lock();
for_each_online_cpu(cpu) {
if (test_and_clear_bit(rmid, per_cpu_ptr(cpu, rmid_bitmap)))
cpumask_set(cpu, tmpmask);
}

for_each_domain(d, resource) {
cpu = cpumask_any_and(d->cpu_mask, tmpmask);
if (cpu < nr_cpu_ids)
cpumask_set(cpu, rmid_entry->mask);


When this cpu goes offline - the rmid_entry->mask needs an update. Otherwise, 
the work function would return true for

 if (!cpumask_test_cpu(cpu, rme->mask))

since the work may have been moved to a different cpu.

So we really need a package mask ? or really a per-domain mask and for that we 
dont know the max domain number(which is why we use a list..)



}

list_add(_entry->list, _list);

for_each_cpu(cpu, rmid_entry->mask)
schedule_delayed_work_on(cpu, rmid_work);
cpus_read_unlock();

The work function:

boot resched = false;

list_for_each_entry(rme, limbo_list,...) {
if (!cpumask_test_cpu(cpu, rme->mask))
continue;

if (!rmid_is_reusable(rme)) {
resched = true;
continue;
}

cpumask_clear_cpu(cpu, rme->mask);
if (!cpumask_empty(rme->mask))
continue;

/* Ready for reuse */
list_del(rme->list);
list_add(>list, _list);
}

The alloc function then becomes:

if (list_empty(_list))
return list_empty(_list) ? -ENOSPC : -EBUSY;

The switch_to() covers the task rmids. The per cpu default rmids can be
marked at the point where they are installed on a CPU in the per cpu
rmid_bitmap. The free path is the same for per task and per cpu.

Another thing which needs some thought it the CPU hotplug code. We need to
make sure that pending work which is scheduled on an outgoing CPU is moved
in the offline callback to a still online CPU of the same domain and not
moved to some random CPU by the workqueue hotplug code.

There is another subtle issue. Assume a RMID is freed. The limbo stuff is
scheduled on all domains which have online CPUs.

Now the 

Re: [PATCH 08/21] x86/intel_rdt/cqm: Add RMID(Resource monitoring ID) management

2017-07-11 Thread Shivappa Vikas



On Mon, 3 Jul 2017, Thomas Gleixner wrote:


On Sun, 2 Jul 2017, Thomas Gleixner wrote:

Thinking a bit more about that limbo mechanics.

In case that a RMID was never used on a particular package, the state check
forces an IPI on all packages unconditionally. That's suboptimal at least.

We know on which package a given RMID was used, so we could restrict the
checks to exactly these packages, but I'm not sure it's worth the
trouble. We might at least document that and explain why this is
implemented in that way.


Second thoughts on that. The allocation logic is:


+   if (list_empty(_free_lru)) {
+   ret = try_freeing_limbo_rmid();
+   if (list_empty(_free_lru))
+   return ret ? -ENOSPC : -EBUSY;
+   }
+
+   entry = list_first_entry(_free_lru,
+struct rmid_entry, list);
+   list_del(>list);
+
+   return entry->rmid;


That means, the free list is used as the primary source. One of my boxes
has 143 RMIDs. So it only takes 142 mkdir/rmdir invocations to move all
RMIDs to the limbo list. On the next mkdir invocation the allocation goes
into the limbo path and the SMP function call has to walk the list with 142
entries on ALL online domains whether they used the RMID or not!


Would it be better if we do this in the MBM 1s overflow timer delayed_work? That 
is not in the interupt context. So we do a periodic flush of the limbo list and 
then mkdir fails with -EBUSY if list_empty(_list) && 
!list_empty(_list).

To improve that -
We may also include the optimization Tony suggested to 
skip the checks for RMIDs which are already checked to be < threshold (however 
that needs a domain mask like I mention below but may be we can just check the 
list here).




That's bad enough already and the number of RMIDs will not become smaller;
it doubled from HSW to BDW ...

The HPC and RT folks will love you for that - NOT!

So this needs to be solved differently.

Let's have a look at the context switch path first. That's the most
sensitive part of it.

if (static_branch_likely(_mon_enable_key)) {
if (current->rmid)
newstate.rmid = current->rmid;
}

That's optimized for the !monitoring case. So we can really penalize the
per task monitoring case.

if (static_branch_likely(_mon_enable_key)) {
if (unlikely(current->rmid)) {
newstate.rmid = current->rmid;
__set_bit(newstate.rmid, this_cpu_ptr(rmid_bitmap));
}
}

Now in rmid_free() we can collect that information:

cpumask_clear();
cpumask_clear(rmid_entry->mask);

cpus_read_lock();
for_each_online_cpu(cpu) {
if (test_and_clear_bit(rmid, per_cpu_ptr(cpu, rmid_bitmap)))
cpumask_set(cpu, tmpmask);
}

for_each_domain(d, resource) {
cpu = cpumask_any_and(d->cpu_mask, tmpmask);
if (cpu < nr_cpu_ids)
cpumask_set(cpu, rmid_entry->mask);


When this cpu goes offline - the rmid_entry->mask needs an update. Otherwise, 
the work function would return true for

 if (!cpumask_test_cpu(cpu, rme->mask))

since the work may have been moved to a different cpu.

So we really need a package mask ? or really a per-domain mask and for that we 
dont know the max domain number(which is why we use a list..)



}

list_add(_entry->list, _list);

for_each_cpu(cpu, rmid_entry->mask)
schedule_delayed_work_on(cpu, rmid_work);
cpus_read_unlock();

The work function:

boot resched = false;

list_for_each_entry(rme, limbo_list,...) {
if (!cpumask_test_cpu(cpu, rme->mask))
continue;

if (!rmid_is_reusable(rme)) {
resched = true;
continue;
}

cpumask_clear_cpu(cpu, rme->mask);
if (!cpumask_empty(rme->mask))
continue;

/* Ready for reuse */
list_del(rme->list);
list_add(>list, _list);
}

The alloc function then becomes:

if (list_empty(_list))
return list_empty(_list) ? -ENOSPC : -EBUSY;

The switch_to() covers the task rmids. The per cpu default rmids can be
marked at the point where they are installed on a CPU in the per cpu
rmid_bitmap. The free path is the same for per task and per cpu.

Another thing which needs some thought it the CPU hotplug code. We need to
make sure that pending work which is scheduled on an outgoing CPU is moved
in the offline callback to a still online CPU of the same domain and not
moved to some random CPU by the workqueue hotplug code.

There is another subtle issue. Assume a RMID is freed. The limbo stuff is
scheduled on all domains which have online CPUs.

Now the 

Re: [PATCH 14/21] x86/intel_rdt/cqm: Add mon_data

2017-07-11 Thread Shivappa Vikas



On Thu, 6 Jul 2017, Thomas Gleixner wrote:


On Thu, 6 Jul 2017, Shivappa Vikas wrote:

On Sun, 2 Jul 2017, Thomas Gleixner wrote:

+static bool __mon_event_count(u32 rmid, struct rmid_read *rr)
+{
+   u64 tval;
+
+   tval = __rmid_read(rmid, rr->evtid);
+   if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) {
+   rr->val = tval;
+   return false;
+   }
+   switch (rr->evtid) {
+   case QOS_L3_OCCUP_EVENT_ID:
+   rr->val += tval;
+   return true;
+   default:
+   return false;


I have no idea what that return code means.


false for the invalid event id and all errors for __rmid_read. (IOW all errors
for __mon_event-read)


Sure, but why bool? What's wrong with proper error return codes, so issues
can be distinguished and potentially propagated in the callchain?


Ok, The error is propagated wih the rr->val actually. is this better?

Hardware throws the RMID_VAL_ERROR (bit 63) when an invalid RMID or 
event is written to event select - this case seems similar.


default:
rr->val = RMID_VAL_ERROR;
return -EINVAL;
}

Thanks,
Vikas



Thanks,

tglx







Re: [PATCH 14/21] x86/intel_rdt/cqm: Add mon_data

2017-07-11 Thread Shivappa Vikas



On Thu, 6 Jul 2017, Thomas Gleixner wrote:


On Thu, 6 Jul 2017, Shivappa Vikas wrote:

On Sun, 2 Jul 2017, Thomas Gleixner wrote:

+static bool __mon_event_count(u32 rmid, struct rmid_read *rr)
+{
+   u64 tval;
+
+   tval = __rmid_read(rmid, rr->evtid);
+   if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) {
+   rr->val = tval;
+   return false;
+   }
+   switch (rr->evtid) {
+   case QOS_L3_OCCUP_EVENT_ID:
+   rr->val += tval;
+   return true;
+   default:
+   return false;


I have no idea what that return code means.


false for the invalid event id and all errors for __rmid_read. (IOW all errors
for __mon_event-read)


Sure, but why bool? What's wrong with proper error return codes, so issues
can be distinguished and potentially propagated in the callchain?


Ok, The error is propagated wih the rr->val actually. is this better?

Hardware throws the RMID_VAL_ERROR (bit 63) when an invalid RMID or 
event is written to event select - this case seems similar.


default:
rr->val = RMID_VAL_ERROR;
return -EINVAL;
}

Thanks,
Vikas



Thanks,

tglx







Re: [PATCH 21/21] x86/intel_rdt/mbm: Handle counter overflow

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:

+static void mbm_update(struct rdt_domain *d, int rmid)
+{
+   struct rmid_read rr;
+
+   rr.first = false;
+   rr.d = d;
+
+   if (is_mbm_total_enabled()) {
+   rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
+   __mon_event_count(rmid, );


This is broken as it is not protected against a concurrent read from user
space which comes in via a smp function call.


The read from user also has the rdtgroup_mutex.

Thanks,
Vikas



This means both the internal state and __rmid_read() are unprotected.

I'm not sure whether it's enough to disable interrupts around
__mon_event_count(), but that's the minimal protection required. It's
definitely good enough for __rmid_read(), but it might not be sufficient
for protecting domain->mbm_[local|total]. I leave the exercise of figuring
that out to you.

Thanks,

tglx



Re: [PATCH 21/21] x86/intel_rdt/mbm: Handle counter overflow

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:

+static void mbm_update(struct rdt_domain *d, int rmid)
+{
+   struct rmid_read rr;
+
+   rr.first = false;
+   rr.d = d;
+
+   if (is_mbm_total_enabled()) {
+   rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
+   __mon_event_count(rmid, );


This is broken as it is not protected against a concurrent read from user
space which comes in via a smp function call.


The read from user also has the rdtgroup_mutex.

Thanks,
Vikas



This means both the internal state and __rmid_read() are unprotected.

I'm not sure whether it's enough to disable interrupts around
__mon_event_count(), but that's the minimal protection required. It's
definitely good enough for __rmid_read(), but it might not be sufficient
for protecting domain->mbm_[local|total]. I leave the exercise of figuring
that out to you.

Thanks,

tglx



Re: [PATCH 19/21] x86/intel_rdt/mbm: Basic counting of MBM events (total and local)

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:

INIT_LIST_HEAD(>evt_list);

if (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID))
list_add_tail(_occupancy_event.list, >evt_list);
+   if (is_mbm_total_enabled())
+   list_add_tail(_total_event.list, >evt_list);
+   if (is_mbm_local_enabled())
+   list_add_tail(_local_event.list, >evt_list);


Confused. This hooks all monitoring features to RDT_RESOURCE_L3. Why?


They are really L3 resource events as per the spec.
CPUID.(EAX=0FH, ECX=0):EDX.L3[bit 1] = 1 if L3 monitoring and we query for all 
the llc_occupancy, l3 total and local b/w with the same resource id 1.




Thanks,

tglx





Re: [PATCH 19/21] x86/intel_rdt/mbm: Basic counting of MBM events (total and local)

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:

INIT_LIST_HEAD(>evt_list);

if (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID))
list_add_tail(_occupancy_event.list, >evt_list);
+   if (is_mbm_total_enabled())
+   list_add_tail(_total_event.list, >evt_list);
+   if (is_mbm_local_enabled())
+   list_add_tail(_local_event.list, >evt_list);


Confused. This hooks all monitoring features to RDT_RESOURCE_L3. Why?


They are really L3 resource events as per the spec.
CPUID.(EAX=0FH, ECX=0):EDX.L3[bit 1] = 1 if L3 monitoring and we query for all 
the llc_occupancy, l3 total and local b/w with the same resource id 1.




Thanks,

tglx





Re: [PATCH 17/21] x86/intel_rdt/cqm: Add sched_in support

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:

 DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
 DECLARE_PER_CPU_READ_MOSTLY(int, cpu_closid);
+DECLARE_PER_CPU_READ_MOSTLY(int, cpu_rmid);
 DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
+DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
+DECLARE_STATIC_KEY_FALSE(rdt_enable_key);


Please make this a two stage change. Add rdt_enable_key first and then the
monitoring stuff. Ideally you introduce rdt_enable_key here and in the
control code in one go.


+static void __intel_rdt_sched_in(void)
 {
-   if (static_branch_likely(_alloc_enable_key)) {
-   struct intel_pqr_state *state = this_cpu_ptr(_state);
-   int closid;
+   struct intel_pqr_state *state = this_cpu_ptr(_state);
+   u32 closid = 0;
+   u32 rmid = 0;

+   if (static_branch_likely(_alloc_enable_key)) {
/*
 * If this task has a closid assigned, use it.
 * Else use the closid assigned to this cpu.
@@ -55,14 +59,31 @@ static inline void intel_rdt_sched_in(void)
closid = current->closid;
if (closid == 0)
closid = this_cpu_read(cpu_closid);
+   }
+
+   if (static_branch_likely(_mon_enable_key)) {
+   /*
+* If this task has a rmid assigned, use it.
+* Else use the rmid assigned to this cpu.
+*/
+   rmid = current->rmid;
+   if (rmid == 0)
+   rmid = this_cpu_read(cpu_rmid);
+   }

-   if (closid != state->closid) {
-   state->closid = closid;
-   wrmsr(IA32_PQR_ASSOC, state->rmid, closid);
-   }
+   if (closid != state->closid || rmid != state->rmid) {
+   state->closid = closid;
+   state->rmid = rmid;
+   wrmsr(IA32_PQR_ASSOC, rmid, closid);


This can be written smarter.

struct intel_pqr_state newstate = this_cpu_read(rdt_cpu_default);
struct intel_pqr_state *curstate = this_cpu_ptr(_state);

if (static_branch_likely(_alloc_enable_key)) {
if (current->closid)
newstate.closid = current->closid;
}

if (static_branch_likely(_mon_enable_key)) {
if (current->rmid)
newstate.rmid = current->rmid;
}

if (newstate != *curstate) {
*curstate = newstate;
wrmsr(IA32_PQR_ASSOC, newstate.rmid, newstate.closid);
}

The unconditional read of rdt_cpu_default is the right thing to do because
the default behaviour is exactly this.


Ok makes sense. Will fix.

Thanks,
Vikas


Thanks,

tglx






Re: [PATCH 17/21] x86/intel_rdt/cqm: Add sched_in support

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:

 DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
 DECLARE_PER_CPU_READ_MOSTLY(int, cpu_closid);
+DECLARE_PER_CPU_READ_MOSTLY(int, cpu_rmid);
 DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
+DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
+DECLARE_STATIC_KEY_FALSE(rdt_enable_key);


Please make this a two stage change. Add rdt_enable_key first and then the
monitoring stuff. Ideally you introduce rdt_enable_key here and in the
control code in one go.


+static void __intel_rdt_sched_in(void)
 {
-   if (static_branch_likely(_alloc_enable_key)) {
-   struct intel_pqr_state *state = this_cpu_ptr(_state);
-   int closid;
+   struct intel_pqr_state *state = this_cpu_ptr(_state);
+   u32 closid = 0;
+   u32 rmid = 0;

+   if (static_branch_likely(_alloc_enable_key)) {
/*
 * If this task has a closid assigned, use it.
 * Else use the closid assigned to this cpu.
@@ -55,14 +59,31 @@ static inline void intel_rdt_sched_in(void)
closid = current->closid;
if (closid == 0)
closid = this_cpu_read(cpu_closid);
+   }
+
+   if (static_branch_likely(_mon_enable_key)) {
+   /*
+* If this task has a rmid assigned, use it.
+* Else use the rmid assigned to this cpu.
+*/
+   rmid = current->rmid;
+   if (rmid == 0)
+   rmid = this_cpu_read(cpu_rmid);
+   }

-   if (closid != state->closid) {
-   state->closid = closid;
-   wrmsr(IA32_PQR_ASSOC, state->rmid, closid);
-   }
+   if (closid != state->closid || rmid != state->rmid) {
+   state->closid = closid;
+   state->rmid = rmid;
+   wrmsr(IA32_PQR_ASSOC, rmid, closid);


This can be written smarter.

struct intel_pqr_state newstate = this_cpu_read(rdt_cpu_default);
struct intel_pqr_state *curstate = this_cpu_ptr(_state);

if (static_branch_likely(_alloc_enable_key)) {
if (current->closid)
newstate.closid = current->closid;
}

if (static_branch_likely(_mon_enable_key)) {
if (current->rmid)
newstate.rmid = current->rmid;
}

if (newstate != *curstate) {
*curstate = newstate;
wrmsr(IA32_PQR_ASSOC, newstate.rmid, newstate.closid);
}

The unconditional read of rdt_cpu_default is the right thing to do because
the default behaviour is exactly this.


Ok makes sense. Will fix.

Thanks,
Vikas


Thanks,

tglx






Re: [PATCH 16/21] x86/intel_rdt/cqm: Add mount,umount support

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:


list_for_each_entry_safe(rdtgrp, tmp, _all_groups, rdtgroup_list) {
+   /* Free any child rmids */
+   llist = >crdtgrp_list;
+   list_for_each_entry_safe(sentry, stmp, llist, crdtgrp_list) {
+   free_rmid(sentry->rmid);
+   list_del(>crdtgrp_list);
+   kfree(sentry);
+   }


I'm pretty sure, that I've seen exactly this code sequence already. Please
create a helper instead of copying stuff over and over.


Thats right, during rmdir_ctrl_mon which deletes all its child mon groups. Will 
fix.


Thanks,
Vikas


Thanks,

tglx



Re: [PATCH 16/21] x86/intel_rdt/cqm: Add mount,umount support

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:


list_for_each_entry_safe(rdtgrp, tmp, _all_groups, rdtgroup_list) {
+   /* Free any child rmids */
+   llist = >crdtgrp_list;
+   list_for_each_entry_safe(sentry, stmp, llist, crdtgrp_list) {
+   free_rmid(sentry->rmid);
+   list_del(>crdtgrp_list);
+   kfree(sentry);
+   }


I'm pretty sure, that I've seen exactly this code sequence already. Please
create a helper instead of copying stuff over and over.


Thats right, during rmdir_ctrl_mon which deletes all its child mon groups. Will 
fix.


Thanks,
Vikas


Thanks,

tglx



Re: [PATCH 15/21] x86/intel_rdt/cqm: Add rmdir support

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:


Resource groups (ctrl_mon and monitor groups) are represented by
directories in resctrl fs. Add support to remove the directories.


Again. Please split that patch into two parts; seperate ctrl stuff from rmdir 
and
then add monitoring support.


+   rdtgrp->flags = RDT_DELETED;
+   free_rmid(rdtgrp->rmid);
+
+   /*
+* Remove your rmid from the parent ctrl groups list


You are not removing a rmid. You remove the group from the parents group
list. Please be more accurate with your comments. Wrong comments are worse
than no comments.


+   WARN_ON(list_empty(>crdtgrp_list));
+   list_del(>crdtgrp_list);



+static int rdtgroup_rmdir_ctrl(struct kernfs_node *kn, struct rdtgroup *rdtgrp)
+{
+   int cpu, closid = rdtgroup_default.closid;
+   struct rdtgroup *entry, *tmp;
+   struct list_head *llist;


*head please.


+   cpumask_var_t tmpmask;
+
+   if (!zalloc_cpumask_var(, GFP_KERNEL))
+   return -ENOMEM;


Allocation/free can be done at the call site for both functions.


+static int rdtgroup_rmdir(struct kernfs_node *kn)
+{
+   struct kernfs_node *parent_kn = kn->parent;
+   struct rdtgroup *rdtgrp;
+   int ret = 0;
+
+   rdtgrp = rdtgroup_kn_lock_live(kn);
+   if (!rdtgrp) {
+   ret = -EPERM;
+   goto out;
+   }
+
+   if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == rdtgroup_default.kn)
+   ret = rdtgroup_rmdir_ctrl(kn, rdtgrp);
+   else if (rdtgrp->type == RDTMON_GROUP &&
+!strcmp(parent_kn->name, "mon_groups"))
+   ret = rdtgroup_rmdir_mon(kn, rdtgrp);
+   else
+   ret = -EPERM;


Like in the other patch, please makes this parseable.


Will fix all..

Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 15/21] x86/intel_rdt/cqm: Add rmdir support

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:


Resource groups (ctrl_mon and monitor groups) are represented by
directories in resctrl fs. Add support to remove the directories.


Again. Please split that patch into two parts; seperate ctrl stuff from rmdir 
and
then add monitoring support.


+   rdtgrp->flags = RDT_DELETED;
+   free_rmid(rdtgrp->rmid);
+
+   /*
+* Remove your rmid from the parent ctrl groups list


You are not removing a rmid. You remove the group from the parents group
list. Please be more accurate with your comments. Wrong comments are worse
than no comments.


+   WARN_ON(list_empty(>crdtgrp_list));
+   list_del(>crdtgrp_list);



+static int rdtgroup_rmdir_ctrl(struct kernfs_node *kn, struct rdtgroup *rdtgrp)
+{
+   int cpu, closid = rdtgroup_default.closid;
+   struct rdtgroup *entry, *tmp;
+   struct list_head *llist;


*head please.


+   cpumask_var_t tmpmask;
+
+   if (!zalloc_cpumask_var(, GFP_KERNEL))
+   return -ENOMEM;


Allocation/free can be done at the call site for both functions.


+static int rdtgroup_rmdir(struct kernfs_node *kn)
+{
+   struct kernfs_node *parent_kn = kn->parent;
+   struct rdtgroup *rdtgrp;
+   int ret = 0;
+
+   rdtgrp = rdtgroup_kn_lock_live(kn);
+   if (!rdtgrp) {
+   ret = -EPERM;
+   goto out;
+   }
+
+   if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == rdtgroup_default.kn)
+   ret = rdtgroup_rmdir_ctrl(kn, rdtgrp);
+   else if (rdtgrp->type == RDTMON_GROUP &&
+!strcmp(parent_kn->name, "mon_groups"))
+   ret = rdtgroup_rmdir_mon(kn, rdtgrp);
+   else
+   ret = -EPERM;


Like in the other patch, please makes this parseable.


Will fix all..

Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 14/21] x86/intel_rdt/cqm: Add mon_data

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:


Add a mon_data directory for the root rdtgroup and all other rdtgroups.
The directory holds all of the monitored data for all domains and events
of all resources being monitored.


Again. This does two things at once. Move the existing code to a new file
and add the monitoring stuff. Please split it apart.


Will fix.




+static bool __mon_event_count(u32 rmid, struct rmid_read *rr)
+{
+   u64 tval;
+
+   tval = __rmid_read(rmid, rr->evtid);
+   if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) {
+   rr->val = tval;
+   return false;
+   }
+   switch (rr->evtid) {
+   case QOS_L3_OCCUP_EVENT_ID:
+   rr->val += tval;
+   return true;
+   default:
+   return false;


I have no idea what that return code means.


false for the invalid event id and all errors for __rmid_read. (IOW all errors 
for __mon_event-read)





+   }
+}
+
+void mon_event_count(void *info)


Some explanation why this is a void pointer and how that function is called
(I assume it's via IPI) would be appreciated.


+{
+   struct rdtgroup *rdtgrp, *entry;
+   struct rmid_read *rr = info;
+   struct list_head *llist;


 *head;


+
+   rdtgrp = rr->rgrp;
+
+   if (!__mon_event_count(rdtgrp->rmid, rr))
+   return;
+
+   /*
+* For Ctrl groups read data from child monitor groups.
+*/
+   llist = >crdtgrp_list;
+
+   if (rdtgrp->type == RDTCTRL_GROUP) {
+   list_for_each_entry(entry, llist, crdtgrp_list) {
+   if (!__mon_event_count(entry->rmid, rr))
+   return;
+   }
+   }
+}



+static int get_rdt_resourceid(struct rdt_resource *r)
+{
+   if (r > (rdt_resources_all + RDT_NUM_RESOURCES - 1) ||
+   r < rdt_resources_all ||
+   ((r - rdt_resources_all) % sizeof(struct rdt_resource)))
+   return -EINVAL;


If that ever happens, then you have other problems than a wrong pointer.


+
+   return ((r - rdt_resources_all) / sizeof(struct rdt_resource));


Moo. Can't you simply put an index field into struct rdt_resource,
intialize it with the resource ID and use that?


Ok will fix all above,

thanks,
Vikas



Thanks,

tglx



Re: [PATCH 14/21] x86/intel_rdt/cqm: Add mon_data

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:


Add a mon_data directory for the root rdtgroup and all other rdtgroups.
The directory holds all of the monitored data for all domains and events
of all resources being monitored.


Again. This does two things at once. Move the existing code to a new file
and add the monitoring stuff. Please split it apart.


Will fix.




+static bool __mon_event_count(u32 rmid, struct rmid_read *rr)
+{
+   u64 tval;
+
+   tval = __rmid_read(rmid, rr->evtid);
+   if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) {
+   rr->val = tval;
+   return false;
+   }
+   switch (rr->evtid) {
+   case QOS_L3_OCCUP_EVENT_ID:
+   rr->val += tval;
+   return true;
+   default:
+   return false;


I have no idea what that return code means.


false for the invalid event id and all errors for __rmid_read. (IOW all errors 
for __mon_event-read)





+   }
+}
+
+void mon_event_count(void *info)


Some explanation why this is a void pointer and how that function is called
(I assume it's via IPI) would be appreciated.


+{
+   struct rdtgroup *rdtgrp, *entry;
+   struct rmid_read *rr = info;
+   struct list_head *llist;


 *head;


+
+   rdtgrp = rr->rgrp;
+
+   if (!__mon_event_count(rdtgrp->rmid, rr))
+   return;
+
+   /*
+* For Ctrl groups read data from child monitor groups.
+*/
+   llist = >crdtgrp_list;
+
+   if (rdtgrp->type == RDTCTRL_GROUP) {
+   list_for_each_entry(entry, llist, crdtgrp_list) {
+   if (!__mon_event_count(entry->rmid, rr))
+   return;
+   }
+   }
+}



+static int get_rdt_resourceid(struct rdt_resource *r)
+{
+   if (r > (rdt_resources_all + RDT_NUM_RESOURCES - 1) ||
+   r < rdt_resources_all ||
+   ((r - rdt_resources_all) % sizeof(struct rdt_resource)))
+   return -EINVAL;


If that ever happens, then you have other problems than a wrong pointer.


+
+   return ((r - rdt_resources_all) / sizeof(struct rdt_resource));


Moo. Can't you simply put an index field into struct rdt_resource,
intialize it with the resource ID and use that?


Ok will fix all above,

thanks,
Vikas



Thanks,

tglx



Re: [PATCH 13/21] x86/intel_rdt/cqm: Add cpus file support

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:

-static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
-  char *buf, size_t nbytes, loff_t off)
+static ssize_t cpus_mon_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ struct rdtgroup *rdtgrp)


Again. Please make the split of rdtgroup_cpus_write() as a seperate
preparatory change first and just move the guts of the existing write
function out into cpus_ctrl_write() and then add the mon_write stuff as an
extra patch.


 {
+   struct rdtgroup *pr = rdtgrp->parent, *cr;


*pr and *cr really suck.


cpumask_var_t tmpmask, newmask;
-   struct rdtgroup *rdtgrp, *r;
+   struct list_head *llist;
int ret;

-   if (!buf)
-   return -EINVAL;
-
if (!zalloc_cpumask_var(, GFP_KERNEL))
return -ENOMEM;
if (!zalloc_cpumask_var(, GFP_KERNEL)) {
@@ -233,10 +235,89 @@ static ssize_t rdtgroup_cpus_write(struct 
kernfs_open_file *of,
return -ENOMEM;
}

-   rdtgrp = rdtgroup_kn_lock_live(of->kn);
-   if (!rdtgrp) {
-   ret = -ENOENT;
-   goto unlock;
+   if (is_cpu_list(of))
+   ret = cpulist_parse(buf, newmask);
+   else
+   ret = cpumask_parse(buf, newmask);


The cpuask allocation and parsing of the user buffer can be done in the
common code. No point in duplicating that.


+
+   if (ret)
+   goto out;
+
+   /* check that user didn't specify any offline cpus */
+   cpumask_andnot(tmpmask, newmask, cpu_online_mask);
+   if (cpumask_weight(tmpmask)) {
+   ret = -EINVAL;
+   goto out;
+   }


Common code.


Will fix all above




+   /* Check whether cpus belong to parent ctrl group */
+   cpumask_andnot(tmpmask, newmask, >cpu_mask);
+   if (cpumask_weight(tmpmask)) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   /* Check whether cpus are dropped from this group */
+   cpumask_andnot(tmpmask, >cpu_mask, newmask);
+   if (cpumask_weight(tmpmask)) {
+   /* Give any dropped cpus to parent rdtgroup */
+   cpumask_or(>cpu_mask, >cpu_mask, tmpmask);


This does not make any sense. The check above verifies that all cpus in
newmask belong to the parent->cpu_mask. If they don't then you return
-EINVAL, but here you give them back to parent->cpu_mask. How is that
supposed to work? You never get into this code path!


The parent->cpu_mask always is the parent->cpus_valid_mask if i understand 
right. With monitor group, the cpu is present is always present in "one" 
ctrl_mon group and one mon_group. And the mon group can have only cpus in its 
parent. May be it needs a comment? (its explaind in the documentation patch).


# mkdir /sys/fs/resctrl/p1
# mkdir /sys/fs/resctrl/p1/mon_groups/m1
# echo 5-10 > /sys/fs/resctr/p1/cpus_list
Say p1 has RMID 2
cpus 5-10 have RMID 2

# echo 5-6 > /sys/fs/resctrl/p1/mon_groups/m1/cpus_list
cpus 5-6 have RMID 3
cpus 7-10 have RMID 2

# cat /sys/fs/resctrl/p1/cpus_list
5-10

This is because when we query the data for p1 it adds its own data (RMID 2) and 
all the data for its child mon groups (hence all cpus from 5-10).


But
 >> + cpumask_or(>cpu_mask, >cpu_mask, tmpmask);
can be removed because it does nothing like you suggest as the parent already 
has these cpus. We just need the update_rmid_closid(tmpmask, pr)




So you need a seperate mask in the parent rdtgroup to store the CPUs which
are valid in any monitoring group which belongs to it. So the logic
becomes:

/*
 * Check whether the CPU mask is a subset of the CPUs
 * which belong to the parent group.
 */
cpumask_andnot(tmpmask, newmask, parent->cpus_valid_mask);
if (cpumask_weight(tmpmask))
return -EINVAL;

When CAT is not available, then parent->cpus_valid_mask is a pointer to
cpu_online_mask. When CAT is enabled, then parent->cpus_valid_mask is a
pointer to the CAT group cpu mask.


When CAT is unavailable we cannot create any ctrl_mon groups.




+   update_closid_rmid(tmpmask, pr);
+   }
+
+   /*
+* If we added cpus, remove them from previous group that owned them
+* and update per-cpu rmid
+*/
+   cpumask_andnot(tmpmask, newmask, >cpu_mask);
+   if (cpumask_weight(tmpmask)) {
+   llist = >crdtgrp_list;


llist is a bad name. We have a facility llist, i.e. lockless list. head ?


Will fix.




+   list_for_each_entry(cr, llist, crdtgrp_list) {
+   if (cr == rdtgrp)
+   continue;
+   cpumask_andnot(>cpu_mask, >cpu_mask, tmpmask);
+   }
+   update_closid_rmid(tmpmask, rdtgrp);
+   }



+static void 

Re: [PATCH 13/21] x86/intel_rdt/cqm: Add cpus file support

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:

-static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of,
-  char *buf, size_t nbytes, loff_t off)
+static ssize_t cpus_mon_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ struct rdtgroup *rdtgrp)


Again. Please make the split of rdtgroup_cpus_write() as a seperate
preparatory change first and just move the guts of the existing write
function out into cpus_ctrl_write() and then add the mon_write stuff as an
extra patch.


 {
+   struct rdtgroup *pr = rdtgrp->parent, *cr;


*pr and *cr really suck.


cpumask_var_t tmpmask, newmask;
-   struct rdtgroup *rdtgrp, *r;
+   struct list_head *llist;
int ret;

-   if (!buf)
-   return -EINVAL;
-
if (!zalloc_cpumask_var(, GFP_KERNEL))
return -ENOMEM;
if (!zalloc_cpumask_var(, GFP_KERNEL)) {
@@ -233,10 +235,89 @@ static ssize_t rdtgroup_cpus_write(struct 
kernfs_open_file *of,
return -ENOMEM;
}

-   rdtgrp = rdtgroup_kn_lock_live(of->kn);
-   if (!rdtgrp) {
-   ret = -ENOENT;
-   goto unlock;
+   if (is_cpu_list(of))
+   ret = cpulist_parse(buf, newmask);
+   else
+   ret = cpumask_parse(buf, newmask);


The cpuask allocation and parsing of the user buffer can be done in the
common code. No point in duplicating that.


+
+   if (ret)
+   goto out;
+
+   /* check that user didn't specify any offline cpus */
+   cpumask_andnot(tmpmask, newmask, cpu_online_mask);
+   if (cpumask_weight(tmpmask)) {
+   ret = -EINVAL;
+   goto out;
+   }


Common code.


Will fix all above




+   /* Check whether cpus belong to parent ctrl group */
+   cpumask_andnot(tmpmask, newmask, >cpu_mask);
+   if (cpumask_weight(tmpmask)) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   /* Check whether cpus are dropped from this group */
+   cpumask_andnot(tmpmask, >cpu_mask, newmask);
+   if (cpumask_weight(tmpmask)) {
+   /* Give any dropped cpus to parent rdtgroup */
+   cpumask_or(>cpu_mask, >cpu_mask, tmpmask);


This does not make any sense. The check above verifies that all cpus in
newmask belong to the parent->cpu_mask. If they don't then you return
-EINVAL, but here you give them back to parent->cpu_mask. How is that
supposed to work? You never get into this code path!


The parent->cpu_mask always is the parent->cpus_valid_mask if i understand 
right. With monitor group, the cpu is present is always present in "one" 
ctrl_mon group and one mon_group. And the mon group can have only cpus in its 
parent. May be it needs a comment? (its explaind in the documentation patch).


# mkdir /sys/fs/resctrl/p1
# mkdir /sys/fs/resctrl/p1/mon_groups/m1
# echo 5-10 > /sys/fs/resctr/p1/cpus_list
Say p1 has RMID 2
cpus 5-10 have RMID 2

# echo 5-6 > /sys/fs/resctrl/p1/mon_groups/m1/cpus_list
cpus 5-6 have RMID 3
cpus 7-10 have RMID 2

# cat /sys/fs/resctrl/p1/cpus_list
5-10

This is because when we query the data for p1 it adds its own data (RMID 2) and 
all the data for its child mon groups (hence all cpus from 5-10).


But
 >> + cpumask_or(>cpu_mask, >cpu_mask, tmpmask);
can be removed because it does nothing like you suggest as the parent already 
has these cpus. We just need the update_rmid_closid(tmpmask, pr)




So you need a seperate mask in the parent rdtgroup to store the CPUs which
are valid in any monitoring group which belongs to it. So the logic
becomes:

/*
 * Check whether the CPU mask is a subset of the CPUs
 * which belong to the parent group.
 */
cpumask_andnot(tmpmask, newmask, parent->cpus_valid_mask);
if (cpumask_weight(tmpmask))
return -EINVAL;

When CAT is not available, then parent->cpus_valid_mask is a pointer to
cpu_online_mask. When CAT is enabled, then parent->cpus_valid_mask is a
pointer to the CAT group cpu mask.


When CAT is unavailable we cannot create any ctrl_mon groups.




+   update_closid_rmid(tmpmask, pr);
+   }
+
+   /*
+* If we added cpus, remove them from previous group that owned them
+* and update per-cpu rmid
+*/
+   cpumask_andnot(tmpmask, newmask, >cpu_mask);
+   if (cpumask_weight(tmpmask)) {
+   llist = >crdtgrp_list;


llist is a bad name. We have a facility llist, i.e. lockless list. head ?


Will fix.




+   list_for_each_entry(cr, llist, crdtgrp_list) {
+   if (cr == rdtgrp)
+   continue;
+   cpumask_andnot(>cpu_mask, >cpu_mask, tmpmask);
+   }
+   update_closid_rmid(tmpmask, rdtgrp);
+   }



+static void 

Re: [PATCH 13/21] x86/intel_rdt/cqm: Add cpus file support

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index fdf3654..fec8ba9 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -37,6 +37,8 @@ struct mon_evt {
 extern bool rdt_alloc_enabled;
 extern int rdt_mon_features;

+DECLARE_PER_CPU_READ_MOSTLY(int, cpu_rmid);


u32



+DEFINE_PER_CPU_READ_MOSTLY(int, cpu_rmid);
 static inline struct rmid_entry *__rmid_entry(u32 rmid)


Bah. Please add a new line between the DEFINE... and the function.

But that whole thing is wrong. The per cpu default closid and rmid want to
be in a single place, not in two distinct per cpu variables.

struct rdt_cpu_default {
  u32   rmid;
  u32   closid;
};

DEFINE_PER_CPU_READ_MOSTLY(struct rdt_cpu_default, rdt_cpu_default);

or something like this. That way it's guaranteed that the context switch
code touches a single cache line for the per cpu defaults.


Will fix and add both rmid and closid into common struct.

Thanks,
Vikas



Thanks,

tglx




Re: [PATCH 13/21] x86/intel_rdt/cqm: Add cpus file support

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index fdf3654..fec8ba9 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -37,6 +37,8 @@ struct mon_evt {
 extern bool rdt_alloc_enabled;
 extern int rdt_mon_features;

+DECLARE_PER_CPU_READ_MOSTLY(int, cpu_rmid);


u32



+DEFINE_PER_CPU_READ_MOSTLY(int, cpu_rmid);
 static inline struct rmid_entry *__rmid_entry(u32 rmid)


Bah. Please add a new line between the DEFINE... and the function.

But that whole thing is wrong. The per cpu default closid and rmid want to
be in a single place, not in two distinct per cpu variables.

struct rdt_cpu_default {
  u32   rmid;
  u32   closid;
};

DEFINE_PER_CPU_READ_MOSTLY(struct rdt_cpu_default, rdt_cpu_default);

or something like this. That way it's guaranteed that the context switch
code touches a single cache line for the per cpu defaults.


Will fix and add both rmid and closid into common struct.

Thanks,
Vikas



Thanks,

tglx




Re: [PATCH 12/21] x86/intel_rdt/cqm: Add tasks file support

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:

@@ -866,6 +866,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_INTEL_RDT
int closid;
+   u32 rmid;


Can you please make a preparatory change which makes closid an u32 as well?
We should have done that in the first place, but in hindsight we are always
smarter...


ok makes sense. Will fix. Think Fenghua or David had suggested this but i 
missed.


Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 12/21] x86/intel_rdt/cqm: Add tasks file support

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:

@@ -866,6 +866,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_INTEL_RDT
int closid;
+   u32 rmid;


Can you please make a preparatory change which makes closid an u32 as well?
We should have done that in the first place, but in hindsight we are always
smarter...


ok makes sense. Will fix. Think Fenghua or David had suggested this but i 
missed.


Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 11/21] x86/intel_rdt/cqm: Add mkdir support for RDT monitoring

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:

+/*
+ * Common code for ctrl_mon and monitor group mkdir.
+ * The caller needs to unlock the global mutex upon success.
+ */
+static int mkdir_rdt_common(struct kernfs_node *pkn, struct kernfs_node *prkn,


pkn and prkn are horrible to distinguish. What's wrong with keeping
*parent_kn and have *kn as the new thing?


the prkn is always the kn for parent rdtgroup where as pkn is the parent kn. May 
be parent_kn parent_kn_rdtgrp ? Wanted to make it shorter.





+   const char *name, umode_t mode,
+   enum rdt_group_type rtype, struct rdtgroup **r)
 {


Can you please split out that mkdir_rdt_common() change into a separate
patch? It can be done as a preparatory stand alone change just for the
existing rdt group code. Then the monitoring add ons come on top of it.


-   struct rdtgroup *parent, *rdtgrp;
+   struct rdtgroup *prgrp, *rdtgrp;
struct kernfs_node *kn;
-   int ret, closid;
-
-   /* Only allow mkdir in the root directory */
-   if (parent_kn != rdtgroup_default.kn)
-   return -EPERM;
-
-   /* Do not accept '\n' to avoid unparsable situation. */
-   if (strchr(name, '\n'))
-   return -EINVAL;
+   uint fshift = 0;
+   int ret;

-   parent = rdtgroup_kn_lock_live(parent_kn);
-   if (!parent) {
+   prgrp = rdtgroup_kn_lock_live(prkn);
+   if (!prgrp) {
ret = -ENODEV;
goto out_unlock;
}

-   ret = closid_alloc();
-   if (ret < 0)
-   goto out_unlock;
-   closid = ret;
-
/* allocate the rdtgroup. */
rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL);
if (!rdtgrp) {
ret = -ENOSPC;
-   goto out_closid_free;
+   goto out_unlock;
}
-   rdtgrp->closid = closid;
-   list_add(>rdtgroup_list, _all_groups);
+   *r = rdtgrp;
+   rdtgrp->parent = prgrp;
+   rdtgrp->type = rtype;
+   INIT_LIST_HEAD(>crdtgrp_list);

/* kernfs creates the directory for rdtgrp */
-   kn = kernfs_create_dir(parent->kn, name, mode, rdtgrp);
+   kn = kernfs_create_dir(pkn, name, mode, rdtgrp);
if (IS_ERR(kn)) {
ret = PTR_ERR(kn);
goto out_cancel_ref;
@@ -1138,27 +1166,138 @@ static int rdtgroup_mkdir(struct kernfs_node 
*parent_kn, const char *name,
if (ret)
goto out_destroy;

-   ret = rdtgroup_add_files(kn, RF_CTRL_BASE);
+   fshift = 1 << (RF_CTRLSHIFT + rtype);
+   ret = rdtgroup_add_files(kn, RFTYPE_BASE | fshift);



I'd rather make this:

files = RFTYPE_BASE | (1U << (RF_CTRLSHIFT + rtype));
ret = rdtgroup_add_files(kn, files);


if (ret)
goto out_destroy;

+   if (rdt_mon_features) {
+   ret = alloc_rmid();
+   if (ret < 0)
+   return ret;
+
+   rdtgrp->rmid = ret;
+   }
kernfs_activate(kn);

-   ret = 0;
-   goto out_unlock;


What unlocks prkn now? The caller, right? Please add a comment ...


+   return 0;

 out_destroy:
kernfs_remove(rdtgrp->kn);
 out_cancel_ref:
-   list_del(>rdtgroup_list);
kfree(rdtgrp);
-out_closid_free:
+out_unlock:
+   rdtgroup_kn_unlock(prkn);
+   return ret;
+}
+
+static void mkdir_rdt_common_clean(struct rdtgroup *rgrp)
+{
+   kernfs_remove(rgrp->kn);
+   if (rgrp->rmid)
+   free_rmid(rgrp->rmid);


Please put that conditonal into free_rmid().


Will fix all above.




+   kfree(rgrp);
+}



+static int rdtgroup_mkdir(struct kernfs_node *pkn, const char *name,
+ umode_t mode)
+{
+   /* Do not accept '\n' to avoid unparsable situation. */
+   if (strchr(name, '\n'))
+   return -EINVAL;
+
+   /*
+* We don't allow rdtgroup ctrl_mon directories to be created anywhere
+* except the root directory and dont allow rdtgroup monitor
+* directories to be created anywhere execept inside mon_groups
+* directory.
+*/
+   if (rdt_alloc_enabled && pkn == rdtgroup_default.kn)
+   return rdtgroup_mkdir_ctrl_mon(pkn, pkn, name, mode);
+   else if (rdt_mon_features &&
+!strcmp(pkn->name, "mon_groups"))
+   return rdtgroup_mkdir_mon(pkn, pkn->parent, name, mode);
+   else
+   return -EPERM;


TBH, this is really convoluted (including the comment).

/*
 * If the parent directory is the root directory and RDT
 * allocation is supported, add a control and monitoring
 * subdirectory.
 */
if (rdt_alloc_capable && parent_kn == rdtgroup_default.kn)
return rdtgroup_mkdir_ctrl_mon(...);

/*
 * If the parent directory is a monitoring group and RDT
 * monitoring 

Re: [PATCH 11/21] x86/intel_rdt/cqm: Add mkdir support for RDT monitoring

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:

+/*
+ * Common code for ctrl_mon and monitor group mkdir.
+ * The caller needs to unlock the global mutex upon success.
+ */
+static int mkdir_rdt_common(struct kernfs_node *pkn, struct kernfs_node *prkn,


pkn and prkn are horrible to distinguish. What's wrong with keeping
*parent_kn and have *kn as the new thing?


the prkn is always the kn for parent rdtgroup where as pkn is the parent kn. May 
be parent_kn parent_kn_rdtgrp ? Wanted to make it shorter.





+   const char *name, umode_t mode,
+   enum rdt_group_type rtype, struct rdtgroup **r)
 {


Can you please split out that mkdir_rdt_common() change into a separate
patch? It can be done as a preparatory stand alone change just for the
existing rdt group code. Then the monitoring add ons come on top of it.


-   struct rdtgroup *parent, *rdtgrp;
+   struct rdtgroup *prgrp, *rdtgrp;
struct kernfs_node *kn;
-   int ret, closid;
-
-   /* Only allow mkdir in the root directory */
-   if (parent_kn != rdtgroup_default.kn)
-   return -EPERM;
-
-   /* Do not accept '\n' to avoid unparsable situation. */
-   if (strchr(name, '\n'))
-   return -EINVAL;
+   uint fshift = 0;
+   int ret;

-   parent = rdtgroup_kn_lock_live(parent_kn);
-   if (!parent) {
+   prgrp = rdtgroup_kn_lock_live(prkn);
+   if (!prgrp) {
ret = -ENODEV;
goto out_unlock;
}

-   ret = closid_alloc();
-   if (ret < 0)
-   goto out_unlock;
-   closid = ret;
-
/* allocate the rdtgroup. */
rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL);
if (!rdtgrp) {
ret = -ENOSPC;
-   goto out_closid_free;
+   goto out_unlock;
}
-   rdtgrp->closid = closid;
-   list_add(>rdtgroup_list, _all_groups);
+   *r = rdtgrp;
+   rdtgrp->parent = prgrp;
+   rdtgrp->type = rtype;
+   INIT_LIST_HEAD(>crdtgrp_list);

/* kernfs creates the directory for rdtgrp */
-   kn = kernfs_create_dir(parent->kn, name, mode, rdtgrp);
+   kn = kernfs_create_dir(pkn, name, mode, rdtgrp);
if (IS_ERR(kn)) {
ret = PTR_ERR(kn);
goto out_cancel_ref;
@@ -1138,27 +1166,138 @@ static int rdtgroup_mkdir(struct kernfs_node 
*parent_kn, const char *name,
if (ret)
goto out_destroy;

-   ret = rdtgroup_add_files(kn, RF_CTRL_BASE);
+   fshift = 1 << (RF_CTRLSHIFT + rtype);
+   ret = rdtgroup_add_files(kn, RFTYPE_BASE | fshift);



I'd rather make this:

files = RFTYPE_BASE | (1U << (RF_CTRLSHIFT + rtype));
ret = rdtgroup_add_files(kn, files);


if (ret)
goto out_destroy;

+   if (rdt_mon_features) {
+   ret = alloc_rmid();
+   if (ret < 0)
+   return ret;
+
+   rdtgrp->rmid = ret;
+   }
kernfs_activate(kn);

-   ret = 0;
-   goto out_unlock;


What unlocks prkn now? The caller, right? Please add a comment ...


+   return 0;

 out_destroy:
kernfs_remove(rdtgrp->kn);
 out_cancel_ref:
-   list_del(>rdtgroup_list);
kfree(rdtgrp);
-out_closid_free:
+out_unlock:
+   rdtgroup_kn_unlock(prkn);
+   return ret;
+}
+
+static void mkdir_rdt_common_clean(struct rdtgroup *rgrp)
+{
+   kernfs_remove(rgrp->kn);
+   if (rgrp->rmid)
+   free_rmid(rgrp->rmid);


Please put that conditonal into free_rmid().


Will fix all above.




+   kfree(rgrp);
+}



+static int rdtgroup_mkdir(struct kernfs_node *pkn, const char *name,
+ umode_t mode)
+{
+   /* Do not accept '\n' to avoid unparsable situation. */
+   if (strchr(name, '\n'))
+   return -EINVAL;
+
+   /*
+* We don't allow rdtgroup ctrl_mon directories to be created anywhere
+* except the root directory and dont allow rdtgroup monitor
+* directories to be created anywhere execept inside mon_groups
+* directory.
+*/
+   if (rdt_alloc_enabled && pkn == rdtgroup_default.kn)
+   return rdtgroup_mkdir_ctrl_mon(pkn, pkn, name, mode);
+   else if (rdt_mon_features &&
+!strcmp(pkn->name, "mon_groups"))
+   return rdtgroup_mkdir_mon(pkn, pkn->parent, name, mode);
+   else
+   return -EPERM;


TBH, this is really convoluted (including the comment).

/*
 * If the parent directory is the root directory and RDT
 * allocation is supported, add a control and monitoring
 * subdirectory.
 */
if (rdt_alloc_capable && parent_kn == rdtgroup_default.kn)
return rdtgroup_mkdir_ctrl_mon(...);

/*
 * If the parent directory is a monitoring group and RDT
 * monitoring 

Re: [PATCH 09/21] x86/intel_rdt: Simplify info and base file lists

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:

@@ -82,6 +82,7 @@ struct rdt_resource rdt_resources_all[] = {
},
.parse_ctrlval  = parse_cbm,
.format_str = "%d=%0*x",
+   .fflags = RFTYPE_RES_CACHE,
},


Can you please convert that array to use explicit array member
initializers? I've noticed this back when I reviewed the intial RDT
implementation, but it somehow escaped. i.e.:

 [RESOURCE_ID] =
 {
.
 }


Will fix.
Thanks,
Vikas



Thanks,

tglx




Re: [PATCH 09/21] x86/intel_rdt: Simplify info and base file lists

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:

@@ -82,6 +82,7 @@ struct rdt_resource rdt_resources_all[] = {
},
.parse_ctrlval  = parse_cbm,
.format_str = "%d=%0*x",
+   .fflags = RFTYPE_RES_CACHE,
},


Can you please convert that array to use explicit array member
initializers? I've noticed this back when I reviewed the intial RDT
implementation, but it somehow escaped. i.e.:

 [RESOURCE_ID] =
 {
.
 }


Will fix.
Thanks,
Vikas



Thanks,

tglx




Re: [PATCH 07/21] x86/intel_rdt/cqm: Add RDT monitoring initialization

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:

+/*
+ * Global boolean for rdt_alloc which is true if any
+ * resource allocation is enabled.
+ */
+bool rdt_alloc_enabled;


That should be rdt_alloc_capable. It's not enabled at probe time. Probing
merily detects the capability. That mirrors the capable/enabled bits in the
rdt resource struct.


 static void
 mba_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
 static void
@@ -230,7 +236,7 @@ static bool rdt_get_mem_config(struct rdt_resource *r)
return true;
 }

-static void rdt_get_cache_config(int idx, struct rdt_resource *r)
+static void rdt_get_cache_alloc_config(int idx, struct rdt_resource *r)
 {
union cpuid_0x10_1_eax eax;
union cpuid_0x10_x_edx edx;
@@ -422,7 +428,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

d->id = id;

-   if (domain_setup_ctrlval(r, d)) {
+   if (r->alloc_capable && domain_setup_ctrlval(r, d)) {


This should be done in the name space cleanup patch or in a separate one.


kfree(d);
return;
}
@@ -513,34 +519,39 @@ static __init void rdt_init_padding(void)

 static __init bool get_rdt_resources(void)
 {
-   bool ret = false;
-
if (cache_alloc_hsw_probe())
-   return true;
+   rdt_alloc_enabled = true;

-   if (!boot_cpu_has(X86_FEATURE_RDT_A))
+   if ((!rdt_alloc_enabled && !boot_cpu_has(X86_FEATURE_RDT_A)) &&
+   !boot_cpu_has(X86_FEATURE_CQM))
return false;

+   if (boot_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
+   rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);


Instead of artificially cramming the CQM bits into this function, it
would be cleaner to leave that function alone, rename it to

 get_rdt_alloc_resources()

and have a new function

 get_rdt_mon_resources()

and handle the aggregation at the call site.

rdt_alloc_capable = get_rdt_alloc_resources();
rdt_mon_capable = get_rdt_mon_resources();

if (!rdt_alloc_capable && !rdt_mon_capable)
return -ENODEV;

I'd make both variables boolean and have rdt_mon_features as a separate
one, which carries the actual available feature bits. This is neither
hotpath nor are we in a situation where we need to spare the last 4byte of
memory. Clean separation of code and functionality is more important.


+/*
+ * Event IDs are used to program IA32_QM_EVTSEL before reading event
+ * counter from IA32_QM_CTR
+ */
+#define QOS_L3_OCCUP_EVENT_ID  0x01
+#define QOS_L3_MBM_TOTAL_EVENT_ID  0x02
+#define QOS_L3_MBM_LOCAL_EVENT_ID  0x03
+
+/**
+ * struct mon_evt - Entry in the event list of a resource
+ * @evtid: event id
+ * @name:  name of the event
+ */
+struct mon_evt {
+   u32 evtid;
+   char*name;
+   struct list_headlist;
+};
+
+extern unsigned int intel_cqm_threshold;
+extern bool rdt_alloc_enabled;
+extern int rdt_mon_features;


Please do not use 'int' for variables which contain bit flags. unsigned int
is the proper choice here.


+struct rmid_entry {
+   u32 rmid;
+   enum rmid_recycle_state state;
+   struct list_head list;


Please make it tabular as you did with mon_evt and other structs.


+};
+
+/**
+ * @rmid_free_lruA least recently used list of free RMIDs
+ * These RMIDs are guaranteed to have an occupancy less than the
+ * threshold occupancy
+ */
+static struct list_headrmid_free_lru;
+
+/**
+ * @rmid_limbo_lru   list of currently unused but (potentially)
+ * dirty RMIDs.
+ * This list contains RMIDs that no one is currently using but that
+ * may have a occupancy value > intel_cqm_threshold. User can change
+ * the threshold occupancy value.
+ */
+static struct list_headrmid_limbo_lru;
+
+/**
+ * @rmid_entry - The entry in the limbo and free lists.
+ */
+static struct rmid_entry   *rmid_ptrs;
+
+/*
+ * Global boolean for rdt_monitor which is true if any


Boolean !?!?!


+ * resource monitoring is enabled.
+ */
+int rdt_mon_features;
+
+/*
+ * This is the threshold cache occupancy at which we will consider an
+ * RMID available for re-allocation.
+ */
+unsigned int intel_cqm_threshold;
+
+static inline struct rmid_entry *__rmid_entry(u32 rmid)
+{
+   struct rmid_entry *entry;
+
+   entry = _ptrs[rmid];
+   WARN_ON(entry->rmid != rmid);
+
+   return entry;
+}
+
+static int dom_data_init(struct rdt_resource *r)
+{
+   struct rmid_entry *entry = NULL;
+   int i = 0, nr_rmids;
+
+   INIT_LIST_HEAD(_free_lru);
+   INIT_LIST_HEAD(_limbo_lru);


You can spare that by declaring the list head with

static LIST_HEAD(rmid_xxx_lru);


+
+   nr_rmids = r->num_rmid;
+   rmid_ptrs = kcalloc(nr_rmids, sizeof(struct rmid_entry), GFP_KERNEL);
+   if (!rmid_ptrs)
+   return -ENOMEM;

Re: [PATCH 07/21] x86/intel_rdt/cqm: Add RDT monitoring initialization

2017-07-06 Thread Shivappa Vikas



On Sun, 2 Jul 2017, Thomas Gleixner wrote:


On Mon, 26 Jun 2017, Vikas Shivappa wrote:

+/*
+ * Global boolean for rdt_alloc which is true if any
+ * resource allocation is enabled.
+ */
+bool rdt_alloc_enabled;


That should be rdt_alloc_capable. It's not enabled at probe time. Probing
merily detects the capability. That mirrors the capable/enabled bits in the
rdt resource struct.


 static void
 mba_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
 static void
@@ -230,7 +236,7 @@ static bool rdt_get_mem_config(struct rdt_resource *r)
return true;
 }

-static void rdt_get_cache_config(int idx, struct rdt_resource *r)
+static void rdt_get_cache_alloc_config(int idx, struct rdt_resource *r)
 {
union cpuid_0x10_1_eax eax;
union cpuid_0x10_x_edx edx;
@@ -422,7 +428,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)

d->id = id;

-   if (domain_setup_ctrlval(r, d)) {
+   if (r->alloc_capable && domain_setup_ctrlval(r, d)) {


This should be done in the name space cleanup patch or in a separate one.


kfree(d);
return;
}
@@ -513,34 +519,39 @@ static __init void rdt_init_padding(void)

 static __init bool get_rdt_resources(void)
 {
-   bool ret = false;
-
if (cache_alloc_hsw_probe())
-   return true;
+   rdt_alloc_enabled = true;

-   if (!boot_cpu_has(X86_FEATURE_RDT_A))
+   if ((!rdt_alloc_enabled && !boot_cpu_has(X86_FEATURE_RDT_A)) &&
+   !boot_cpu_has(X86_FEATURE_CQM))
return false;

+   if (boot_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
+   rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);


Instead of artificially cramming the CQM bits into this function, it
would be cleaner to leave that function alone, rename it to

 get_rdt_alloc_resources()

and have a new function

 get_rdt_mon_resources()

and handle the aggregation at the call site.

rdt_alloc_capable = get_rdt_alloc_resources();
rdt_mon_capable = get_rdt_mon_resources();

if (!rdt_alloc_capable && !rdt_mon_capable)
return -ENODEV;

I'd make both variables boolean and have rdt_mon_features as a separate
one, which carries the actual available feature bits. This is neither
hotpath nor are we in a situation where we need to spare the last 4byte of
memory. Clean separation of code and functionality is more important.


+/*
+ * Event IDs are used to program IA32_QM_EVTSEL before reading event
+ * counter from IA32_QM_CTR
+ */
+#define QOS_L3_OCCUP_EVENT_ID  0x01
+#define QOS_L3_MBM_TOTAL_EVENT_ID  0x02
+#define QOS_L3_MBM_LOCAL_EVENT_ID  0x03
+
+/**
+ * struct mon_evt - Entry in the event list of a resource
+ * @evtid: event id
+ * @name:  name of the event
+ */
+struct mon_evt {
+   u32 evtid;
+   char*name;
+   struct list_headlist;
+};
+
+extern unsigned int intel_cqm_threshold;
+extern bool rdt_alloc_enabled;
+extern int rdt_mon_features;


Please do not use 'int' for variables which contain bit flags. unsigned int
is the proper choice here.


+struct rmid_entry {
+   u32 rmid;
+   enum rmid_recycle_state state;
+   struct list_head list;


Please make it tabular as you did with mon_evt and other structs.


+};
+
+/**
+ * @rmid_free_lruA least recently used list of free RMIDs
+ * These RMIDs are guaranteed to have an occupancy less than the
+ * threshold occupancy
+ */
+static struct list_headrmid_free_lru;
+
+/**
+ * @rmid_limbo_lru   list of currently unused but (potentially)
+ * dirty RMIDs.
+ * This list contains RMIDs that no one is currently using but that
+ * may have a occupancy value > intel_cqm_threshold. User can change
+ * the threshold occupancy value.
+ */
+static struct list_headrmid_limbo_lru;
+
+/**
+ * @rmid_entry - The entry in the limbo and free lists.
+ */
+static struct rmid_entry   *rmid_ptrs;
+
+/*
+ * Global boolean for rdt_monitor which is true if any


Boolean !?!?!


+ * resource monitoring is enabled.
+ */
+int rdt_mon_features;
+
+/*
+ * This is the threshold cache occupancy at which we will consider an
+ * RMID available for re-allocation.
+ */
+unsigned int intel_cqm_threshold;
+
+static inline struct rmid_entry *__rmid_entry(u32 rmid)
+{
+   struct rmid_entry *entry;
+
+   entry = _ptrs[rmid];
+   WARN_ON(entry->rmid != rmid);
+
+   return entry;
+}
+
+static int dom_data_init(struct rdt_resource *r)
+{
+   struct rmid_entry *entry = NULL;
+   int i = 0, nr_rmids;
+
+   INIT_LIST_HEAD(_free_lru);
+   INIT_LIST_HEAD(_limbo_lru);


You can spare that by declaring the list head with

static LIST_HEAD(rmid_xxx_lru);


+
+   nr_rmids = r->num_rmid;
+   rmid_ptrs = kcalloc(nr_rmids, sizeof(struct rmid_entry), GFP_KERNEL);
+   if (!rmid_ptrs)
+   return -ENOMEM;

Re: [PATCH 0/1] x86/cqm: Cqm3 Design Documentation

2017-05-22 Thread Shivappa Vikas



On Mon, 22 May 2017, Thomas Gleixner wrote:


On Mon, 17 Apr 2017, Shivappa Vikas wrote:



Hello Thomas,

Wanted to know if you had any feedback on the new interface for cqm design
which is based on monitoring the resctrl groups. It tries to address all the
requirements discussed in :

https://marc.info/?l=linux-kernel=148891934720489

-basically to monitor resctrl groups / per thread including kernel threads.
-monitor thread from creation (also a child thread forked belongs to same
group like its parent)
-monitor subset of tasks in a resctrl group.
-monitor subset of cpus in a resctrl group.
-monitor different events (however the first version only shows llc_occupancy)
-monitor all domains and show per domain data.

I have a patch series but was thinking to send after the design document is
reviewed so that its easy to review and I can make changes that are discussed
here.


Send it along. This looks sane. I've wanted to reply long ago, but forgot
about it again. Sorry for the delay.


Ok - will do , Expecting to send by next week or so..

Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 0/1] x86/cqm: Cqm3 Design Documentation

2017-05-22 Thread Shivappa Vikas



On Mon, 22 May 2017, Thomas Gleixner wrote:


On Mon, 17 Apr 2017, Shivappa Vikas wrote:



Hello Thomas,

Wanted to know if you had any feedback on the new interface for cqm design
which is based on monitoring the resctrl groups. It tries to address all the
requirements discussed in :

https://marc.info/?l=linux-kernel=148891934720489

-basically to monitor resctrl groups / per thread including kernel threads.
-monitor thread from creation (also a child thread forked belongs to same
group like its parent)
-monitor subset of tasks in a resctrl group.
-monitor subset of cpus in a resctrl group.
-monitor different events (however the first version only shows llc_occupancy)
-monitor all domains and show per domain data.

I have a patch series but was thinking to send after the design document is
reviewed so that its easy to review and I can make changes that are discussed
here.


Send it along. This looks sane. I've wanted to reply long ago, but forgot
about it again. Sorry for the delay.


Ok - will do , Expecting to send by next week or so..

Thanks,
Vikas



Thanks,

tglx



Re: [PATCH] perf/x86/intel/cqm: Make sure the head event of cache_groups always has valid RMID

2017-05-22 Thread Shivappa Vikas




Besides there's another bug that we retry rotating without resetting
nr_needed and start in __intel_cqm_rmid_rotate().

Those bugs combined together led to the following oops.

WARNING: at arch/x86/kernel/cpu/perf_event_intel_cqm.c:186 
__put_rmid+0x28/0x80()
...
 [] __put_rmid+0x28/0x80
 [] intel_cqm_rmid_rotate+0xba/0x440
 [] process_one_work+0x17b/0x470
 [] worker_thread+0x11b/0x400
...
BUG: unable to handle kernel NULL pointer dereference at   (null)


Yes, I saw these issues in the RMID rotation and also that it even rotated the 
rmids which were in use before the limbo ones which did not crash but it just 
made it output wierd data. Like David mentioned thats why we started 
writing the new patches.



...
 [] intel_cqm_rmid_rotate+0xba/0x440
 [] process_one_work+0x17b/0x470
 [] worker_thread+0x11b/0x400


I've managed to forgot most if not all of that horror show. Vikas and
David seem to be working on a replacement, but until such a time it
would be good if this thing would not crash the kernel.

Guys, could you have a look? To me it appears to mostly have the right
shape, but like I said, I forgot most details...


The new patch is on the way after Thomas agreed for the requirements we sent a 
few weeks back. Expect to send it in a week or so.. This fix seems fine.


Thanks,
Vikas



Re: [PATCH] perf/x86/intel/cqm: Make sure the head event of cache_groups always has valid RMID

2017-05-22 Thread Shivappa Vikas




Besides there's another bug that we retry rotating without resetting
nr_needed and start in __intel_cqm_rmid_rotate().

Those bugs combined together led to the following oops.

WARNING: at arch/x86/kernel/cpu/perf_event_intel_cqm.c:186 
__put_rmid+0x28/0x80()
...
 [] __put_rmid+0x28/0x80
 [] intel_cqm_rmid_rotate+0xba/0x440
 [] process_one_work+0x17b/0x470
 [] worker_thread+0x11b/0x400
...
BUG: unable to handle kernel NULL pointer dereference at   (null)


Yes, I saw these issues in the RMID rotation and also that it even rotated the 
rmids which were in use before the limbo ones which did not crash but it just 
made it output wierd data. Like David mentioned thats why we started 
writing the new patches.



...
 [] intel_cqm_rmid_rotate+0xba/0x440
 [] process_one_work+0x17b/0x470
 [] worker_thread+0x11b/0x400


I've managed to forgot most if not all of that horror show. Vikas and
David seem to be working on a replacement, but until such a time it
would be good if this thing would not crash the kernel.

Guys, could you have a look? To me it appears to mostly have the right
shape, but like I said, I forgot most details...


The new patch is on the way after Thomas agreed for the requirements we sent a 
few weeks back. Expect to send it in a week or so.. This fix seems fine.


Thanks,
Vikas



Re: [PATCH 2/3] x86/intel_rdt: Trim whitespace while parsing schemata input

2017-04-20 Thread Shivappa Vikas



On Thu, 20 Apr 2017, Thomas Gleixner wrote:


On Thu, 20 Apr 2017, Shivappa Vikas wrote:

Will fix. I went overboard not wanting to add a line


Finish reading the thread before you start :)


Got it :) Had not seen the tip bot emails on my other email..





Re: [PATCH 2/3] x86/intel_rdt: Trim whitespace while parsing schemata input

2017-04-20 Thread Shivappa Vikas



On Thu, 20 Apr 2017, Thomas Gleixner wrote:


On Thu, 20 Apr 2017, Shivappa Vikas wrote:

Will fix. I went overboard not wanting to add a line


Finish reading the thread before you start :)


Got it :) Had not seen the tip bot emails on my other email..





Re: [PATCH 3/3] x86/intel_rdt: Return error for incorrect resource names in schemata

2017-04-20 Thread Shivappa Vikas



On Thu, 20 Apr 2017, Thomas Gleixner wrote:


On Wed, 19 Apr 2017, Vikas Shivappa wrote:


}


The resulting loop is just horrible to read. We can do better than
that. Patch below.

Thanks,

tglx
8<-

--- a/arch/x86/kernel/cpu/intel_rdt_schemata.c
+++ b/arch/x86/kernel/cpu/intel_rdt_schemata.c
@@ -187,6 +187,17 @@ static int update_domains(struct rdt_res
return 0;
}

+static int rdtgroup_parse_resource(char *resname, char *tok, int closid)
+{
+   struct rdt_resource *r;
+
+   for_each_enabled_rdt_resource(r) {
+   if (!strcmp(resname, r->name) && closid < r->num_closid)
+   return parse_line(tok, r);
+   }
+   return -EINVAL;
+}
+
ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
@@ -209,9 +220,10 @@ ssize_t rdtgroup_schemata_write(struct k

closid = rdtgrp->closid;

-   for_each_enabled_rdt_resource(r)
+   for_each_enabled_rdt_resource(r) {
list_for_each_entry(dom, >domains, list)
dom->have_new_ctrl = false;
+   }

while ((tok = strsep(, "\n")) != NULL) {
resname = strsep(, ":");
@@ -219,19 +231,9 @@ ssize_t rdtgroup_schemata_write(struct k
ret = -EINVAL;
goto out;
}
-   for_each_enabled_rdt_resource(r) {
-   if (!strcmp(strim(resname), r->name) &&
-   closid < r->num_closid) {
-   ret = parse_line(tok, r);
-   if (ret)
-   goto out;
-   break;
-   }
-   }
-   if (!r->name) {
-   ret = -EINVAL;
+   ret = rdtgroup_parse_resource(strim(resname), tok, closid);
+   if (ret)
goto out;
-   }
}


Ok agree its better way. I was trying to add a NULL which Tony already said was 
Nacked, did not really want to use a pointer which may be out of bounds 
although we dont check the contents.


Thanks,
Vikas


Re: [PATCH 3/3] x86/intel_rdt: Return error for incorrect resource names in schemata

2017-04-20 Thread Shivappa Vikas



On Thu, 20 Apr 2017, Thomas Gleixner wrote:


On Wed, 19 Apr 2017, Vikas Shivappa wrote:


}


The resulting loop is just horrible to read. We can do better than
that. Patch below.

Thanks,

tglx
8<-

--- a/arch/x86/kernel/cpu/intel_rdt_schemata.c
+++ b/arch/x86/kernel/cpu/intel_rdt_schemata.c
@@ -187,6 +187,17 @@ static int update_domains(struct rdt_res
return 0;
}

+static int rdtgroup_parse_resource(char *resname, char *tok, int closid)
+{
+   struct rdt_resource *r;
+
+   for_each_enabled_rdt_resource(r) {
+   if (!strcmp(resname, r->name) && closid < r->num_closid)
+   return parse_line(tok, r);
+   }
+   return -EINVAL;
+}
+
ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
@@ -209,9 +220,10 @@ ssize_t rdtgroup_schemata_write(struct k

closid = rdtgrp->closid;

-   for_each_enabled_rdt_resource(r)
+   for_each_enabled_rdt_resource(r) {
list_for_each_entry(dom, >domains, list)
dom->have_new_ctrl = false;
+   }

while ((tok = strsep(, "\n")) != NULL) {
resname = strsep(, ":");
@@ -219,19 +231,9 @@ ssize_t rdtgroup_schemata_write(struct k
ret = -EINVAL;
goto out;
}
-   for_each_enabled_rdt_resource(r) {
-   if (!strcmp(strim(resname), r->name) &&
-   closid < r->num_closid) {
-   ret = parse_line(tok, r);
-   if (ret)
-   goto out;
-   break;
-   }
-   }
-   if (!r->name) {
-   ret = -EINVAL;
+   ret = rdtgroup_parse_resource(strim(resname), tok, closid);
+   if (ret)
goto out;
-   }
}


Ok agree its better way. I was trying to add a NULL which Tony already said was 
Nacked, did not really want to use a pointer which may be out of bounds 
although we dont check the contents.


Thanks,
Vikas


Re: [PATCH 2/3] x86/intel_rdt: Trim whitespace while parsing schemata input

2017-04-20 Thread Shivappa Vikas



On Thu, 20 Apr 2017, Thomas Gleixner wrote:


On Wed, 19 Apr 2017, Vikas Shivappa wrote:


Schemata is displayed in tabular format which introduces some whitespace
to show data in a tabular format. If user wants to input the same data
that is displayed, the parsing fails. Trim the leading and trailing
whitespace to help parse such data.

Reported-by: Prakhya, Sai Praneeth 
Signed-off-by: Vikas Shivappa 
Tested-by: Prakhya, Sai Praneeth 
---
 arch/x86/kernel/cpu/intel_rdt_schemata.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_schemata.c 
b/arch/x86/kernel/cpu/intel_rdt_schemata.c
index 9467a00..3cfa1ca 100644
--- a/arch/x86/kernel/cpu/intel_rdt_schemata.c
+++ b/arch/x86/kernel/cpu/intel_rdt_schemata.c
@@ -143,7 +143,7 @@ static int parse_line(char *line, struct rdt_resource *r)
return -EINVAL;
list_for_each_entry(d, >domains, list) {
if (d->id == dom_id) {
-   if (r->parse_ctrlval(dom, r, d))
+   if (r->parse_ctrlval(strim(dom), r, d))
return -EINVAL;
goto next;
}
@@ -220,7 +220,7 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
goto out;
}
for_each_enabled_rdt_resource(r) {
-   if (!strcmp(resname, r->name) &&
+   if (!strcmp(strim(resname), r->name) &&


Both strims() invocations are just silly. They can be done before the loop
once instead of doing them over and over again.


Will fix. I went overboard not wanting to add a line

Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 2/3] x86/intel_rdt: Trim whitespace while parsing schemata input

2017-04-20 Thread Shivappa Vikas



On Thu, 20 Apr 2017, Thomas Gleixner wrote:


On Wed, 19 Apr 2017, Vikas Shivappa wrote:


Schemata is displayed in tabular format which introduces some whitespace
to show data in a tabular format. If user wants to input the same data
that is displayed, the parsing fails. Trim the leading and trailing
whitespace to help parse such data.

Reported-by: Prakhya, Sai Praneeth 
Signed-off-by: Vikas Shivappa 
Tested-by: Prakhya, Sai Praneeth 
---
 arch/x86/kernel/cpu/intel_rdt_schemata.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_schemata.c 
b/arch/x86/kernel/cpu/intel_rdt_schemata.c
index 9467a00..3cfa1ca 100644
--- a/arch/x86/kernel/cpu/intel_rdt_schemata.c
+++ b/arch/x86/kernel/cpu/intel_rdt_schemata.c
@@ -143,7 +143,7 @@ static int parse_line(char *line, struct rdt_resource *r)
return -EINVAL;
list_for_each_entry(d, >domains, list) {
if (d->id == dom_id) {
-   if (r->parse_ctrlval(dom, r, d))
+   if (r->parse_ctrlval(strim(dom), r, d))
return -EINVAL;
goto next;
}
@@ -220,7 +220,7 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
goto out;
}
for_each_enabled_rdt_resource(r) {
-   if (!strcmp(resname, r->name) &&
+   if (!strcmp(strim(resname), r->name) &&


Both strims() invocations are just silly. They can be done before the loop
once instead of doing them over and over again.


Will fix. I went overboard not wanting to add a line

Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 0/1] x86/cqm: Cqm3 Design Documentation

2017-04-17 Thread Shivappa Vikas


Hello Thomas,

Wanted to know if you had any feedback on the new interface for cqm design which 
is based on monitoring the resctrl groups. It tries to address all the 
requirements discussed in :


https://marc.info/?l=linux-kernel=148891934720489

-basically to monitor resctrl groups / per thread including kernel threads.
-monitor thread from creation (also a child thread forked belongs to same group 
like its parent)

-monitor subset of tasks in a resctrl group.
-monitor subset of cpus in a resctrl group.
-monitor different events (however the first version only shows llc_occupancy)
-monitor all domains and show per domain data.

I have a patch series but was thinking to send after the design document is 
reviewed so that its easy to review and I can make changes that are discussed 
here.


Thanks,
Vikas

On Mon, 3 Apr 2017, Vikas Shivappa wrote:


Resending the design doc for cqm3

Includes one cqm documentation patch which explains the design for the
new resctrl based cqm interface. This is a followup with design document
after the requirements for new cqm was reviewed :
https://marc.info/?l=linux-kernel=148891934720489

[PATCH 1/1] x86/cqm: Cqm3 Documentation



Re: [PATCH 0/1] x86/cqm: Cqm3 Design Documentation

2017-04-17 Thread Shivappa Vikas


Hello Thomas,

Wanted to know if you had any feedback on the new interface for cqm design which 
is based on monitoring the resctrl groups. It tries to address all the 
requirements discussed in :


https://marc.info/?l=linux-kernel=148891934720489

-basically to monitor resctrl groups / per thread including kernel threads.
-monitor thread from creation (also a child thread forked belongs to same group 
like its parent)

-monitor subset of tasks in a resctrl group.
-monitor subset of cpus in a resctrl group.
-monitor different events (however the first version only shows llc_occupancy)
-monitor all domains and show per domain data.

I have a patch series but was thinking to send after the design document is 
reviewed so that its easy to review and I can make changes that are discussed 
here.


Thanks,
Vikas

On Mon, 3 Apr 2017, Vikas Shivappa wrote:


Resending the design doc for cqm3

Includes one cqm documentation patch which explains the design for the
new resctrl based cqm interface. This is a followup with design document
after the requirements for new cqm was reviewed :
https://marc.info/?l=linux-kernel=148891934720489

[PATCH 1/1] x86/cqm: Cqm3 Documentation



Re: [PATCH 0/8 V4] x86/intel_rdt: Intel Memory bandwidth allocation

2017-04-14 Thread Shivappa Vikas



On Fri, 14 Apr 2017, Shivappa Vikas wrote:




On Fri, 14 Apr 2017, Thomas Gleixner wrote:


Please do the following:

1) Verify that it still works as I have no hardware to test it. Once you
   confirmed, it's going to show up in -next. So please do that ASAP,
   i.e. yesterday.

2) Go through the patches one by one and compare it to your own to figure
   out yourself how it should be done. Next time, I'm simply going to drop
   such crap whether that makes it miss the merge window or not.


Ok doing the testing now. Will update soon.
Also will followup with the type of changes and implement the same convention 
in the future patches.


All your changes are tested to be fine when used on the hardware and function 
as before. Although we did discover some minor parsing issues (which existed in 
the version i sent originally..) and will send a fix on top of 
current tip/x86/cpu soon.


Thanks,
Vikas



Thanks,
Vikas



Yours grumpy

 tglx





Re: [PATCH 0/8 V4] x86/intel_rdt: Intel Memory bandwidth allocation

2017-04-14 Thread Shivappa Vikas



On Fri, 14 Apr 2017, Shivappa Vikas wrote:




On Fri, 14 Apr 2017, Thomas Gleixner wrote:


Please do the following:

1) Verify that it still works as I have no hardware to test it. Once you
   confirmed, it's going to show up in -next. So please do that ASAP,
   i.e. yesterday.

2) Go through the patches one by one and compare it to your own to figure
   out yourself how it should be done. Next time, I'm simply going to drop
   such crap whether that makes it miss the merge window or not.


Ok doing the testing now. Will update soon.
Also will followup with the type of changes and implement the same convention 
in the future patches.


All your changes are tested to be fine when used on the hardware and function 
as before. Although we did discover some minor parsing issues (which existed in 
the version i sent originally..) and will send a fix on top of 
current tip/x86/cpu soon.


Thanks,
Vikas



Thanks,
Vikas



Yours grumpy

 tglx





Re: [PATCH 0/8 V4] x86/intel_rdt: Intel Memory bandwidth allocation

2017-04-14 Thread Shivappa Vikas



On Fri, 14 Apr 2017, Thomas Gleixner wrote:


On Thu, 13 Apr 2017, Thomas Gleixner wrote:


On Wed, 12 Apr 2017, Shivappa Vikas wrote:

This series has minor changes with respect to V3 addressing all your comments.
Was wondering if there was any feedback or if we still have a chance for 4.12.


It's on my radar and should make it, unless there is some major hickup.


To be honest, I almost dropped it because as usual you cobbled it together
in a hurry just to get it out the door.

I asked for putting the CBM and MBA related data into seperate structs and
make a anon union of them in struct rdt_resource. Instead you went and made
it a anon union of anon structs, so you did not have to change anything
else in the code. What's the point of this? That's a completely useless
exercise and even worse than the data lump which was there before.

I also asked several times in the past to split preparatory stuff from new
stuff. No, the msr_update crap comes in one go. You introduce an update
function and instead of replacing _ALL_ loops you keep one and then fix it
up in some completely unrelated patch.

The ordering of the new struct members was also completely random along
with the kernel doc comments not being aligned.

As a bonus, you reimplemented roundup() open coded in the bandwidth
validation function.

Instead of wasting my time for another round of review and another delivery
of half baken crap, I fixed it up myself. The result is pushed out to
tip/x86/cpu.

Please do the following:

1) Verify that it still works as I have no hardware to test it. Once you
   confirmed, it's going to show up in -next. So please do that ASAP,
   i.e. yesterday.

2) Go through the patches one by one and compare it to your own to figure
   out yourself how it should be done. Next time, I'm simply going to drop
   such crap whether that makes it miss the merge window or not.


Ok doing the testing now. Will update soon.
Also will followup with the type of changes and implement the same convention in 
the future patches.


Thanks,
Vikas



Yours grumpy

 tglx



Re: [PATCH 0/8 V4] x86/intel_rdt: Intel Memory bandwidth allocation

2017-04-14 Thread Shivappa Vikas



On Fri, 14 Apr 2017, Thomas Gleixner wrote:


On Thu, 13 Apr 2017, Thomas Gleixner wrote:


On Wed, 12 Apr 2017, Shivappa Vikas wrote:

This series has minor changes with respect to V3 addressing all your comments.
Was wondering if there was any feedback or if we still have a chance for 4.12.


It's on my radar and should make it, unless there is some major hickup.


To be honest, I almost dropped it because as usual you cobbled it together
in a hurry just to get it out the door.

I asked for putting the CBM and MBA related data into seperate structs and
make a anon union of them in struct rdt_resource. Instead you went and made
it a anon union of anon structs, so you did not have to change anything
else in the code. What's the point of this? That's a completely useless
exercise and even worse than the data lump which was there before.

I also asked several times in the past to split preparatory stuff from new
stuff. No, the msr_update crap comes in one go. You introduce an update
function and instead of replacing _ALL_ loops you keep one and then fix it
up in some completely unrelated patch.

The ordering of the new struct members was also completely random along
with the kernel doc comments not being aligned.

As a bonus, you reimplemented roundup() open coded in the bandwidth
validation function.

Instead of wasting my time for another round of review and another delivery
of half baken crap, I fixed it up myself. The result is pushed out to
tip/x86/cpu.

Please do the following:

1) Verify that it still works as I have no hardware to test it. Once you
   confirmed, it's going to show up in -next. So please do that ASAP,
   i.e. yesterday.

2) Go through the patches one by one and compare it to your own to figure
   out yourself how it should be done. Next time, I'm simply going to drop
   such crap whether that makes it miss the merge window or not.


Ok doing the testing now. Will update soon.
Also will followup with the type of changes and implement the same convention in 
the future patches.


Thanks,
Vikas



Yours grumpy

 tglx



Re: [PATCH 0/8 V4] x86/intel_rdt: Intel Memory bandwidth allocation

2017-04-12 Thread Shivappa Vikas


Hello Thomas,

This series has minor changes with respect to V3 addressing all your comments.
Was wondering if there was any feedback or if we still have a chance for 4.12.

Thanks,
Vikas

On Fri, 7 Apr 2017, Vikas Shivappa wrote:


Sending another version of  MBA patch series with changes to V3 version
as per Thomas feedback here:
https://marc.info/?l=linux-kernel=149125664024881

Changes:
- Fixed the wrong names in struct document for rdt_domain
- Changed the mba and cache ctrl values to have seperate structs and put
them in a union in rdt_resource structure.

patch applies on tip x86/cpus

[PATCH 1/8] Documentation, x86: Intel Memory bandwidth allocation
[PATCH 2/8] x86/intel_rdt/mba: Generalize the naming to get ready for
[PATCH 3/8] x86/intel_rdt/mba: Memory b/w allocation feature detect
[PATCH 4/8] x86/intel_rct/mba: Add MBA structures and initialize MBA
[PATCH 5/8] x86/intel_rdt: Prep to add info files for MBA
[PATCH 6/8] x86/intel_rdt/mba: Add info directory files for MBA
[PATCH 7/8] x86/intel_rdt: Prep to add schemata file for MBA
[PATCH 8/8] x86/intel_rdt/mba: Add schemata file support for MBA



Re: [PATCH 0/8 V4] x86/intel_rdt: Intel Memory bandwidth allocation

2017-04-12 Thread Shivappa Vikas


Hello Thomas,

This series has minor changes with respect to V3 addressing all your comments.
Was wondering if there was any feedback or if we still have a chance for 4.12.

Thanks,
Vikas

On Fri, 7 Apr 2017, Vikas Shivappa wrote:


Sending another version of  MBA patch series with changes to V3 version
as per Thomas feedback here:
https://marc.info/?l=linux-kernel=149125664024881

Changes:
- Fixed the wrong names in struct document for rdt_domain
- Changed the mba and cache ctrl values to have seperate structs and put
them in a union in rdt_resource structure.

patch applies on tip x86/cpus

[PATCH 1/8] Documentation, x86: Intel Memory bandwidth allocation
[PATCH 2/8] x86/intel_rdt/mba: Generalize the naming to get ready for
[PATCH 3/8] x86/intel_rdt/mba: Memory b/w allocation feature detect
[PATCH 4/8] x86/intel_rct/mba: Add MBA structures and initialize MBA
[PATCH 5/8] x86/intel_rdt: Prep to add info files for MBA
[PATCH 6/8] x86/intel_rdt/mba: Add info directory files for MBA
[PATCH 7/8] x86/intel_rdt: Prep to add schemata file for MBA
[PATCH 8/8] x86/intel_rdt/mba: Add schemata file support for MBA



Re: [PATCH 1/3] x86/intel_rdt: Fix issue when mkdir uses a freed CLOSid

2017-04-11 Thread Shivappa Vikas



On Mon, 10 Apr 2017, Thomas Gleixner wrote:


On Wed, 5 Apr 2017, Luck, Tony wrote:

On Wed, Apr 05, 2017 at 05:20:24PM +0200, Thomas Gleixner wrote:

That's just wrong.

The proper behaviour for a new control group is, that at the time when it
is created it copies the CBM values of the default group and not claiming
access to ALL of the cache by default.


I don't see that as any more helpful. When you make a new
control group it is because none of the existing groups
provides the QoS that you want.  So the first thing the
user will do is write the schemata file with the values
they do want.

So "all access", or "same as default group" are both the
same to the user ... not what they want.

We do need to make sure that the schemata matches what is
in the registers. We need to make sure that changes to the
schemata file result in the MSRs being written where needed.


That's true today. The MSRs and the schemata file of a newly created group
always match. So there's nothing to fix, right?


Yes. Upstream patch has the MSRs matching with what schemata shows. we can drop 
this patch. I sent a new MBA version without this just on the tip which has the 
other two patches.


Thanks,
Vikas



Thanks,

tglx





Re: [PATCH 1/3] x86/intel_rdt: Fix issue when mkdir uses a freed CLOSid

2017-04-11 Thread Shivappa Vikas



On Mon, 10 Apr 2017, Thomas Gleixner wrote:


On Wed, 5 Apr 2017, Luck, Tony wrote:

On Wed, Apr 05, 2017 at 05:20:24PM +0200, Thomas Gleixner wrote:

That's just wrong.

The proper behaviour for a new control group is, that at the time when it
is created it copies the CBM values of the default group and not claiming
access to ALL of the cache by default.


I don't see that as any more helpful. When you make a new
control group it is because none of the existing groups
provides the QoS that you want.  So the first thing the
user will do is write the schemata file with the values
they do want.

So "all access", or "same as default group" are both the
same to the user ... not what they want.

We do need to make sure that the schemata matches what is
in the registers. We need to make sure that changes to the
schemata file result in the MSRs being written where needed.


That's true today. The MSRs and the schemata file of a newly created group
always match. So there's nothing to fix, right?


Yes. Upstream patch has the MSRs matching with what schemata shows. we can drop 
this patch. I sent a new MBA version without this just on the tip which has the 
other two patches.


Thanks,
Vikas



Thanks,

tglx





Re: [PATCH 4/8] x86/intel_rct/mba: Add MBA structures and initialize MBA

2017-04-05 Thread Shivappa Vikas



On Wed, 5 Apr 2017, Thomas Gleixner wrote:


On Mon, 3 Apr 2017, Vikas Shivappa wrote:


 /**
+ * struct rdt_domain - group of cpus sharing an RDT resource
+ * @list:  all instances of this resource
+ * @id:unique id for this instance
+ * @cpu_mask:  which cpus share this resource
+ * @ctrl_val:  array of cache or mem ctrl values (indexed by CLOSID)
+ * @new_cbm:   new cbm value to be loaded
+ * @have_new_cbm: did user provide new_cbm for this domain


The version which you removed below has the kernel-doc comments correct 


Will fix




+/**
  * struct rdt_resource - attributes of an RDT resource
  * @enabled:   Is this feature enabled on this machine
  * @capable:   Is this feature available on this machine
@@ -78,6 +109,16 @@ struct rftype {
  * @data_width:Character width of data when displaying
  * @min_cbm_bits:  Minimum number of consecutive bits to be set
  * in a cache bit mask
+ * @msr_update:Function pointer to update QOS MSRs
+ * @max_delay: Max throttle delay. Delay is the hardware
+ * understandable value for memory bandwidth.
+ * @min_bw:Minimum memory bandwidth percentage user
+ * can request
+ * @bw_gran:   Granularity at which the memory bandwidth
+ * is allocated
+ * @delay_linear:  True if memory b/w delay is in linear scale
+ * @mb_map:Mapping of memory b/w percentage to
+ * memory b/w delay values
  * @domains:   All domains for this resource
  * @msr_base:  Base MSR address for CBMs
  * @cache_level:   Which cache level defines scope of this domain
@@ -94,6 +135,14 @@ struct rdt_resource {
int min_cbm_bits;
u32 default_ctrl;
int data_width;
+   void (*msr_update)  (struct rdt_domain *d, struct msr_param *m,
+struct rdt_resource *r);
+   u32 max_delay;
+   u32 min_bw;
+   u32 bw_gran;
+   u32 delay_linear;
+   u32 *mb_map;


I don't know what other weird controls will be added over time, but we are
probably better off to have

struct cache_ctrl {
int cbm_len;
int min_cbm_bits;
};

struct mba_ctrl {
u32 max_delay;
u32 min_bw;
u32 bw_gran;
u32 delay_linear;
u32 *mb_map;
};

and in then in struct rdt_resource:

  
  union {
struct cache_ctrl   foo;
struct mba_ctrl bla;
} ctrl;


That avoids that rdt_resource becomes a hodgepodge of unrelated or even
contradicting fields.

Hmm?


Ok, makes sense. Will fix. Thought of a union when i had added a couple fields 
and given up but its grown a lot now.


Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 4/8] x86/intel_rct/mba: Add MBA structures and initialize MBA

2017-04-05 Thread Shivappa Vikas



On Wed, 5 Apr 2017, Thomas Gleixner wrote:


On Mon, 3 Apr 2017, Vikas Shivappa wrote:


 /**
+ * struct rdt_domain - group of cpus sharing an RDT resource
+ * @list:  all instances of this resource
+ * @id:unique id for this instance
+ * @cpu_mask:  which cpus share this resource
+ * @ctrl_val:  array of cache or mem ctrl values (indexed by CLOSID)
+ * @new_cbm:   new cbm value to be loaded
+ * @have_new_cbm: did user provide new_cbm for this domain


The version which you removed below has the kernel-doc comments correct 


Will fix




+/**
  * struct rdt_resource - attributes of an RDT resource
  * @enabled:   Is this feature enabled on this machine
  * @capable:   Is this feature available on this machine
@@ -78,6 +109,16 @@ struct rftype {
  * @data_width:Character width of data when displaying
  * @min_cbm_bits:  Minimum number of consecutive bits to be set
  * in a cache bit mask
+ * @msr_update:Function pointer to update QOS MSRs
+ * @max_delay: Max throttle delay. Delay is the hardware
+ * understandable value for memory bandwidth.
+ * @min_bw:Minimum memory bandwidth percentage user
+ * can request
+ * @bw_gran:   Granularity at which the memory bandwidth
+ * is allocated
+ * @delay_linear:  True if memory b/w delay is in linear scale
+ * @mb_map:Mapping of memory b/w percentage to
+ * memory b/w delay values
  * @domains:   All domains for this resource
  * @msr_base:  Base MSR address for CBMs
  * @cache_level:   Which cache level defines scope of this domain
@@ -94,6 +135,14 @@ struct rdt_resource {
int min_cbm_bits;
u32 default_ctrl;
int data_width;
+   void (*msr_update)  (struct rdt_domain *d, struct msr_param *m,
+struct rdt_resource *r);
+   u32 max_delay;
+   u32 min_bw;
+   u32 bw_gran;
+   u32 delay_linear;
+   u32 *mb_map;


I don't know what other weird controls will be added over time, but we are
probably better off to have

struct cache_ctrl {
int cbm_len;
int min_cbm_bits;
};

struct mba_ctrl {
u32 max_delay;
u32 min_bw;
u32 bw_gran;
u32 delay_linear;
u32 *mb_map;
};

and in then in struct rdt_resource:

  
  union {
struct cache_ctrl   foo;
struct mba_ctrl bla;
} ctrl;


That avoids that rdt_resource becomes a hodgepodge of unrelated or even
contradicting fields.

Hmm?


Ok, makes sense. Will fix. Thought of a union when i had added a couple fields 
and given up but its grown a lot now.


Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 1/8] Documentation, x86: Intel Memory bandwidth allocation

2017-04-05 Thread Shivappa Vikas



On Wed, 5 Apr 2017, Thomas Gleixner wrote:


On Mon, 3 Apr 2017, Vikas Shivappa wrote:

+Cache resource(L3/L2)  subdirectory contains the following files:

-"num_closids":  The number of CLOSIDs which are valid for this
-   resource. The kernel uses the smallest number of
-   CLOSIDs of all enabled resources as limit.
+"num_closids":   The number of CLOSIDs which are valid for this
+   resource. The kernel uses the smallest number of
+   CLOSIDs of all enabled resources as limit.

-"cbm_mask": The bitmask which is valid for this resource. This
-   mask is equivalent to 100%.
+"cbm_mask":  The bitmask which is valid for this resource.
+   This mask is equivalent to 100%.

-"min_cbm_bits": The minimum number of consecutive bits which must be
-   set when writing a mask.
+"min_cbm_bits":  The minimum number of consecutive bits which
+   must be set when writing a mask.

+Memory bandwitdh(MB) subdirectory contains the following files:


This has a num_closids file as well, right?


Yes. copy below - just indented same text

 -"num_closids":  The number of CLOSIDs which are valid for this
 -resource. The kernel uses the smallest number of
 -CLOSIDs of all enabled resources as limit.
 +"num_closids":  The number of CLOSIDs which are valid for this
 +resource. The kernel uses the smallest number of
 +CLOSIDs of all enabled resources as limit.

Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 1/8] Documentation, x86: Intel Memory bandwidth allocation

2017-04-05 Thread Shivappa Vikas



On Wed, 5 Apr 2017, Thomas Gleixner wrote:


On Mon, 3 Apr 2017, Vikas Shivappa wrote:

+Cache resource(L3/L2)  subdirectory contains the following files:

-"num_closids":  The number of CLOSIDs which are valid for this
-   resource. The kernel uses the smallest number of
-   CLOSIDs of all enabled resources as limit.
+"num_closids":   The number of CLOSIDs which are valid for this
+   resource. The kernel uses the smallest number of
+   CLOSIDs of all enabled resources as limit.

-"cbm_mask": The bitmask which is valid for this resource. This
-   mask is equivalent to 100%.
+"cbm_mask":  The bitmask which is valid for this resource.
+   This mask is equivalent to 100%.

-"min_cbm_bits": The minimum number of consecutive bits which must be
-   set when writing a mask.
+"min_cbm_bits":  The minimum number of consecutive bits which
+   must be set when writing a mask.

+Memory bandwitdh(MB) subdirectory contains the following files:


This has a num_closids file as well, right?


Yes. copy below - just indented same text

 -"num_closids":  The number of CLOSIDs which are valid for this
 -resource. The kernel uses the smallest number of
 -CLOSIDs of all enabled resources as limit.
 +"num_closids":  The number of CLOSIDs which are valid for this
 +resource. The kernel uses the smallest number of
 +CLOSIDs of all enabled resources as limit.

Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 1/3] x86/intel_rdt: Fix issue when mkdir uses a freed CLOSid

2017-04-05 Thread Shivappa Vikas



On Wed, 5 Apr 2017, Thomas Gleixner wrote:


On Mon, 3 Apr 2017, Vikas Shivappa wrote:


Subject: x86/intel_rdt: Fix issue when mkdir uses a freed CLOSid


This subject line is useless again. It want's to be descriptive.

"Fix issue" Which issue?


Each resctrl directory has one CLOSid allocated which is mapped to a
control register/QOS_MSR. During an rmdir this CLOSid is freed and can
be reused later when a new directory is created.  Currently we do not
reset the QOS_MSR to a default when the CLOSid is freed. So when the
next mkdir uses a freed CLOSid, the new directory is associated with a
stale QOS_MSR.


That's not an issue. That's maybe something people are not expecting.


Fix this issue by writing a default value to the QOS_MSR when the
associated CLOSid is freed. The default value is all 1s for CBM which
means no control is enforced when a mkdir reuses this CLOSid.


That's just wrong.

The proper behaviour for a new control group is, that at the time when it
is created it copies the CBM values of the default group and not claiming
access to ALL of the cache by default.


The behaviour of the default/root group is having access to all cache. Because 
we set the value of all CBMs to all 1s and then root group takes CLOS 0. This is 
trying to set the same values for all newly created groups.


Thanks,
Vikas



Thanks,

tglx





Re: [PATCH 1/3] x86/intel_rdt: Fix issue when mkdir uses a freed CLOSid

2017-04-05 Thread Shivappa Vikas



On Wed, 5 Apr 2017, Thomas Gleixner wrote:


On Mon, 3 Apr 2017, Vikas Shivappa wrote:


Subject: x86/intel_rdt: Fix issue when mkdir uses a freed CLOSid


This subject line is useless again. It want's to be descriptive.

"Fix issue" Which issue?


Each resctrl directory has one CLOSid allocated which is mapped to a
control register/QOS_MSR. During an rmdir this CLOSid is freed and can
be reused later when a new directory is created.  Currently we do not
reset the QOS_MSR to a default when the CLOSid is freed. So when the
next mkdir uses a freed CLOSid, the new directory is associated with a
stale QOS_MSR.


That's not an issue. That's maybe something people are not expecting.


Fix this issue by writing a default value to the QOS_MSR when the
associated CLOSid is freed. The default value is all 1s for CBM which
means no control is enforced when a mkdir reuses this CLOSid.


That's just wrong.

The proper behaviour for a new control group is, that at the time when it
is created it copies the CBM values of the default group and not claiming
access to ALL of the cache by default.


The behaviour of the default/root group is having access to all cache. Because 
we set the value of all CBMs to all 1s and then root group takes CLOS 0. This is 
trying to set the same values for all newly created groups.


Thanks,
Vikas



Thanks,

tglx





Re: [PATCH 4/8] x86/intel_rct/mba: Add MBA structures and initialize MBA

2017-04-04 Thread Shivappa Vikas




On Mon, 3 Apr 2017, Tracy Smith wrote:


Hi All,

No JTAG available and need to understand why Linux 4.8.3 doesn't boot on a
x86_64 corei7-64.  Hangs at the typical "Starting kernel" location after
the last message of the U-boot.  The bootcmd is given below.


Do you see the issue when you apply MBA patches . It seems like you use 4.8 
kernel but the mba patches are dependent on :

https://marc.info/?l=linux-kernel=149125583824700

which is applied on 4.11-rc4.



See a fault FFS and a message indicating the image didn't load after
failover.

1) How can I verify if the kernel image was loaded from uboot
2) What is this fault?
3) Has the bootargs or bootcmd changed between 4.1 and 4.8.3?
4) If the boot cmd/arg has changed, what should the boot cmd/arg be for
4.8.3 to boot on x86_64 corei7-64?

Initial RAM disk at linear address 0x2000, size 11638378 bytes
Kernel command line: "BOOT_IMAGE=/imgx/bzImage LABEL=BOOT root=/dev/ram0
imgmnt=/media/sda2 imgdir=imgx img=image.rootfs rootdelay=2 slub_debug=F
console=ttyS1,115200  bootcount=1 bootcount_addr=0xa4000
acpi_enforce_resources=lax pram_size=0x80 pram_addr=1000
pram_loc=ddr crashkernel=128M memmap=0x80$0x1000  "
EFI table at 683392c0, mmap 68339300, mmap size 4c0, version 1, descr. size
0x30

Starting kernel ...

Timer summary in microseconds:
  MarkElapsed  Stage
 0  0  reset
 1  1  board_init_r
   105104  board_init_f
10,180,048 10,179,943  id=64
10,221,985 41,937  id=65
10,356,645134,660  main_loop
12,366,521  2,009,876  usb_start
18,747,284  6,380,763  start_kernel
Accumulated time:
   10,162,689  ahci

On a 4.1.26-yocto-standard #1 SMP it boots with no issues.  Basically same
.config used in both cases except for anything deprecated between 4.1 and
4.8.3.

root@:~# cat /proc/cmdline
BOOT_IMAGE=/imgy/bzImage LABEL=BOOT root=/dev/ram0 imgmnt=/media/sda2
imgdir=imgy img=image.rootfs rootdelay=2 slub_debug=F console=ttyS1,115200
fault=FFS bootcount=3 bootcount_addr=0xa4000  acpi_enforce_resources=lax
pram_size=0x80 pram_addr=1000 pram_loc=ddr crashkernel=128M
memmap=0x80$0x1000
root@CLX3001:~# cat /proc/consoles
ttyS1-W- (EC p a)4:65
netcon0  -W- (E )

thx,
Tray



Re: [PATCH 4/8] x86/intel_rct/mba: Add MBA structures and initialize MBA

2017-04-04 Thread Shivappa Vikas




On Mon, 3 Apr 2017, Tracy Smith wrote:


Hi All,

No JTAG available and need to understand why Linux 4.8.3 doesn't boot on a
x86_64 corei7-64.  Hangs at the typical "Starting kernel" location after
the last message of the U-boot.  The bootcmd is given below.


Do you see the issue when you apply MBA patches . It seems like you use 4.8 
kernel but the mba patches are dependent on :

https://marc.info/?l=linux-kernel=149125583824700

which is applied on 4.11-rc4.



See a fault FFS and a message indicating the image didn't load after
failover.

1) How can I verify if the kernel image was loaded from uboot
2) What is this fault?
3) Has the bootargs or bootcmd changed between 4.1 and 4.8.3?
4) If the boot cmd/arg has changed, what should the boot cmd/arg be for
4.8.3 to boot on x86_64 corei7-64?

Initial RAM disk at linear address 0x2000, size 11638378 bytes
Kernel command line: "BOOT_IMAGE=/imgx/bzImage LABEL=BOOT root=/dev/ram0
imgmnt=/media/sda2 imgdir=imgx img=image.rootfs rootdelay=2 slub_debug=F
console=ttyS1,115200  bootcount=1 bootcount_addr=0xa4000
acpi_enforce_resources=lax pram_size=0x80 pram_addr=1000
pram_loc=ddr crashkernel=128M memmap=0x80$0x1000  "
EFI table at 683392c0, mmap 68339300, mmap size 4c0, version 1, descr. size
0x30

Starting kernel ...

Timer summary in microseconds:
  MarkElapsed  Stage
 0  0  reset
 1  1  board_init_r
   105104  board_init_f
10,180,048 10,179,943  id=64
10,221,985 41,937  id=65
10,356,645134,660  main_loop
12,366,521  2,009,876  usb_start
18,747,284  6,380,763  start_kernel
Accumulated time:
   10,162,689  ahci

On a 4.1.26-yocto-standard #1 SMP it boots with no issues.  Basically same
.config used in both cases except for anything deprecated between 4.1 and
4.8.3.

root@:~# cat /proc/cmdline
BOOT_IMAGE=/imgy/bzImage LABEL=BOOT root=/dev/ram0 imgmnt=/media/sda2
imgdir=imgy img=image.rootfs rootdelay=2 slub_debug=F console=ttyS1,115200
fault=FFS bootcount=3 bootcount_addr=0xa4000  acpi_enforce_resources=lax
pram_size=0x80 pram_addr=1000 pram_loc=ddr crashkernel=128M
memmap=0x80$0x1000
root@CLX3001:~# cat /proc/consoles
ttyS1-W- (EC p a)4:65
netcon0  -W- (E )

thx,
Tray



Re: [PATCH] x86/cqm: Cqm3 Documentation

2017-04-03 Thread Shivappa Vikas


On Mon, 3 Apr 2017, Vikas Shivappa wrote:


Explains the design for the interface


Explains the design for the new resctrl based cqm interface. A followup
with design document after the requirements for new cqm was reviewed :
https://marc.info/?l=linux-kernel=148891934720489



Signed-off-by: Vikas Shivappa 
---
Documentation/x86/intel_rdt_ui.txt | 210 ++---
1 file changed, 197 insertions(+), 13 deletions(-)

diff --git a/Documentation/x86/intel_rdt_ui.txt 
b/Documentation/x86/intel_rdt_ui.txt
index d918d26..46a2efd 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -1,12 +1,13 @@
-User Interface for Resource Allocation in Intel Resource Director Technology
+User Interface for Resource Allocation and Monitoring in Intel Resource
+Director Technology

Copyright (C) 2016 Intel Corporation

Fenghua Yu 
Tony Luck 

-This feature is enabled by the CONFIG_INTEL_RDT_A Kconfig and the
-X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".
+This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
+X86 /proc/cpuinfo flag bits "rdt", "cqm", "cat_l3" and "cdp_l3".

To use the feature mount the file system:

@@ -16,14 +17,20 @@ mount options are:

"cdp": Enable code/data prioritization in L3 cache allocations.

+The mount succeeds if either of allocation or monitoring is present.
+Monitoring is enabled for each resource which has support in the
+hardware. For more details on the behaviour of the interface during
+monitoring and allocation, see resctrl group section.

Info directory
--

The 'info' directory contains information about the enabled
resources. Each resource has its own subdirectory. The subdirectory
-names reflect the resource names. Each subdirectory contains the
-following files:
+names reflect the resource names.
+
+Each subdirectory contains the following files with respect to
+allocation:

"num_closids":  The number of CLOSIDs which are valid for this
resource. The kernel uses the smallest number of
@@ -35,15 +42,36 @@ following files:
"min_cbm_bits": The minimum number of consecutive bits which must be
set when writing a mask.

+Each subdirectory contains the following files with respect to
+monitoring:
+
+"num_rmids": The number of RMIDs which are valid for
+   this resource.
+
+"mon_enabled":   Indicates if monitoring is enabled for
+   the resource.

-Resource groups

+"max_threshold_occupancy":   This is specific to LLC_occupancy
+   monitoring. provides an upper bound on
+   the threshold and is measured in bytes
+   because it's exposed to userland.
+
+Resource alloc and monitor groups (ALLOC_MON group)
+---
Resource groups are represented as directories in the resctrl file
system. The default group is the root directory. Other groups may be
created as desired by the system administrator using the "mkdir(1)"
command, and removed using "rmdir(1)".

-There are three files associated with each group:
+User can allocate resources and monitor resources via these
+resource groups created in the root directory.
+
+Note that the creation of new ALLOC_MON groups is only allowed when RDT
+allocation is supported. This means user can still monitor the root
+group when only RDT monitoring is supported.
+
+There are three files associated with each group with respect to
+resource allocation:

"tasks": A list of tasks that belongs to this group. Tasks can be
added to a group by writing the task ID to the "tasks" file
@@ -75,6 +103,56 @@ the CPU's group is used.

3) Otherwise the schemata for the default group is used.

+There are three files associated with each group with respect to
+resource monitoring:
+
+"data": A list of all the monitored resource data available to this
+   group. This includes the monitored data for all the tasks in the
+   'tasks' and the cpus in 'cpus' file. Each resource has its own
+   line and format - see below for details the 'data' file
+   description. The monitored data for
+   the ALLOC_MON group is the sum of all the data for its sub MON
+   groups.
+
+"mon_tasks": A directory where in user can create Resource monitor
+   groups (MON groups). This will let user create a group to
+   monitor a subset of tasks in the above 'tasks' file.
+
+Resource monitor groups (MON group)
+---
+
+Resource monitor groups are directories inside the mon_tasks directory.
+There is one mon_tasks directory inside every ALLOC_MON group including
+the root group.
+
+MON group help user monitor a subset of tasks and cpus with in
+the parent ALLOC_MON group.
+
+Each MON group has 3 files:
+

Re: [PATCH] x86/cqm: Cqm3 Documentation

2017-04-03 Thread Shivappa Vikas


On Mon, 3 Apr 2017, Vikas Shivappa wrote:


Explains the design for the interface


Explains the design for the new resctrl based cqm interface. A followup
with design document after the requirements for new cqm was reviewed :
https://marc.info/?l=linux-kernel=148891934720489



Signed-off-by: Vikas Shivappa 
---
Documentation/x86/intel_rdt_ui.txt | 210 ++---
1 file changed, 197 insertions(+), 13 deletions(-)

diff --git a/Documentation/x86/intel_rdt_ui.txt 
b/Documentation/x86/intel_rdt_ui.txt
index d918d26..46a2efd 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -1,12 +1,13 @@
-User Interface for Resource Allocation in Intel Resource Director Technology
+User Interface for Resource Allocation and Monitoring in Intel Resource
+Director Technology

Copyright (C) 2016 Intel Corporation

Fenghua Yu 
Tony Luck 

-This feature is enabled by the CONFIG_INTEL_RDT_A Kconfig and the
-X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".
+This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
+X86 /proc/cpuinfo flag bits "rdt", "cqm", "cat_l3" and "cdp_l3".

To use the feature mount the file system:

@@ -16,14 +17,20 @@ mount options are:

"cdp": Enable code/data prioritization in L3 cache allocations.

+The mount succeeds if either of allocation or monitoring is present.
+Monitoring is enabled for each resource which has support in the
+hardware. For more details on the behaviour of the interface during
+monitoring and allocation, see resctrl group section.

Info directory
--

The 'info' directory contains information about the enabled
resources. Each resource has its own subdirectory. The subdirectory
-names reflect the resource names. Each subdirectory contains the
-following files:
+names reflect the resource names.
+
+Each subdirectory contains the following files with respect to
+allocation:

"num_closids":  The number of CLOSIDs which are valid for this
resource. The kernel uses the smallest number of
@@ -35,15 +42,36 @@ following files:
"min_cbm_bits": The minimum number of consecutive bits which must be
set when writing a mask.

+Each subdirectory contains the following files with respect to
+monitoring:
+
+"num_rmids": The number of RMIDs which are valid for
+   this resource.
+
+"mon_enabled":   Indicates if monitoring is enabled for
+   the resource.

-Resource groups

+"max_threshold_occupancy":   This is specific to LLC_occupancy
+   monitoring. provides an upper bound on
+   the threshold and is measured in bytes
+   because it's exposed to userland.
+
+Resource alloc and monitor groups (ALLOC_MON group)
+---
Resource groups are represented as directories in the resctrl file
system. The default group is the root directory. Other groups may be
created as desired by the system administrator using the "mkdir(1)"
command, and removed using "rmdir(1)".

-There are three files associated with each group:
+User can allocate resources and monitor resources via these
+resource groups created in the root directory.
+
+Note that the creation of new ALLOC_MON groups is only allowed when RDT
+allocation is supported. This means user can still monitor the root
+group when only RDT monitoring is supported.
+
+There are three files associated with each group with respect to
+resource allocation:

"tasks": A list of tasks that belongs to this group. Tasks can be
added to a group by writing the task ID to the "tasks" file
@@ -75,6 +103,56 @@ the CPU's group is used.

3) Otherwise the schemata for the default group is used.

+There are three files associated with each group with respect to
+resource monitoring:
+
+"data": A list of all the monitored resource data available to this
+   group. This includes the monitored data for all the tasks in the
+   'tasks' and the cpus in 'cpus' file. Each resource has its own
+   line and format - see below for details the 'data' file
+   description. The monitored data for
+   the ALLOC_MON group is the sum of all the data for its sub MON
+   groups.
+
+"mon_tasks": A directory where in user can create Resource monitor
+   groups (MON groups). This will let user create a group to
+   monitor a subset of tasks in the above 'tasks' file.
+
+Resource monitor groups (MON group)
+---
+
+Resource monitor groups are directories inside the mon_tasks directory.
+There is one mon_tasks directory inside every ALLOC_MON group including
+the root group.
+
+MON group help user monitor a subset of tasks and cpus with in
+the parent ALLOC_MON group.
+
+Each MON group has 3 files:
+
+"tasks": This behaves exactly as the 'tasks' file above in the ALLOC_MON
+ 

Re: [PATCH] x86/intel_rdt: Implement "update" mode when writing schemata file

2017-03-31 Thread Shivappa Vikas



On Fri, 24 Mar 2017, Luck, Tony wrote:


+++ b/arch/x86/kernel/cpu/intel_rdt_schemata.c
@@ -56,7 +56,7 @@ static bool cbm_validate(unsigned long var, struct 
rdt_resource *r)
 * Read one cache bit mask (hex). Check that it is valid for the current
 * resource type.
 */
-static int parse_cbm(char *buf, struct rdt_resource *r)
+static int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d)
{
unsigned long data;
int ret;
@@ -66,7 +66,8 @@ static int parse_cbm(char *buf, struct rdt_resource *r)
return ret;


Would need a check here for repeated domain entries to error out something like:

echo -e "L3:1=7f;1=7;1=7ff" | cat > /sys/fs/resctrl/p1/schemata

if (d->have_newcbm)
return -EINVAL;



if (!cbm_validate(data, r))
return -EINVAL;
-   r->tmp_cbms[r->num_tmp_cbms++] = data;
+   d->new_cbm = data;
+   d->have_new_cbm = true;



Re: [PATCH] x86/intel_rdt: Implement "update" mode when writing schemata file

2017-03-31 Thread Shivappa Vikas



On Fri, 24 Mar 2017, Luck, Tony wrote:


+++ b/arch/x86/kernel/cpu/intel_rdt_schemata.c
@@ -56,7 +56,7 @@ static bool cbm_validate(unsigned long var, struct 
rdt_resource *r)
 * Read one cache bit mask (hex). Check that it is valid for the current
 * resource type.
 */
-static int parse_cbm(char *buf, struct rdt_resource *r)
+static int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d)
{
unsigned long data;
int ret;
@@ -66,7 +66,8 @@ static int parse_cbm(char *buf, struct rdt_resource *r)
return ret;


Would need a check here for repeated domain entries to error out something like:

echo -e "L3:1=7f;1=7;1=7ff" | cat > /sys/fs/resctrl/p1/schemata

if (d->have_newcbm)
return -EINVAL;



if (!cbm_validate(data, r))
return -EINVAL;
-   r->tmp_cbms[r->num_tmp_cbms++] = data;
+   d->new_cbm = data;
+   d->have_new_cbm = true;



Re: [PATCH] x86/intel_rdt: Implement "update" mode when writing schemata file

2017-03-31 Thread Shivappa Vikas



On Fri, 31 Mar 2017, Thomas Gleixner wrote:


On Fri, 24 Mar 2017, Luck, Tony wrote:

+Reading/writing the schemata file
+-
+Reading the schemata file will show the state of all resources
+on all domains. When writing you only need to specify those values
+which you wish to change.  E.g.
+
+# cat schemata
+L3DATA:0=f;1=f;2=f;3=f
+L3CODE:0=f;1=f;2=f;3=f
+# echo "L3DATA:2=3c0;" > schemata
+# cat schemata
+L3DATA:0=f;1=f;2=3c0;3=f
+L3CODE:0=f;1=f;2=f;3=f


It would be nice if the output would cover the mask width, i.e:

L3DATA:0=f;1=f;2=003c0;3=f
L3CODE:0=f;1=f;2=f;3=f

because that makes it tabular and simpler to read.


Will add this.

A sample with MBA:

  L3DATA:0=f;1=f;2=003c0;3=f
  L3CODE:0=f;1=f;2=f;3=f
  MB:0=   10;1=   20;2=  100;3=  100

Thanks,
Vikas



Thanks,

tglx



Re: [PATCH] x86/intel_rdt: Implement "update" mode when writing schemata file

2017-03-31 Thread Shivappa Vikas



On Fri, 31 Mar 2017, Thomas Gleixner wrote:


On Fri, 24 Mar 2017, Luck, Tony wrote:

+Reading/writing the schemata file
+-
+Reading the schemata file will show the state of all resources
+on all domains. When writing you only need to specify those values
+which you wish to change.  E.g.
+
+# cat schemata
+L3DATA:0=f;1=f;2=f;3=f
+L3CODE:0=f;1=f;2=f;3=f
+# echo "L3DATA:2=3c0;" > schemata
+# cat schemata
+L3DATA:0=f;1=f;2=3c0;3=f
+L3CODE:0=f;1=f;2=f;3=f


It would be nice if the output would cover the mask width, i.e:

L3DATA:0=f;1=f;2=003c0;3=f
L3CODE:0=f;1=f;2=f;3=f

because that makes it tabular and simpler to read.


Will add this.

A sample with MBA:

  L3DATA:0=f;1=f;2=003c0;3=f
  L3CODE:0=f;1=f;2=f;3=f
  MB:0=   10;1=   20;2=  100;3=  100

Thanks,
Vikas



Thanks,

tglx



Re: [PATCH 7/8] x86/intel_rdt: schemata file support for MBA prepare\

2017-03-30 Thread Shivappa Vikas



On Wed, 1 Mar 2017, Thomas Gleixner wrote:


On Fri, 17 Feb 2017, Vikas Shivappa wrote:

Subject: x86/intel_rdt: schemata file support for MBA prepare

I have no idea what MBA prepare is. Is that yet another variant aside of
MBE?


Add support to introduce generic APIs for control validation and writing
QOS_MSRs for RDT resources. The control validation api is meant to
validate the control values like cache bit mask for CAT and memory b/w
percentage for MBA. A resource generic display format is also added and
used for the resources depending on whether its displayed in
hex/decimal.


The usual unpenetratable mess of random sentences lumped together.


diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 24de64c..8748b0d 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -77,6 +77,9 @@ struct rftype {
  * @default_ctrl:  Specifies default cache cbm or mem b/w percent.
  * @min_cbm_bits:  Minimum number of consecutive bits to be set
  * in a cache bit mask
+ * @format_str:Per resource format string to show domain val


Can you please spell out words in comments instead of using random
abbreviations just as you see fit?


+ * @parse_ctrlval: Per resource API to parse the ctrl values


That's not an API. That's a function pointer.


+ * @msr_update:API to update QOS MSRs


Ditto.


  * @info_files:resctrl info files for the resource
  * @infofiles_len: Number of info files
  * @max_delay: Max throttle delay. Delay is the hardware
@@ -105,6 +108,9 @@ struct rdt_resource {
int cbm_len;
int min_cbm_bits;
u32 default_ctrl;
+   const char  *format_str;
+   int (*parse_ctrlval)(char *buf, struct rdt_resource *r);
+   void (*msr_update)  (void *a1, void *a2, struct rdt_resource *r);


void *a1, *a2? Dammit, both implementations (CAT and MBA) use the same
types. This just avoids type safety which does not magically come back by
your completely nonsensical typecasts inside the callback implementations.


+void cqm_wrmsr(void *a1, void *a2, struct rdt_resource *r)


And this is global because it's only used in this file, right?


+{
+   struct rdt_domain *d = (struct rdt_domain *)a2;
+   struct msr_param *m = (struct msr_param *)a1;


Oh well..


+   int i;
+
+   for (i = m->low; i < m->high; i++) {
+   int idx = cbm_idx(r, i);
+
+   wrmsrl(r->msr_base + idx, d->ctrl_val[i]);
+   }
+}



-static bool cbm_validate(unsigned long var, struct rdt_resource *r)
+static int cbm_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 {
-   unsigned long first_bit, zero_bit;
+   unsigned long first_bit, zero_bit, var;
+   int ret;
+
+   ret = kstrtoul(buf, 16, );
+   if (ret)
+   return ret;

if (var == 0 || var > r->default_ctrl)
-   return false;
+   return -EINVAL;


So you change this function and the whole call chain to return either
-EINVAL or 0 instead of false/true.

And then you treat the integer return value as boolean on the call site
again:


Will fix wrt all the comments.

Thanks,
Vikas



Re: [PATCH 7/8] x86/intel_rdt: schemata file support for MBA prepare\

2017-03-30 Thread Shivappa Vikas



On Wed, 1 Mar 2017, Thomas Gleixner wrote:


On Fri, 17 Feb 2017, Vikas Shivappa wrote:

Subject: x86/intel_rdt: schemata file support for MBA prepare

I have no idea what MBA prepare is. Is that yet another variant aside of
MBE?


Add support to introduce generic APIs for control validation and writing
QOS_MSRs for RDT resources. The control validation api is meant to
validate the control values like cache bit mask for CAT and memory b/w
percentage for MBA. A resource generic display format is also added and
used for the resources depending on whether its displayed in
hex/decimal.


The usual unpenetratable mess of random sentences lumped together.


diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 24de64c..8748b0d 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -77,6 +77,9 @@ struct rftype {
  * @default_ctrl:  Specifies default cache cbm or mem b/w percent.
  * @min_cbm_bits:  Minimum number of consecutive bits to be set
  * in a cache bit mask
+ * @format_str:Per resource format string to show domain val


Can you please spell out words in comments instead of using random
abbreviations just as you see fit?


+ * @parse_ctrlval: Per resource API to parse the ctrl values


That's not an API. That's a function pointer.


+ * @msr_update:API to update QOS MSRs


Ditto.


  * @info_files:resctrl info files for the resource
  * @infofiles_len: Number of info files
  * @max_delay: Max throttle delay. Delay is the hardware
@@ -105,6 +108,9 @@ struct rdt_resource {
int cbm_len;
int min_cbm_bits;
u32 default_ctrl;
+   const char  *format_str;
+   int (*parse_ctrlval)(char *buf, struct rdt_resource *r);
+   void (*msr_update)  (void *a1, void *a2, struct rdt_resource *r);


void *a1, *a2? Dammit, both implementations (CAT and MBA) use the same
types. This just avoids type safety which does not magically come back by
your completely nonsensical typecasts inside the callback implementations.


+void cqm_wrmsr(void *a1, void *a2, struct rdt_resource *r)


And this is global because it's only used in this file, right?


+{
+   struct rdt_domain *d = (struct rdt_domain *)a2;
+   struct msr_param *m = (struct msr_param *)a1;


Oh well..


+   int i;
+
+   for (i = m->low; i < m->high; i++) {
+   int idx = cbm_idx(r, i);
+
+   wrmsrl(r->msr_base + idx, d->ctrl_val[i]);
+   }
+}



-static bool cbm_validate(unsigned long var, struct rdt_resource *r)
+static int cbm_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 {
-   unsigned long first_bit, zero_bit;
+   unsigned long first_bit, zero_bit, var;
+   int ret;
+
+   ret = kstrtoul(buf, 16, );
+   if (ret)
+   return ret;

if (var == 0 || var > r->default_ctrl)
-   return false;
+   return -EINVAL;


So you change this function and the whole call chain to return either
-EINVAL or 0 instead of false/true.

And then you treat the integer return value as boolean on the call site
again:


Will fix wrt all the comments.

Thanks,
Vikas



Re: [PATCH 5/5] x86/intel_rdt: hotcpu updates for RDT

2017-03-30 Thread Shivappa Vikas



On Wed, 1 Mar 2017, Thomas Gleixner wrote:


WARN_ON(c->x86_cache_occ_scale != cqm_l3_scale);

@@ -1585,12 +1580,17 @@ static int intel_cqm_cpu_starting(unsigned int cpu)

 static int intel_cqm_cpu_exit(unsigned int cpu)
 {
+   struct intel_pqr_state *state = _cpu(pqr_state, cpu);


Can be this_cpu_ptr() because the callback is guaranteed to run on the
outgoing CPU.


Will fix this. Assumed the calls are setup cache alloc way - 
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN ..





int target;

/* Is @cpu the current cqm reader for this package ? */
if (!cpumask_test_and_clear_cpu(cpu, _cpumask))
return 0;


So if the CPU is not the current cqm reader then the per cpu state of this
CPU is left stale. Great improvement.


+   state->rmid = 0;
+   state->rmid_usecnt = 0;
+   wrmsr(MSR_IA32_PQR_ASSOC, 0, state->closid);


What clears state->closid? And what guarantees that state->rmid is not
updated before the CPU has really gone away?


- The rdt code takes care of clearing closid state now. Will update the comment. 
- The cqm however was never writing a zero to PQR_ASSOC.


So the update needs to be - to remove the state->closid = 0 from cqm code as the 
rdt code takes care of closid state in clear_closid() called from both offline and 
online cpu.

And also write a rmid = 0 to PQR_ASSOC.

We can integrate the two of these hot cpu calls(from cat and cqm) to write PQR 
only once.


guess I can skip all of these and send it as part of cqm changes we planned 
anyways, because this is really a cqm change.


Thanks,
Vikas



Re: [PATCH 5/5] x86/intel_rdt: hotcpu updates for RDT

2017-03-30 Thread Shivappa Vikas



On Wed, 1 Mar 2017, Thomas Gleixner wrote:


WARN_ON(c->x86_cache_occ_scale != cqm_l3_scale);

@@ -1585,12 +1580,17 @@ static int intel_cqm_cpu_starting(unsigned int cpu)

 static int intel_cqm_cpu_exit(unsigned int cpu)
 {
+   struct intel_pqr_state *state = _cpu(pqr_state, cpu);


Can be this_cpu_ptr() because the callback is guaranteed to run on the
outgoing CPU.


Will fix this. Assumed the calls are setup cache alloc way - 
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN ..





int target;

/* Is @cpu the current cqm reader for this package ? */
if (!cpumask_test_and_clear_cpu(cpu, _cpumask))
return 0;


So if the CPU is not the current cqm reader then the per cpu state of this
CPU is left stale. Great improvement.


+   state->rmid = 0;
+   state->rmid_usecnt = 0;
+   wrmsr(MSR_IA32_PQR_ASSOC, 0, state->closid);


What clears state->closid? And what guarantees that state->rmid is not
updated before the CPU has really gone away?


- The rdt code takes care of clearing closid state now. Will update the comment. 
- The cqm however was never writing a zero to PQR_ASSOC.


So the update needs to be - to remove the state->closid = 0 from cqm code as the 
rdt code takes care of closid state in clear_closid() called from both offline and 
online cpu.

And also write a rmid = 0 to PQR_ASSOC.

We can integrate the two of these hot cpu calls(from cat and cqm) to write PQR 
only once.


guess I can skip all of these and send it as part of cqm changes we planned 
anyways, because this is really a cqm change.


Thanks,
Vikas



Re: [PATCH] x86/intel_rdt: Implement "update" mode when writing schemata file

2017-03-30 Thread Shivappa Vikas



On Fri, 24 Mar 2017, Fenghua Yu wrote:


On Fri, Mar 24, 2017 at 10:51:58AM -0700, Luck, Tony wrote:

From: Tony Luck 

The schemata file can have multiple lines and it is cumbersome to
update from shell scripts.


"from shell scripts" makes people a bit confused. Not just shell scripts,
C or Java code also can be cumbersome to update the file, right?


+Reading/writing the schemata file
+-
+Reading the schemata file will show the state of all resources
+on all domains. When writing you only need to specify those values
+which you wish to change.  E.g.


User can still write all values including not changed and change just
like 4.10 does. We may say "When writing you can either write all values
or only need to specify those values which you wish to change."?


Will integrate these changes to Tony's patch and send it as part of the next 
series.


Thanks,
Vikas


Re: [PATCH] x86/intel_rdt: Implement "update" mode when writing schemata file

2017-03-30 Thread Shivappa Vikas



On Fri, 24 Mar 2017, Fenghua Yu wrote:


On Fri, Mar 24, 2017 at 10:51:58AM -0700, Luck, Tony wrote:

From: Tony Luck 

The schemata file can have multiple lines and it is cumbersome to
update from shell scripts.


"from shell scripts" makes people a bit confused. Not just shell scripts,
C or Java code also can be cumbersome to update the file, right?


+Reading/writing the schemata file
+-
+Reading the schemata file will show the state of all resources
+on all domains. When writing you only need to specify those values
+which you wish to change.  E.g.


User can still write all values including not changed and change just
like 4.10 does. We may say "When writing you can either write all values
or only need to specify those values which you wish to change."?


Will integrate these changes to Tony's patch and send it as part of the next 
series.


Thanks,
Vikas


Re: [PATCH 6/8] x86/intel_rdt/mba: Add info directory files for MBA

2017-03-10 Thread Shivappa Vikas



On Wed, 1 Mar 2017, Thomas Gleixner wrote:


On Fri, 17 Feb 2017, Vikas Shivappa wrote:


Add files in info directory for MBA.
The files in the info directory are as follows :
- num_closids: max number of closids for MBA which represents the max
class of service user can configure.
- min_bw: the minimum memory bandwidth(b/w) values in percentage.

OS maps the b/w percentage values to memory b/w throttle delay values
and configures them via MSR interface by writing to the QOS_MSRs. These
delay values can have a linear or nonlinear scale.

- bw_gran: The memory b/w granularity that can be configured.
For ex: If the granularity is 10% and min_bw is 10, valid bandwidth
values are 10,20,30...


This is unreadable. It's possible to structure ASCII text cleanly.

x86/rdt: Add info directory files for MBA

The info directory for MBA contains the following files:

num_closids

The maximum number of class of service slots available for MBA

min_bandwidth

The minimum memory bandwidth percentage value

bandwidth_gran

The granularity of the bandwidth control for the particular
hardware in percent. The available bandwidth control steps are:
min_bw + N * bw_gran Intermediate values are rounded to the next


Is this next or pervious? Meaning when 12 is requested on a 10 granularity , we 
give 10 ?



control step available on the hardware.

delay_linear

If set, the control registers take a linear percentage based value
between min_bandwidth and 100 percent.

If not set, the control registers take a power of 2 based value
which is mapped by the kernel to percentage based values.

This file is of pure informational nature and has no influence on
the values which are written to the schemata files. These are
always percentage based.


Will update the changelogs



Note, that this uses the actual file names and not some random
abbreviations thereof. It also documents delay_linear and gets rid of the
implementation details of QOS_MSRs. They are irrelevant here.

And exactly this information wants to go into Documentation/... preferably
in exactly this patch and not in a disconnected one which describes stuff
differently for whatever reasons.


+static int rdt_min_bw_show(struct kernfs_open_file *of,
+struct seq_file *seq, void *v)
+{
+   struct rdt_resource *r = of->kn->parent->priv;
+
+   seq_printf(seq, "%d\n", r->min_bw);
+


Can you please get rid of these pointless extra new lines before the
'return 0;' ? They are just eating screen estate and do not make the code
more readable.


All the rest of the show functions have that line before return like the 
rdt_min_cbm_bits_show etc.  I have tried to 
always keep a line before return like the old cqm/rapl - if thats ok





+/* rdtgroup information files for MBE. */


What is MBE?


+static struct rftype res_mbe_info_files[] = {


Randomizing names make the code more secure or what are you trying to
achieve?


+void rdt_get_mba_infofile(struct rdt_resource *r)
+{
+   r->info_files = _mbe_info_files[0];


See other mail.



Will fix the MBE to MBA and the pointer init.

Thanks,
Vikas


Thanks,

tglx



Re: [PATCH 6/8] x86/intel_rdt/mba: Add info directory files for MBA

2017-03-10 Thread Shivappa Vikas



On Wed, 1 Mar 2017, Thomas Gleixner wrote:


On Fri, 17 Feb 2017, Vikas Shivappa wrote:


Add files in info directory for MBA.
The files in the info directory are as follows :
- num_closids: max number of closids for MBA which represents the max
class of service user can configure.
- min_bw: the minimum memory bandwidth(b/w) values in percentage.

OS maps the b/w percentage values to memory b/w throttle delay values
and configures them via MSR interface by writing to the QOS_MSRs. These
delay values can have a linear or nonlinear scale.

- bw_gran: The memory b/w granularity that can be configured.
For ex: If the granularity is 10% and min_bw is 10, valid bandwidth
values are 10,20,30...


This is unreadable. It's possible to structure ASCII text cleanly.

x86/rdt: Add info directory files for MBA

The info directory for MBA contains the following files:

num_closids

The maximum number of class of service slots available for MBA

min_bandwidth

The minimum memory bandwidth percentage value

bandwidth_gran

The granularity of the bandwidth control for the particular
hardware in percent. The available bandwidth control steps are:
min_bw + N * bw_gran Intermediate values are rounded to the next


Is this next or pervious? Meaning when 12 is requested on a 10 granularity , we 
give 10 ?



control step available on the hardware.

delay_linear

If set, the control registers take a linear percentage based value
between min_bandwidth and 100 percent.

If not set, the control registers take a power of 2 based value
which is mapped by the kernel to percentage based values.

This file is of pure informational nature and has no influence on
the values which are written to the schemata files. These are
always percentage based.


Will update the changelogs



Note, that this uses the actual file names and not some random
abbreviations thereof. It also documents delay_linear and gets rid of the
implementation details of QOS_MSRs. They are irrelevant here.

And exactly this information wants to go into Documentation/... preferably
in exactly this patch and not in a disconnected one which describes stuff
differently for whatever reasons.


+static int rdt_min_bw_show(struct kernfs_open_file *of,
+struct seq_file *seq, void *v)
+{
+   struct rdt_resource *r = of->kn->parent->priv;
+
+   seq_printf(seq, "%d\n", r->min_bw);
+


Can you please get rid of these pointless extra new lines before the
'return 0;' ? They are just eating screen estate and do not make the code
more readable.


All the rest of the show functions have that line before return like the 
rdt_min_cbm_bits_show etc.  I have tried to 
always keep a line before return like the old cqm/rapl - if thats ok





+/* rdtgroup information files for MBE. */


What is MBE?


+static struct rftype res_mbe_info_files[] = {


Randomizing names make the code more secure or what are you trying to
achieve?


+void rdt_get_mba_infofile(struct rdt_resource *r)
+{
+   r->info_files = _mbe_info_files[0];


See other mail.



Will fix the MBE to MBA and the pointer init.

Thanks,
Vikas


Thanks,

tglx



Re: [PATCH 5/8] x86/intel_rdt: info file support for MBA prepare

2017-03-10 Thread Shivappa Vikas



On Wed, 1 Mar 2017, Thomas Gleixner wrote:


On Fri, 17 Feb 2017, Vikas Shivappa wrote:


As a preparatory patch to MBA info file setup, generalize the info file
setup to have the option to choose between different set of files.
Although multiple cache resources have same info files, Memory resources
have different set of info files. That way we have the option to choose
between memory resource and cache resource info files.


Sigh.


@@ -77,6 +77,8 @@ struct rftype {
  * @default_ctrl:  Specifies default cache cbm or mem b/w percent.
  * @min_cbm_bits:  Minimum number of consecutive bits to be set
  * in a cache bit mask
+ * @info_files:resctrl info files for the resource
+ * @infofiles_len: Number of info files


len == length, nr == number. No? Too intuitive, right?


Took it as len as the rdtgroup_add_files calls its parameter as len.. Will fix 
to nr_infofiles as that seems better.




And no, not infofiles_nr. It wants to be nr_infofiles. And while at it
please either use infofiles or info_files, but not a mixture of
both. Random underscores are not enhancing readability at all.


+void rdt_get_cache_infofile(struct rdt_resource *r)
+{
+   r->info_files = _cache_info_files[0];


What's wrong with

   r->info_files = res_cache_info_files;

Nothing, but it would be too easy to read. This is not the obfuscated
c-code contest.


Will fix..

Thanks,
Vikas




+   r->infofiles_len = ARRAY_SIZE(res_cache_info_files);


Thanks,

tglx



Re: [PATCH 5/8] x86/intel_rdt: info file support for MBA prepare

2017-03-10 Thread Shivappa Vikas



On Wed, 1 Mar 2017, Thomas Gleixner wrote:


On Fri, 17 Feb 2017, Vikas Shivappa wrote:


As a preparatory patch to MBA info file setup, generalize the info file
setup to have the option to choose between different set of files.
Although multiple cache resources have same info files, Memory resources
have different set of info files. That way we have the option to choose
between memory resource and cache resource info files.


Sigh.


@@ -77,6 +77,8 @@ struct rftype {
  * @default_ctrl:  Specifies default cache cbm or mem b/w percent.
  * @min_cbm_bits:  Minimum number of consecutive bits to be set
  * in a cache bit mask
+ * @info_files:resctrl info files for the resource
+ * @infofiles_len: Number of info files


len == length, nr == number. No? Too intuitive, right?


Took it as len as the rdtgroup_add_files calls its parameter as len.. Will fix 
to nr_infofiles as that seems better.




And no, not infofiles_nr. It wants to be nr_infofiles. And while at it
please either use infofiles or info_files, but not a mixture of
both. Random underscores are not enhancing readability at all.


+void rdt_get_cache_infofile(struct rdt_resource *r)
+{
+   r->info_files = _cache_info_files[0];


What's wrong with

   r->info_files = res_cache_info_files;

Nothing, but it would be too easy to read. This is not the obfuscated
c-code contest.


Will fix..

Thanks,
Vikas




+   r->infofiles_len = ARRAY_SIZE(res_cache_info_files);


Thanks,

tglx



  1   2   >