from:"Nilay Vaish"

Re: [PATCH v2 2/6] perf report: Caculate and return the branch counting in callchain

2016-10-20 Thread Nilay Vaish

On 20 October 2016 at 11:48, Andi Kleen  wrote:
> On Thu, Oct 20, 2016 at 11:41:11AM -0500, Nilay Vaish wrote:
>> On 19 October 2016 at 17:01, Jin Yao  wrote:
>> > diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
>> > index 40ecf25..4f6bf6c 100644
>> > --- a/tools/perf/util/callchain.h
>> > +++ b/tools/perf/util/callchain.h
>> > @@ -115,6 +115,10 @@ struct callchain_list {
>> > boolunfolded;
>> > boolhas_children;
>> > };
>> > +   u64 branch_count;
>> > +   u64 predicted_count;
>> > +   u64 abort_count;
>>
>> Can you explain what abort count is?  It seems you are referring to
>> miss-speculated branches.  If that is the case, I would prefer that we
>> replace abort by miss_speculated or miss_predicted.
>
> abort refers to TSX aborts. It has nothing to do with branch
> mispredictions.

OK, I am more confused now.  Are you predicting some quantity related
to transactions?  Why would you divide abort count by branch count?
Further, I just looked at patch 6/6.  It has the following text:

+ Also show with some branch flags that can be:
+ - Predicted: display the average percentage of predicated branches.
+ (predicated number / total number)
+ - Abort: display the average percentage of abort branches.
+ (abort number /total number)
+ - Cycles: cycles in basic block.


I think there is inconsistency between what you are suggesting and
what the patch has.

--
Nilay

Re: [PATCH v2 2/6] perf report: Caculate and return the branch counting in callchain

2016-10-20 Thread Nilay Vaish

On 19 October 2016 at 17:01, Jin Yao  wrote:
> diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
> index 40ecf25..4f6bf6c 100644
> --- a/tools/perf/util/callchain.h
> +++ b/tools/perf/util/callchain.h
> @@ -115,6 +115,10 @@ struct callchain_list {
> boolunfolded;
> boolhas_children;
> };
> +   u64 branch_count;
> +   u64 predicted_count;
> +   u64 abort_count;

Can you explain what abort count is?  It seems you are referring to
miss-speculated branches.  If that is the case, I would prefer that we
replace abort by miss_speculated or miss_predicted.

--
Nilay

Re: [PATCH 1/9] perf/jit: improve error messages from JVMTI

2016-10-13 Thread Nilay Vaish

On 13 October 2016 at 05:59, Stephane Eranian  wrote:
> diff --git a/tools/perf/jvmti/libjvmti.c b/tools/perf/jvmti/libjvmti.c
> index ac12e4b91a92..2d9bc04b79a8 100644
> --- a/tools/perf/jvmti/libjvmti.c
> +++ b/tools/perf/jvmti/libjvmti.c
> @@ -12,6 +12,17 @@
>  static int has_line_numbers;
>  void *jvmti_agent;
>
> +static void print_error(jvmtiEnv *jvmti, const char *msg, jvmtiError ret)
> +{
> +   char *err_msg = NULL;
> +   jvmtiError err;
> +   err = (*jvmti)->GetErrorName(jvmti, ret, &err_msg);
> +   if (err == JVMTI_ERROR_NONE) {
> +   warnx("%s failed with %s", msg, err_msg);
> +   (*jvmti)->Deallocate(jvmti, (unsigned char *)err_msg);
> +   }
> +}

Do we not need to release the memory for err_msg if the condition for
the 'if' statement evaluates to false?  Is it that we are going to
kill the process, so no need to release the memory?

--
Nilay

Re: [PATCH v3 05/18] Documentation, x86: Documentation for Intel resource allocation user interface

2016-10-11 Thread Nilay Vaish

On 10 October 2016 at 12:19, Luck, Tony  wrote:
> On Sat, Oct 08, 2016 at 01:33:06PM -0700, Fenghua Yu wrote:
>> On Sat, Oct 08, 2016 at 12:12:07PM -0500, Nilay Vaish wrote:
>> > On 7 October 2016 at 21:45, Fenghua Yu  wrote:
>> > > From: Fenghua Yu 
>> > >
>> > > +L3 details (code and data prioritization disabled)
>> > > +--
>> > > +With CDP disabled the L3 schemata format is:
>> > > +
>> > > +   L3:=;=;...
>> > > +
>> > > +L3 details (CDP enabled via mount option to resctrl)
>> > > +
>> > > +When CDP is enabled, you need to specify separate cache bit masks for
>> > > +code and data access. The generic format is:
>> > > +
>> > > +   L3:=,;=,;...
>> >
>> > Can we drop L3 here and instead say:
>> > L:=,;=,;...
>> >
>> > and similarly for without CDP as well.
>>
>> L3 and L2 are similar but different. L2 doesn't have CDP feature. It would
>> be better to talk them separately here.
>>
>
> Perhaps we should document the general form, and then show examples
> for L3 (and later L2 and other resources as they are added). Note
> in particular that other resources are not cache-like ... so the
> pattern you see between L3 and L2 is short lived. Future resources
> that can be controlled are not caches.
>
> General form of a resource line is:
>
> {resource type}:[{resource instanceM}={resource value};]*N
>
> where "M" iterates over all online instances of this resource,
> and "N" is the total number. [and the ";" is a separator, not a terminator,
> but I'm not sure how to write that].
>
> For the "L3" resource there are two formats for the value.
> With CDP enabled:
> {d-cache bit mask},{i-cache bit mask}
> with CDP disabled:
> {cache bit mask}


I am in favour of documenting the general form, but Fenghua and you
can take a call on this.

>
> "L2" resource doesn't support CDP, so the only format is {cache bit mask}
>
>
> The next resource coming will have values that are simple ranges {0 .. max}
>

Regarding addition of more resources, I was wondering if one common
definition of struct rdt_resource to be used for each resource the way
to take.  As you point out, L2 caches will not support CDP, but cdp
would be part of the variable describing L2 cache.  Should we have
separate structs for each resource?

--
Nilay

Re: [PATCH v3 04/18] x86/intel_rdt: Feature discovery

2016-10-11 Thread Nilay Vaish

On 8 October 2016 at 14:52, Borislav Petkov  wrote:
> On Sat, Oct 08, 2016 at 01:54:54PM -0700, Fenghua Yu wrote:
>> > I think these #defines are specific to Intel.  I would prefer if we
>> > have _INTEL_ somewhere in them.
>
> We don't generally add vendor names to those defines. Even more so if
> the 0x0... leaf range is Intel-specific anyway.
>
>> Is adding "Intel" in comment good?
>
> I don't see any need if the leaf has already this heading:
>
> /* Intel-defined CPU features, CPUID level 0x0007:0 (ebx), word 9 */
>

I think we should go with Fenghua' suggestion on this.  Reading the
code around the edits from this patch, it seems word 7 is not owned by
anyone.  Both AMD and Intel seem to be using it.

--
Nilay

Re: [PATCH v3 01/18] Documentation, ABI: Add a document entry for cache id

2016-10-11 Thread Nilay Vaish

On 10 October 2016 at 11:45, Luck, Tony  wrote:
> On Sat, Oct 08, 2016 at 12:11:08PM -0500, Nilay Vaish wrote:
>> On 7 October 2016 at 21:45, Fenghua Yu  wrote:
>> > From: Fenghua Yu 
>
>> > +   caches typically exist per core, but there may not be a
>> > +   power of two cores on a socket, so these caches may be
>> > +   numbered 0, 1, 2, 3, 4, 5, 8, 9, 10, ...
>> > +
>>
>> While it is ok that the caches are not numbered contiguously, it is
>> unclear how this is related to number of cores on a socket being a
>> power of 2 or not.
>
> That's a side effect of the x86 algorithm to generate the unique ID
> which uses a shift to put the socket number in some upper bits while
> leaving the "id within a socket" in the low bits.
>
> I don't think it worth documenting here, but I noticed that we don't
> keep the IDs within a core contguous either.  On my 24 core Broadwell
> they are not 0 ... 23 then a gap from 24 to 31.  I actually have on
> socket 0:
>
>  0,  1,  2,  3,  4,  5
>  8,  9, 10, 11, 12, 13
> 16, 17, 18, 19, 20, 21
> 24, 25, 26, 27, 28, 29
>

Thanks for the info.

--
Nilay

Re: [PATCH v3 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-10 Thread Nilay Vaish

On 7 October 2016 at 21:45, Fenghua Yu  wrote:
> From: Fenghua Yu 
> diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c 
> b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> index 28df92e..efcbfe7 100644
> --- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> +++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> @@ -36,6 +36,52 @@ struct kernfs_root *rdt_root;
>  struct rdtgroup rdtgroup_default;
>  LIST_HEAD(rdt_all_groups);
>
> +/*
> + * Trivial allocator for CLOSIDs. Since h/w only supports
> + * a small number, we can keep a bitmap of free CLOSIDs in
> + * a single integer.
> + */
> +static int closid_free_map;
> +
> +static void closid_init(void)
> +{
> +   closid_free_map = BIT_MASK(rdt_max_closid) - 1;
> +
> +   /* CLOSID 0 is always reserved for the default group */
> +   closid_free_map &= ~1;
> +}
> +
> +int closid_alloc(void)
> +{
> +   int closid = ffs(closid_free_map);
> +
> +   if (closid == 0)
> +   return -ENOSPC;
> +   closid--;
> +   closid_free_map &= ~(1 << closid);
> +
> +   return closid;
> +}
> +
> +static void closid_free(int closid)
> +{
> +   closid_free_map |= 1 << closid;
> +}
> +
> +static struct rdtgroup *rdtgroup_alloc(void)
> +{
> +   struct rdtgroup *rdtgrp;
> +
> +   rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL);
> +
> +   return rdtgrp;
> +}
> +
> +static void rdtgroup_free(struct rdtgroup *rdtgroup)
> +{
> +   kfree(rdtgroup);
> +}

Any reason why you defined the above two functions?  Why not call
kzalloc() and kfree() directly?

--
Nilay

Re: [PATCH v3 11/18] x86/intel_rdt: Add basic resctrl filesystem support

2016-10-09 Thread Nilay Vaish

On 7 October 2016 at 21:45, Fenghua Yu  wrote:
> From: Fenghua Yu 
>
> diff --git a/arch/x86/include/asm/intel_rdt.h 
> b/arch/x86/include/asm/intel_rdt.h
> index bad8dc7..f63815c 100644
> --- a/arch/x86/include/asm/intel_rdt.h
> +++ b/arch/x86/include/asm/intel_rdt.h
> @@ -2,6 +2,23 @@
>  #define _ASM_X86_INTEL_RDT_H
>
>  /**
> + * struct rdtgroup - store rdtgroup's data in resctrl file system.
> + * @kn:kernfs node
> + * @rdtgroup_list: linked list for all rdtgroups
> + * @closid:closid for this rdtgroup
> + */
> +struct rdtgroup {
> +   struct kernfs_node  *kn;
> +   struct list_headrdtgroup_list;
> +   int closid;
> +};
> +
> +/* List of all resource groups */
> +extern struct list_head rdt_all_groups;
> +
> +int __init rdtgroup_init(void);
> +
> +/**
>   * struct rdt_resource - attributes of an RDT resource
>   * @enabled:   Is this feature enabled on this machine
>   * @name:  Name to use in "schemata" file
> @@ -39,6 +56,7 @@ struct rdt_resource {
> for (r = rdt_resources_all; r->name; r++) \
> if (r->enabled)
>
> +#define IA32_L3_QOS_CFG0xc81
>  #define IA32_L3_CBM_BASE   0xc90
>
>  /**
> @@ -72,7 +90,7 @@ extern struct mutex rdtgroup_mutex;
>
>  int __init rdtgroup_init(void);

This statement appears about twenty lines above as well.


> diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c 
> b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> new file mode 100644
> index 000..c99d3a0
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> @@ -0,0 +1,236 @@
> +/*
> + * User interface for Resource Alloction in Resource Director Technology(RDT)
> + *
> + * Copyright (C) 2016 Intel Corporation
> + *
> + * Author: Fenghua Yu 
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * More information about RDT be found in the Intel (R) x86 Architecture
> + * Software Developer Manual.
> + */
> +
> +#define pr_fmt(fmt)KBUILD_MODNAME ": " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +#include 
> +
> +DEFINE_STATIC_KEY_FALSE(rdt_enable_key);
> +struct kernfs_root *rdt_root;
> +struct rdtgroup rdtgroup_default;
> +LIST_HEAD(rdt_all_groups);
> +
> +static void l3_qos_cfg_update(void *arg)
> +{
> +   struct rdt_resource *r = arg;
> +
> +   wrmsrl(IA32_L3_QOS_CFG, r->cdp_enabled);
> +}
> +
> +static void set_l3_qos_cfg(struct rdt_resource *r)
> +{
> +   struct list_head *l;
> +   struct rdt_domain *d;
> +   struct cpumask cpu_mask;
> +
> +   cpumask_clear(&cpu_mask);
> +   list_for_each(l, &r->domains) {
> +   d = list_entry(l, struct rdt_domain, list);
> +   cpumask_set_cpu(cpumask_any(&d->cpu_mask), &cpu_mask);
> +   }
> +   smp_call_function_many(&cpu_mask, l3_qos_cfg_update, r, 1);
> +}
> +
> +static int parse_rdtgroupfs_options(char *data, struct rdt_resource *r)
> +{
> +   char *token, *o = data;
> +
> +   while ((token = strsep(&o, ",")) != NULL) {
> +   if (!*token)
> +   return -EINVAL;
> +
> +   if (!strcmp(token, "cdp"))
> +   if (r->enabled && r->cdp_capable)
> +   r->cdp_enabled = true;
> +   }
> +
> +   return 0;
> +}
> +
> +static struct dentry *rdt_mount(struct file_system_type *fs_type,
> +   int flags, const char *unused_dev_name,
> +   void *data)
> +{
> +   struct dentry *dentry;
> +   int ret;
> +   bool new_sb;
> +   struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3];
> +
> +   mutex_lock(&rdtgroup_mutex);
> +   /*
> +* resctrl file system can only be mounted once.
> +*/
> +   if (static_branch_unlikely(&rdt_enable_key)) {
> +   dentry = ERR_PTR(-EBUSY);
> +   goto out;
> +   }
> +
> +   r->cdp_enabled = false;
> +   ret = parse_rdtgroupfs_options(data, r);
> +   if (ret) {
> +   dentry = ERR_PTR(ret);
> +   goto out;
> +   }
> +   if (r->cdp_enabled)
> +   r->num_closid = r->max_closid / 2;
> +   else
> +   r->num_closid = r->max_closid;
> +
> +   /* Recompute rdt_max_closid because CDP may have changed things. */
> +   rdt_max_closid = 0;
> +   for_each_rdt_resource(r)
> +   rdt_max_closid = max(rdt_max_closid, r->num_closid);
> +

Re: [PATCH v3 10/18] x86/intel_rdt: Build structures for each resource based on cache topology

2016-10-09 Thread Nilay Vaish

On 7 October 2016 at 21:45, Fenghua Yu  wrote:
> From: Tony Luck 
>
> diff --git a/arch/x86/include/asm/intel_rdt.h 
> b/arch/x86/include/asm/intel_rdt.h
> index 251ac2a..bad8dc7 100644
> --- a/arch/x86/include/asm/intel_rdt.h
> +++ b/arch/x86/include/asm/intel_rdt.h
> @@ -40,6 +40,39 @@ struct rdt_resource {
> if (r->enabled)
>
>  #define IA32_L3_CBM_BASE   0xc90
> +
> +/**
> + * struct rdt_domain - group of cpus sharing an RDT resource
> + * @list:  all instances of this resource
> + * @id:unique id for this instance
> + * @cpu_mask:  which cpus share this resource
> + * @cbm:   array of cache bit masks (indexed by CLOSID)
> + */
> +struct rdt_domain {
> +   struct list_headlist;
> +   int id;
> +   struct cpumask  cpu_mask;
> +   u32 *cbm;
> +};
> +
> +/**
> + * struct msr_param - set a range of MSRs from a domain
> + * @res:   The resource to use
> + * @low:   Beginning index from base MSR
> + * @high:  End index
> + */
> +struct msr_param {
> +   struct rdt_resource *res;
> +   int low;
> +   int high;
> +};
> +
> +extern struct static_key_false rdt_enable_key;
> +extern struct mutex rdtgroup_mutex;
> +
> +int __init rdtgroup_init(void);
> +
> +extern struct rdtgroup *rdtgroup_default;

struct rdtgroup has not been defined yet.  The statement above should
be part of the next patch.

--
Nilay

Re: [PATCH v3 02/18] cacheinfo: Introduce cache id

2016-10-08 Thread Nilay Vaish

On 7 October 2016 at 21:45, Fenghua Yu  wrote:
> From: Fenghua Yu 
>
> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
> index e9fd32e..2a21c15 100644
> --- a/drivers/base/cacheinfo.c
> +++ b/drivers/base/cacheinfo.c
> @@ -233,6 +233,7 @@ static ssize_t file_name##_show(struct device *dev,   
>   \
> return sprintf(buf, "%u\n", this_leaf->object); \
>  }
>
> +show_one(id, id);
>  show_one(level, level);
>  show_one(coherency_line_size, coherency_line_size);
>  show_one(number_of_sets, number_of_sets);
> @@ -314,6 +315,7 @@ static ssize_t write_policy_show(struct device *dev,
> return n;
>  }
>
> +static DEVICE_ATTR_RO(id);
>  static DEVICE_ATTR_RO(level);
>  static DEVICE_ATTR_RO(type);
>  static DEVICE_ATTR_RO(coherency_line_size);
> @@ -327,6 +329,7 @@ static DEVICE_ATTR_RO(shared_cpu_list);
>  static DEVICE_ATTR_RO(physical_line_partition);
>
>  static struct attribute *cache_default_attrs[] = {
> +   &dev_attr_id.attr,
> &dev_attr_type.attr,
> &dev_attr_level.attr,
> &dev_attr_shared_cpu_map.attr,
> @@ -350,6 +353,8 @@ cache_default_attrs_is_visible(struct kobject *kobj,
> const struct cpumask *mask = &this_leaf->shared_cpu_map;
> umode_t mode = attr->mode;
>
> +   if ((attr == &dev_attr_id.attr) && this_leaf->attributes & CACHE_ID)

Can you put parentheses around 'this_leaf->attributes & CACHE_ID'?
Whenever I look at expressions using bitwise operators without
parentheses, I have to lookup the precedence table to make sure
everything is fine.

--
Nilay

Re: [PATCH v3 05/18] Documentation, x86: Documentation for Intel resource allocation user interface

2016-10-08 Thread Nilay Vaish

On 7 October 2016 at 21:45, Fenghua Yu  wrote:
> From: Fenghua Yu 
>
> +L3 details (code and data prioritization disabled)
> +--
> +With CDP disabled the L3 schemata format is:
> +
> +   L3:=;=;...
> +
> +L3 details (CDP enabled via mount option to resctrl)
> +
> +When CDP is enabled, you need to specify separate cache bit masks for
> +code and data access. The generic format is:
> +
> +   L3:=,;=,;...

Can we drop L3 here and instead say:
L:=,;=,;...

and similarly for without CDP as well.

--
Nilay

Re: [PATCH v3 04/18] x86/intel_rdt: Feature discovery

2016-10-08 Thread Nilay Vaish

On 7 October 2016 at 21:45, Fenghua Yu  wrote:
> From: Fenghua Yu 
>
> Check CPUID leaves for all the Resource Director Technology (RDT)
> Cache Allocation Technology (CAT) bits.
>
> Prescence of allocation features:

Presence

> diff --git a/arch/x86/include/asm/cpufeatures.h 
> b/arch/x86/include/asm/cpufeatures.h
> index 92a8308..64dd8274 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -196,6 +196,10 @@
>
>  #define X86_FEATURE_INTEL_PT   ( 7*32+15) /* Intel Processor Trace */
>
> +#define X86_FEATURE_CAT_L3 ( 7*32+16) /* Cache Allocation Technology L3 
> */
> +#define X86_FEATURE_CAT_L2 ( 7*32+17) /* Cache Allocation Technology L2 
> */
> +#define X86_FEATURE_CDP_L3 ( 7*32+18) /* Code and Data Prioritization L3 
> */
> +
>  /* Virtualization flags: Linux defined, word 8 */
>  #define X86_FEATURE_TPR_SHADOW  ( 8*32+ 0) /* Intel TPR Shadow */
>  #define X86_FEATURE_VNMI( 8*32+ 1) /* Intel Virtual NMI */
> @@ -220,6 +224,7 @@
>  #define X86_FEATURE_RTM( 9*32+11) /* Restricted 
> Transactional Memory */
>  #define X86_FEATURE_CQM( 9*32+12) /* Cache QoS Monitoring */
>  #define X86_FEATURE_MPX( 9*32+14) /* Memory Protection 
> Extension */
> +#define X86_FEATURE_RDT_A  ( 9*32+15) /* Resource Director Technology 
> Allocation */
>  #define X86_FEATURE_AVX512F( 9*32+16) /* AVX-512 Foundation */
>  #define X86_FEATURE_AVX512DQ   ( 9*32+17) /* AVX-512 DQ (Double/Quad 
> granular) Instructions */
>  #define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */

I think these #defines are specific to Intel.  I would prefer if we
have _INTEL_ somewhere in them.

--
Nilay

Re: [PATCH v3 01/18] Documentation, ABI: Add a document entry for cache id

2016-10-08 Thread Nilay Vaish

On 7 October 2016 at 21:45, Fenghua Yu  wrote:
> From: Fenghua Yu 
>
> Add an ABI document entry for /sys/devices/system/cpu/cpu*/cache/index*/id.
>
> Signed-off-by: Fenghua Yu 
> Signed-off-by: Tony Luck 
> ---
>  Documentation/ABI/testing/sysfs-devices-system-cpu | 16 
>  1 file changed, 16 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu 
> b/Documentation/ABI/testing/sysfs-devices-system-cpu
> index 4987417..b1c3d69 100644
> --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
> +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
> @@ -272,6 +272,22 @@ Description:   Parameters for the CPU cache 
> attributes
>  the modified cache line is written to 
> main
>  memory only when it is replaced
>
> +
> +What:  /sys/devices/system/cpu/cpu*/cache/index*/id
> +Date:  September 2016
> +Contact:   Linux kernel mailing list 
> +Description:   Cache id
> +
> +   The id provides a unique name for a specific instance of
> +   a cache of a particular type. E.g. there may be a level
> +   3 unified cache on each socket in a server and we may
> +   assign them ids 0, 1, 2, ...
> +
> +   Note that id value may not be contiguous. E.g. level 1
> +   caches typically exist per core, but there may not be a
> +   power of two cores on a socket, so these caches may be
> +   numbered 0, 1, 2, 3, 4, 5, 8, 9, 10, ...
> +

While it is ok that the caches are not numbered contiguously, it is
unclear how this is related to number of cores on a socket being a
power of 2 or not.

--
Nilay

Re: [PATCH v5 1/9] sched: Extend scheduler's asym packing

2016-10-01 Thread Nilay Vaish

On 1 October 2016 at 06:45, Srinivas Pandruvada
 wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e86c4a5..08135ca 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6237,7 +6237,25 @@ static void init_sched_groups_capacity(int cpu, struct 
> sched_domain *sd)
> WARN_ON(!sg);
>
> do {
> +   int cpu, max_cpu = -1, prev_cpu = -1;
> +
> sg->group_weight = cpumask_weight(sched_group_cpus(sg));
> +
> +   if (!(sd->flags & SD_ASYM_PACKING))
> +   goto next;
> +
> +   for_each_cpu(cpu, sched_group_cpus(sg)) {
> +   if (prev_cpu < 0) {
> +   prev_cpu = cpu;
> +   max_cpu = cpu;

It seems that you can drop prev_cpu and put the check on max_cpu instead.

--
Nilay

Re: [PATCH 05/61] perf tools: Introduce c2c_decode_stats function

2016-09-19 Thread Nilay Vaish

On 19 September 2016 at 08:09, Jiri Olsa  wrote:
> diff --git a/tools/perf/util/mem-events.h b/tools/perf/util/mem-events.h
> index 7f69bf9d789d..27c6bb5abafb 100644
> --- a/tools/perf/util/mem-events.h
> +++ b/tools/perf/util/mem-events.h
> @@ -2,6 +2,10 @@
>  #define __PERF_MEM_EVENTS_H
>
>  #include 
> +#include 
> +#include 
> +#include 
> +#include "stat.h"
>
>  struct perf_mem_event {
> boolrecord;
> @@ -33,4 +37,36 @@ int perf_mem__lck_scnprintf(char *out, size_t sz, struct 
> mem_info *mem_info);
>
>  int perf_script__meminfo_scnprintf(char *bf, size_t size, struct mem_info 
> *mem_info);
>
> +struct c2c_stats {
> +   int nr_entries;
> +
> +   int locks;   /* count of 'lock' transactions */
> +   int store;   /* count of all stores in trace */
> +   int st_uncache;  /* stores to uncacheable address */
> +   int st_noadrs;   /* cacheable store with no address */

No address! Why would that happen?


--
Nilay

Re: [PATCH v3 03/15] lockdep: Refactor lookup_chain_cache()

2016-09-19 Thread Nilay Vaish

On 18 September 2016 at 22:05, Byungchul Park  wrote:
> On Thu, Sep 15, 2016 at 10:33:46AM -0500, Nilay Vaish wrote:
>> On 13 September 2016 at 04:45, Byungchul Park  wrote:
>> > @@ -2215,6 +2178,75 @@ cache_hit:
>> > return 1;
>> >  }
>> >
>> > +/*
>> > + * Look up a dependency chain.
>> > + */
>> > +static inline struct lock_chain *lookup_chain_cache(u64 chain_key)
>> > +{
>> > +   struct hlist_head *hash_head = chainhashentry(chain_key);
>> > +   struct lock_chain *chain;
>> > +
>> > +   /*
>> > +* We can walk it lock-free, because entries only get added
>> > +* to the hash:
>> > +*/
>> > +   hlist_for_each_entry_rcu(chain, hash_head, entry) {
>> > +   if (chain->chain_key == chain_key) {
>> > +   debug_atomic_inc(chain_lookup_hits);
>> > +   return chain;
>> > +   }
>> > +   }
>> > +   return NULL;
>> > +}
>>
>> Byungchul,  do you think we should increment chain_lookup_misses
>> before returning NULL from the above function?
>
> Hello,
>
> No, I don't think so.
> It will be done in add_chain_cache().
>

I think you are assuming that a call to lookup will always be followed
by add.  I thought the point of breaking the original function into
two was that each of the functions can be used individually, without
the other being called.  This means we would not increment the number
of misses when only lookup() gets called, but not add().  Or we would
increment the number of misses when only add() is called and not
lookup().

It really seems odd to me that hits get incremented in lookup and misses don't.

--
Nilay

Re: [PATCH v3 15/15] lockdep: Crossrelease feature documentation

2016-09-16 Thread Nilay Vaish

On 13 September 2016 at 04:45, Byungchul Park  wrote:
> This document describes the concept of crossrelease feature, which
> generalizes what causes a deadlock and how can detect a deadlock.
>
> Signed-off-by: Byungchul Park 
> ---
>  Documentation/locking/crossrelease.txt | 785 
> +
>  1 file changed, 785 insertions(+)
>  create mode 100644 Documentation/locking/crossrelease.txt
>
> diff --git a/Documentation/locking/crossrelease.txt 
> b/Documentation/locking/crossrelease.txt
> new file mode 100644
> index 000..78558af
> --- /dev/null
> +++ b/Documentation/locking/crossrelease.txt
> @@ -0,0 +1,785 @@
> +Crossrelease
> +
> +
> +Started by Byungchul Park 
> +
> +Contents:
> +
> + (*) Background.
> +
> + - What causes deadlock.
> + - What lockdep detects.
> + - How lockdep works.
> +
> + (*) Limitation.
> +
> + - Limit to typical lock.
> + - Pros from the limitation.
> + - Cons from the limitation.
> +
> + (*) Generalization.
> +
> + - Relax the limitation.
> +
> + (*) Crossrelease.
> +
> + - Introduce crossrelease.
> + - Introduce commit.
> +
> + (*) Implementation.
> +
> + - Data structures.
> + - How crossrelease works.
> +
> + (*) Optimizations.
> +
> + - Avoid duplication.
> + - Avoid lock contention.
> +
> +
> +==
> +Background
> +==
> +
> +What causes deadlock
> +
> +
> +A deadlock occurs when a context is waiting for an event to be issued
> +which cannot be issued because the context or another context who can
> +issue the event is also waiting for an event to be issued which cannot
> +be issued.

I think 'some event happened' and 'context triggered an event' is
better than 'some event issued' or 'context issued an event'.  I think
'happen' and 'trigger' are more widely used words when we talk about
events.  For example, I would prefer the following version of the
above:

A deadlock occurs when a context is waiting for an event to happen,
which cannot happen because the context which can trigger the event is
also waiting for an event to happen which cannot happen either.

> +Single context or more than one context both waiting for an
> +event and issuing an event may paricipate in a deadlock.

I am not able to make sense of the line above.

> +
> +For example,
> +
> +A context who can issue event D is waiting for event A to be issued.
> +A context who can issue event A is waiting for event B to be issued.
> +A context who can issue event B is waiting for event C to be issued.
> +A context who can issue event C is waiting for event D to be issued.
> +
> +A deadlock occurs when these four operations are run at a time because
> +event D cannot be issued if event A isn't issued which in turn cannot be
> +issued if event B isn't issued which in turn cannot be issued if event C
> +isn't issued which in turn cannot be issued if event D isn't issued. No
> +event can be issued since any of them never meets its precondition.
> +
> +We can easily recognize that each wait operation creates a dependency
> +between two issuings e.g. between issuing D and issuing A like, 'event D
> +cannot be issued if event A isn't issued', in other words, 'issuing
> +event D depends on issuing event A'. So the whole example can be
> +rewritten in terms of dependency,
> +
> +Do an operation making 'event D cannot be issued if event A isn't issued'.
> +Do an operation making 'event A cannot be issued if event B isn't issued'.
> +Do an operation making 'event B cannot be issued if event C isn't issued'.
> +Do an operation making 'event C cannot be issued if event D isn't issued'.
> +
> +or,

I think we can remove the text above.  The example only needs to be
provided once.

> +
> +Do an operation making 'issuing event D depends on issuing event A'.
> +Do an operation making 'issuing event A depends on issuing event B'.
> +Do an operation making 'issuing event B depends on issuing event C'.
> +Do an operation making 'issuing event C depends on issuing event D'.
> +
> +What causes a deadlock is a set of dependencies a chain of which forms a
> +cycle, which means that issuing event D depending on issuing event A
> +depending on issuing event B depending on issuing event C depending on
> +issuing event D, finally depends on issuing event D itself, which means
> +no event can be issued.
> +
> +Any set of operations creating dependencies causes a deadlock. The set
> +of lock operations e.g. acquire and release is an example. Waiting for a
> +lock to be released corresponds to waiting for an event and releasing a
> +lock corresponds to issuing an event. So the description of dependency
> +above can be altered to one in terms of lock.
> +
> +In terms of event, issuing event A depends on issuing event B if,
> +
> +   Event A cannot be issued if event B isn't issued.
> +
> +In terms of lock, releasing lock A depends on releasing lock B if,
> +
> +   Lock A cannot be released if lock B isn't released.

Re: [PATCH v3 15/15] lockdep: Crossrelease feature documentation

2016-09-15 Thread Nilay Vaish

On 13 September 2016 at 04:45, Byungchul Park  wrote:
> This document describes the concept of crossrelease feature, which
> generalizes what causes a deadlock and how can detect a deadlock.
>
> Signed-off-by: Byungchul Park 
> ---
>  Documentation/locking/crossrelease.txt | 785 
> +
>  1 file changed, 785 insertions(+)
>  create mode 100644 Documentation/locking/crossrelease.txt

Byungchul, I mostly skimmed through the document.  I suggest that we
split this document.  The initial 1/4 of the document talks about
lockdep's current implementation which I believe should be combined
with the file: Documentation/locking/lockdep-design.txt. Tomorrow I
would try to understand the document in detail and hopefully provide
some useful comments.

--
Nilay

Re: [PATCH v3 03/15] lockdep: Refactor lookup_chain_cache()

2016-09-15 Thread Nilay Vaish

On 13 September 2016 at 04:45, Byungchul Park  wrote:
> @@ -2215,6 +2178,75 @@ cache_hit:
> return 1;
>  }
>
> +/*
> + * Look up a dependency chain.
> + */
> +static inline struct lock_chain *lookup_chain_cache(u64 chain_key)
> +{
> +   struct hlist_head *hash_head = chainhashentry(chain_key);
> +   struct lock_chain *chain;
> +
> +   /*
> +* We can walk it lock-free, because entries only get added
> +* to the hash:
> +*/
> +   hlist_for_each_entry_rcu(chain, hash_head, entry) {
> +   if (chain->chain_key == chain_key) {
> +   debug_atomic_inc(chain_lookup_hits);
> +   return chain;
> +   }
> +   }
> +   return NULL;
> +}

Byungchul,  do you think we should increment chain_lookup_misses
before returning NULL from the above function?

--
Nilay

Re: [PATCH v2 11/33] x86/intel_rdt: Hot cpu support for Cache Allocation

2016-09-13 Thread Nilay Vaish

On 8 September 2016 at 04:57, Fenghua Yu  wrote:
> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> index 9f30492..4537658 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.c
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -141,6 +145,80 @@ static inline bool rdt_cpumask_update(int cpu)
> return false;
>  }
>
> +/*
> + * cbm_update_msrs() - Updates all the existing IA32_L3_MASK_n MSRs
> + * which are one per CLOSid on the current package.
> + */
> +static void cbm_update_msrs(void *dummy)
> +{
> +   int maxid = boot_cpu_data.x86_cache_max_closid;
> +   struct rdt_remote_data info;
> +   unsigned int i;
> +
> +   for (i = 0; i < maxid; i++) {
> +   if (cctable[i].clos_refcnt) {
> +   info.msr = CBM_FROM_INDEX(i);
> +   info.val = cctable[i].cbm;
> +   msr_cpu_update(&info);
> +   }
> +   }
> +}
> +
> +static int intel_rdt_online_cpu(unsigned int cpu)
> +{
> +   struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
> +
> +   state->closid = 0;
> +   mutex_lock(&rdtgroup_mutex);
> +   /* The cpu is set in root rdtgroup after online. */
> +   cpumask_set_cpu(cpu, &root_rdtgrp->cpu_mask);
> +   per_cpu(cpu_rdtgroup, cpu) = root_rdtgrp;
> +   /*
> +* If the cpu is first time found and set in its siblings that
> +* share the same cache, update the CBM MSRs for the cache.
> +*/

I am finding it slightly hard to parse the comment above.  Does the
following sound better:  If the cpu is the first one found and set
amongst its siblings that ...

> +   if (rdt_cpumask_update(cpu))
> +   smp_call_function_single(cpu, cbm_update_msrs, NULL, 1);
> +   mutex_unlock(&rdtgroup_mutex);
> +}
> +
> +static int clear_rdtgroup_cpumask(unsigned int cpu)
> +{
> +   struct list_head *l;
> +   struct rdtgroup *r;
> +
> +   list_for_each(l, &rdtgroup_lists) {
> +   r = list_entry(l, struct rdtgroup, rdtgroup_list);
> +   if (cpumask_test_cpu(cpu, &r->cpu_mask)) {
> +   cpumask_clear_cpu(cpu, &r->cpu_mask);
> +   return 0;
> +   }
> +   }
> +
> +   return -EINVAL;
> +}
> +
> +static int intel_rdt_offline_cpu(unsigned int cpu)
> +{
> +   int i;
> +
> +   mutex_lock(&rdtgroup_mutex);
> +   if (!cpumask_test_and_clear_cpu(cpu, &rdt_cpumask)) {
> +   mutex_unlock(&rdtgroup_mutex);
> +   return;
> +   }
> +
> +   cpumask_and(&tmp_cpumask, topology_core_cpumask(cpu), 
> cpu_online_mask);
> +   cpumask_clear_cpu(cpu, &tmp_cpumask);
> +   i = cpumask_any(&tmp_cpumask);
> +
> +   if (i < nr_cpu_ids)
> +   cpumask_set_cpu(i, &rdt_cpumask);
> +
> +   clear_rdtgroup_cpumask(cpu);
> +   mutex_unlock(&rdtgroup_mutex);
> +}
> +

Just for my info, why do we need not update MSRs when a cpu goes offline?



Thanks
Nilay

Re: [PATCH v2 10/33] x86/intel_rdt: Implement scheduling support for Intel RDT

2016-09-13 Thread Nilay Vaish

On 8 September 2016 at 04:57, Fenghua Yu  wrote:
> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> index 9cf3a7d..9f30492 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.c
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -21,6 +21,8 @@
>   */
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>
>  /*
> @@ -41,12 +43,28 @@ static cpumask_t rdt_cpumask;
>   */
>  static cpumask_t tmp_cpumask;
>  static DEFINE_MUTEX(rdtgroup_mutex);
> +struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;

I had pointed this for the previous version of the patch as well.  You
should use DEFINE_STATIC_KEY_FALSE(rdt_enable_key), instead of what
you have here.

--
Nilay

Re: [PATCH v2 09/33] x86/intel_rdt: Add L3 cache capacity bitmask management

2016-09-12 Thread Nilay Vaish

On 8 September 2016 at 04:57, Fenghua Yu  wrote:
> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> index b25940a..9cf3a7d 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.c
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -31,8 +31,22 @@ static struct clos_cbm_table *cctable;
>   * closid availability bit map.
>   */
>  unsigned long *closmap;
> +/*
> + * Mask of CPUs for writing CBM values. We only need one CPU per-socket.

Does the second line make sense here?

> + */
> +static cpumask_t rdt_cpumask;
> +/*
> + * Temporary cpumask used during hot cpu notificaiton handling. The usage
> + * is serialized by hot cpu locks.

s/notificaiton/notification


--
Nilay

Re: [PATCH v2 08/33] x86/intel_rdt: Add Class of service management

2016-09-12 Thread Nilay Vaish

On 8 September 2016 at 04:57, Fenghua Yu  wrote:
> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> index fcd0642..b25940a 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.c
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -21,17 +21,94 @@
>   */
>  #include 
>  #include 
> +#include 
> +
> +/*
> + * cctable maintains 1:1 mapping between CLOSid and cache bitmask.
> + */
> +static struct clos_cbm_table *cctable;
> +/*
> + * closid availability bit map.
> + */
> +unsigned long *closmap;
> +static DEFINE_MUTEX(rdtgroup_mutex);
> +
> +static inline void closid_get(u32 closid)
> +{
> +   struct clos_cbm_table *cct = &cctable[closid];
> +
> +   lockdep_assert_held(&rdtgroup_mutex);
> +
> +   cct->clos_refcnt++;
> +}
> +
> +static int closid_alloc(u32 *closid)
> +{
> +   u32 maxid;
> +   u32 id;
> +
> +   lockdep_assert_held(&rdtgroup_mutex);
> +
> +   maxid = boot_cpu_data.x86_cache_max_closid;
> +   id = find_first_zero_bit(closmap, maxid);
> +   if (id == maxid)
> +   return -ENOSPC;
> +
> +   set_bit(id, closmap);
> +   closid_get(id);
> +   *closid = id;
> +
> +   return 0;
> +}
> +
> +static inline void closid_free(u32 closid)
> +{
> +   clear_bit(closid, closmap);
> +   cctable[closid].cbm = 0;
> +}
> +
> +static void closid_put(u32 closid)
> +{
> +   struct clos_cbm_table *cct = &cctable[closid];
> +
> +   lockdep_assert_held(&rdtgroup_mutex);
> +   if (WARN_ON(!cct->clos_refcnt))
> +   return;
> +
> +   if (!--cct->clos_refcnt)
> +   closid_free(closid);
> +}
>

I would suggest a small change in the function names here.  I think
the names closid_free and closid_put should be swapped. As I
understand, under the current naming scheme, the opposite of
closid_alloc() is closid_put() and the opposite of closid_get() is
closid_free().  This is not the normal way these names are paired.
Typically, alloc and free go as a pair and get and put go as a pair.

--
Nilay

Re: [PATCH v2 02/33] Documentation, ABI: Add a document entry for cache id

2016-09-09 Thread Nilay Vaish

On 8 September 2016 at 14:33, Thomas Gleixner  wrote:
> On Thu, 8 Sep 2016, Fenghua Yu wrote:
>> +What:/sys/devices/system/cpu/cpu*/cache/index*/id
>> +Date:July 2016
>> +Contact: Linux kernel mailing list 
>> +Description: Cache id
>> +
>> + The id identifies a hardware cache of the system within a given
>> + cache index in a set of cache indices. The "index" name is
>> + simply a nomenclature from CPUID's leaf 4 which enumerates all
>> + caches on the system by referring to each one as a cache index.
>> + The (cache index, cache id) pair is unique for the whole
>> + system.
>> +
>> + Currently id is implemented on x86. On other platforms, id is
>> + not enabled yet.
>
> And it never will be available on anything else than x86 because there is
> no other architecture providing CPUID leaf 4 
>
> If you want this to be generic then get rid of the x86isms in the
> explanation and describe the x86 specific part seperately.
>

I second tglx here.  Also, I think this patch should be renumbered to
be patch 01/33.  The current 01/33 patch should be 02/33.  That way we
get to read about the definition of cache id first and its use comes
later.

--
Nilay

Re: [PATCH v2 01/33] cacheinfo: Introduce cache id

2016-09-09 Thread Nilay Vaish

On 8 September 2016 at 04:56, Fenghua Yu  wrote:
> diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
> index 2189935..cf6984d 100644
> --- a/include/linux/cacheinfo.h
> +++ b/include/linux/cacheinfo.h
> @@ -18,6 +18,7 @@ enum cache_type {
>
>  /**
>   * struct cacheinfo - represent a cache leaf node
> + * @id: This cache's id. ID is unique in the same index on the platform.

Can you explain what 'index on the platform' means?  I have absolutely
no idea what that is.

--
Nilay

Re: [PATCH v2 1/1] x86, relocs: add function attributes to die()

2016-09-05 Thread Nilay Vaish

  die("Seek to %"PRIuELF" failed: %s\n",
> sec->shdr.sh_offset, strerror(errno));
> }
> if (fread(sec->symtab, 1, sec->shdr.sh_size, fp)
> @@ -489,11 +489,11 @@ static void read_relocs(FILE *fp)
> }
> sec->reltab = malloc(sec->shdr.sh_size);
> if (!sec->reltab) {
> -   die("malloc of %d bytes for relocs failed\n",
> +   die("malloc of %"PRIuELF" bytes for relocs failed\n",
> sec->shdr.sh_size);
> }
> if (fseek(fp, sec->shdr.sh_offset, SEEK_SET) < 0) {
> -   die("Seek to %d failed: %s\n",
> +   die("Seek to %"PRIuELF" failed: %s\n",
> sec->shdr.sh_offset, strerror(errno));
> }
> if (fread(sec->reltab, 1, sec->shdr.sh_size, fp)
> diff --git a/arch/x86/tools/relocs.h b/arch/x86/tools/relocs.h
> index f59590645b68..6b7969719833 100644
> --- a/arch/x86/tools/relocs.h
> +++ b/arch/x86/tools/relocs.h
> @@ -16,7 +16,8 @@
>  #include 
>  #include 
>
> -void die(char *fmt, ...);
> +__attribute__((format(printf, 1, 2))) __attribute__((noreturn))
> +void die(const char *fmt, ...);
>
>  #define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
>
> diff --git a/arch/x86/tools/relocs_32.c b/arch/x86/tools/relocs_32.c
> index b2ade2bb4162..8024ec473e6a 100644
> --- a/arch/x86/tools/relocs_32.c
> +++ b/arch/x86/tools/relocs_32.c
> @@ -14,4 +14,7 @@
>  #define ELF_ST_BIND(o) ELF32_ST_BIND(o)
>  #define ELF_ST_VISIBILITY(o)   ELF32_ST_VISIBILITY(o)
>
> +/* printf format for Elf32_Off */
> +#define PRIuELF PRIu32
> +
>  #include "relocs.c"
> diff --git a/arch/x86/tools/relocs_64.c b/arch/x86/tools/relocs_64.c
> index 56b61b743c4c..2cf4de5c9d99 100644
> --- a/arch/x86/tools/relocs_64.c
> +++ b/arch/x86/tools/relocs_64.c
> @@ -14,4 +14,7 @@
>  #define ELF_ST_BIND(o)  ELF64_ST_BIND(o)
>  #define ELF_ST_VISIBILITY(o)ELF64_ST_VISIBILITY(o)
>
> +/* printf format for Elf64_Off */
> +#define PRIuELF PRIu64
> +
>  #include "relocs.c"
> diff --git a/arch/x86/tools/relocs_common.c b/arch/x86/tools/relocs_common.c
> index acab636bcb34..30adb44eff79 100644
> --- a/arch/x86/tools/relocs_common.c
> +++ b/arch/x86/tools/relocs_common.c
> @@ -1,6 +1,6 @@
>  #include "relocs.h"
>
> -void die(char *fmt, ...)
> +void die(const char *fmt, ...)
>  {
> va_list ap;
> va_start(ap, fmt);
> --
> 2.9.3
>

I have to admit that this is the first time I am looking at these
format specifier macros from inttypes.h.  I looked at that file.  Your
usage seems correct to me and should achieve the desired outcome.

Reviewed-by: Nilay Vaish

Re: [PATCH 0/3] perf report: Recognize hugetlb mapping as anon mapping

2016-09-03 Thread Nilay Vaish

On 2 September 2016 at 08:59, Wang Nan  wrote:
> The requirement of this function is first proposed at 2015.
> Please refer to
>
> http://lkml.iu.edu/hypermail/linux/kernel/1506.2/02372.html
> http://lkml.iu.edu/hypermail/linux/kernel/1506.3/02290.html
> http://lkml.iu.edu/hypermail/linux/kernel/1506.3/03512.html
>
> For systems which use hugetlbfs, if a sample is captured inside
> hugetlbfs, perf should not resolve symbols from the file shown in
> /prof//mmap, because 'files' in hugetlbfs are constructed
> during execution and not ELF files. If user knows positions of
> symbols, he/she should provide /tmp/perf-.map file.
>
> This 3 patches makes perf recognize hugetlbfs mapping as anon mapping.
> Before this 3 patches user has no chance to use his/her own .map file
> to resolve symbols because perf tries to use hugetlbfs file.
>
> Wang Nan (3):
>   perf tools: Recognize hugetlb mapping as anon mapping
>   tools lib api fs: Add hugetlbfs filesystem detector
>   perf record: Mark MAP_HUGETLB during synthesizing mmap events
>
>  tools/lib/api/fs/fs.c   | 15 +++
>  tools/lib/api/fs/fs.h   |  1 +
>  tools/perf/util/event.c |  7 +++
>  tools/perf/util/map.c   |  8 +---
>  4 files changed, 28 insertions(+), 3 deletions(-)
>
> Signed-off-by: Wang Nan 
> Cc: Hou Pengyang 
> Cc: He Kuang 
> Cc: Arnaldo Carvalho de Melo 


Reviewed-by: Nilay Vaish

Re: [PATCH 1/3] perf tools: Recognize hugetlb mapping as anon mapping

2016-09-03 Thread Nilay Vaish

On 2 September 2016 at 08:59, Wang Nan  wrote:
> Hugetlbfs mapping should be recognized as anon mapping so user has
> a chance to create /tmp/perf-.map file for symbol resolving. This
> patch utilizes MAP_HUGETLB to identify hugetlb mapping.
>
> After this patch, if perf is started before the program uses huge pages
> starting, perf is able to recognize hugetlb mapping as anon mapping.

Wang, everything in the patch series looks good to me.  Just a slight
change that I would suggest for the summary above:  replace "program
uses huge pages starting" with "program starts using huge pages".

--
Nilay

Re: [PATCH 1/1] x86, relocs: add function attributes to die()

2016-09-03 Thread Nilay Vaish

On 3 September 2016 at 09:50, Nicolas Iooss  wrote:
>
> arch/x86/tools/relocs.c:460:5: error: format specifies type 'int'
> but the argument has type 'Elf64_Xword' (aka 'unsigned long')
> [-Werror,-Wformat]
> sec->shdr.sh_size);
> ^
> arch/x86/tools/relocs.c:464:5: error: format specifies type 'int'
> but the argument has type 'Elf64_Off' (aka 'unsigned long')
> [-Werror,-Wformat]
> sec->shdr.sh_offset, strerror(errno));
> ^~~
>
> To support both 32-bit and 64-bit modes, add casts to long types and use
> %lu and %ld to format the numbers.
>

Nicolas,  should not just changing the format specifiers fix the
problem?  How do those type casts help?

--
Nilay

Re: [PATCH 04/13] perf/core: Extend perf_output_sample_regs() to include perf_arch_regs

2016-08-30 Thread Nilay Vaish

On 28 August 2016 at 16:00, Madhavan Srinivasan
 wrote:
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 274288819829..e16bf4d057d1 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5371,16 +5371,24 @@ u64 __attribute__((weak)) perf_arch_reg_value(struct 
> perf_arch_regs *regs,
>
>  static void
>  perf_output_sample_regs(struct perf_output_handle *handle,
> -   struct pt_regs *regs, u64 mask)
> +   struct perf_regs *regs, u64 mask)
>  {
> int bit;
> DECLARE_BITMAP(_mask, 64);
> +   u64 arch_regs_mask = regs->arch_regs_mask;
>
> bitmap_from_u64(_mask, mask);
> for_each_set_bit(bit, _mask, sizeof(mask) * BITS_PER_BYTE) {
> u64 val;
>
> -   val = perf_reg_value(regs, bit);
> +   val = perf_reg_value(regs->regs, bit);
> +   perf_output_put(handle, val);
> +   }
> +
> +   bitmap_from_u64(_mask, arch_regs_mask);
> +   for_each_set_bit(bit, _mask, sizeof(mask) * BITS_PER_BYTE) {
> +   u64 val;
> +   val = perf_arch_reg_value(regs->arch_regs, bit);
> perf_output_put(handle, val);
> }
>  }
> @@ -5792,7 +5800,7 @@ void perf_output_sample(struct perf_output_handle 
> *handle,
> if (abi) {
> u64 mask = event->attr.sample_regs_user;
> perf_output_sample_regs(handle,
> -   data->regs_user.regs,
> +   &data->regs_user,
> mask);
> }
> }
> @@ -5827,7 +5835,7 @@ void perf_output_sample(struct perf_output_handle 
> *handle,
> u64 mask = event->attr.sample_regs_intr;
>
> perf_output_sample_regs(handle,
> -   data->regs_intr.regs,
> +   &data->regs_intr,
> mask);
> }
> }
> --
> 2.7.4
>

I would like to suggest a slightly different version.  Would it make
more sense to have something like following:

@@ -5792,7 +5800,7 @@ void perf_output_sample(struct perf_output_handle *handle,
 if (abi) {
u64 mask = event->attr.sample_regs_user;
perf_output_sample_regs(handle,
data->regs_user.regs,
mask);
}
+
+  if (arch_regs_mask) {
+   perf_output_pmu_regs(handle,
data->regs_users.arch_regs, arch_regs_mask);
+  }
}


Somehow I don't like outputting the two sets of registers through the
same function call.

--
Nilay

Re: [PATCH 00/13] Add support for perf_arch_regs

2016-08-30 Thread Nilay Vaish

On 28 August 2016 at 16:00, Madhavan Srinivasan
 wrote:
> Patchset to extend PERF_SAMPLE_REGS_INTR to include
> platform specific PMU registers.
>
> Patchset applies cleanly on tip:perf/core branch
>
> It's a perennial request from hardware folks to be able to
> see the raw values of the pmu registers. Partly it's so that
> they can verify perf is doing what they want, and some
> of it is that they're interested in some of the more obscure
> info that isn't plumbed out through other perf interfaces.
>
> Over the years internally we have used various hack to get
> the requested data out but this is an attempt to use a
> somewhat standard mechanism (using PERF_SAMPLE_REGS_INTR).
>
> This would also be helpful for those of us working on the perf
> hardware backends, to be able to verify that we're programming
> things correctly, without resorting to debug printks etc.
>
> Mechanism proposed:
>
> 1)perf_regs structure is extended with a perf_arch_regs structure
> which each arch/ can populate with their specific platform
> registers to sample on each perf interrupt and an arch_regs_mask
> variable, which is for perf tool to know about the perf_arch_regs
> that are supported.
>
> 2)perf/core func perf_sample_regs_intr() extended to update
> the perf_arch_regs structure and the perf_arch_reg_mask. Set of new
> support functions added perf_get_arch_regs_mask() and
> perf_get_arch_reg() to aid the updates from arch/ side.
>
> 3) perf/core funcs perf_prepare_sample() and perf_output_sample()
> are extended to support the update for the perf_arch_regs_mask and
> perf_arch_regs in the sample
>
> 4)perf/core func perf_output_sample_regs() extended to dump
> the arch_regs to the output sample.
>
> 5)Finally, perf tool side is updated to include a new element
> "arch_regs_mask" in the "struct regs_dump", event sample funcs
> and print functions are updated to support perf_arch_regs.
>

I read the patch series and I have one suggestion to make.  I think we
should not use 'arch regs' to refer to these pmu registers.  I think
architectural registers typically refer to the ones that hold the
state of the process.  Can we replace arch_regs by pmu_regs, or some
other choice?

Thanks
Nilay

Re: [PATCH v2 1/4] tracing: Added hardware latency tracer

2016-08-22 Thread Nilay Vaish

On 10 August 2016 at 08:53, Steven Rostedt  wrote:
> diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
> new file mode 100644
> index ..08dfabe4e862
> --- /dev/null
> +++ b/kernel/trace/trace_hwlat.c
> @@ -0,0 +1,527 @@
> +/*
> + * trace_hwlatdetect.c - A simple Hardware Latency detector.
> + *
> + * Use this tracer to detect large system latencies induced by the behavior 
> of
> + * certain underlying system hardware or firmware, independent of Linux 
> itself.
> + * The code was developed originally to detect the presence of SMIs on Intel
> + * and AMD systems, although there is no dependency upon x86 herein.
> + *
> + * The classical example usage of this tracer is in detecting the presence of
> + * SMIs or System Management Interrupts on Intel and AMD systems. An SMI is a
> + * somewhat special form of hardware interrupt spawned from earlier CPU debug
> + * modes in which the (BIOS/EFI/etc.) firmware arranges for the South Bridge
> + * LPC (or other device) to generate a special interrupt under certain
> + * circumstances, for example, upon expiration of a special SMI timer device,
> + * due to certain external thermal readings, on certain I/O address accesses,
> + * and other situations. An SMI hits a special CPU pin, triggers a special
> + * SMI mode (complete with special memory map), and the OS is unaware.
> + *
> + * Although certain hardware-inducing latencies are necessary (for example,
> + * a modern system often requires an SMI handler for correct thermal control
> + * and remote management) they can wreak havoc upon any OS-level performance
> + * guarantees toward low-latency, especially when the OS is not even made
> + * aware of the presence of these interrupts. For this reason, we need a
> + * somewhat brute force mechanism to detect these interrupts. In this case,
> + * we do it by hogging all of the CPU(s) for configurable timer intervals,
> + * sampling the built-in CPU timer, looking for discontiguous readings.
> + *
> + * WARNING: This implementation necessarily introduces latencies. Therefore,
> + *  you should NEVER use this tracer while running in a production
> + *  environment requiring any kind of low-latency performance
> + *  guarantee(s).
> + *
> + * Copyright (C) 2008-2009 Jon Masters, Red Hat, Inc. 
> + * Copyright (C) 2013-2016 Steven Rostedt, Red Hat, Inc. 
> 
> + *
> + * Includes useful feedback from Clark Williams 
> + *
> + * This file is licensed under the terms of the GNU General Public
> + * License version 2. This program is licensed "as is" without any
> + * warranty of any kind, whether express or implied.
> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include "trace.h"
> +
> +static struct trace_array  *hwlat_trace;
> +
> +#define U64STR_SIZE22  /* 20 digits max */
> +
> +#define BANNER "hwlat_detector: "
> +#define DEFAULT_SAMPLE_WINDOW  100 /* 1s */
> +#define DEFAULT_SAMPLE_WIDTH   50  /* 0.5s */
> +#define DEFAULT_LAT_THRESHOLD  10  /* 10us */
> +
> +/* sampling thread*/
> +static struct task_struct *hwlat_kthread;
> +
> +static struct dentry *hwlat_sample_width;  /* sample width us */
> +static struct dentry *hwlat_sample_window; /* sample window us */
> +
> +/* Save the previous tracing_thresh value */
> +static unsigned long save_tracing_thresh;
> +
> +/* If the user changed threshold, remember it */
> +static u64 last_tracing_thresh = DEFAULT_LAT_THRESHOLD * NSEC_PER_USEC;
> +
> +/* Individual latency samples are stored here when detected. */
> +struct hwlat_sample {
> +   u64 seqnum; /* unique sequence */
> +   u64 duration;   /* delta */
> +   u64 outer_duration; /* delta (outer loop) */
> +   struct timespec timestamp;  /* wall time */
> +};
> +
> +/* keep the global state somewhere. */
> +static struct hwlat_data {
> +
> +   struct mutex lock;  /* protect changes */
> +
> +   u64 count;  /* total since reset */
> +
> +   u64 sample_window;  /* total sampling window (on+off) */
> +   u64 sample_width;   /* active sampling portion of window 
> */
> +
> +} hwlat_data = {
> +   .sample_window  = DEFAULT_SAMPLE_WINDOW,
> +   .sample_width   = DEFAULT_SAMPLE_WIDTH,
> +};
> +
> +static void trace_hwlat_sample(struct hwlat_sample *sample)
> +{
> +   struct trace_array *tr = hwlat_trace;
> +   struct trace_event_call *call = &event_hwlat;

Steven, where is this variable event_hwlat declared?  To me it seems
that some macro is declaring it (most likely DEFINE_EVENT) but I was
not able to figure out the chain that ends up in the declaration.

--
Nilay

Re: [PATCH v3 14/51] x86/asm/head: put real return address on idle task stack

2016-08-17 Thread Nilay Vaish

On 12 August 2016 at 09:28, Josh Poimboeuf  wrote:
> The frame at the end of each idle task stack has a zeroed return
> address.  This is inconsistent with real task stacks, which have a real
> return address at that spot.  This inconsistency can be confusing for
> stack unwinders.
>
> Make it a real address by using the side effect of a call instruction to
> push the instruction pointer on the stack.
>
> Signed-off-by: Josh Poimboeuf 
> ---
>  arch/x86/kernel/head_64.S | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 3621ad2..c90f481 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -298,8 +298,9 @@ ENTRY(start_cpu)
>  *  REX.W + FF /5 JMP m16:64 Jump far, absolute indirect,
>  *  address given in m16:64.
>  */
> -   movqinitial_code(%rip),%rax
> -   pushq   $0  # fake return address to stop unwinder
> +   call1f  # put return address on stack for unwinder
> +1: xorq%rbp, %rbp  # clear frame pointer
> +   movqinitial_code(%rip), %rax
> pushq   $__KERNEL_CS# set correct cs
> pushq   %rax# target address in negative space
> lretq

Josh,  I have a couple of questions.

It seems to me that this patch and the patch 16/51 are both aiming at
the same thing, but are for two different architectures: 32-bit and
64-bit versions of x86.  But you have taken slightly different
approaches in the two patches (for 64-bit, we first jump and then make
a function call, for 32-bit we directly call the function).  Is there
any particular reason for this?  May be I missed out on something.

Second, this is for the whole patch series.  If I wanted to test this
series, how should I go about doing so?

Thanks
Nilay

Re: [PATCH 3/3] perf/x86/intel/uncore: Add Skylake server uncore support

2016-08-17 Thread Nilay Vaish

On 15 August 2016 at 09:12, Liang, Kan  wrote:
>
>
>> > diff --git a/arch/x86/events/intel/uncore_snbep.c
>> > b/arch/x86/events/intel/uncore_snbep.c
>> > index 3719af5..55a081e 100644
>> > --- a/arch/x86/events/intel/uncore_snbep.c
>> > +++ b/arch/x86/events/intel/uncore_snbep.c
>> > +
>> > +static int skx_count_chabox(void)
>> > +{
>> > +   struct pci_dev *chabox_dev = NULL;
>> > +   int bus, count = 0;
>> > +
>> > +   while (1) {
>> > +   chabox_dev = pci_get_device(PCI_VENDOR_ID_INTEL, 0x208d,
>> chabox_dev);
>> > +   if (!chabox_dev)
>> > +   break;
>> > +   if (count == 0)
>> > +   bus = chabox_dev->bus->number;
>> > +   if (bus != chabox_dev->bus->number)
>> > +   break;
>> > +   count++;
>> > +   }
>> > +
>> > +   pci_dev_put(chabox_dev);
>> > +   return count;
>> > +}
>>
>> Kan,  do we not need to call pci_dev_put() each time we call 
>> pci_get_device()?
>
> The pci_get_device will always decrease the reference count for chabox_dev, if
> it is not NULL.
> So I think it's OK to only pci_dev_put it at last.
>

Yup, you are right.  I was not aware that pci_dev_put() gets called
automatically when pci_dev argument is not NULL.
Thanks for the info.

--
Nilay

Re: [PATCH 3/3] perf/x86/intel/uncore: Add Skylake server uncore support

2016-08-13 Thread Nilay Vaish

On 12 August 2016 at 16:30, Kan Liang  wrote:
> From: Kan Liang 
>
> This patch implements the uncore monitoring driver for Skylake server.
> The uncore subsystem in Skylake server is similar to previous
> server. There are some differences in config register encoding and pci
> device IDs. Besides, Skylake introduces many new boxes to reflect the
> MESH architecture changes.
>
> The control registers for IIO and UPI have been extended to 64 bit. This
> patch also introduces event_mask_ext to handle the high 32 bit mask.
>
> The CHA box number could vary for different machines. This patch gets
> the CHA box number by counting the CHA register space during
> initialization at runtime.
>

The code in this patch contains a lot of constants related to the
Skylake server.  Is it possible to mention in the description the
source from where these values have been picked?  May be Intel
provides some manual that has these values.  It would be good if we
can mention the name of the manual (or a URL) and its version.

> Signed-off-by: Kan Liang 
> ---
>  arch/x86/events/intel/uncore.c   |   9 +-
>  arch/x86/events/intel/uncore.h   |   3 +
>  arch/x86/events/intel/uncore_snbep.c | 589 
> +++
>  3 files changed, 600 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/events/intel/uncore_snbep.c 
> b/arch/x86/events/intel/uncore_snbep.c
> index 3719af5..55a081e 100644
> --- a/arch/x86/events/intel/uncore_snbep.c
> +++ b/arch/x86/events/intel/uncore_snbep.c
> +
> +static int skx_count_chabox(void)
> +{
> +   struct pci_dev *chabox_dev = NULL;
> +   int bus, count = 0;
> +
> +   while (1) {
> +   chabox_dev = pci_get_device(PCI_VENDOR_ID_INTEL, 0x208d, 
> chabox_dev);
> +   if (!chabox_dev)
> +   break;
> +   if (count == 0)
> +   bus = chabox_dev->bus->number;
> +   if (bus != chabox_dev->bus->number)
> +   break;
> +   count++;
> +   }
> +
> +   pci_dev_put(chabox_dev);
> +   return count;
> +}

Kan,  do we not need to call pci_dev_put() each time we call pci_get_device()?


--
Nilay

Re: [PATCH 1/3] perf/x86/intel/uncore: remove hard code for Node ID mapping location

2016-08-13 Thread Nilay Vaish

On 12 August 2016 at 16:30, Kan Liang  wrote:
> From: Kan Liang 
>
> The method to build PCI bus to socket mapping is similar among
> platforms. However, the PCI location where store Node ID mapping could

I think we should replace "where store" with "which store".

> vary for different platforms. For example, the Node ID mapping address
> on Skylake server is different as the previous platform. Also, to build

I think we should change "as the previous platform" to "from the
previous platforms".

> the mapping for the PCI bus without UBOX, it has to start from bus 0 on
> Skylake server.
>
> This patch removes the hard code in current implementation, and adds
> three parameters for snbep_pci2phy_map_init. So the Node ID mapping
> address and bus searching direction could be configured according to
> different platforms.
>
> Signed-off-by: Kan Liang 
> ---
>  arch/x86/events/intel/uncore_snbep.c | 37 
> 
>  1 file changed, 25 insertions(+), 12 deletions(-)
>

Kan,  I have suggested a couple of changes to patch's description
above.  The main content of the patch seems good to me.


--
Nilay

Re: [PATCH v3 04/51] x86/asm/head: use a common function for starting CPUs

2016-08-12 Thread Nilay Vaish

On 12 August 2016 at 09:28, Josh Poimboeuf  wrote:
> There are two different pieces of code for starting a CPU: start_cpu0()
> and the end of secondary_startup_64().  They're identical except for the
> stack setup.  Combine the common parts into a shared start_cpu()
> function.
>
> Signed-off-by: Josh Poimboeuf 
> ---
>  arch/x86/kernel/head_64.S | 22 +++---
>  1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index e048142..a212310 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -264,13 +264,17 @@ ENTRY(secondary_startup_64)
> movl$MSR_GS_BASE,%ecx
> movlinitial_gs(%rip),%eax
> movlinitial_gs+4(%rip),%edx
> -   wrmsr
> +   wrmsr
>
> /* rsi is pointer to real mode structure with interesting info.
>pass it to C */
> movq%rsi, %rdi
> -
> -   /* Finally jump to run C code and to be on real kernel address
> +   jmp start_cpu
> +ENDPROC(secondary_startup_64)
> +
> +ENTRY(start_cpu)
> +   /*
> +* Jump to run C code and to be on a real kernel address.
>  * Since we are running on identity-mapped space we have to jump
>  * to the full 64bit address, this is only possible as indirect
>  * jump.  In addition we need to ensure %cs is set so we make this
> @@ -299,7 +303,7 @@ ENTRY(secondary_startup_64)
> pushq   $__KERNEL_CS# set correct cs
> pushq   %rax# target address in negative space
> lretq
> -ENDPROC(secondary_startup_64)
> +ENDPROC(start_cpu)
>
>  #include "verify_cpu.S"
>
> @@ -307,15 +311,11 @@ ENDPROC(secondary_startup_64)
>  /*
>   * Boot CPU0 entry point. It's called from play_dead(). Everything has been 
> set
>   * up already except stack. We just set up stack here. Then call
> - * start_secondary().
> + * start_secondary() via start_cpu().
>   */
>  ENTRY(start_cpu0)
> -   movq initial_stack(%rip),%rsp
> -   movqinitial_code(%rip),%rax
> -   pushq   $0  # fake return address to stop unwinder
> -   pushq   $__KERNEL_CS# set correct cs
> -   pushq   %rax# target address in negative space
> -   lretq
> +   movqinitial_stack(%rip), %rsp
> +   jmp start_cpu
>  ENDPROC(start_cpu0)
>  #endif
>

Josh,  I think this patch looks good now, better than the previous
version.  Thanks for making the change.

--
Nilay

Re: [PATCH v2 0/4] remove unnecessary IPI reading uncore events

2016-08-10 Thread Nilay Vaish

On 6 August 2016 at 22:12, David Carrillo-Cisneros  wrote:
> This patch series adds a new flag to the struct perf_event
> (and a flag field to store it) to allow a PMU to tag a CPU or
> cgroup event as readable from any CPU in the same package and not
> just the CPU where the is currently active.
>
> This capability is used with uncore events to potentially avoid
> an unnecessary IPI when executing perf_event_read.
>
> A previous version of this change was introduced in the last Intel's
> CQM/CMT driver series (under review), but now we present it separately
> here since it is also useful for other uncore events.
>
> The next version of Intel CQM/CMT will add 3 new flags that use
> the pmu_event_flags field (added in patch 03 in this series).
>
> Patch 02 generalizes event->group_flags so that new flags can use it
> in a similar way that PERF_GROUP_SOFTWARE was used.
>
> Patches rebased at peterz/queue/perf/core
>
> Changes in v2:
>   - Change logic to use event->group_flags instead of individually
>   testing sibling in perf_event_read.
>   - Remove erroneous read of inactive events.
>
> David Carrillo-Cisneros (4):
>   perf/core: check return value of perf_event_read IPI
>   perf/core: generalize event->group_flags
>   perf/core: introduce PMU_EV_CAP_READ_ACTIVE_PKG
>   perf/x86: use PMUEF_READ_CPU_PKG in uncore events
>
>  arch/x86/events/intel/rapl.c   |  2 ++
>  arch/x86/events/intel/uncore.c |  2 ++
>  arch/x86/events/intel/uncore_snb.c |  2 ++
>  include/linux/perf_event.h | 21 +++
>  kernel/events/core.c   | 52 
> +++++-----
>  5 files changed, 57 insertions(+), 22 deletions(-)
>
> --
> 2.8.0.rc3.226.g39d4020
>

Reviewed-by: Nilay Vaish

Re: [PATCH 3/4] perf/core: introduce PMU_EV_CAP_READ_ACTIVE_PKG

2016-08-10 Thread Nilay Vaish

On 9 August 2016 at 17:28, David Carrillo-Cisneros  wrote:
> Introduce the flag PMU_EV_CAP_READ_ACTIVE_PKG, useful for uncore events,
> that allows a PMU to signal the generic perf code that an event is readable
> in the current CPU if the event is active in a CPU in the same package as
> the current CPU.
>
> This is an optimization that avoids a unnecessary IPI for the common case
> where uncore events are run and read in the same package but in
> different CPUs.
>
> As an example, the IPI removal speeds up perf_read in my Haswell system
> as follows:
>   - For event UNC_C_LLC_LOOKUP: From 260 us to 31 us.
>   - For event RAPL's power/energy-cores/: From to 255 us to 27 us.
>
> For the optimization to work, all events in the group must have it
> (similarly to PERF_EV_CAP_SOFTWARE).
>
> Signed-off-by: David Carrillo-Cisneros 
> ---
>  include/linux/perf_event.h |  3 +++
>  kernel/events/core.c   | 26 --
>  2 files changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index fa5617f..c8bb1b3 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -501,8 +501,11 @@ typedef void (*perf_overflow_handler_t)(struct 
> perf_event *,
>   * Event capabilities. For event_caps and groups caps.
>   *
>   * PERF_EV_CAP_SOFTWARE: Is a software event.
> + * PERF_EV_CAP_READ_ACTIVE_PKG: A CPU event (or cgroup event) that can be 
> read
> + * from any CPU in the package where it is active.
>   */
>  #define PERF_EV_CAP_SOFTWARE   BIT(0)
> +#define PERF_EV_CAP_READ_ACTIVE_PKGBIT(1)
>
>  #define SWEVENT_HLIST_BITS 8
>  #define SWEVENT_HLIST_SIZE (1 << SWEVENT_HLIST_BITS)
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 34049cc..38ec68d 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -,6 +,22 @@ struct perf_read_data {
> int ret;
>  };
>
> +static int find_cpu_to_read(struct perf_event *event, int local_cpu)
> +{
> +   int event_cpu = event->oncpu;
> +   u16 local_pkg, event_pkg;
> +
> +   if (event->group_caps & PERF_EV_CAP_READ_ACTIVE_PKG) {
> +   event_pkg =  topology_physical_package_id(event_cpu);
> +   local_pkg =  topology_physical_package_id(local_cpu);
> +
> +   if (event_pkg == local_pkg)
> +   return local_cpu;
> +   }
> +
> +   return event_cpu;
> +}
> +
>  /*
>   * Cross CPU call to read the hardware event
>   */
> @@ -3454,7 +3470,7 @@ u64 perf_event_read_local(struct perf_event *event)
>
>  static int perf_event_read(struct perf_event *event, bool group)
>  {
> -   int ret = 0;
> +   int ret = 0, cpu_to_read, local_cpu;
>
> /*
>  * If event is enabled and currently active on a CPU, update the
> @@ -3466,8 +3482,14 @@ static int perf_event_read(struct perf_event *event, 
> bool group)
> .group = group,
> .ret = 0,
> };
> -   ret = smp_call_function_single(event->oncpu,
> +
> +   local_cpu = get_cpu();
> +   cpu_to_read = find_cpu_to_read(event, local_cpu);
> +   put_cpu();
> +
> +   ret = smp_call_function_single(cpu_to_read,
>__perf_event_read, &data, 1);
> +
> ret = ret ? : data.ret;
> } else if (event->state == PERF_EVENT_STATE_INACTIVE) {
> struct perf_event_context *ctx = event->ctx;
> --
> 2.8.0.rc3.226.g39d4020
>

David, thanks for making the changes.  This patch looks good to me.

--
Nilay

Re: [PATCH v2 30/44] x86/unwind: add new unwind interface and implementations

2016-08-09 Thread Nilay Vaish

On 4 August 2016 at 17:22, Josh Poimboeuf  wrote:
> diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
> new file mode 100644
> index 000..f28f1b5
> --- /dev/null
> +++ b/arch/x86/kernel/unwind_frame.c
> @@ -0,0 +1,84 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define FRAME_HEADER_SIZE (sizeof(long) * 2)
> +
> +unsigned long unwind_get_return_address(struct unwind_state *state)
> +{
> +   unsigned long *addr_p = unwind_get_return_address_ptr(state);
> +   unsigned long addr;
> +
> +   if (state->stack_info.type == STACK_TYPE_UNKNOWN)
> +   return 0;
> +
> +   addr = ftrace_graph_ret_addr(state->task, &state->graph_idx, *addr_p,
> +addr_p);
> +
> +   return __kernel_text_address(addr) ? addr : 0;
> +}
> +EXPORT_SYMBOL_GPL(unwind_get_return_address);
> +
> +static bool update_stack_state(struct unwind_state *state, void *addr,
> +  size_t len)
> +{
> +   struct stack_info *info = &state->stack_info;
> +
> +   if (on_stack(info, addr, len))
> +   return true;
> +
> +   if (get_stack_info(info->next_sp, state->task, info,
> +  &state->stack_mask))
> +   goto unknown;
> +
> +   if (!on_stack(info, addr, len))
> +   goto unknown;
> +
> +   return true;
> +
> +unknown:
> +   info->type = STACK_TYPE_UNKNOWN;
> +   return false;
> +}
> +
> +bool unwind_next_frame(struct unwind_state *state)
> +{
> +   unsigned long *next_bp;
> +
> +   if (unwind_done(state))
> +   return false;
> +
> +   next_bp = (unsigned long *)*state->bp;
> +
> +   /*
> +* Make sure the next frame is on a valid stack and can be accessed
> +* safely.
> +*/
> +   if (!update_stack_state(state, next_bp, FRAME_HEADER_SIZE))
> +   return false;
> +
> +   /* move to the next frame */
> +   state->bp = next_bp;
> +   return true;
> +}
> +EXPORT_SYMBOL_GPL(unwind_next_frame);
> +
> +void __unwind_start(struct unwind_state *state, struct task_struct *task,
> +   struct pt_regs *regs, unsigned long *sp)
> +{
> +   memset(state, 0, sizeof(*state));
> +
> +   state->task = task;
> +   state->bp = get_frame_pointer(task, regs);
> +
> +   get_stack_info(state->bp, state->task, &state->stack_info,
> +  &state->stack_mask);
> +   update_stack_state(state, state->bp, FRAME_HEADER_SIZE);
> +
> +   /* unwind to the first frame after the specified stack pointer */
> +   while (state->bp < sp && !unwind_done(state))
> +   unwind_next_frame(state);

Do we unwind all the frames here?  It seems strange to me that in a
function named __unwind_start(), we unwind all the frames.

> +}
> +EXPORT_SYMBOL_GPL(__unwind_start);
> diff --git a/arch/x86/kernel/unwind_guess.c b/arch/x86/kernel/unwind_guess.c
> new file mode 100644
> index 000..e03df5a
> --- /dev/null
> +++ b/arch/x86/kernel/unwind_guess.c
> @@ -0,0 +1,40 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +bool unwind_next_frame(struct unwind_state *state)
> +{
> +   struct stack_info *info = &state->stack_info;
> +
> +   if (info->type == STACK_TYPE_UNKNOWN)
> +   return false;
> +
> +   do {
> +   for (state->sp++; state->sp < info->end; state->sp++)
> +   if (__kernel_text_address(*state->sp))
> +   return true;
> +
> +   state->sp = info->next_sp;
> +
> +   } while (!get_stack_info(state->sp, state->task, info,
> +&state->stack_mask));
> +
> +   return false;
> +}
> +
> +void __unwind_start(struct unwind_state *state, struct task_struct *task,
> +   struct pt_regs *regs, unsigned long *sp)
> +{
> +   memset(state, 0, sizeof(*state));
> +
> +   state->task = task;
> +   state->sp   = sp;
> +
> +   get_stack_info(sp, state->task, &state->stack_info, 
> &state->stack_mask);
> +
> +   if (!__kernel_text_address(*sp))
> +   unwind_next_frame(state);
> +}
> --
> 2.7.4
>

Why is it that you need to export symbols in unwind_frame.c but not in
unwind_guess.c.  As per the Makefile, we would be compiling either of
those two files.  Should not EXPORT_SYMBOL_GPL(__unwind_start) appear
in both files?

--
Nilay

Re: [PATCH v2 3/4] perf/core: introduce PMU_EV_CAP_READ_ACTIVE_PKG

2016-08-08 Thread Nilay Vaish

On 7 August 2016 at 15:10, David Carrillo-Cisneros  wrote:
> Hi Nilay,
>
>>>  static int perf_event_read(struct perf_event *event, bool group)
>>>  {
>>> -   int ret = 0;
>>> +   int ret = 0, cpu_to_read;
>>>
>>> -   /*
>>> -* If event is enabled and currently active on a CPU, update the
>>> -* value in the event structure:
>>> -*/
>>> -   if (event->state == PERF_EVENT_STATE_ACTIVE) {
>>> +   cpu_to_read = find_cpu_to_read(event);
>>> +
>>> +   if (cpu_to_read >= 0) {
>>> struct perf_read_data data = {
>>> .event = event,
>>> .group = group,
>>> .ret = 0,
>>> };
>>> -   ret = smp_call_function_single(event->oncpu,
>>> +   ret = smp_call_function_single(cpu_to_read,
>>>__perf_event_read, &data,
>>> 1);
>>> ret = ret ? : data.ret;
>>> } else if (event->state == PERF_EVENT_STATE_INACTIVE) {
>>>
>>
>> I would like to suggest a small change to this patch.  I think the check on
>> PERF_EVENT_STATE_ACTIVE should be retained in the perf_event_read()
>> function.  The new function should assume that the event is active.  I find
>> this more readable since the next check in function perf_event_read() is on
>> PERF_EVENT_STATE_INACTIVE.
>
> Two oncoming flags that Intel CQM/CMT will use are meant to allow read
> even if event is inactive. This makes sense in CQM/CMT because the hw
> RMID is always reserved. I am ok with keeping the check for
> STATE_ACTIVE until those flags are actually introduced, tough.


Hello David

Lets go with checking PERF_EVENT_STATE_ACTIVE in perf_event_read() for
the time being.  With the new version of the patch that you posted, I
find that checking PERF_EVENT_STATE_ACTIVE in find_cpu_to_read() makes
you introduce another if statement for checking STATE_INACTIVE.

If your CQM/CMT patches later need the code structure you have now, I
would also support it.  But as of now, I think, it is better to check
STATE_ACTIVE in perf_event_read().


Thanks
Nilay

Re: [PATCH v2 3/4] perf/core: introduce PMU_EV_CAP_READ_ACTIVE_PKG

2016-08-07 Thread Nilay Vaish


On 08/06/16 22:12, David Carrillo-Cisneros wrote:

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 34049cc..77f1bd3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -,6 +,26 @@ struct perf_read_data {
int ret;
 };

+static int find_cpu_to_read(struct perf_event *event)
+{
+   bool active = event->state == PERF_EVENT_STATE_ACTIVE;
+   int event_cpu = event->oncpu, local_cpu = smp_processor_id();
+   u16 local_pkg, event_pkg;
+
+   if (!active)
+   return -1;
+
+   if (event->group_caps & PERF_EV_CAP_READ_ACTIVE_PKG) {
+   event_pkg =  topology_physical_package_id(event_cpu);
+   local_pkg =  topology_physical_package_id(local_cpu);
+
+   if (event_pkg == local_pkg)
+   return local_cpu;
+   }
+
+   return event_cpu;
+}
+
 /*
  * Cross CPU call to read the hardware event
  */
@@ -3454,19 +3474,17 @@ u64 perf_event_read_local(struct perf_event *event)

 static int perf_event_read(struct perf_event *event, bool group)
 {
-   int ret = 0;
+   int ret = 0, cpu_to_read;

-   /*
-* If event is enabled and currently active on a CPU, update the
-* value in the event structure:
-*/
-   if (event->state == PERF_EVENT_STATE_ACTIVE) {
+   cpu_to_read = find_cpu_to_read(event);
+
+   if (cpu_to_read >= 0) {
struct perf_read_data data = {
.event = event,
.group = group,
.ret = 0,
};
-   ret = smp_call_function_single(event->oncpu,
+   ret = smp_call_function_single(cpu_to_read,
   __perf_event_read, &data, 1);
ret = ret ? : data.ret;
} else if (event->state == PERF_EVENT_STATE_INACTIVE) {



I would like to suggest a small change to this patch.  I think the check 
on PERF_EVENT_STATE_ACTIVE should be retained in the perf_event_read() 
function.  The new function should assume that the event is active.  I 
find this more readable since the next check in function 
perf_event_read() is on PERF_EVENT_STATE_INACTIVE.



Diff below:

+static int find_cpu_to_read(struct perf_event *event)
+{
+int event_cpu = event->oncpu, local_cpu = smp_processor_id();
+u16 local_pkg, event_pkg;
+
+if (event->group_caps & PERF_EV_CAP_READ_ACTIVE_PKG) {
+event_pkg = topology_physical_package_id(event_cpu);
+local_pkg = topology_physical_package_id(local_cpu);
+
+if (event_pkg == local_pkg)
+return local_cpu;
+}
+
+return event_cpu;
+}
+
 /*
  * Cross CPU call to read the hardware event
  */
@@ -3477,17 +3493,15 @@ static int perf_event_read(struct perf_event 
*event, bool group)

 {
int ret = 0;

-   /*
-* If event is enabled and currently active on a CPU, update the
-* value in the event structure:
-*/
-   if (event->state == PERF_EVENT_STATE_ACTIVE) {
+   if (event->state == PERF_EVENT_STATE_ACTIVE) {
+   int cpu_to_read = find_cpu_to_read(event);
+
struct perf_read_data data = {
.event = event,
.group = group,
.ret = 0,
};
-   ret = smp_call_function_single(event->oncpu,
+   ret = smp_call_function_single(cpu_to_read,
   __perf_event_read, 
&data, 1);

ret = ret ? : data.ret;
} else if (event->state == PERF_EVENT_STATE_INACTIVE) {

Re: [PATCH v2 04/44] x86/asm/head: use a common function for starting CPUs

2016-08-05 Thread Nilay Vaish

On 4 August 2016 at 17:22, Josh Poimboeuf  wrote:
> There are two different pieces of code for starting a CPU: start_cpu0()
> and the end of secondary_startup_64().  They're identical except for the
> stack setup.  Combine the common parts into a shared start_cpu()
> function.
>
> Signed-off-by: Josh Poimboeuf 
> ---
>  arch/x86/kernel/head_64.S | 18 --
>  1 file changed, 8 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index aa10a53..8822c20 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -264,13 +264,15 @@ ENTRY(secondary_startup_64)
> movl$MSR_GS_BASE,%ecx
> movlinitial_gs(%rip),%eax
> movlinitial_gs+4(%rip),%edx
> -   wrmsr
> +   wrmsr
>
> /* rsi is pointer to real mode structure with interesting info.
>pass it to C */
> movq%rsi, %rdi
> -
> -   /* Finally jump to run C code and to be on real kernel address
> +
> +ENTRY(start_cpu)
> +   /*
> +* Jump to run C code and to be on a real kernel address.
>  * Since we are running on identity-mapped space we have to jump
>  * to the full 64bit address, this is only possible as indirect
>  * jump.  In addition we need to ensure %cs is set so we make this
> @@ -307,15 +309,11 @@ ENDPROC(secondary_startup_64)
>  /*
>   * Boot CPU0 entry point. It's called from play_dead(). Everything has been 
> set
>   * up already except stack. We just set up stack here. Then call
> - * start_secondary().
> + * start_secondary() via start_cpu().
>   */
>  ENTRY(start_cpu0)
> -   movq initial_stack(%rip),%rsp
> -   movqinitial_code(%rip),%rax
> -   pushq   $0  # fake return address to stop unwinder
> -   pushq   $__KERNEL_CS# set correct cs
> -   pushq   %rax# target address in negative space
> -   lretq
> +   movqinitial_stack(%rip), %rsp
> +   jmp start_cpu
>  ENDPROC(start_cpu0)
>  #endif
>

I have small suggestion here.  To me jumping from start_cpu0 into the
middle of secondary_startup_64 just seems strange.  May be we can
define separate ENTRY and ENDPROC pair for start_cpu and jump there
from start_cpu0 and also from secondary_startup_64.

--
Nilay

Re: [PATCH v2 03/44] x86/asm/head: rename 'stack_start' -> 'initial_stack'

2016-08-05 Thread Nilay Vaish

On 4 August 2016 at 17:21, Josh Poimboeuf  wrote:
> The 'stack_start' variable is similar in usage to 'initial_code' and
> 'initial_gs': they're all stored in head_64.S and they're all updated by
> SMP and ACPI suspend before starting a CPU.
>
> Rename it to 'initial_stack' to be consistent with the others.
>

May be change the following line as well:

./arch/x86/kernel/head_64.S:69: * Setup stack for verify_cpu().
"-8" because stack_start is defined

--
Nilay

Re: [PATCH 21/32] x86/intel_rdt.h: Header for inter_rdt.c

2016-07-28 Thread Nilay Vaish

On 12 July 2016 at 20:02, Fenghua Yu  wrote:
> From: Fenghua Yu 
>
> The header mainly provides functions to call from the user interface
> file intel_rdt_rdtgroup.c.
>
> Signed-off-by: Fenghua Yu 
> Reviewed-by: Tony Luck 
> ---
>  arch/x86/include/asm/intel_rdt.h | 87 
> +---
>  1 file changed, 81 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/intel_rdt.h 
> b/arch/x86/include/asm/intel_rdt.h
> index f2cb91d..4c5e0ac 100644
> --- a/arch/x86/include/asm/intel_rdt.h
> +++ b/arch/x86/include/asm/intel_rdt.h
> @@ -3,27 +3,99 @@
>
>  #ifdef CONFIG_INTEL_RDT
>
> +#include 
>  #include 
>
> -#define MAX_CBM_LENGTH 32
>  #define IA32_L3_CBM_BASE   0xc90
> -#define CBM_FROM_INDEX(x)  (IA32_L3_CBM_BASE + x)
> -#define MSR_IA32_PQOS_CFG  0xc81
> +#define L3_CBM_FROM_INDEX(x)   (IA32_L3_CBM_BASE + x)
> +
> +#define MSR_IA32_L3_QOS_CFG0xc81
> +
> +enum resource_type {
> +   RESOURCE_L3  = 0,
> +   RESOURCE_NUM = 1,
> +};
> +
> +#define MAX_CACHE_LEAVES4
> +#define MAX_CACHE_DOMAINS   64
> +
> +DECLARE_PER_CPU_READ_MOSTLY(int, cpu_l3_domain);
> +DECLARE_PER_CPU_READ_MOSTLY(struct rdtgroup *, cpu_rdtgroup);
>
>  extern struct static_key rdt_enable_key;
>  void __intel_rdt_sched_in(void *dummy);
> +extern bool use_rdtgroup_tasks;
> +
> +extern bool cdp_enabled;
> +
> +struct rdt_opts {
> +   bool cdp_enabled;
> +   bool verbose;
> +   bool simulate_cat_l3;
> +};
> +
> +struct cache_domain {
> +   cpumask_t shared_cpu_map[MAX_CACHE_DOMAINS];
> +   unsigned int max_cache_domains_num;
> +   unsigned int level;
> +   unsigned int shared_cache_id[MAX_CACHE_DOMAINS];
> +};
> +
> +extern struct rdt_opts rdt_opts;
>
>  struct clos_cbm_table {
> -   unsigned long l3_cbm;
> +   unsigned long cbm;
> unsigned int clos_refcnt;
>  };
>
>  struct clos_config {
> -   unsigned long *closmap;
> +   unsigned long **closmap;
> u32 max_closid;
> -   u32 closids_used;
>  };
>
> +struct shared_domain {
> +   struct cpumask cpumask;
> +   int l3_domain;
> +};
> +
> +#define for_each_cache_domain(domain, start_domain, max_domain)\
> +   for (domain = start_domain; domain < max_domain; domain++)
> +
> +extern struct clos_config cconfig;
> +extern struct shared_domain *shared_domain;
> +extern int shared_domain_num;
> +
> +extern struct rdtgroup *root_rdtgrp;
> +extern void rdtgroup_fork(struct task_struct *child);
> +extern void rdtgroup_post_fork(struct task_struct *child);
> +
> +extern struct clos_cbm_table **l3_cctable;
> +
> +extern unsigned int min_bitmask_len;
> +extern void msr_cpu_update(void *arg);
> +extern inline void closid_get(u32 closid, int domain);
> +extern void closid_put(u32 closid, int domain);
> +extern void closid_free(u32 closid, int domain, int level);
> +extern int closid_alloc(u32 *closid, int domain);
> +extern bool cat_l3_enabled;
> +extern unsigned int get_domain_num(int level);
> +extern struct shared_domain *shared_domain;
> +extern int shared_domain_num;
> +extern inline int get_dcbm_table_index(int x);
> +extern inline int get_icbm_table_index(int x);
> +
> +extern int get_cache_leaf(int level, int cpu);
> +
> +extern void cbm_update_l3_msr(void *pindex);
> +extern int level_to_leaf(int level);
> +
> +extern void init_msrs(bool cdpenabled);
> +extern bool cat_enabled(int level);
> +extern u64 max_cbm(int level);
> +extern u32 max_cbm_len(int level);
> +
> +extern void rdtgroup_exit(struct task_struct *tsk);
> +
>  /*
>   * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
>   *
> @@ -54,6 +126,9 @@ static inline void intel_rdt_sched_in(void)
>  #else
>
>  static inline void intel_rdt_sched_in(void) {}
> +static inline void rdtgroup_fork(struct task_struct *child) {}
> +static inline void rdtgroup_post_fork(struct task_struct *child) {}
> +static inline void rdtgroup_exit(struct task_struct *tsk) {}
>
>  #endif
>  #endif
> --
> 2.5.0
>


Hi Fenghua


There are few things about this patch that I think can be improved upon.

* some of the variables that were introduced in the first few patches
have been renamed in the later patches.  Since these patches are still
not part of the kernel, I think, we should use the variable names we
want to use right from the beginning, rather than changing them midway
through the patch series.

*  I think struct rdtgroup is being used before it has been declared anywhere.

* I somehow do not like this mass declaration of prototypes of all the
functions.and variables that would be used later.  I think the
prototype and the implementation should be part of the same patch.  If
possible, combine this patch with the ones that have the
implementations.

--
Nilay

Re: [PATCH 20/32] magic number for rscctrl file system

2016-07-27 Thread Nilay Vaish

On 12 July 2016 at 20:02, Fenghua Yu  wrote:
> From: Fenghua Yu 
>
> Signed-off-by: Fenghua Yu 
> Reviewed-by: Tony Luck 
> ---
>  include/uapi/linux/magic.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 546b388..655036a 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -57,6 +57,8 @@
>  #define CGROUP_SUPER_MAGIC 0x27e0eb
>  #define CGROUP2_SUPER_MAGIC0x63677270
>
> +#define RDTGROUP_SUPER_MAGIC   0x7655821
> +
>
>  #define STACK_END_MAGIC0x57AC6E9D


Just for my information, can you tell me how did you choose the magic number?

--
Nilay

Re: [PATCH 19/32] sched.h: Add rg_list and rdtgroup in task_struct

2016-07-27 Thread Nilay Vaish

On 12 July 2016 at 20:02, Fenghua Yu  wrote:
> From: Fenghua Yu 
>
> rg_list is linked list to connect to other tasks in a rdtgroup.
>
> The point of rdtgroup allows the task to access its own rdtgroup directly.
>
> Signed-off-by: Fenghua Yu 
> Reviewed-by: Tony Luck 
> ---
>  include/linux/sched.h | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 253538f..55adf17 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1761,6 +1761,10 @@ struct task_struct {
> /* cg_list protected by css_set_lock and tsk->alloc_lock */
> struct list_head cg_list;
>  #endif
> +#ifdef CONFIG_INTEL_RDT
> +   struct list_head rg_list;
> +   struct rdtgroup *rdtgroup;
> +#endif
>  #ifdef CONFIG_FUTEX
> struct robust_list_head __user *robust_list;
>  #ifdef CONFIG_COMPAT
> --
> 2.5.0
>

I think this patch should be merged with patch 22/32 since struct
rdtgroup 's definition appears in the patch 22/32.  I do not recall
seeing any forward declaration either.

--
Nilay

Re: [PATCH 17/32] x86, intel_cacheinfo: Enable cache id in x86

2016-07-27 Thread Nilay Vaish

On 12 July 2016 at 20:02, Fenghua Yu  wrote:
> From: Fenghua Yu 
>
> Enable cache id in x86. Cache id comes from APIC ID and CPUID4.
>

I think one of these patches on cache ids should refer to some
documentation from Intel on this subject, either in the commit message
or in the comments in some file.  I found one:
https://software.intel.com/sites/default/files/63/1a/Kuo_CpuTopology_rc1.rh1.final.pdf.
You would know better than me which document we should be looking at.

Thanks
Nilay

Re: [PATCH 15/32] cacheinfo: Introduce cache id

2016-07-27 Thread Nilay Vaish

On 12 July 2016 at 20:02, Fenghua Yu  wrote:
> From: Fenghua Yu 
>
> Each cache is described by cacheinfo and is unique in the same index
> across the platform. But there is no id for a cache. We introduce cache
> ID to identify a cache.
>
> Intel Cache Allocation Technology (CAT) allows some control on the
> allocation policy within each cache that it controls. We need a unique
> cache ID for each cache level to allow the user to specify which
> controls are applied to which cache. Cache id is a concise way to specify
> a cache.
>
> Cache id is first enabled on x86. It can be enabled on other platforms
> as well. The cache id is not necessary contiguous.
>
> Add an "id" entry to /sys/devices/system/cpu/cpu*/cache/index*/


Can you explain what index and platform mean?  I think those terms are
generic in nature.   May be an example on how cache ids would be
assigned to different caches would help.

--
Nilay

Re: [PATCH 14/32] x86/cpufeatures: Get max closid and max cbm len and clean feature comments and code

2016-07-27 Thread Nilay Vaish

On 12 July 2016 at 20:02, Fenghua Yu  wrote:
> From: Fenghua Yu 
>
> Define two new cpuid leaves for CAT and CDP. The leaves are used in
> x86_capability to avoid hard coded index.
>
> Clean comments for RDT, CAT_L3, and CDP_L3 cpufeatures.
>
> Signed-off-by: Fenghua Yu 
> Reviewed-by: Tony Luck 
> ---
>  arch/x86/include/asm/cpufeature.h  |  2 ++
>  arch/x86/include/asm/cpufeatures.h |  6 +++---
>  arch/x86/include/asm/processor.h   |  6 +++---
>  arch/x86/kernel/cpu/common.c   | 11 +++
>  4 files changed, 15 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/cpufeature.h 
> b/arch/x86/include/asm/cpufeature.h
> index 483fb54..cd3b0bd 100644
> --- a/arch/x86/include/asm/cpufeature.h
> +++ b/arch/x86/include/asm/cpufeature.h
> @@ -28,6 +28,8 @@ enum cpuid_leafs
> CPUID_8000_000A_EDX,
> CPUID_7_ECX,
> CPUID_8000_0007_EBX,
> +   CPUID_10_0_EBX,
> +   CPUID_10_1_ECX,
>  };
>
>  #ifdef CONFIG_X86_FEATURE_NAMES
> diff --git a/arch/x86/include/asm/cpufeatures.h 
> b/arch/x86/include/asm/cpufeatures.h
> index 16489b3..588932a 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -220,7 +220,7 @@
>  #define X86_FEATURE_RTM( 9*32+11) /* Restricted 
> Transactional Memory */
>  #define X86_FEATURE_CQM( 9*32+12) /* Cache QoS Monitoring */
>  #define X86_FEATURE_MPX( 9*32+14) /* Memory Protection 
> Extension */
> -#define X86_FEATURE_RDT( 9*32+15) /* Resource Allocation */
> +#define X86_FEATURE_RDT( 9*32+15) /* Resource Director 
> Technology */
>  #define X86_FEATURE_AVX512F( 9*32+16) /* AVX-512 Foundation */
>  #define X86_FEATURE_AVX512DQ   ( 9*32+17) /* AVX-512 DQ (Double/Quad 
> granular) Instructions */
>  #define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */
> @@ -289,10 +289,10 @@
>  #define X86_FEATURE_SMCA   (17*32+ 3) /* Scalable MCA */
>
>  /* Intel-defined CPU features, CPUID level 0x0010:0 (ebx), word 18 */
> -#define X86_FEATURE_CAT_L3  (18*32+ 1) /* Cache Allocation L3 */
> +#define X86_FEATURE_CAT_L3 (18*32+ 1) /* Cache Allocation L3 */
>
>  /* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0010:1 (ecx), word 19 */
> -#define X86_FEATURE_CDP_L3 (19*32+ 2) /* Code data prioritization L3 */
> +#define X86_FEATURE_CDP_L3 (19*32+ 2) /* Code Data Prioritization L3 */
>
>  /*
>   * BUG word(s)
> diff --git a/arch/x86/include/asm/processor.h 
> b/arch/x86/include/asm/processor.h
> index 598c9bc..308aa03 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -119,9 +119,9 @@ struct cpuinfo_x86 {
> int x86_cache_occ_scale;/* scale to bytes */
> int x86_power;
> unsigned long   loops_per_jiffy;
> -   /* Cache Allocation values: */
> -   u16 x86_cache_max_cbm_len;
> -   u16 x86_cache_max_closid;
> +   /* Cache Allocation l3 values: */
> +   u16 x86_l3_max_cbm_len;
> +   u16 x86_l3_max_closid;
> /* cpuid returned max cores value: */
> u16  x86_max_cores;
> u16 apicid;
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index a695e58..e945e70 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -716,14 +716,17 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
> u32 eax, ebx, ecx, edx;
>
> cpuid_count(0x0010, 0, &eax, &ebx, &ecx, &edx);
> -   c->x86_capability[17] = ebx;
> +   c->x86_capability[CPUID_10_0_EBX] = ebx;
>
> if (cpu_has(c, X86_FEATURE_CAT_L3)) {
>
> cpuid_count(0x0010, 1, &eax, &ebx, &ecx, &edx);
> -   c->x86_cache_max_closid = edx + 1;
> -   c->x86_cache_max_cbm_len = eax + 1;
> -   c->x86_capability[18] = ecx;
> +   c->x86_l3_max_closid = edx + 1;
> +   c->x86_l3_max_cbm_len = eax + 1;
> +   c->x86_capability[CPUID_10_1_ECX] = ecx;
> +   } else {
> +   c->x86_l3_max_closid = -1;
> +   c->x86_l3_max_cbm_len = -1;
> }
> }
>

I think this patch should be earlier in this patch series where the
constants 17 and 18 were first used.  Similarly the renaming of
variables, I think, should be merged with the patch that originally
introduced them.

--
Nilay

Re: [PATCH 13/32] Documentation, x86: Documentation for Intel resource allocation user interface

2016-07-27 Thread Nilay Vaish

On 12 July 2016 at 20:02, Fenghua Yu  wrote:
> +1. Terms
> +
> +
> +We use the following terms and concepts in this documentation.
> +
> +RDT: Intel Resoure Director Technology
> +
> +CAT: Cache Allocation Technology
> +
> +CDP: Code and Data Prioritization
> +
> +CBM: Cache Bit Mask
> +
> +Cache ID: A cache identification. It is unique in one cache index on the
> +platform. User can find cache ID in cache sysfs interface:
> +/sys/devices/system/cpu/cpu*/cache/index*/id
> +
> +Share resource domain: A few different resources can share same QoS mask
> +MSRs array. For example, one L2 cache can share QoS MSRs with its next level
> +L3 cache. A domain number represents the L2 cache, the L3 cache, the L2
> +cache's shared cpumask, and the L3 cache's shared cpumask.
> +

I think CLOS ID should be defined here.  Is it same as Cache ID?


> +As one example, CAT L3's schema format is:
> +
> +L3:=;=;...
> +
> +On a two socket machine, L3's schema line could be:
> +
> +L3:0=ff;1=c0
> +
> +which means this line in "schemas" file is for CAT L3, L3 cache id 0's CBM
> +is 0xff, and L3 cache id 1's CBM is 0xc0.
> +
> +If one resource is disabled, its line is not shown in schemas file.
> +
> +The schema line can be expended for situations. L3 cbms format can be
> +expended to CDP enabled L3 cbms format:
> +
> +L3:=,;=,;...
> +
> +Initial value is all ones which means all tasks use all resources initially.
> +

Your example here makes me feel that Cache ID and CLOS ID mean the same thing.


> +7. Some usage examples
> +==
> +
> +7.1 Example 1 for sharing CLOSID on socket 0 between two partitions
> +
> +Only L3 cbm is enabled. Assume the machine is 2-socket and dual-core without
> +hyperthreading.
> +
> +#mount -t rscctrl rscctrl /sys/fs/rscctrl
> +#cd /sys/fs/rscctrl
> +#mkdir p0 p1
> +#echo "L3:0=3;1=c" > /sys/fs/rscctrl/p0/schemas
> +#echo "L3:0=3;1=3" > /sys/fs/rscctrl/p1/schemas
> +
> +In partition p0, kernel allocates CLOSID 0 for L3 cbm=0x3 on socket 0 and
> +CLOSID 0 for cbm=0xc on socket 1.
> +
> +In partition p1, kernel allocates CLOSID 0 for L3 cbm=0x3 on socket 0 and
> +CLOSID 1 for cbm=0x3 on socket 1.


And over here you have switched to using CLOS ID and you do not
mention Cache ID at all.
As I said above, I think Cache ID and CLOS ID are the same thing.  If
that is the case, I think Cache ID should be completely replaced with
CLOS ID.

--
Nilay

Re: [PATCH 10/32] x86/intel_rdt: Adds support to enable Code Data Prioritization

2016-07-26 Thread Nilay Vaish

On 12 July 2016 at 20:02, Fenghua Yu  wrote:
> From: Vikas Shivappa 
>
> On Intel SKUs that support Code Data Prioritization(CDP), intel_rdt
> operates in 2 modes - legacy cache allocation mode/default or CDP mode.
>
> When CDP is enabled, the number of available CLOSids is halved. Hence the
> enabling is done when less than half the number of CLOSids available are
> used. When CDP is enabled each CLOSid maps to a
> data cache mask and an instruction cache mask. The enabling itself is done
> by writing to the IA32_PQOS_CFG MSR and can dynamically be enabled or
> disabled.
>
> CDP is disabled when for each (dcache_cbm,icache_cbm) pair, the
> dcache_cbm = icache_cbm.
>
> Signed-off-by: Vikas Shivappa 
> Signed-off-by: Fenghua Yu 
> Reviewed-by: Tony Luck 
> ---
>  arch/x86/include/asm/intel_rdt.h |  7 +
>  arch/x86/kernel/cpu/intel_rdt.c  | 66 
> ++--
>  2 files changed, 51 insertions(+), 22 deletions(-)
>
> diff --git a/arch/x86/include/asm/intel_rdt.h 
> b/arch/x86/include/asm/intel_rdt.h
> index 6e20314..f2cb91d 100644
> --- a/arch/x86/include/asm/intel_rdt.h
> +++ b/arch/x86/include/asm/intel_rdt.h
> @@ -8,6 +8,7 @@
>  #define MAX_CBM_LENGTH 32
>  #define IA32_L3_CBM_BASE   0xc90
>  #define CBM_FROM_INDEX(x)  (IA32_L3_CBM_BASE + x)
> +#define MSR_IA32_PQOS_CFG  0xc81
>
>  extern struct static_key rdt_enable_key;
>  void __intel_rdt_sched_in(void *dummy);
> @@ -17,6 +18,12 @@ struct clos_cbm_table {
> unsigned int clos_refcnt;
>  };
>
> +struct clos_config {
> +   unsigned long *closmap;
> +   u32 max_closid;
> +   u32 closids_used;
> +};
> +

I think most of this patch is not about CDP, but about moving from an
independently defined closmap to one defined as part of struct
clos_config.  I suggest we combine part of this patch with patch 03/32
and work with struct clos_config right from the beginning.


Thanks
Nilay

Re: [PATCH 02/32] x86/intel_rdt: Add support for Cache Allocation detection

2016-07-26 Thread Nilay Vaish

On 12 July 2016 at 20:02, Fenghua Yu  wrote:
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 0fe6953..42c90cb 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -711,6 +711,21 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
> }
> }
>
> +   /* Additional Intel-defined flags: level 0x0010 */
> +   if (c->cpuid_level >= 0x0010) {
> +   u32 eax, ebx, ecx, edx;
> +
> +   cpuid_count(0x0010, 0, &eax, &ebx, &ecx, &edx);
> +   c->x86_capability[14] = ebx;

Should we have a name for this capability / leaf, instead of that
constant 14?  grep informs me that everywhere else we use enum of type
cpuid_leafs.


--
Nilay

Re: [PATCH 05/32] x86/intel_rdt: Implement scheduling support for Intel RDT

2016-07-25 Thread Nilay Vaish

On 12 July 2016 at 20:02, Fenghua Yu  wrote:
> From: Vikas Shivappa 
>
> Adds support for IA32_PQR_ASSOC MSR writes during task scheduling. For
> Cache Allocation, MSR write would let the task fill in the cache
> 'subset' represented by the task's capacity bit mask.
>
> The high 32 bits in the per processor MSR IA32_PQR_ASSOC represents the
> CLOSid. During context switch kernel implements this by writing the
> CLOSid of the task belongs to the CPU's IA32_PQR_ASSOC MSR.
>
> This patch also implements a common software cache for IA32_PQR_MSR
> (RMID 0:9, CLOSId 32:63) to be used by both Cache monitoring (CMT) and
> Cache allocation. CMT updates the RMID where as cache_alloc updates the
> CLOSid in the software cache. During scheduling when the new RMID/CLOSid
> value is different from the cached values, IA32_PQR_MSR is updated.
> Since the measured rdmsr latency for IA32_PQR_MSR is very high (~250
>  cycles) this software cache is necessary to avoid reading the MSR to
> compare the current CLOSid value.
>
> The following considerations are done for the PQR MSR write so that it
> minimally impacts scheduler hot path:
>  - This path does not exist on any non-intel platforms.
>  - On Intel platforms, this would not exist by default unless INTEL_RDT
>  is enabled.
>  - remains a no-op when INTEL_RDT is enabled and intel SKU does not
>  support the feature.
>  - When feature is available and enabled, never does MSR write till the
>  user manually starts using one of the capacity bit masks.
>  - MSR write is only done when there is a task with different Closid is
>  scheduled on the CPU. Typically if the task groups are bound to be
>  scheduled on a set of CPUs, the number of MSR writes is greatly
>  reduced.
>  - A per CPU cache of CLOSids is maintained to do the check so that we
>  don't have to do a rdmsr which actually costs a lot of cycles.
>

I was thinking more about this software caching of CLOSids.  How
likely do you think these CLOSids would be found cached?  I think the
software cache would be very infrequently accessed, so it seems you
are likely to miss these in all levels of cache hierarchy and more
likely to have to fetch these from the main memory, which itself might
cost ~250 cycles.

--
Nilay

Re: [PATCH 05/32] x86/intel_rdt: Implement scheduling support for Intel RDT

2016-07-25 Thread Nilay Vaish

On 12 July 2016 at 20:02, Fenghua Yu  wrote:
> diff --git a/arch/x86/include/asm/intel_rdt.h 
> b/arch/x86/include/asm/intel_rdt.h
> index 4f45dc8..afb6da3 100644
> --- a/arch/x86/include/asm/intel_rdt.h
> +++ b/arch/x86/include/asm/intel_rdt.h
> @@ -3,14 +3,42 @@
>
>  #ifdef CONFIG_INTEL_RDT
>
> +#include 
> +
>  #define MAX_CBM_LENGTH 32
>  #define IA32_L3_CBM_BASE   0xc90
>  #define CBM_FROM_INDEX(x)  (IA32_L3_CBM_BASE + x)
>
> +extern struct static_key rdt_enable_key;
> +void __intel_rdt_sched_in(void *dummy);
> +
>  struct clos_cbm_table {
> unsigned long l3_cbm;
> unsigned int clos_refcnt;
>  };
>
> +/*
> + * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
> + *
> + * Following considerations are made so that this has minimal impact
> + * on scheduler hot path:
> + * - This will stay as no-op unless we are running on an Intel SKU
> + * which supports L3 cache allocation.
> + * - Caches the per cpu CLOSid values and does the MSR write only
> + * when a task with a different CLOSid is scheduled in.
> + */
> +static inline void intel_rdt_sched_in(void)
> +{
> +   /*
> +* Call the schedule in code only when RDT is enabled.
> +*/
> +   if (static_key_false(&rdt_enable_key))
> +   __intel_rdt_sched_in(NULL);

static_key_false() is deprecated.  I think this should be
static_branch_unlikely().

> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> index 6ad5b48..8379df8 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.c
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -44,12 +46,33 @@ static cpumask_t rdt_cpumask;
>   */
>  static cpumask_t tmp_cpumask;
>  static DEFINE_MUTEX(rdt_group_mutex);
> +struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
>

Similarly, this should be DEFINE_STATIC_KEY_FALSE(rdt_enable_key);


Thanks
Nilay

Re: [PATCH 3/5] perf tests: Fix thread map test for -F option

2016-06-30 Thread Nilay Vaish

On 29 June 2016 at 11:06, Nilay Vaish  wrote:
> On 28 June 2016 at 06:29, Jiri Olsa  wrote:
>> I hit a bug when running test suite without forking
>> each test (-F option):
>>
>>   $ perf test -Fv
>>   ...
>>   34: Test thread map  :
>>   --- start ---
>>   FAILED tests/thread-map.c:24 wrong comm
>>    end 
>>   Test thread map: FAILED!
>>
>> The reason was the process name wasn't 'perf' as expected
>> by the test, because other tests set the name as well.
>>
>> Setting it explicitly now.
>>
>> Link: http://lkml.kernel.org/n/tip-bqapag0ljaiwmb7hlkw09...@git.kernel.org
>> Signed-off-by: Jiri Olsa 
>
>
> This patch also seems fine.  Again, I confirmed that test 34 fails
> without the patch and that this patch fixed it.
>

Tested-by: Nilay Vaish

Re: [PATCH 2/5] perf tools: Allow to reset open files counter

2016-06-30 Thread Nilay Vaish

On 29 June 2016 at 11:05, Nilay Vaish  wrote:
> On 28 June 2016 at 06:29, Jiri Olsa  wrote:
>> I hit a bug when running test suite without forking
>> each test (-F option):
>>
>>   $ perf test -F dso
>>8: Test dso data read   : Ok
>>9: Test dso data cache  : FAILED!
>>   10: Test dso data reopen : FAILED!
>>
>> The reason the session file limit is set just once for
>> perf process so we need to reset it for each test,
>> otherwise wrong limit is taken into account.
>>
>> Link: http://lkml.kernel.org/n/tip-bqapag0ljaiwmb7hlkw09...@git.kernel.org
>> Signed-off-by: Jiri Olsa 
>
>
> The patch seems fine to me.  I confirmed that we see those failed
> tests without this patch and applying it fixes them.
>

Tested-by: Nilay Vaish

Re: [PATCH 1/5] perf test: Add -F/--dont-fork option

2016-06-30 Thread Nilay Vaish

On 29 June 2016 at 11:07, Arnaldo Carvalho de Melo
 wrote:
> Em Wed, Jun 29, 2016 at 11:04:01AM -0500, Nilay Vaish escreveu:
>> On 28 June 2016 at 06:29, Jiri Olsa  wrote:
>> > Adding -F/--dont-fork option to bypass forking
>> > for each test. It's useful for debugging test.
>> >
>> > Link: http://lkml.kernel.org/n/tip-yq9gy0fcr8nl70986gwnl...@git.kernel.org
>> > Signed-off-by: Jiri Olsa 
>> > ---
>> >  tools/perf/Documentation/perf-test.txt |  4 +++
>> >  tools/perf/tests/builtin-test.c| 55 
>> > --
>> >  2 files changed, 37 insertions(+), 22 deletions(-)
>> >
>>
>> This patch seems fine to me.
>
> In these cases, please state that more formally, i.e. please read
> Documentation/SubmittingPatches, specially this section:
>
> 12) When to use Acked-by:
>
> and:
>
> 13) Using Reported-by:, Tested-by:, Reviewed-by:, Suggested-by: and
> Fixes:
>

Thanks for the info.

--
Nilay

Re: [PATCH 3/5] perf tests: Fix thread map test for -F option

2016-06-29 Thread Nilay Vaish

On 28 June 2016 at 06:29, Jiri Olsa  wrote:
> I hit a bug when running test suite without forking
> each test (-F option):
>
>   $ perf test -Fv
>   ...
>   34: Test thread map  :
>   --- start ---
>   FAILED tests/thread-map.c:24 wrong comm
>    end 
>   Test thread map: FAILED!
>
> The reason was the process name wasn't 'perf' as expected
> by the test, because other tests set the name as well.
>
> Setting it explicitly now.
>
> Link: http://lkml.kernel.org/n/tip-bqapag0ljaiwmb7hlkw09...@git.kernel.org
> Signed-off-by: Jiri Olsa 


This patch also seems fine.  Again, I confirmed that test 34 fails
without the patch and that this patch fixed it.

--
Nilay

Re: [PATCH 2/5] perf tools: Allow to reset open files counter

2016-06-29 Thread Nilay Vaish

On 28 June 2016 at 06:29, Jiri Olsa  wrote:
> I hit a bug when running test suite without forking
> each test (-F option):
>
>   $ perf test -F dso
>8: Test dso data read   : Ok
>9: Test dso data cache  : FAILED!
>   10: Test dso data reopen : FAILED!
>
> The reason the session file limit is set just once for
> perf process so we need to reset it for each test,
> otherwise wrong limit is taken into account.
>
> Link: http://lkml.kernel.org/n/tip-bqapag0ljaiwmb7hlkw09...@git.kernel.org
> Signed-off-by: Jiri Olsa 


The patch seems fine to me.  I confirmed that we see those failed
tests without this patch and applying it fixes them.

--
Nilay

Re: [PATCH 1/5] perf test: Add -F/--dont-fork option

2016-06-29 Thread Nilay Vaish

On 28 June 2016 at 06:29, Jiri Olsa  wrote:
> Adding -F/--dont-fork option to bypass forking
> for each test. It's useful for debugging test.
>
> Link: http://lkml.kernel.org/n/tip-yq9gy0fcr8nl70986gwnl...@git.kernel.org
> Signed-off-by: Jiri Olsa 
> ---
>  tools/perf/Documentation/perf-test.txt |  4 +++
>  tools/perf/tests/builtin-test.c| 55 
> --
>  2 files changed, 37 insertions(+), 22 deletions(-)
>

This patch seems fine to me.

--
Nilay

Re: [RFC PATCH v2 05/26] tools include: Sync math64.h and div64.h

2016-06-26 Thread Nilay Vaish

On 26 June 2016 at 06:20, He Kuang  wrote:
> From: Wang Nan 
>
> This patch copies "include/linux/math64.h" into
> "tools/include/linux/math64.h" and copies
> "include/asm-generic/div64.h" into
> "tools/include/asm-generic/div64.h", to enable other libraries use
> arithmetic operation defined in them.

This probably is a newbie question.  What prevents us from using the
header files directly?

--
Nilay

Re: [PATCH v10 07/10] perf record: Read from overwritable ring buffer

2016-06-23 Thread Nilay Vaish

On 23 June 2016 at 09:31, pi3orama  wrote:
>
>
> 发自我的 iPhone
>
>> 在 2016年6月23日，下午10:27，Nilay Vaish  写道：
>>
>>> On 23 June 2016 at 00:27, Wang Nan  wrote:
>>> @@ -542,6 +568,79 @@ static struct perf_event_header finished_round_event = 
>>> {
>>>.type = PERF_RECORD_FINISHED_ROUND,
>>> };
>>>
>>> +static void
>>> +record__toggle_overwrite_evsels(struct record *rec,
>>> +   enum overwrite_evt_state state)
>>> +{
>>> +   struct perf_evlist *evlist = rec->overwrite_evlist;
>>> +   enum overwrite_evt_state old_state = rec->overwrite_evt_state;
>>> +   enum action {
>>> +   NONE,
>>> +   PAUSE,
>>> +   RESUME,
>>> +   } action = NONE;
>>> +
>>> +   switch (old_state) {
>>> +   case OVERWRITE_EVT_RUNNING: {
>>> +   switch (state) {
>>> +   case OVERWRITE_EVT_DATA_PENDING:
>>> +   action = PAUSE;
>>> +   break;
>>> +   case OVERWRITE_EVT_RUNNING:
>>> +   case OVERWRITE_EVT_EMPTY:
>>> +   default:
>>> +   goto state_err;
>>> +   }
>>> +   break;
>>> +   }
>>> +   case OVERWRITE_EVT_DATA_PENDING: {
>>> +   switch (state) {
>>> +   case OVERWRITE_EVT_EMPTY:
>>> +   break;
>>> +   case OVERWRITE_EVT_RUNNING:
>>> +   case OVERWRITE_EVT_DATA_PENDING:
>>> +   default:
>>> +   goto state_err;
>>> +   }
>>> +   break;
>>> +   }
>>> +   case OVERWRITE_EVT_EMPTY: {
>>> +   switch (state) {
>>> +   case OVERWRITE_EVT_RUNNING:
>>> +   action = RESUME;
>>> +   break;
>>> +   case OVERWRITE_EVT_EMPTY:
>>> +   case OVERWRITE_EVT_DATA_PENDING:
>>> +   default:
>>> +   goto state_err;
>>
>>
>> Wang, thanks for making the changes I suggested.
>> The patch overall looks fine to me.  Just as a matter
>> of style, I probably would not write case labels that
>> do not have any statements associated with them.
>> I'll let default take care of those labels.
>>
>
> I don't like them either.
>
> You have to do this when you start working
> on perf code. It turns on -Wall and -Werror,
> without these case gcc will complain,
> compiling will fail.
>

Oops!  Did not realize that compiler would complain.
I guess we have to leave them as they are.  Patch is good to be committed.

Thanks
Nilay

Re: [PATCH v10 07/10] perf record: Read from overwritable ring buffer

2016-06-23 Thread Nilay Vaish

On 23 June 2016 at 00:27, Wang Nan  wrote:
> @@ -542,6 +568,79 @@ static struct perf_event_header finished_round_event = {
> .type = PERF_RECORD_FINISHED_ROUND,
>  };
>
> +static void
> +record__toggle_overwrite_evsels(struct record *rec,
> +   enum overwrite_evt_state state)
> +{
> +   struct perf_evlist *evlist = rec->overwrite_evlist;
> +   enum overwrite_evt_state old_state = rec->overwrite_evt_state;
> +   enum action {
> +   NONE,
> +   PAUSE,
> +   RESUME,
> +   } action = NONE;
> +
> +   switch (old_state) {
> +   case OVERWRITE_EVT_RUNNING: {
> +   switch (state) {
> +   case OVERWRITE_EVT_DATA_PENDING:
> +   action = PAUSE;
> +   break;
> +   case OVERWRITE_EVT_RUNNING:
> +   case OVERWRITE_EVT_EMPTY:
> +   default:
> +   goto state_err;
> +   }
> +   break;
> +   }
> +   case OVERWRITE_EVT_DATA_PENDING: {
> +   switch (state) {
> +   case OVERWRITE_EVT_EMPTY:
> +   break;
> +   case OVERWRITE_EVT_RUNNING:
> +   case OVERWRITE_EVT_DATA_PENDING:
> +   default:
> +   goto state_err;
> +   }
> +   break;
> +   }
> +   case OVERWRITE_EVT_EMPTY: {
> +   switch (state) {
> +   case OVERWRITE_EVT_RUNNING:
> +   action = RESUME;
> +   break;
> +   case OVERWRITE_EVT_EMPTY:
> +   case OVERWRITE_EVT_DATA_PENDING:
> +   default:
> +   goto state_err;


Wang, thanks for making the changes I suggested.
The patch overall looks fine to me.  Just as a matter
of style, I probably would not write case labels that
do not have any statements associated with them.
I'll let default take care of those labels.

--
Nilay

Re: [PATCH v10 04/10] perf evlist: Introduce aux evlist

2016-06-23 Thread Nilay Vaish

On 23 June 2016 at 00:27, Wang Nan  wrote:
> An auxiliary evlist is created by perf_evlist__new_aux() using an
> existing evlist as its parent. An auxiliary evlist can have its own
> 'struct perf_mmap', but can't have any other data. User should use its
> parent instead when accessing other data.
>
> Auxiliary evlists are containers of 'struct perf_mmap'. It is introduced
> to allow its parent evlist to map different events into separated mmaps.
>
> Following commits create an auxiliary evlist for overwritable
> events, because overwritable events need a read only and backwards ring
> buffer, which is different from normal events.
>
> To achieve this goal, this patch carefully changes 'evlist' to
> 'evlist->parent' in all functions in the path of 'perf_evlist__mmap_ex',
> except 'evlist->mmap' related operations, to make sure all evlist
> modifications (like pollfd and event id hash tables) goes to original
> evlist.
>
> A 'evlist->parent' pointer is added to 'struct perf_evlist' and points to
> the evlist itself for normal evlists.
>
> Children of one evlist are linked into it so one can find all children
> from its parent.
>
> To avoid potential complexity, forbid creating aux evlist from another
> aux evlist.
>
> Improve perf_evlist__munmap_filtered(), so when recording, if an event
> is terminated, unmap mmaps, from parent and children.
>
> Signed-off-by: Wang Nan 
> Cc: He Kuang 
> Cc: Jiri Olsa 
> Cc: Masami Hiramatsu 
> Cc: Namhyung Kim 
> Cc: Zefan Li 
> Cc: pi3or...@163.com
> ---
>  tools/perf/util/evlist.c | 49 
> +---
>  tools/perf/util/evlist.h | 12 
>  2 files changed, 50 insertions(+), 11 deletions(-)


Wang, thanks for making the change.  The patch seems perfectly fine to me.

--
Nilay

Re: [PATCH v9 5/8] perf record: Toggle overwrite ring buffer for reading

2016-06-22 Thread Nilay Vaish

On 22 June 2016 at 04:08, Wang Nan  wrote:
> @@ -549,17 +573,72 @@ static struct perf_event_header finished_round_event = {
> .type = PERF_RECORD_FINISHED_ROUND,
>  };
>
> -static int record__mmap_read_all(struct record *rec)
> +static void
> +record__toggle_overwrite_evsels(struct record *rec,
> +   enum overwrite_evt_state state)
> +{
> +   struct perf_evlist *evlist = rec->overwrite_evlist;
> +   enum overwrite_evt_state old_state = rec->overwrite_evt_state;
> +   enum action {
> +   NONE,
> +   PAUSE,
> +   RESUME,
> +   } action = NONE;
> +
> +   switch (old_state) {
> +   case OVERWRITE_EVT_RUNNING:
> +   if (state != OVERWRITE_EVT_RUNNING)
> +   action = PAUSE;
> +   break;
> +   case OVERWRITE_EVT_DATA_PENDING:
> +   if (state == OVERWRITE_EVT_RUNNING)
> +   action = RESUME;
> +   break;
> +   case OVERWRITE_EVT_EMPTY:
> +   if (state == OVERWRITE_EVT_RUNNING)
> +   action = RESUME;
> +   if (state == OVERWRITE_EVT_DATA_PENDING)
> +   state = OVERWRITE_EVT_EMPTY;

else if (state == OVERWRITE_EVT_DATA_PENDING)

> +   break;
> +   default:
> +   WARN_ONCE(1, "Shouldn't get there\n");
> +   }
> +
> +   rec->overwrite_evt_state = state;
> +
> +   if (action == NONE)
> +   return;

I think the above two lines are not required.  The switch below should
be enough.

> +
> +   if (!evlist)
> +   return;
> +
> +   switch (action) {
> +   case PAUSE:
> +   perf_evlist__pause(evlist);
> +   break;
> +   case RESUME:
> +   perf_evlist__resume(evlist);
> +   break;
> +   case NONE:
> +   default:
> +   break;
> +   }
> +}
> +

--
Nilay

Re: [PATCH v9 1/8] perf evlist: Introduce aux evlist

2016-06-22 Thread Nilay Vaish

On 22 June 2016 at 04:08, Wang Nan  wrote:

> @@ -1916,3 +1922,24 @@ perf_evlist__find_evsel_by_str(struct perf_evlist 
> *evlist,
>
> return NULL;
>  }
> +
> +struct perf_evlist *perf_evlist__new_aux(struct perf_evlist *parent)
> +{
> +   struct perf_evlist *evlist;
> +
> +   if (perf_evlist__is_aux(parent)) {
> +   pr_err("Internal error: create aux evlist from another aux 
> evlist\n");
> +   return NULL;
> +   }
> +
> +   evlist = zalloc(sizeof(*evlist));
> +   if (!evlist)
> +   return NULL;
> +
> +   perf_evlist__init(evlist, parent->cpus, parent->threads);
> +   evlist->parent = parent->parent;

A very minor suggestion.  I think evlist->parent  should be set to
'parent' and not 'parent->parent'.  I agree the two values are equal,
but setting to parent->parent just does not seem right.

--
Nilay

Re: [PATCH 6/6] ppc: ebpf/jit: Implement JIT compiler for extended BPF

2016-06-08 Thread Nilay Vaish

Naveen, can you point out where in the patch you update the variable:
idx, a member of codegen_contex structure?  Somehow I am unable to
figure it out.  I can only see that we set it to 0 in the
bpf_int_jit_compile function.  Since all your test cases pass, I am
clearly overlooking something.

Thanks
Nilay

On 7 June 2016 at 08:32, Naveen N. Rao  wrote:
> PPC64 eBPF JIT compiler.
>
> Enable with:
> echo 1 > /proc/sys/net/core/bpf_jit_enable
> or
> echo 2 > /proc/sys/net/core/bpf_jit_enable
>
> ... to see the generated JIT code. This can further be processed with
> tools/net/bpf_jit_disasm.
>
> With CONFIG_TEST_BPF=m and 'modprobe test_bpf':
> test_bpf: Summary: 305 PASSED, 0 FAILED, [297/297 JIT'ed]
>
> ... on both ppc64 BE and LE.
>
> The details of the approach are documented through various comments in
> the code.
>
> Cc: Matt Evans 
> Cc: Denis Kirjanov 
> Cc: Michael Ellerman 
> Cc: Paul Mackerras 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Cc: "David S. Miller" 
> Cc: Ananth N Mavinakayanahalli 
> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/Kconfig  |   3 +-
>  arch/powerpc/include/asm/asm-compat.h |   2 +
>  arch/powerpc/include/asm/ppc-opcode.h |  20 +-
>  arch/powerpc/net/Makefile |   4 +
>  arch/powerpc/net/bpf_jit.h|  53 +-
>  arch/powerpc/net/bpf_jit64.h  | 102 
>  arch/powerpc/net/bpf_jit_asm64.S  | 180 +++
>  arch/powerpc/net/bpf_jit_comp64.c | 956 
> ++
>  8 files changed, 1317 insertions(+), 3 deletions(-)
>  create mode 100644 arch/powerpc/net/bpf_jit64.h
>  create mode 100644 arch/powerpc/net/bpf_jit_asm64.S
>  create mode 100644 arch/powerpc/net/bpf_jit_comp64.c
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 01f7464..ee82f9a 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -128,7 +128,8 @@ config PPC
> select IRQ_FORCED_THREADING
> select HAVE_RCU_TABLE_FREE if SMP
> select HAVE_SYSCALL_TRACEPOINTS
> -   select HAVE_CBPF_JIT
> +   select HAVE_CBPF_JIT if !PPC64
> +   select HAVE_EBPF_JIT if PPC64
> select HAVE_ARCH_JUMP_LABEL
> select ARCH_HAVE_NMI_SAFE_CMPXCHG
> select ARCH_HAS_GCOV_PROFILE_ALL
> diff --git a/arch/powerpc/include/asm/asm-compat.h 
> b/arch/powerpc/include/asm/asm-compat.h
> index dc85dcb..cee3aa0 100644
> --- a/arch/powerpc/include/asm/asm-compat.h
> +++ b/arch/powerpc/include/asm/asm-compat.h
> @@ -36,11 +36,13 @@
>  #define PPC_MIN_STKFRM 112
>
>  #ifdef __BIG_ENDIAN__
> +#define LHZX_BEstringify_in_c(lhzx)
>  #define LWZX_BEstringify_in_c(lwzx)
>  #define LDX_BE stringify_in_c(ldx)
>  #define STWX_BEstringify_in_c(stwx)
>  #define STDX_BEstringify_in_c(stdx)
>  #else
> +#define LHZX_BEstringify_in_c(lhbrx)
>  #define LWZX_BEstringify_in_c(lwbrx)
>  #define LDX_BE stringify_in_c(ldbrx)
>  #define STWX_BEstringify_in_c(stwbrx)
> diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
> b/arch/powerpc/include/asm/ppc-opcode.h
> index fd8d640..6a77d130 100644
> --- a/arch/powerpc/include/asm/ppc-opcode.h
> +++ b/arch/powerpc/include/asm/ppc-opcode.h
> @@ -142,9 +142,11 @@
>  #define PPC_INST_ISEL  0x7c1e
>  #define PPC_INST_ISEL_MASK 0xfc3e
>  #define PPC_INST_LDARX 0x7ca8
> +#define PPC_INST_STDCX 0x7c0001ad
>  #define PPC_INST_LSWI  0x7c0004aa
>  #define PPC_INST_LSWX  0x7c00042a
>  #define PPC_INST_LWARX 0x7c28
> +#define PPC_INST_STWCX 0x7c00012d
>  #define PPC_INST_LWSYNC0x7c2004ac
>  #define PPC_INST_SYNC  0x7c0004ac
>  #define PPC_INST_SYNC_MASK 0xfc0007fe
> @@ -211,8 +213,11 @@
>  #define PPC_INST_LBZ   0x8800
>  #define PPC_INST_LD0xe800
>  #define PPC_INST_LHZ   0xa000
> -#define PPC_INST_LHBRX 0x7c00062c
>  #define PPC_INST_LWZ   0x8000
> +#define PPC_INST_LHBRX 0x7c00062c
> +#define PPC_INST_LDBRX 0x7c000428
> +#define PPC_INST_STB   0x9800
> +#define PPC_INST_STH   0xb000
>  #define PPC_INST_STD   0xf800
>  #define PPC_INST_STDU  0xf801
>  #define PPC_INST_STW   0x9000
> @@ -221,22 +226,34 @@
>  #define PPC_INST_MTLR  0x7c0803a6
>  #define PPC_INST_CMPWI 0x2c00
>  #define PPC_INST_CMPDI 0x2c20
> +#define PPC_INST_CMPW  0x7c00
> +#define PPC_INST_CMPD  0x7c20
>  #define PPC_INST_CMPLW 0x7c40
> +#define PPC_INST_CMPLD 0x7c200040
>  #define PPC_INST_CMPLWI0x2800
> +#define PPC_INST_CMPLDI0x2820
>  #

Re: [PATCH 2/4] perf stat: Add computation of TopDown formulas

2016-06-02 Thread Nilay Vaish

Andi,  I am talking about the if statement.  I don't know why it would
happen that nothing got measured.  I am guessing you saw it happen.
May be we can add a comment in the patch that it is possible that all
counter values are zero and therefore we need that if statement.

--
Nilay

On 1 June 2016 at 09:56, Andi Kleen  wrote:
> On Wed, Jun 01, 2016 at 09:50:07AM -0500, Nilay Vaish wrote:
>> On 24 May 2016 at 14:52, Andi Kleen  wrote:
>> > +static double td_be_bound(int ctx, int cpu)
>> > +{
>> > +   double sum = (td_fe_bound(ctx, cpu) +
>> > + td_bad_spec(ctx, cpu) +
>> > + td_retiring(ctx, cpu));
>> > +   if (sum == 0)
>> > +   return 0;
>> > +   return sanitize_val(1.0 - sum);
>> > +}
>> > +
>>
>> Can you explain why we need the check on sum?
>
> You mean the if statement?
>
> Otherwise if nothing was measured it would always report everything backend 
> bound,
> which wouldn't be correct.
>
> -Andi
>
> --
> a...@linux.intel.com -- Speaking for myself only

Re: [PATCH 1/4] perf stat: Basic support for TopDown in perf stat

2016-06-02 Thread Nilay Vaish

This patch looks fine to me.

--
Nilay

On 1 June 2016 at 10:24, Andi Kleen  wrote:
>
> Here's an updated patch addresses the duplicated issue and explicitly frees 
> memory.
>
> ---
>
> Add basic plumbing for TopDown in perf stat
>
> TopDown is intended to replace the frontend cycles idle/
> backend cycles idle metrics in standard perf stat output.
> These metrics are not reliable in many workloads,
> due to out of order effects.
>
> This implements a new --topdown mode in perf stat
> (similar to --transaction) that measures the pipe line
> bottlenecks using standardized formulas. The measurement
> can be all done with 5 counters (one fixed counter)
>
> The result are four metrics:
> FrontendBound, BackendBound, BadSpeculation, Retiring
>
> that describe the CPU pipeline behavior on a high level.
>
> FrontendBound and BackendBound
> BadSpeculation is a higher
>
> The full top down methology has many hierarchical metrics.
> This implementation only supports level 1 which can be
> collected without multiplexing. A full implementation
> of top down on top of perf is available in pmu-tools toplev.
> (http://github.com/andikleen/pmu-tools)
>
> The current version works on Intel Core CPUs starting
> with Sandy Bridge, and Atom CPUs starting with Silvermont.
> In principle the generic metrics should be also implementable
> on other out of order CPUs.
>
> TopDown level 1 uses a set of abstracted metrics which
> are generic to out of order CPU cores (although some
> CPUs may not implement all of them):
>
> topdown-total-slots   Available slots in the pipeline
> topdown-slots-issued  Slots issued into the pipeline
> topdown-slots-retired Slots successfully retired
> topdown-fetch-bubbles Pipeline gaps in the frontend
> topdown-recovery-bubbles  Pipeline gaps during recovery
>   from misspeculation
>
> These metrics then allow to compute four useful metrics:
> FrontendBound, BackendBound, Retiring, BadSpeculation.
>
> Add a new --topdown options to enable events.
> When --topdown is specified set up events for all topdown
> events supported by the kernel.
> Add topdown-* as a special case to the event parser, as is
> needed for all events containing -.
>
> The actual code to compute the metrics is in follow-on patches.
>
> v2: Use standard sysctl read function.
> v3: Move x86 specific code to arch/
> v4: Enable --metric-only implicitly for topdown.
> v5: Add --single-thread option to not force per core mode
> v6: Fix output order of topdown metrics
> v7: Allow combining with -d
> v8: Remove --single-thread again
> v9: Rename functions, adding arch_ and topdown_.
> v10: Expand man page and describe TopDown better
> Paste intro into commit description.
> Print error when malloc fails.
> v11:
> Free memory before exit
> Remove duplicate include.
> Acked-by: Jiri Olsa 
> Signed-off-by: Andi Kleen 
> ---
>  tools/perf/Documentation/perf-stat.txt |  32 +
>  tools/perf/arch/x86/util/Build |   1 +
>  tools/perf/arch/x86/util/group.c   |  27 
>  tools/perf/builtin-stat.c  | 119 
> -
>  tools/perf/util/group.h|   7 ++
>  tools/perf/util/parse-events.l |   1 +
>  6 files changed, 184 insertions(+), 3 deletions(-)
>  create mode 100644 tools/perf/arch/x86/util/group.c
>  create mode 100644 tools/perf/util/group.h
>
> diff --git a/tools/perf/Documentation/perf-stat.txt 
> b/tools/perf/Documentation/perf-stat.txt
> index 04f23b404bbc..d96ccd4844df 100644
> --- a/tools/perf/Documentation/perf-stat.txt
> +++ b/tools/perf/Documentation/perf-stat.txt
> @@ -204,6 +204,38 @@ Aggregate counts per physical processor for system-wide 
> mode measurements.
>  --no-aggr::
>  Do not aggregate counts across all monitored CPUs.
>
> +--topdown::
> +Print top down level 1 metrics if supported by the CPU. This allows to
> +determine bottle necks in the CPU pipeline for CPU bound workloads,
> +by breaking the cycles consumed down into frontend bound, backend bound,
> +bad speculation and retiring.
> +
> +Frontend bound means that the CPU cannot fetch and decode instructions fast
> +enough. Backend bound means that computation or memory access is the bottle
> +neck. Bad Speculation means that the CPU wasted cycles due to branch
> +mispredictions and similar issues. Retiring means that the CPU computed 
> without
> +an apparently bottleneck. The bottleneck is only the real bottleneck
> +if the workload is actually bound by the CPU and not by something else.
> +
> +For best results it is usually a good idea to use it with interval
> +mode like -I 1000, as the bottleneck of workloads can change often.
> +
> +The top down metrics are collected per core instead of per
> +CPU thread. Per core mode is automatically enabled
> +and -a (global monitoring) is needed, requiring root rights or
> +perf.perf_event_paranoid=-1.
> +
> +Topdown uses the full Performance Monitoring Unit, and needs
> +disabling of the NMI watch

Re: [PATCH 2/4] perf stat: Add computation of TopDown formulas

2016-06-01 Thread Nilay Vaish

On 24 May 2016 at 14:52, Andi Kleen  wrote:
> +static double td_be_bound(int ctx, int cpu)
> +{
> +   double sum = (td_fe_bound(ctx, cpu) +
> + td_bad_spec(ctx, cpu) +
> + td_retiring(ctx, cpu));
> +   if (sum == 0)
> +   return 0;
> +   return sanitize_val(1.0 - sum);
> +}
> +

Can you explain why we need the check on sum?

--
Nilay

Re: [PATCH 1/4] perf stat: Basic support for TopDown in perf stat

2016-06-01 Thread Nilay Vaish

On 24 May 2016 at 14:52, Andi Kleen  wrote:
> From: Andi Kleen 
>
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 715a1128daeb..dab184d86816 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -59,10 +59,13 @@
>  #include "util/thread.h"
>  #include "util/thread_map.h"
>  #include "util/counts.h"
> +#include "util/group.h"
>  #include "util/session.h"
>  #include "util/tool.h"
> +#include "util/group.h"
>  #include "asm/bug.h"

You have included util/group.h twice.  Is this intentional?

> +static int topdown_filter_events(const char **attr, char **str, bool 
> use_group)
> +{
> +   int off = 0;
> +   int i;
> +   int len = 0;
> +   char *s;
> +
> +   for (i = 0; attr[i]; i++) {
> +   if (pmu_have_event("cpu", attr[i])) {
> +   len += strlen(attr[i]) + 1;
> +   attr[i - off] = attr[i];
> +   } else
> +   off++;
> +   }
> +   attr[i - off] = NULL;
> +
> +   *str = malloc(len + 1 + 2);
> +   if (!*str)
> +   return -1;
> +   s = *str;
> +   if (i - off == 0) {
> +   *s = 0;
> +   return 0;
> +   }

I think we are leaking some memory here.  If i == off, then we set
attr[0] = NULL and do not free the memory allocated to str.


> @@ -1909,6 +1982,46 @@ static int add_default_attributes(void)
> return 0;
> }
>
> +   if (topdown_run) {
> +   char *str = NULL;
> +   bool warn = false;
> +
> +   if (stat_config.aggr_mode != AGGR_GLOBAL &&
> +   stat_config.aggr_mode != AGGR_CORE) {
> +   pr_err("top down event configuration requires 
> --per-core mode\n");
> +   return -1;
> +   }
> +   stat_config.aggr_mode = AGGR_CORE;
> +   if (nr_cgroups || !target__has_cpu(&target)) {
> +   pr_err("top down event configuration requires 
> system-wide mode (-a)\n");
> +   return -1;
> +   }
> +
> +   if (!force_metric_only)
> +   metric_only = true;
> +   if (topdown_filter_events(topdown_attrs, &str,
> +   arch_topdown_check_group(&warn)) < 0) {
> +   pr_err("Out of memory\n");
> +   return -1;
> +   }
> +   if (topdown_attrs[0] && str) {
> +   if (warn)
> +   arch_topdown_group_warn();
> +   err = parse_events(evsel_list, str, NULL);
> +   if (err) {
> +   fprintf(stderr,
> +   "Cannot set up top down events %s: 
> %d\n",
> +   str, err);
> +   free(str);
> +   return -1;
> +   }
> +   } else {
> +   fprintf(stderr, "System does not support topdown\n");
> +   return -1;
> +   }
> +   free(str);
> +   }
> +

Continuing with my comment from above memory leak, if i == off,
topdown_attrs[0] will be NULL.  So we would enter the else portion
here and return -1.  But we never free the string we allocated in the
function topdown_filter_events().   I think we are leaking some memory
though seems only about 3 bytes.

--
Nilay

Re: UBSAN: Undefined behaviour in arch/x86/events/intel/p6.c:115:29

2016-05-18 Thread Nilay Vaish

On 16 May 2016 at 13:41, Meelis Roos  wrote:
> Not sure if this is a genuine warning or a false positive but since some
> UBSAN warnings have been real and google does not find report about this
> specific warning, I'll send it in anyway.
>
> I have seen similar amd pmu warnings from UBSAN but I do not have any
> amd machines from that time frame online for now, so p6 only.
>
> [0.15] Performance Events: p6 PMU driver.
> [0.15] 
> 
> [0.15] UBSAN: Undefined behaviour in arch/x86/events/intel/p6.c:115:29
> [0.15] index 8 is out of range for type 'u64 [8]'
> [0.15] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.6.0 #21
> [0.15] Hardware name: Dell Computer Corporation PowerEdge 1550/933
>   , BIOS A09 12/10/2004
> [0.15]   c13a4bcc 0046 f605de88 0008 c13d5188 
> c17ddfd4 c13d5725
> [0.15]  c176ae8c f605de8c c17ddfec 0202 0038 00752101 
> 0002 
> [0.15]   0297 00c2 c17d2b60  c102b14f 
> 0008 
> [0.15] Call Trace:
> [0.15]  [] ? dump_stack+0x45/0x69
> [0.15]  [] ? ubsan_epilogue+0x8/0x30
> [0.15]  [] ? __ubsan_handle_out_of_bounds+0x55/0x60
> [0.15]  [] ? __register_nmi_handler+0xbf/0x300
> [0.15]  [] ? p4_pmu_schedule_events+0x740/0x740
> [0.15]  [] ? p6_pmu_event_map+0x3d/0x50
> [0.15]  [] ? p4_pmu_schedule_events+0x740/0x740
> [0.15]  [] ? init_hw_perf_events+0x493/0x688
> [0.15]  [] ? merge_attr+0x1d5/0x1d5
> [0.15]  [] ? do_one_initcall+0x82/0x230
> [0.15]  [] ? vprintk_default+0xf/0x20
> [0.15]  [] ? printk+0x11/0x12
> [0.15]  [] ? print_cpu_info+0x86/0x130
> [0.15]  [] ? native_smp_prepare_cpus+0x40e/0x453
> [0.15]  [] ? kernel_init_freeable+0x117/0x2fd
> [0.15]  [] ? kernel_init+0x6/0x100
> [0.15]  [] ? ret_from_kernel_thread+0x21/0x38
> [0.15]  [] ? rest_init+0x60/0x60
> [0.15] 
> 
> [0.15] 
> 
> [0.15] UBSAN: Undefined behaviour in arch/x86/events/intel/p6.c:115:9
> [0.15] load of address c16adf20 with insufficient space
> [0.15] for an object of type 'const u64'
> [0.15] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.6.0 #21
> [0.15] Hardware name: Dell Computer Corporation PowerEdge 1550/933
>   , BIOS A09 12/10/2004
> [0.15]   c13a4bcc 0046 f605deb4 c16adf20 c13d5188 
> c17ddfac c13d5229
> [0.15]  c176a901 c17ddfc8 c176aca4 c176a95e c16adf20 0202 
> 0008 
> [0.15]  c101841d 0008 c10183d0 0008 c1af0c78 c17d2c00 
> f605df08 0001
> [0.15] Call Trace:
> [0.15]  [] ? dump_stack+0x45/0x69
> [0.15]  [] ? ubsan_epilogue+0x8/0x30
> [0.15]  [] ? __ubsan_handle_type_mismatch+0x79/0x150
> [0.15]  [] ? p6_pmu_event_map+0x4d/0x50
> [0.15]  [] ? p4_pmu_schedule_events+0x740/0x740
> [0.15]  [] ? init_hw_perf_events+0x493/0x688
> [0.15]  [] ? merge_attr+0x1d5/0x1d5
> [0.15]  [] ? do_one_initcall+0x82/0x230
> [0.15]  [] ? vprintk_default+0xf/0x20
> [0.15]  [] ? printk+0x11/0x12
> [0.15]  [] ? print_cpu_info+0x86/0x130
> [0.15]  [] ? native_smp_prepare_cpus+0x40e/0x453
> [0.15]  [] ? kernel_init_freeable+0x117/0x2fd
> [0.15]  [] ? kernel_init+0x6/0x100
> [0.15]  [] ? ret_from_kernel_thread+0x21/0x38
> [0.15]  [] ? rest_init+0x60/0x60
> [0.15] 
> 
> [0.15] ... version:0
> [0.15] ... bit width:  32
> [0.15] ... generic registers:  2
> [0.15] ... value mask: 
> [0.15] ... max period: 7fff
> [0.15] ... fixed-purpose events:   0
> [0.15] ... event mask: 0003
>


I think UBSAN has correctly identified a bug.  I looked at the code in
v4.6.  In file arch/x86/events/core.c, in the function
filter_events(), there is a loop starting at line 1554 that should go
over 10 event counters.  But in file arch/x86/events/intel/p6.c, only
8 event counters have been declared at line 9.

I have a fix but do not for sure if its reasonable.  I think we should
pass on the max_events for the pmu to filter_events() function and
change the loop condition accordingly.  Can you apply the patch below
and test again?  It compiles, but I have not tested it.


diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 041e442..3fd33e6 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@

73 matches

Mail list logo