Re: [PATCH] kthread: kthread_bind fails to enforce CPU affinity (fixes kernel BUG at kernel/smpboot.c:134!)

2014-12-08 Thread Ingo Molnar

* Anton Blanchard an...@samba.org wrote:

 I have a busy ppc64le KVM box where guests sometimes hit the 
 infamous kernel BUG at kernel/smpboot.c:134! issue during 
 boot:
 
 BUG_ON(td-cpu != smp_processor_id());
 
 Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
 output confirms it:
 
 CPU: 0
 Comm: watchdog/130
 
 The issue is in kthread_bind where we set the cpus_allowed 
 mask, but do not touch task_thread_info(p)-cpu. The scheduler 
 assumes the previously scheduled CPU is in the cpus_allowed 
 mask, but in this case we are moving a thread to another CPU so 
 it is not.
 
 We used to call set_task_cpu which sets 
 task_thread_info(p)-cpu (in fact kthread_bind still has a 
 comment suggesting this). That was removed in e2912009fb7b 
 (sched: Ensure set_task_cpu() is never called on blocked 
 tasks).
 
 Since we cannot call set_task_cpu (the task is in a sleeping 
 state), just do an explicit set of task_thread_info(p)-cpu.

So we cannot call set_task_cpu() because in the normal life time 
of a task the -cpu value gets set on wakeup. So if a task is 
blocked right now, and its affinity changes, it ought to get a 
correct -cpu selected on wakeup. The affinity mask and the 
current value of -cpu getting out of sync is thus 'normal'.

(Check for example how set_cpus_allowed_ptr() works: we first set 
the new allowed mask, then do we migrate the task away if 
necessary.)

In the kthread_bind() case this is explicitly assumed: it only 
calls do_set_cpus_allowed().

But obviously the bug triggers in kernel/smpboot.c, and that 
assert shows a real bug - and your patch makes the assert go 
away, so the question is, how did the kthread get woken up and 
put on a runqueue without its -cpu getting set?

One possibility is a generic scheduler bug in ttwu(), resulting 
in -cpu not getting set properly. If this was the case then 
other places would be blowing up as well, and I don't think we 
are seeing this currently, especially not over such a long 
timespan.

Another possibility would be that kthread_bind()'s assumption 
that the task is inactive is false: if the task activates when we 
think it's blocked and we just hotplug-migrate it away while its 
running (setting its td-cpu?), the assert could trigger I think 
- and the patch would make the assert go away.

A third possibility would be, if this is a freshly created 
thread, some sort of initialization race - either in the kthread 
or in the scheduler code.

Weird.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] perf/e6500: Make event translations available in sysfs

2015-02-18 Thread Ingo Molnar

* Scott Wood scottw...@freescale.com wrote:

 On Mon, 2015-02-09 at 21:40 +0100, Andi Kleen wrote:
   I'll NAK any external 'download area' (and I told that Andi 
   before): tools/perf/event-tables/ or so is a good enough 
   'download area' with fast enough update cycles.
  
  The proposal was to put it on kernel.org, similar to how
  external firmware blobs are distributed. [...]

Fortunately perf is not an external firmware blob ...

  [...] CPU event lists are data sheets, so are like 
  firmware. [...]

What an absolute, idiotic, nonsense argument!

CPU event lists are human readable descriptions for events. 
If they aren't then they have no place in tooling.

Treating them like firmware is as backwards as it gets.

  [...]  They do not follow the normal kernel code 
  licenses. They are not source code. They cannot be 
  reviewed in the normal way.
 
 How is it different from describing registers and bits in 
 driver header files?  What does it mean to talk about a 
 license on information, rather than the expression of 
 information?

Andi is making idiotic arguments, instead of implementing 
the technically sane solution.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [GIT PULL 00/26] perf/core improvements and fixes

2015-01-28 Thread Ingo Molnar

* Arnaldo Carvalho de Melo a...@kernel.org wrote:

 Hi Ingo,
 
   Please consider pulling, it has my latest perf/urgent pull content,
 please let me know if you don't want it to be submitted like that, i.e. if
 you have any problems with my latest perf/urgent pull request and I'll try
 to address it ASAP.
 
 - Arnaldo
 
 The following changes since commit 25dd9171f51c482eb7c4dc8618766ae733756e2d:
 
   perf probe: Fix probing kretprobes (2015-01-21 10:06:24 -0300)
 
 are available in the git repository at:
 
   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
 tags/perf-core-for-mingo
 
 for you to fetch changes up to 3d199b5be53348bef84883013c484b414adf0a2e:
 
   tools lib traceevent: Add support for IP address formats (2015-01-26 
 12:04:41 -0300)
 
 
 perf/core improvements and fixes:
 
 User visible:
 
 - Enable sampling loads and stores simultaneously in 'perf mem' (Stephane 
 Eranian)
 
 - 'perf diff' output improvements (Namhyung Kim)
 
 - Fix error reporting for evsel pgfault constructor (Arnaldo Carvalho de Melo)
 
 Infrastructure:
 
 - Move debugfs sterrno like method to tools/lib/ so that it may be used by
   other tools, as 'perf probe' will be soon (Arnaldo Carvalho de Melo)
 
 - Introduce function fro deleting/removing hist_entry to avoid code 
 duplication
   (Arnaldo Carvalho de Melo)
 
 - Support parsing parameterized events (Cody P Schafer)
 
 - Add support for IP address formats in libtraceevent (David Ahern)
 
 - Fix typo in sample-parsing.c 'perf test' entry (Rasmus Villemoes)
 
 - Remove some unused functions from color.c (Rickard Strandqvist)
 
 Signed-off-by: Arnaldo Carvalho de Melo a...@redhat.com
 
 
 Arnaldo Carvalho de Melo (9):
   perf mem: Move the mem_operations global to struct perf_mem
   perf tools: Remove EOL whitespaces
   perf hists: Rename hist_entry__free to __delete
   perf hists: Introduce function for deleting/removing hist_entry
   tools lib fs: Adopt debugfs open strerrno method
   tools lib fs: Pass filename to debugfs__strerror_open
   perf trace: Fix error reporting for evsel pgfault constructor
   tools lib fs debugfs: Introduce debugfs__strerror_open_tp
   tools lib fs debugfs: Check if debugfs is mounted when handling ENOENT
 
 Cody P Schafer (4):
   perf tools: Support parsing parameterized events
   perf tools: Extend format_alias() to include event parameters
   perf Documentation: Add event parameters
   perf tools: Document parameterized and symbolic events
 
 David Ahern (1):
   tools lib traceevent: Add support for IP address formats
 
 Namhyung Kim (9):
   perf report: Get rid of report__inc_stat()
   perf tools: Allow use of an exclusive option more than once
   perf diff: Get rid of hists__compute_resort()
   perf diff: Print diff result more precisely
   perf diff: Introduce fmt_to_data_file() helper
   perf tools: Pass struct perf_hpp_fmt to its callbacks
   perf diff: Fix output ordering to honor next column
   perf diff: Fix -o/--order option behavior
   perf ui/tui: Show fatal error message only if exists
 
 Rasmus Villemoes (1):
   perf tests: Fix typo in sample-parsing.c
 
 Rickard Strandqvist (1):
   perf tools: Remove some unused functions from color.c
 
 Stephane Eranian (1):
   perf mem: Enable sampling loads and stores simultaneously
 
  .../testing/sysfs-bus-event_source-devices-events  |   6 +
  tools/lib/api/fs/debugfs.c |  43 +++
  tools/lib/api/fs/debugfs.h |   3 +
  tools/lib/traceevent/event-parse.c | 328 
 +
  tools/perf/Documentation/perf-buildid-cache.txt|   2 +-
  tools/perf/Documentation/perf-list.txt |  13 +
  tools/perf/Documentation/perf-mem.txt  |   9 +-
  tools/perf/Documentation/perf-record.txt   |  12 +
  tools/perf/Documentation/perf-script.txt   |  28 +-
  tools/perf/Documentation/perf-stat.txt |  20 +-
  tools/perf/builtin-buildid-cache.c |   4 +-
  tools/perf/builtin-diff.c  | 248 ++--
  tools/perf/builtin-mem.c   | 131 ++--
  tools/perf/builtin-report.c|  16 +-
  tools/perf/builtin-stat.c  |   2 +-
  tools/perf/builtin-top.c   |   2 +-
  tools/perf/builtin-trace.c | 106 ---
  tools/perf/tests/attr.py   |   1 -
  tools/perf/tests/hists_cumulate.c  |   2 +-
  tools/perf/tests/hists_output.c|   2 +-
  tools/perf/tests/make  |   1 -
  tools/perf/tests/parse-events.c|   2 +-
  tools/perf/tests/sample-parsing.c  |   2 +-
  

Re: [PATCH 1/3] perf/e6500: Make event translations available in sysfs

2015-02-09 Thread Ingo Molnar

* Peter Zijlstra pet...@infradead.org wrote:

 On Fri, Feb 06, 2015 at 04:43:54PM -0600, Tom Huynh wrote:
   arch/powerpc/perf/e6500-events-list.h | 289 
  ++
 
 That's a lot of events to stuff in the kernel, would a 
 userspace list not be more convenient?
 
 ISTR there being various discussions on providing support 
 for that in tools/perf, Jiri?

As long as it's in a single well organized place in tools/, 
I'd be fine with that solution as well.

What doesn't work very well is disjunct, disorganized, 
inconsistent event descriptions all across the tooling and 
platform landscape - putting static tables into sysfs is a 
marked improvement over that, despite its memory usage.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] perf/e6500: Make event translations available in sysfs

2015-02-09 Thread Ingo Molnar

* Jiri Olsa jo...@redhat.com wrote:

 On Mon, Feb 09, 2015 at 11:07:38AM +0100, Ingo Molnar wrote:
  
  * Peter Zijlstra pet...@infradead.org wrote:
  
   On Fri, Feb 06, 2015 at 04:43:54PM -0600, Tom Huynh wrote:
 arch/powerpc/perf/e6500-events-list.h | 289 
++
   
   That's a lot of events to stuff in the kernel, would a 
   userspace list not be more convenient?
   
   ISTR there being various discussions on providing support 
   for that in tools/perf, Jiri?
  
  As long as it's in a single well organized place in tools/, 
  I'd be fine with that solution as well.
  
  What doesn't work very well is disjunct, disorganized, 
  inconsistent event descriptions all across the tooling and 
  platform landscape - putting static tables into sysfs is a 
  marked improvement over that, despite its memory usage.
 
 the last version is in here:
 http://marc.info/?l=linux-kernelm=140676269017820w=2
 
 AFAIK Andi is setting up the download area as discussed 
 in the thread and should repost at some point

I'll NAK any external 'download area' (and I told that Andi 
before): tools/perf/event-tables/ or so is a good enough 
'download area' with fast enough update cycles.

If any 'update' of event descriptions is needed it can 
happen through the distro package mechanism, or via a 
simple 'git pull' if it's compiled directly.

Lets not overengineer this with any dependence on an 
external site and with a separate update mechanism - lets 
just get the tables into tools/ and see it from there...

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-08 Thread Ingo Molnar

* Mel Gorman mgor...@suse.de wrote:

 Elapsed time is primarily worse on one benchmark -- numa01 which is 
 an adverse workload. The user time differences are also dominated by 
 that benchmark
 
4.0.0-rc1 4.0.0-rc1
 3.19.0
  vanilla slowscan-v2r7
vanilla
 Time User-NUMA01  32883.59 (  0.00%)35288.00 ( -7.31%)
 25695.96 ( 21.86%)
 Time User-NUMA01_THEADLOCAL   17453.20 (  0.00%)17765.79 ( -1.79%)
 17404.36 (  0.28%)
 Time User-NUMA02   2063.70 (  0.00%) 2063.22 (  0.02%)
  2037.65 (  1.26%)
 Time User-NUMA02_SMT983.70 (  0.00%)  976.01 (  0.78%)
   981.02 (  0.27%)

But even for 'numa02', the simplest of the workloads, there appears to 
be some of a regression relative to v3.19, which ought to be beyond 
the noise of the measurement (which would be below 1% I suspect), and 
as such relevant, right?

And the XFS numbers still show significant regression compared to 
v3.19 - and that cannot be ignored as artificial, 'adversarial' 
workload, right?

For example, from your numbers:

xfsrepair
4.0.0-rc1 4.0.0-rc1 
   3.19.0
  vanilla   slowscan-v2 
  vanilla
...
Ameanreal-xfsrepair  507.85 (  0.00%)  459.58 (  9.50%)  447.66 
( 11.85%)
Ameansyst-xfsrepair  519.88 (  0.00%)  281.63 ( 45.83%)  202.93 
( 60.97%)

if I interpret the numbers correctly, it shows that compared to v3.19, 
system time increased by 38% - which is rather significant!

  So what worries me is that Dave bisected the regression to:
  
4d9424669946 (mm: convert p[te|md]_mknonnuma and remaining page table 
  manipulations)
  
  And clearly your patch #4 just tunes balancing/migration intensity 
  - is that a workaround for the real problem/bug?
 
 The patch makes NUMA hinting faults use standard page table handling 
 routines and protections to trap the faults. Fundamentally it's 
 safer even though it appears to cause more traps to be handled. I've 
 been assuming this is related to the different permissions PTEs get 
 and when they are visible on all CPUs. This path is addressing the 
 symptom that more faults are being handled and that it needs to be 
 less aggressive.

But the whole cleanup ought to have been close to an identity 
transformation from the CPU's point of view - and your measurements 
seem to confirm Dave's findings.

And your measurement was on bare metal, while Dave's is on a VM, and 
both show a significant slowdown on the xfs tests even with your 
slow-tuning patch applied, so it's unlikely to be a measurement fluke 
or some weird platform property.

 I've gone through that patch and didn't spot anything else that is 
 doing wrong that is not already handled in this series. Did you spot 
 anything obviously wrong in that patch that isn't addressed in this 
 series?

I didn't spot anything wrong, but is that a basis to go forward and 
work around the regression, in a way that doesn't even recover lost 
performance?

  And the patch Dave bisected to is a relatively simple patch. Why 
  not simply revert it to see whether that cures much of the 
  problem?
 
 Because it also means reverting all the PROT_NONE handling and going 
 back to _PAGE_NUMA tricks which I expect would be naked by Linus.

Yeah, I realize that (and obviously I support the PROT_NONE direction 
that Peter Zijlstra prototyped with the original sched/numa series), 
but can we leave this much of a regression on the table?

I hate to be such a pain in the neck, but especially the 'down tuning' 
of the scanning intensity will make an apples to apples comparison 
harder!

I'd rather not do the slow-tuning part and leave sucky performance in 
place for now and have an easy method plus the motivation to find and 
fix the real cause of the regression, than to partially hide it this 
way ...

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-08 Thread Ingo Molnar

* Mel Gorman mgor...@suse.de wrote:

 xfsrepair
 4.0.0-rc1 4.0.0-rc1   
  3.19.0
   vanilla   slowscan-v2   
 vanilla
 Min  real-fsmark1157.41 (  0.00%) 1150.38 (  0.61%) 
 1164.44 ( -0.61%)
 Min  syst-fsmark3998.06 (  0.00%) 3988.42 (  0.24%) 
 4016.12 ( -0.45%)
 Min  real-xfsrepair  497.64 (  0.00%)  456.87 (  8.19%)  
 442.64 ( 11.05%)
 Min  syst-xfsrepair  500.61 (  0.00%)  263.41 ( 47.38%)  
 194.97 ( 61.05%)
 Ameanreal-fsmark1166.63 (  0.00%) 1155.97 (  0.91%) 
 1166.28 (  0.03%)
 Ameansyst-fsmark4020.94 (  0.00%) 4004.19 (  0.42%) 
 4025.87 ( -0.12%)
 Ameanreal-xfsrepair  507.85 (  0.00%)  459.58 (  9.50%)  
 447.66 ( 11.85%)
 Ameansyst-xfsrepair  519.88 (  0.00%)  281.63 ( 45.83%)  
 202.93 ( 60.97%)
 Stddev   real-fsmark   6.55 (  0.00%)3.97 ( 39.30%)
 1.44 ( 77.98%)
 Stddev   syst-fsmark  16.22 (  0.00%)   15.09 (  6.96%)
 9.76 ( 39.86%)
 Stddev   real-xfsrepair   11.17 (  0.00%)3.41 ( 69.43%)
 5.57 ( 50.17%)
 Stddev   syst-xfsrepair   13.98 (  0.00%)   19.94 (-42.60%)
 5.69 ( 59.31%)
 CoeffVar real-fsmark   0.56 (  0.00%)0.34 ( 38.74%)
 0.12 ( 77.97%)
 CoeffVar syst-fsmark   0.40 (  0.00%)0.38 (  6.57%)
 0.24 ( 39.93%)
 CoeffVar real-xfsrepair2.20 (  0.00%)0.74 ( 66.22%)
 1.24 ( 43.47%)
 CoeffVar syst-xfsrepair2.69 (  0.00%)7.08 (-163.23%)
 2.80 ( -4.23%)
 Max  real-fsmark1171.98 (  0.00%) 1159.25 (  1.09%) 
 1167.96 (  0.34%)
 Max  syst-fsmark4033.84 (  0.00%) 4024.53 (  0.23%) 
 4039.20 ( -0.13%)
 Max  real-xfsrepair  523.40 (  0.00%)  464.40 ( 11.27%)  
 455.42 ( 12.99%)
 Max  syst-xfsrepair  533.37 (  0.00%)  309.38 ( 42.00%)  
 207.94 ( 61.01%)

Btw., I think it would be nice if these numbers listed v3.19 
performance in the first column, to make it clear at a glance
how much regression we still have?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-08 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

 On Sat, Mar 7, 2015 at 8:36 AM, Ingo Molnar mi...@kernel.org wrote:
 
  And the patch Dave bisected to is a relatively simple patch. Why 
  not simply revert it to see whether that cures much of the 
  problem?
 
 So the problem with that is that pmd_set_numa() and friends simply 
 no longer exist. So we can't just revert that one patch, it's the 
 whole series, and the whole point of the series.

Yeah.

 What confuses me is that the only real change that I can see in that 
 patch is the change to change_huge_pmd(). Everything else is 
 pretty much a 100% equivalent transformation, afaik. Of course, I 
 may be wrong about that, and missing something silly.

Well, there's a difference in what we write to the pte:

 #define _PAGE_BIT_NUMA  (_PAGE_BIT_GLOBAL+1)
 #define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL

and our expectation was that the two should be equivalent methods from 
the POV of the NUMA balancing code, right?

 And the changes to change_huge_pmd() were basically re-done
 differently by subsequent patches anyway.
 
 The *only* change I see remaining is that change_huge_pmd() now does
 
entry = pmdp_get_and_clear_notify(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
set_pmd_at(mm, addr, pmd, entry);
 
 for all changes. It used to do that pmdp_set_numa() for the
 prot_numa case, which did just
 
pmd_t pmd = *pmdp;
pmd = pmd_mknuma(pmd);
set_pmd_at(mm, addr, pmdp, pmd);
 
 instead.
 
 I don't like the old pmdp_set_numa() because it can drop dirty bits,
 so I think the old code was actively buggy.

Could we, as a silly testing hack not to be applied, write a 
hack-patch that re-introduces the racy way of setting the NUMA bit, to 
confirm that it is indeed this difference that changes pte visibility 
across CPUs enough to create so many more faults?

Because if the answer is 'yes', then we can safely say: 'we regressed 
performance because correctness [not dropping dirty bits] comes before 
performance'.

If the answer is 'no', then we still have a mystery (and a regression) 
to track down.

As a second hack (not to be applied), could we change:

 #define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL

to:

 #define _PAGE_BIT_PROTNONE  (_PAGE_BIT_GLOBAL+1)

to double check that the position of the bit does not matter?

I don't think we've exhaused all avenues of analysis here.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-07 Thread Ingo Molnar

* Mel Gorman mgor...@suse.de wrote:

 Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226
 
 Across the board the 4.0-rc1 numbers are much slower, and the 
 degradation is far worse when using the large memory footprint 
 configs. Perf points straight at the cause - this is from 4.0-rc1 on 
 the -o bhash=101073 config:
 
 [...]

4.0.0-rc1   4.0.0-rc1  3.19.0
  vanilla  slowscan-v2 vanilla
 User53384.2956093.1146119.12
 System692.14  311.64  306.41
 Elapsed  1236.87 1328.61 1039.88
 
 Note that the system CPU usage is now similar to 3.19-vanilla.

Similar, but still worse, and also the elapsed time is still much 
worse. User time is much higher, although it's the same amount of work 
done on every kernel, right?

 I also tested with a workload very similar to Dave's. The machine 
 configuration and storage is completely different so it's not an 
 equivalent test unfortunately. It's reporting the elapsed time and 
 CPU time while fsmark is running to create the inodes and when 
 runnig xfsrepair afterwards
 
 xfsrepair
 4.0.0-rc1 4.0.0-rc1   
  3.19.0
   vanilla   slowscan-v2   
 vanilla
 Min  real-fsmark1157.41 (  0.00%) 1150.38 (  0.61%) 
 1164.44 ( -0.61%)
 Min  syst-fsmark3998.06 (  0.00%) 3988.42 (  0.24%) 
 4016.12 ( -0.45%)
 Min  real-xfsrepair  497.64 (  0.00%)  456.87 (  8.19%)  
 442.64 ( 11.05%)
 Min  syst-xfsrepair  500.61 (  0.00%)  263.41 ( 47.38%)  
 194.97 ( 61.05%)
 Ameanreal-fsmark1166.63 (  0.00%) 1155.97 (  0.91%) 
 1166.28 (  0.03%)
 Ameansyst-fsmark4020.94 (  0.00%) 4004.19 (  0.42%) 
 4025.87 ( -0.12%)
 Ameanreal-xfsrepair  507.85 (  0.00%)  459.58 (  9.50%)  
 447.66 ( 11.85%)
 Ameansyst-xfsrepair  519.88 (  0.00%)  281.63 ( 45.83%)  
 202.93 ( 60.97%)
 Stddev   real-fsmark   6.55 (  0.00%)3.97 ( 39.30%)
 1.44 ( 77.98%)
 Stddev   syst-fsmark  16.22 (  0.00%)   15.09 (  6.96%)
 9.76 ( 39.86%)
 Stddev   real-xfsrepair   11.17 (  0.00%)3.41 ( 69.43%)
 5.57 ( 50.17%)
 Stddev   syst-xfsrepair   13.98 (  0.00%)   19.94 (-42.60%)
 5.69 ( 59.31%)
 CoeffVar real-fsmark   0.56 (  0.00%)0.34 ( 38.74%)
 0.12 ( 77.97%)
 CoeffVar syst-fsmark   0.40 (  0.00%)0.38 (  6.57%)
 0.24 ( 39.93%)
 CoeffVar real-xfsrepair2.20 (  0.00%)0.74 ( 66.22%)
 1.24 ( 43.47%)
 CoeffVar syst-xfsrepair2.69 (  0.00%)7.08 (-163.23%)
 2.80 ( -4.23%)
 Max  real-fsmark1171.98 (  0.00%) 1159.25 (  1.09%) 
 1167.96 (  0.34%)
 Max  syst-fsmark4033.84 (  0.00%) 4024.53 (  0.23%) 
 4039.20 ( -0.13%)
 Max  real-xfsrepair  523.40 (  0.00%)  464.40 ( 11.27%)  
 455.42 ( 12.99%)
 Max  syst-xfsrepair  533.37 (  0.00%)  309.38 ( 42.00%)  
 207.94 ( 61.01%)
 
 The key point is that system CPU usage for xfsrepair (syst-xfsrepair)
 is almost cut in half. It's still not as low as 3.19-vanilla but it's
 much closer
 
  4.0.0-rc1   4.0.0-rc1  3.19.0
vanilla  slowscan-v2 vanilla
 NUMA alloc hit   146138883   121929782   104019526
 NUMA alloc miss   1314632811456356 7806370
 NUMA interleave hit  0   0   0
 NUMA alloc local 146060848   121865921   103953085
 NUMA base PTE updates242201535   117237258   216624143
 NUMA huge PMD updates   113270   52121  127782
 NUMA page range updates  300195775   143923210   282048527
 NUMA hint faults 18038802587299060   147235021
 NUMA hint local faults727845323293925861866265
 NUMA hint local percent 40  37  42
 NUMA pages migrated   711752624139530223237799
 
 Note the big differences in faults trapped and pages migrated. 
 3.19-vanilla still migrated fewer pages but if necessary the 
 threshold at which we start throttling migrations can be lowered.

This too is still worse than what v3.19 had.

So what worries me is that Dave bisected the regression to:

  4d9424669946 (mm: convert p[te|md]_mknonnuma and remaining page table 
manipulations)

And clearly your patch #4 just tunes balancing/migration intensity - 
is that a workaround for the real problem/bug?

And the patch Dave bisected to is a relatively simple patch.
Why not simply revert it to see whether that cures much of the 
problem?

Am I missing something fundamental?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org

Re: [PATCH v2 2/2] powerpc/mm: Tracking vDSO remap

2015-03-25 Thread Ingo Molnar

* Laurent Dufour lduf...@linux.vnet.ibm.com wrote:

 Some processes (CRIU) are moving the vDSO area using the mremap system
 call. As a consequence the kernel reference to the vDSO base address is
 no more valid and the signal return frame built once the vDSO has been
 moved is not pointing to the new sigreturn address.
 
 This patch handles vDSO remapping and unmapping.
 
 Signed-off-by: Laurent Dufour lduf...@linux.vnet.ibm.com
 ---
  arch/powerpc/include/asm/mmu_context.h | 36 
 +-
  1 file changed, 35 insertions(+), 1 deletion(-)
 
 diff --git a/arch/powerpc/include/asm/mmu_context.h 
 b/arch/powerpc/include/asm/mmu_context.h
 index 73382eba02dc..be5dca3f7826 100644
 --- a/arch/powerpc/include/asm/mmu_context.h
 +++ b/arch/powerpc/include/asm/mmu_context.h
 @@ -8,7 +8,6 @@
  #include linux/spinlock.h
  #include asm/mmu.h 
  #include asm/cputable.h
 -#include asm-generic/mm_hooks.h
  #include asm/cputhreads.h
  
  /*
 @@ -109,5 +108,40 @@ static inline void enter_lazy_tlb(struct mm_struct *mm,
  #endif
  }
  
 +static inline void arch_dup_mmap(struct mm_struct *oldmm,
 +  struct mm_struct *mm)
 +{
 +}
 +
 +static inline void arch_exit_mmap(struct mm_struct *mm)
 +{
 +}
 +
 +static inline void arch_unmap(struct mm_struct *mm,
 + struct vm_area_struct *vma,
 + unsigned long start, unsigned long end)
 +{
 + if (start = mm-context.vdso_base  mm-context.vdso_base  end)
 + mm-context.vdso_base = 0;
 +}
 +
 +static inline void arch_bprm_mm_init(struct mm_struct *mm,
 +  struct vm_area_struct *vma)
 +{
 +}
 +
 +#define __HAVE_ARCH_REMAP
 +static inline void arch_remap(struct mm_struct *mm,
 +   unsigned long old_start, unsigned long old_end,
 +   unsigned long new_start, unsigned long new_end)
 +{
 + /*
 +  * mremap don't allow moving multiple vma so we can limit the check
 +  * to old_start == vdso_base.

s/mremap don't allow moving multiple vma
  mremap() doesn't allow moving multiple vmas

right?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 2/2] powerpc/mm: Tracking vDSO remap

2015-03-25 Thread Ingo Molnar

* Laurent Dufour lduf...@linux.vnet.ibm.com wrote:

 +static inline void arch_unmap(struct mm_struct *mm,
 + struct vm_area_struct *vma,
 + unsigned long start, unsigned long end)
 +{
 + if (start = mm-context.vdso_base  mm-context.vdso_base  end)
 + mm-context.vdso_base = 0;
 +}

So AFAICS PowerPC can have multi-page vDSOs, right?

So what happens if I munmap() the middle or end of the vDSO? The above 
condition only seems to cover unmaps that affect the first page. I 
think 'affects any page' ought to be the right condition? (But I know 
nothing about PowerPC so I might be wrong.)


 +#define __HAVE_ARCH_REMAP
 +static inline void arch_remap(struct mm_struct *mm,
 +   unsigned long old_start, unsigned long old_end,
 +   unsigned long new_start, unsigned long new_end)
 +{
 + /*
 +  * mremap() doesn't allow moving multiple vmas so we can limit the
 +  * check to old_start == vdso_base.
 +  */
 + if (old_start == mm-context.vdso_base)
 + mm-context.vdso_base = new_start;
 +}

mremap() doesn't allow moving multiple vmas, but it allows the 
movement of multi-page vmas and it also allows partial mremap()s, 
where it will split up a vma.

In particular, what happens if an mremap() is done with 
old_start == vdso_base, but a shorter end than the end of the vDSO? 
(i.e. a partial mremap() with fewer pages than the vDSO size)

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 2/2] powerpc/mm: Tracking vDSO remap

2015-03-25 Thread Ingo Molnar

* Ingo Molnar mi...@kernel.org wrote:

  +#define __HAVE_ARCH_REMAP
  +static inline void arch_remap(struct mm_struct *mm,
  + unsigned long old_start, unsigned long old_end,
  + unsigned long new_start, unsigned long new_end)
  +{
  +   /*
  +* mremap() doesn't allow moving multiple vmas so we can limit the
  +* check to old_start == vdso_base.
  +*/
  +   if (old_start == mm-context.vdso_base)
  +   mm-context.vdso_base = new_start;
  +}
 
 mremap() doesn't allow moving multiple vmas, but it allows the 
 movement of multi-page vmas and it also allows partial mremap()s, 
 where it will split up a vma.

I.e. mremap() supports the shrinking (and growing) of vmas. In that 
case mremap() will unmap the end of the vma and will shrink the 
remaining vDSO vma.

Doesn't that result in a non-working vDSO that should zero out 
vdso_base?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 2/2] powerpc/mm: Tracking vDSO remap

2015-03-26 Thread Ingo Molnar

* Benjamin Herrenschmidt b...@kernel.crashing.org wrote:

   +#define __HAVE_ARCH_REMAP
   +static inline void arch_remap(struct mm_struct *mm,
   +   unsigned long old_start, unsigned long old_end,
   +   unsigned long new_start, unsigned long new_end)
   +{
   + /*
   +  * mremap() doesn't allow moving multiple vmas so we can limit the
   +  * check to old_start == vdso_base.
   +  */
   + if (old_start == mm-context.vdso_base)
   + mm-context.vdso_base = new_start;
   +}
  
  mremap() doesn't allow moving multiple vmas, but it allows the 
  movement of multi-page vmas and it also allows partial mremap()s, 
  where it will split up a vma.
  
  In particular, what happens if an mremap() is done with 
  old_start == vdso_base, but a shorter end than the end of the vDSO? 
  (i.e. a partial mremap() with fewer pages than the vDSO size)
 
 Is there a way to forbid splitting ? Does x86 deal with that case at 
 all or it doesn't have to for some other reason ?

So we use _install_special_mapping() - maybe PowerPC does that too? 
That adds VM_DONTEXPAND which ought to prevent some - but not all - of 
the VM API weirdnesses.

On x86 we'll just dump core if someone unmaps the vdso.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 2/2] powerpc/mm: Tracking vDSO remap

2015-03-26 Thread Ingo Molnar

* Benjamin Herrenschmidt b...@kernel.crashing.org wrote:

 On Wed, 2015-03-25 at 19:36 +0100, Ingo Molnar wrote:
  * Ingo Molnar mi...@kernel.org wrote:
  
+#define __HAVE_ARCH_REMAP
+static inline void arch_remap(struct mm_struct *mm,
+ unsigned long old_start, unsigned long 
old_end,
+ unsigned long new_start, unsigned long 
new_end)
+{
+   /*
+* mremap() doesn't allow moving multiple vmas so we can limit 
the
+* check to old_start == vdso_base.
+*/
+   if (old_start == mm-context.vdso_base)
+   mm-context.vdso_base = new_start;
+}
   
   mremap() doesn't allow moving multiple vmas, but it allows the 
   movement of multi-page vmas and it also allows partial mremap()s, 
   where it will split up a vma.
  
  I.e. mremap() supports the shrinking (and growing) of vmas. In that 
  case mremap() will unmap the end of the vma and will shrink the 
  remaining vDSO vma.
  
  Doesn't that result in a non-working vDSO that should zero out 
  vdso_base?
 
 Right. Now we can't completely prevent the user from shooting itself 
 in the foot I suppose, though there is a legit usage scenario which 
 is to move the vDSO around which it would be nice to support. I 
 think it's reasonable to put the onus on the user here to do the 
 right thing.

I argue we should use the right condition to clear vdso_base: if the 
vDSO gets at least partially unmapped. Otherwise there's little point 
in the whole patch: either correctly track whether the vDSO is OK, or 
don't ...

There's also the question of mprotect(): can users mprotect() the vDSO 
on PowerPC?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 0/5] split ET_DYN ASLR from mmap ASLR

2015-03-02 Thread Ingo Molnar

* Kees Cook keesc...@chromium.org wrote:

 To address the offset2lib ASLR weakness[1], this separates ET_DYN
 ASLR from mmap ASLR, as already done on s390. The architectures
 that are already randomizing mmap (arm, arm64, mips, powerpc, s390,
 and x86), have their various forms of arch_mmap_rnd() made available
 via the new CONFIG_ARCH_HAS_ELF_RANDOMIZE. For these architectures,
 arch_randomize_brk() is collapsed as well.
 
 This is an alternative to the solutions in:
 https://lkml.org/lkml/2015/2/23/442

Looks good so far:

Reviewed-by: Ingo Molnar mi...@kernel.org

While reviewing this series I also noticed that the following code 
could be factored out from architecture mmap code as well:

  - arch_pick_mmap_layout() uses very similar patterns across the 
platforms, with only few variations. Many architectures use 
the same duplicated mmap_is_legacy() helper as well. There's 
usually just trivial differences between mmap_legacy_base() 
approaches as well.

  - arch_mmap_rnd(): the PF_RANDOMIZE checks are needlessly
exposed to the arch routine - the arch routine should only 
concentrate on arch details, not generic flags like
PF_RANDOMIZE.

In theory the mmap layout could be fully parametrized as well: i.e. no 
callback functions to architectures by default at all: just 
declarations of bits of randomization desired (or, available address 
space bits), and perhaps an arch helper to allow 32-bit vs. 64-bit 
address space distinctions.

'Weird' architectures could provide special routines, but only by 
overriding the default behavior, which should be generic, safe and 
robust.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/5] split ET_DYN ASLR from mmap ASLR

2015-02-26 Thread Ingo Molnar

* Kees Cook keesc...@chromium.org wrote:

 This separates ET_DYN ASLR from mmap ASLR, as already 
 done on s390. The various architectures that are already 
 randomizing mmap (arm, arm64, mips, powerpc, s390, and 
 x86), have their various forms of arch_mmap_rnd() made 
 available via the new CONFIG_ARCH_HAS_ELF_RANDOMIZE. For 
 these architectures, arch_randomize_brk() is collapsed as 
 well.
 
 This is an alternative to the solutions in: 
 https://lkml.org/lkml/2015/2/23/442

Nice!

Acked-by: Ingo Molnar mi...@kernel.org

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v4 0/10] split ET_DYN ASLR from mmap ASLR

2015-03-04 Thread Ingo Molnar

* Kees Cook keesc...@chromium.org wrote:

 To address the offset2lib ASLR weakness[1], this separates ET_DYN
 ASLR from mmap ASLR, as already done on s390. The architectures
 that are already randomizing mmap (arm, arm64, mips, powerpc, s390,
 and x86), have their various forms of arch_mmap_rnd() made available
 via the new CONFIG_ARCH_HAS_ELF_RANDOMIZE. For these architectures,
 arch_randomize_brk() is collapsed as well.
 
 This is an alternative to the solutions in:
 https://lkml.org/lkml/2015/2/23/442
 
 I've been able to test x86 and arm, and the buildbot (so far) seems
 happy with building the rest.

Ok, this looks really good - for all patches:

   Reviewed-by: Ingo Molnar mi...@kernel.org

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 8/8] x86: switch to using asm-generic for seccomp.h

2015-03-04 Thread Ingo Molnar

* Kees Cook keesc...@chromium.org wrote:

 Switch to using the newly created asm-generic/seccomp.h for the 
 seccomp strict mode syscall definitions. The obsolete sigreturn 
 syscall override is retained in 32-bit mode, and the ia32 syscall 
 overrides are used in the compat case. Remaining definitions were 
 identical.
 
 Signed-off-by: Kees Cook keesc...@chromium.org

Acked-by: Ingo Molnar mi...@kernel.org

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 0/5] split ET_DYN ASLR from mmap ASLR

2015-03-03 Thread Ingo Molnar

* Kees Cook keesc...@chromium.org wrote:

 On Mon, Mar 2, 2015 at 11:31 PM, Ingo Molnar mi...@kernel.org wrote:
 
  * Kees Cook keesc...@chromium.org wrote:
 
  To address the offset2lib ASLR weakness[1], this separates ET_DYN
  ASLR from mmap ASLR, as already done on s390. The architectures
  that are already randomizing mmap (arm, arm64, mips, powerpc, s390,
  and x86), have their various forms of arch_mmap_rnd() made available
  via the new CONFIG_ARCH_HAS_ELF_RANDOMIZE. For these architectures,
  arch_randomize_brk() is collapsed as well.
 
  This is an alternative to the solutions in:
  https://lkml.org/lkml/2015/2/23/442
 
  Looks good so far:
 
  Reviewed-by: Ingo Molnar mi...@kernel.org
 
  While reviewing this series I also noticed that the following code
  could be factored out from architecture mmap code as well:
 
- arch_pick_mmap_layout() uses very similar patterns across the
  platforms, with only few variations. Many architectures use
  the same duplicated mmap_is_legacy() helper as well. There's
  usually just trivial differences between mmap_legacy_base()
  approaches as well.
 
 I was nervous to start refactoring this code, but it's true: most of 
 it is the same.

Well, it still needs to be done if we want to add new randomization 
features: code fractured over multiple architectures is a receipe for 
bugs, as this series demonstrates. So it first has to be made more 
maintainable.

- arch_mmap_rnd(): the PF_RANDOMIZE checks are needlessly
  exposed to the arch routine - the arch routine should only
  concentrate on arch details, not generic flags like
  PF_RANDOMIZE.
 
 Yeah, excellent point. I will send a follow-up patch to move this 
 into binfmt_elf instead. I'd like to avoid removing it in any of the 
 other patches since each was attempting a single step in the 
 refactoring.

Finegrained patches are ideal!

  In theory the mmap layout could be fully parametrized as well: 
  i.e. no callback functions to architectures by default at all: 
  just declarations of bits of randomization desired (or, available 
  address space bits), and perhaps an arch helper to allow 32-bit 
  vs. 64-bit address space distinctions.
 
 Yeah, I was considering that too, since each architecture has a 
 nearly identical arch_mmap_rnd() at this point. Only the size of the 
 entropy was changing.

  'Weird' architectures could provide special routines, but only by 
  overriding the default behavior, which should be generic, safe and 
  robust.
 
 Yeah, quite true. Should entropy size be a #define like 
 ELF_ET_DYN_BASE? Something like ASLR_MMAP_ENTROPY and 
 ASLR_MMAP_ENTROPY_32? [...]

That would work I suspect.

 [...] Is there a common function for determining a compat task? That 
 seemed to be per-arch too. Maybe arch_mmap_entropy()?

Compat flags are a bit of a mess, and since they often tie into arch 
low level assembly code, they are hard to untangle. So maybe as an 
intermediate step add an is_compat() generic method, and make that 
obvious and self-defined function a per arch thing?

But I'm just handwaving here - I suspect it has to be tried to see all 
the complications and to determine whether that's the best structure 
and whether it's a win ... Only one thing is certain: the current code 
is not compact and reviewable enough, and VM bits hiding in 
arch/*/mm/mmap.c tends to reduce net attention paid to these details.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] seccomp: switch to using asm-generic for seccomp.h

2015-03-03 Thread Ingo Molnar

* Kees Cook keesc...@chromium.org wrote:

 Most architectures don't need to do anything special for the strict
 seccomp syscall entries. Remove the redundant headers and reduce the
 others.

  19 files changed, 27 insertions(+), 137 deletions(-)

Lovely cleanup factor.

Just to make sure, are you sure the 32-bit details are identical 
across architectures?

For example some architectures did this:

 --- a/arch/microblaze/include/asm/seccomp.h
 +++ /dev/null
 @@ -1,16 +0,0 @@
 -#ifndef _ASM_MICROBLAZE_SECCOMP_H
 -#define _ASM_MICROBLAZE_SECCOMP_H
 -
 -#include linux/unistd.h
 -
 -#define __NR_seccomp_read__NR_read
 -#define __NR_seccomp_write   __NR_write
 -#define __NR_seccomp_exit__NR_exit
 -#define __NR_seccomp_sigreturn   __NR_sigreturn
 -
 -#define __NR_seccomp_read_32 __NR_read
 -#define __NR_seccomp_write_32__NR_write
 -#define __NR_seccomp_exit_32 __NR_exit
 -#define __NR_seccomp_sigreturn_32__NR_sigreturn

others did this:

 diff --git a/arch/x86/include/asm/seccomp_64.h 
 b/arch/x86/include/asm/seccomp_64.h
 deleted file mode 100644
 index 84ec1bd161a5..
 --- a/arch/x86/include/asm/seccomp_64.h
 +++ /dev/null
 @@ -1,17 +0,0 @@
 -#ifndef _ASM_X86_SECCOMP_64_H
 -#define _ASM_X86_SECCOMP_64_H
 -
 -#include linux/unistd.h
 -#include asm/ia32_unistd.h
 -
 -#define __NR_seccomp_read __NR_read
 -#define __NR_seccomp_write __NR_write
 -#define __NR_seccomp_exit __NR_exit
 -#define __NR_seccomp_sigreturn __NR_rt_sigreturn
 -
 -#define __NR_seccomp_read_32 __NR_ia32_read
 -#define __NR_seccomp_write_32 __NR_ia32_write
 -#define __NR_seccomp_exit_32 __NR_ia32_exit
 -#define __NR_seccomp_sigreturn_32 __NR_ia32_sigreturn
 -
 -#endif /* _ASM_X86_SECCOMP_64_H */

While in yet another case you kept the syscall mappings:

 --- a/arch/x86/include/asm/seccomp.h
 +++ b/arch/x86/include/asm/seccomp.h
 @@ -1,5 +1,20 @@
 +#ifndef _ASM_X86_SECCOMP_H
 +#define _ASM_X86_SECCOMP_H
 +
 +#include asm/unistd.h
 +
 +#ifdef CONFIG_COMPAT
 +#include asm/ia32_unistd.h
 +#define __NR_seccomp_read_32 __NR_ia32_read
 +#define __NR_seccomp_write_32__NR_ia32_write
 +#define __NR_seccomp_exit_32 __NR_ia32_exit
 +#define __NR_seccomp_sigreturn_32__NR_ia32_sigreturn
 +#endif
 +
  #ifdef CONFIG_X86_32
 -# include asm/seccomp_32.h
 -#else
 -# include asm/seccomp_64.h
 +#define __NR_seccomp_sigreturn   __NR_sigreturn
  #endif
 +
 +#include asm-generic/seccomp.h
 +
 +#endif /* _ASM_X86_SECCOMP_H */

It might all be correct, but it's not obvious to me.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/7] Serialise oopses, BUGs, WARNs, dump_stack, soft lockups and hard lockups

2015-02-23 Thread Ingo Molnar

* Anton Blanchard an...@samba.org wrote:

 Every now and then I end up with an undebuggable issue 
 because multiple CPUs hit something at the same time and 
 everything is interleaved:
 
 CR: 4882  XER: 
 ,RI
 c003dc72fd10
 ,LE
 d65b84e8
 Instruction dump:
 MSR: 800100029033
 
 Very annoying.
 
 Some architectures already have their own recursive 
 locking for oopses and we have another version for 
 serialising dump_stack.
 
 Create a common version and use it everywhere (oopses, 
 BUGs, WARNs, dump_stack, soft lockups and hard lockups). 

Dunno. I've had cases where the simultaneity of the oopses 
(i.e. their garbled nature) gave me the clue about the type 
of race to expect.

To still get that information: instead of taking a 
serializing spinlock (or in addition to it), it would be 
nice to at least preserve the true time order of the 
incidents, at minimum by generating a global count for 
oopses/warnings (a bit like the oops count # currently), 
and to gather it first - before taking any spinlocks.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] Fix offset2lib issue for x86*, ARM*, PowerPC and MIPS

2015-02-23 Thread Ingo Molnar

* Hector Marco Gisbert hecma...@upv.es wrote:

 +unsigned long randomize_et_dyn(unsigned long base)
 +{
 + unsigned long ret;
 + if ((current-personality  ADDR_NO_RANDOMIZE) ||
 + !(current-flags  PF_RANDOMIZE))
 + return base;
 + ret = base + mmap_rnd();
 + return (ret  base) ? ret : base;
 +}

 +unsigned long randomize_et_dyn(unsigned long base)
 +{
 + unsigned long ret;
 + if ((current-personality  ADDR_NO_RANDOMIZE) ||
 + !(current-flags  PF_RANDOMIZE))
 + return base;
 + ret = base + mmap_rnd();
 + return (ret  base) ? ret : base;
 +}

 +unsigned long randomize_et_dyn(unsigned long base)
 +{
 + unsigned long ret;
 + if ((current-personality  ADDR_NO_RANDOMIZE) ||
 + !(current-flags  PF_RANDOMIZE))
 + return base;
 + ret = base + brk_rnd();
 + return (ret  base) ? ret : base;
 +}

 +unsigned long randomize_et_dyn(unsigned long base)
 +{
 + unsigned long ret;
 + if ((current-personality  ADDR_NO_RANDOMIZE) ||
 + !(current-flags  PF_RANDOMIZE))
 + return base;
 + ret = base + mmap_rnd();
 + return (ret  base) ? ret : base;
 +}

 +unsigned long randomize_et_dyn(unsigned long base)
 +{
 + unsigned long ret;
 + if ((current-personality  ADDR_NO_RANDOMIZE) ||
 + !(current-flags  PF_RANDOMIZE))
 + return base;
 + ret = base + mmap_rnd();
 + return (ret  base) ? ret : base;
 +}

That pointless repetition should be avoided.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/7] Add die_spin_lock_{irqsave,irqrestore}

2015-02-23 Thread Ingo Molnar

* Ingo Molnar mi...@kernel.org wrote:

 
 * Anton Blanchard an...@samba.org wrote:
 
  +static arch_spinlock_t die_lock = __ARCH_SPIN_LOCK_UNLOCKED;
  +static int die_owner = -1;
  +static unsigned int die_nest_count;
  +
  +unsigned long __die_spin_lock_irqsave(void)
  +{
  +   unsigned long flags;
  +   int cpu;
  +
  +   /* racy, but better than risking deadlock. */
  +   raw_local_irq_save(flags);
  +
  +   cpu = smp_processor_id();
  +   if (!arch_spin_trylock(die_lock)) {
  +   if (cpu != die_owner)
  +   arch_spin_lock(die_lock);
 
 So why not trylock and time out here after a few seconds, 
 instead of indefinitely supressing some potentially vital 
 output due to some other CPU crashing/locking with the lock 
 held?

[...]

 If we fix the deadlock potential, and get a true global 
 ordering of various oopses/warnings as they triggered (or 
 at least timestamping them), [...]

If we had a global 'trouble counter' we could use that to 
refine the spin-looping timeout: instead of using a pure 
timeout of a few seconds, we could say 'a timeout of a few 
seconds while the counter does not increase'.

I.e. only override the locking/ordering if the owner CPU 
does not seem to be able to make progress with printing the 
oops/warning.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/7] Add die_spin_lock_{irqsave,irqrestore}

2015-02-23 Thread Ingo Molnar

* Anton Blanchard an...@samba.org wrote:

 +static arch_spinlock_t die_lock = __ARCH_SPIN_LOCK_UNLOCKED;
 +static int die_owner = -1;
 +static unsigned int die_nest_count;
 +
 +unsigned long __die_spin_lock_irqsave(void)
 +{
 + unsigned long flags;
 + int cpu;
 +
 + /* racy, but better than risking deadlock. */
 + raw_local_irq_save(flags);
 +
 + cpu = smp_processor_id();
 + if (!arch_spin_trylock(die_lock)) {
 + if (cpu != die_owner)
 + arch_spin_lock(die_lock);

So why not trylock and time out here after a few seconds, 
instead of indefinitely supressing some potentially vital 
output due to some other CPU crashing/locking with the lock 
held?

 + }
 + die_nest_count++;
 + die_owner = cpu;
 +
 + return flags;

I suspect this would work in most cases.

If we fix the deadlock potential, and get a true global 
ordering of various oopses/warnings as they triggered (or 
at least timestamping them), then I'm sold on this I guess, 
it will likely improve things.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v4 2/2] powerpc/mm: Tracking vDSO remap

2015-03-26 Thread Ingo Molnar

* Laurent Dufour lduf...@linux.vnet.ibm.com wrote:

 +{
 + unsigned long vdso_end, vdso_start;
 +
 + if (!mm-context.vdso_base)
 + return;
 + vdso_start = mm-context.vdso_base;
 +
 +#ifdef CONFIG_PPC64
 + /* Calling is_32bit_task() implies that we are dealing with the
 +  * current process memory. If there is a call path where mm is not
 +  * owned by the current task, then we'll have need to store the
 +  * vDSO size in the mm-context.
 +  */
 + BUG_ON(current-mm != mm);
 + if (is_32bit_task())
 + vdso_end = vdso_start + (vdso32_pages  PAGE_SHIFT);
 + else
 + vdso_end = vdso_start + (vdso64_pages  PAGE_SHIFT);
 +#else
 + vdso_end = vdso_start + (vdso32_pages  PAGE_SHIFT);
 +#endif
 + vdso_end += (1PAGE_SHIFT); /* data page */
 +
 + /* Check if the vDSO is in the range of the remapped area */
 + if ((vdso_start = old_start  old_start  vdso_end) ||
 + (vdso_start  old_end  old_end = vdso_end)  ||
 + (old_start = vdso_start  vdso_start  old_end)) {
 + /* Update vdso_base if the vDSO is entirely moved. */
 + if (old_start == vdso_start  old_end == vdso_end 
 + (old_end - old_start) == (new_end - new_start))
 + mm-context.vdso_base = new_start;
 + else
 + mm-context.vdso_base = 0;
 + }
 +}

Oh my, that really looks awfully complex, as you predicted, and right 
in every mremap() call.

I'm fine with your original, imperfect, KISS approach. Sorry about 
this detour ...

Reviewed-by: Ingo Molnar mi...@kernel.org

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 2/2] powerpc/mm: Tracking vDSO remap

2015-03-26 Thread Ingo Molnar

* Laurent Dufour lduf...@linux.vnet.ibm.com wrote:

  I argue we should use the right condition to clear vdso_base: if 
  the vDSO gets at least partially unmapped. Otherwise there's 
  little point in the whole patch: either correctly track whether 
  the vDSO is OK, or don't ...
 
 That's a good option, but it may be hard to achieve in the case the 
 vDSO area has been splitted in multiple pieces.

 Not sure there is a right way to handle that, here this is a best 
 effort, allowing a process to unmap its vDSO and having the 
 sigreturn call done through the stack area (it has to make it 
 executable).
 
 Anyway I'll dig into that, assuming that the vdso_base pointer 
 should be clear if a part of the vDSO is moved or unmapped. The 
 patch will be larger since I'll have to get the vDSO size which is 
 private to the vdso.c file.

At least for munmap() I don't think that's a worry: once unmapped 
(even if just partially), vdso_base becomes zero and won't ever be set 
again.

So no need to track the zillion pieces, should there be any: Humpty 
Dumpty won't be whole again, right?

  There's also the question of mprotect(): can users mprotect() the 
  vDSO on PowerPC?
 
 Yes, mprotect() the vDSO is allowed on PowerPC, as it is on x86, and 
 certainly all the other architectures. Furthermore, if it is done on 
 a partial part of the vDSO it is splitting the vma...

btw., CRIU's main purpose here is to reconstruct a vDSO that was 
originally randomized, but whose address must now be reproduced as-is, 
right?

In that sense detecting the 'good' mremap() as your patch does should 
do the trick and is certainly not objectionable IMHO - I was just 
wondering whether we could make a perfect job very simply.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/2] mm: Introducing arch_remap hook

2015-03-23 Thread Ingo Molnar

* Laurent Dufour lduf...@linux.vnet.ibm.com wrote:

 Some architecture would like to be triggered when a memory area is moved
 through the mremap system call.
 
 This patch is introducing a new arch_remap mm hook which is placed in the
 path of mremap, and is called before the old area is unmapped (and the
 arch_unmap hook is called).
 
 To no break the build, this patch adds the empty hook definition to the
 architectures that were not using the generic hook's definition.
 
 Signed-off-by: Laurent Dufour lduf...@linux.vnet.ibm.com
 ---
  arch/s390/include/asm/mmu_context.h  | 6 ++
  arch/um/include/asm/mmu_context.h| 5 +
  arch/unicore32/include/asm/mmu_context.h | 6 ++
  arch/x86/include/asm/mmu_context.h   | 6 ++
  include/asm-generic/mm_hooks.h   | 6 ++
  mm/mremap.c  | 9 +++--
  6 files changed, 36 insertions(+), 2 deletions(-)
 
 diff --git a/arch/s390/include/asm/mmu_context.h 
 b/arch/s390/include/asm/mmu_context.h
 index 8fb3802f8fad..ddd861a490ba 100644
 --- a/arch/s390/include/asm/mmu_context.h
 +++ b/arch/s390/include/asm/mmu_context.h
 @@ -131,4 +131,10 @@ static inline void arch_bprm_mm_init(struct mm_struct 
 *mm,
  {
  }
  
 +static inline void arch_remap(struct mm_struct *mm,
 +   unsigned long old_start, unsigned long old_end,
 +   unsigned long new_start, unsigned long new_end)
 +{
 +}
 +
  #endif /* __S390_MMU_CONTEXT_H */
 diff --git a/arch/um/include/asm/mmu_context.h 
 b/arch/um/include/asm/mmu_context.h
 index 941527e507f7..f499b017c1f9 100644
 --- a/arch/um/include/asm/mmu_context.h
 +++ b/arch/um/include/asm/mmu_context.h
 @@ -27,6 +27,11 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
struct vm_area_struct *vma)
  {
  }
 +static inline void arch_remap(struct mm_struct *mm,
 +   unsigned long old_start, unsigned long old_end,
 +   unsigned long new_start, unsigned long new_end)
 +{
 +}
  /*
   * end asm-generic/mm_hooks.h functions
   */
 diff --git a/arch/unicore32/include/asm/mmu_context.h 
 b/arch/unicore32/include/asm/mmu_context.h
 index 1cb5220afaf9..39a0a553172e 100644
 --- a/arch/unicore32/include/asm/mmu_context.h
 +++ b/arch/unicore32/include/asm/mmu_context.h
 @@ -97,4 +97,10 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
  {
  }
  
 +static inline void arch_remap(struct mm_struct *mm,
 +   unsigned long old_start, unsigned long old_end,
 +   unsigned long new_start, unsigned long new_end)
 +{
 +}
 +
  #endif
 diff --git a/arch/x86/include/asm/mmu_context.h 
 b/arch/x86/include/asm/mmu_context.h
 index 883f6b933fa4..75cb71f4be1e 100644
 --- a/arch/x86/include/asm/mmu_context.h
 +++ b/arch/x86/include/asm/mmu_context.h
 @@ -172,4 +172,10 @@ static inline void arch_unmap(struct mm_struct *mm, 
 struct vm_area_struct *vma,
   mpx_notify_unmap(mm, vma, start, end);
  }
  
 +static inline void arch_remap(struct mm_struct *mm,
 +   unsigned long old_start, unsigned long old_end,
 +   unsigned long new_start, unsigned long new_end)
 +{
 +}
 +
  #endif /* _ASM_X86_MMU_CONTEXT_H */

So instead of spreading these empty prototypes around mmu_context.h 
files, why not add something like this to the PPC definition:

 #define __HAVE_ARCH_REMAP

and define the empty prototype for everyone else? It's a bit like how 
the __HAVE_ARCH_PTEP_* namespace works.

That should shrink this patch considerably.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V2] clockevents: Fix cpu down race for hrtimer based broadcasting

2015-04-02 Thread Ingo Molnar

* Peter Zijlstra pet...@infradead.org wrote:

 On Thu, Apr 02, 2015 at 12:42:27PM +0200, Ingo Molnar wrote:
  So why not use a suitable CPU_DOWN* notifier for this, instead of open 
  coding it all into a random place in the hotplug machinery?
 
 Because notifiers are crap? ;-) [...]

No doubt - but I didn't feel this poorly named random call into the 
hotplug code, with no comments was any better.

 [...] Its entirely impossible to figure out what's happening to core 
 code in hotplug. You need to go chase down and random order notifier 
 things.
 
 I'm planning on taking out many of the core hotplug notifiers and 
 hard coding their callbacks into the hotplug code.

That's very welcome news - but please also lets put in place a proper 
namespace for all these callbacks, to make them easy to find and 
change: hotplug_cpu__*() or so, which in this case would turn into 
hotplug_cpu__tick_pull() or so?

 That way at least its clear wtf happens when.

Okay. I'll resurrect the fix with a hotplug_cpu__tick_pull() name - 
agreed?

  Also, I improved the changelog (attached below), but decided 
  against applying it until these questions are cleared - please use 
  that for future versions of this patch.
 
 
  Fixes: 
  http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html
 
 You forgot to fix the Fixes line ;-)
 
 My copy has:
 
  Fixes: 5d1638acb9f6 (tick: Introduce hrtimer based broadcast)

Hm, not sure how that got lost - my git-log of the patch ported on top 
of timers/core still has it:

==
From 413fbf5193b330c5f478ef7aaeaaee08907a993e Mon Sep 17 00:00:00 2001
From: Preeti U Murthy pre...@linux.vnet.ibm.com
Date: Mon, 30 Mar 2015 14:59:19 +0530
Subject: [PATCH] clockevents: Fix cpu_down() race for hrtimer based broadcasting

It was found when doing a hotplug stress test on POWER, that the
machine either hit softlockups or rcu_sched stall warnings.  The
issue was traced to commit:

  7cba160ad789 (powernv/cpuidle: Redesign idle states management)

which exposed the cpu_down() race with hrtimer based broadcast mode:

  5d1638acb9f6 (tick: Introduce hrtimer based broadcast)

The race is the following:

Assume CPU1 is the CPU which holds the hrtimer broadcasting duty
before it is taken down.

CPU0CPU1

cpu_down()  take_cpu_down()
disable_interrupts()

cpu_die()

while (CPU1 != CPU_DEAD) {
msleep(100);
switch_to_idle();
stop_cpu_timer();
schedule_broadcast();
}

tick_cleanup_cpu_dead()
take_over_broadcast()

So after CPU1 disabled interrupts it cannot handle the broadcast
hrtimer anymore, so CPU0 will be stuck forever.

Fix this by explicitly taking over broadcast duty before cpu_die().

This is a temporary workaround. What we really want is a callback
in the clockevent device which allows us to do that from the dying
CPU by pushing the hrtimer onto a different cpu. That might involve
an IPI and is definitely more complex than this immediate fix.

Changelog was picked up from:

https://lkml.org/lkml/2015/2/16/213

Suggested-by: Thomas Gleixner t...@linutronix.de
Tested-by: Nicolas Pitre n...@linaro.org
Signed-off-by: Preeti U. Murthy pre...@linux.vnet.ibm.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: m...@ellerman.id.au
Cc: nicolas.pi...@linaro.org
Cc: pet...@infradead.org
Cc: r...@rjwysocki.net
Fixes: http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html
Link: 
http://lkml.kernel.org/r/20150330092410.24979.59887.st...@preeti.in.ibm.com
[ Merged it to the latest timer tree, tidied up the changelog. ]
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 kernel/cpu.c |  2 ++
 kernel/time/tick-broadcast.c | 19 +++
 kernel/time/tick-internal.h  |  2 ++
 3 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 1972b161c61e..f9ca351a404a 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -20,6 +20,7 @@
 #include linux/gfp.h
 #include linux/suspend.h
 #include linux/lockdep.h
+#include linux/tick.h
 #include trace/events/power.h
 
 #include smpboot.h
@@ -411,6 +412,7 @@ static int __ref _cpu_down(unsigned int cpu, int 
tasks_frozen)
while (!idle_cpu(cpu))
cpu_relax();
 
+   tick_takeover(cpu);
/* This actually kills the CPU. */
__cpu_die(cpu);
 
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 19cfb381faa9..81174cd9a29c 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -680,14 +680,19 @@ static void broadcast_shutdown_local(struct 
clock_event_device *bc,
clockevents_set_state(dev, CLOCK_EVT_STATE_SHUTDOWN);
 }
 
-static void broadcast_move_bc(int deadcpu)
+void tick_takeover(int deadcpu)
 {
-   struct clock_event_device *bc

Re: [PATCH V2] clockevents: Fix cpu down race for hrtimer based broadcasting

2015-04-02 Thread Ingo Molnar

* Preeti U Murthy pre...@linux.vnet.ibm.com wrote:

 On 04/02/2015 04:12 PM, Ingo Molnar wrote:
  
  * Preeti U Murthy pre...@linux.vnet.ibm.com wrote:
  
  It was found when doing a hotplug stress test on POWER, that the machine
  either hit softlockups or rcu_sched stall warnings.  The issue was
  traced to commit 7cba160ad789a powernv/cpuidle: Redesign idle states
  management, which exposed the cpu down race with hrtimer based broadcast
  mode(Commit 5d1638acb9f6(tick: Introduce hrtimer based broadcast). This
  is explained below.
 
  Assume CPU1 is the CPU which holds the hrtimer broadcasting duty before
  it is taken down.
 
  CPU0   CPU1
 
  cpu_down() take_cpu_down()
 disable_interrupts()
 
  cpu_die()
 
   while(CPU1 != CPU_DEAD) {
msleep(100);
 switch_to_idle();
  stop_cpu_timer();
   schedule_broadcast();
   }
 
  tick_cleanup_cpu_dead()
 take_over_broadcast()
 
  So after CPU1 disabled interrupts it cannot handle the broadcast hrtimer
  anymore, so CPU0 will be stuck forever.
 
  Fix this by explicitly taking over broadcast duty before cpu_die().
  This is a temporary workaround. What we really want is a callback in the
  clockevent device which allows us to do that from the dying CPU by
  pushing the hrtimer onto a different cpu. That might involve an IPI and
  is definitely more complex than this immediate fix.
  
  So why not use a suitable CPU_DOWN* notifier for this, instead of open 
  coding it all into a random place in the hotplug machinery?
 
 This is because each of them is unsuitable for a reason:
 
 1. CPU_DOWN_PREPARE stage allows for a fail. The cpu in question may not
 successfully go down. So we may pull the hrtimer unnecessarily.

Failure is really rare - and as long as things will continue to work 
afterwards it's not a problem to pull the hrtimer to this CPU. Right?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V2] clockevents: Fix cpu down race for hrtimer based broadcasting

2015-04-02 Thread Ingo Molnar

* Preeti U Murthy pre...@linux.vnet.ibm.com wrote:

 It was found when doing a hotplug stress test on POWER, that the machine
 either hit softlockups or rcu_sched stall warnings.  The issue was
 traced to commit 7cba160ad789a powernv/cpuidle: Redesign idle states
 management, which exposed the cpu down race with hrtimer based broadcast
 mode(Commit 5d1638acb9f6(tick: Introduce hrtimer based broadcast). This
 is explained below.
 
 Assume CPU1 is the CPU which holds the hrtimer broadcasting duty before
 it is taken down.
 
 CPU0  CPU1
 
 cpu_down()take_cpu_down()
   disable_interrupts()
 
 cpu_die()
 
  while(CPU1 != CPU_DEAD) {
   msleep(100);
switch_to_idle();
 stop_cpu_timer();
  schedule_broadcast();
  }
 
 tick_cleanup_cpu_dead()
   take_over_broadcast()
 
 So after CPU1 disabled interrupts it cannot handle the broadcast hrtimer
 anymore, so CPU0 will be stuck forever.
 
 Fix this by explicitly taking over broadcast duty before cpu_die().
 This is a temporary workaround. What we really want is a callback in the
 clockevent device which allows us to do that from the dying CPU by
 pushing the hrtimer onto a different cpu. That might involve an IPI and
 is definitely more complex than this immediate fix.

So why not use a suitable CPU_DOWN* notifier for this, instead of open 
coding it all into a random place in the hotplug machinery?

Also, I improved the changelog (attached below), but decided against 
applying it until these questions are cleared - please use that for 
future versions of this patch.

Thanks,

Ingo

===
From 413fbf5193b330c5f478ef7aaeaaee08907a993e Mon Sep 17 00:00:00 2001
From: Preeti U Murthy pre...@linux.vnet.ibm.com
Date: Mon, 30 Mar 2015 14:59:19 +0530
Subject: [PATCH] clockevents: Fix cpu_down() race for hrtimer based broadcasting

It was found when doing a hotplug stress test on POWER, that the
machine either hit softlockups or rcu_sched stall warnings.  The
issue was traced to commit:

  7cba160ad789 (powernv/cpuidle: Redesign idle states management)

which exposed the cpu_down() race with hrtimer based broadcast mode:

  5d1638acb9f6 (tick: Introduce hrtimer based broadcast)

The race is the following:

Assume CPU1 is the CPU which holds the hrtimer broadcasting duty
before it is taken down.

CPU0CPU1

cpu_down()  take_cpu_down()
disable_interrupts()

cpu_die()

while (CPU1 != CPU_DEAD) {
msleep(100);
switch_to_idle();
stop_cpu_timer();
schedule_broadcast();
}

tick_cleanup_cpu_dead()
take_over_broadcast()

So after CPU1 disabled interrupts it cannot handle the broadcast
hrtimer anymore, so CPU0 will be stuck forever.

Fix this by explicitly taking over broadcast duty before cpu_die().

This is a temporary workaround. What we really want is a callback
in the clockevent device which allows us to do that from the dying
CPU by pushing the hrtimer onto a different cpu. That might involve
an IPI and is definitely more complex than this immediate fix.

Changelog was picked up from:

https://lkml.org/lkml/2015/2/16/213

Suggested-by: Thomas Gleixner t...@linutronix.de
Tested-by: Nicolas Pitre n...@linaro.org
Signed-off-by: Preeti U. Murthy pre...@linux.vnet.ibm.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: m...@ellerman.id.au
Cc: nicolas.pi...@linaro.org
Cc: pet...@infradead.org
Cc: r...@rjwysocki.net
Fixes: http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

'perf upgrade' (was: Re: [PATCH v9 00/11] Add support for JSON event files.)

2015-04-14 Thread Ingo Molnar

* Sukadev Bhattiprolu suka...@linux.vnet.ibm.com wrote:

 This is another attempt to resurrect Andi Kleen's patchset so users
 can specify perf events by their event names rather than raw codes.
 
 This is a rebase of Andi Kleen's patchset from Jul 30, 2014[1] to 4.0.
 (I fixed minor and not so minor conflicts).

So this series shows some progress, but instead of this limited 
checkout ability I'd still prefer it if 'perf download' downloaded the 
latest perf code itself and built it - it shouldn't be limited to just 
a small subset of the perf source code!

There's a few Git tricks we can use to make this palatable on most 
systems.

To save disk space and network bandwith this could be done using Git's 
'shallow clone' feature:

   git clone --depth 1 --bare

The initial checkout finishes in 1.5 minutes here, a continent away 
from korg. The resulting repository size is just 143MB.

The code could also check whether the user has already a ~/linux.git 
repository around, and use --reference ~/linux.git in that case. In 
such a case the cloning takes just 2 seconds.

To get the source code we could use Git's 'sparse checkout' feature:

   git config core.sparsecheckout true
   echo tools/  .git/info/sparse-checkout
   git checkout tools/

Note that a sparse checkout build will need a few relatively simple 
other files as well, for the few files not in tools/ that the perf 
build needs - mostly include files.

I've attached below a working test script that will create a buildable 
tools/perf/ repository into ~/.perf/git/ of the latest tip/perf/core 
tree from kernel.org.

With ~/linux.git primed it takes under 10 seconds to execute. perf 
builds fine in the directory. The whole directory together with the 
Git repo is 53 MB - that could be shrunk some more if needed.

If there's no local Linux repository to fall back on to then the 
initial step takes 1.5 minutes (depending on network bandwidth) and 
another 140MB for the Git objects. It's a lot faster after that.

That way 'perf download' could also be renamed to 'perf upgrade'.

Building perf ought to be possible even on fairly limited development 
systems and our auto-detection and library install suggestions are 
pretty good.

And note that once we have that there's no reason to move the event 
descriptions into a separate file - it can be part of the perf binary 
itself.

 This patchset includes the perf-download tool that was dropped and 
 sets the default download location to the 
 (tools/perf/pmu-events/arch/... directory in Linus's tree.

So the way I think this would work best is for 'perf upgrade' to have 
different levels, similar to what the kernel has:

  perf upgrade devel   # pick files up from Arnaldo's korg tree
  perf upgrade next# pick files up from tip.git on korg
  perf upgrade rc  # pick files up from linus's Git tree
  perf upgrade stable  # pick files up the latest -stable version

'perf upgrade' would default to the most conservative, -stable 
variant, but of course users could pick whichever variant they prefer.

There's some limitations (such as if the build fails - but we want to 
fix such cases anyway), but note the fundamental power of this 
approach: 'perf upgrade' could turn any developer who has a perf 
binary into a perf tester and even into a contributor.

There's no need to even know Git or other development details - the 
latest code could be picked up and built.

'perf upgrade distro' could be offered as a way to downgrade to 
whatever previous (distro) perf version the user was using.

Likewise, later on this approach could be generalized into 
tools/build/ to enable self-hosting for other tools in tools/ as well, 
if they so desire.

Thanks,

Ingo


#!/bin/bash
#
# Simple script that fetches the 'perf' utility from Linus's tree, builds and
# installs it, by creating a shallow clone and a sparse checkout for Linux's
# tools/ directory and related dependencies:
#

DIR=~/.perf/git

rm -rf $DIR
mkdir -p $DIR || { echo 'error: No write permissions in the current 
directory?'; exit -1; }
cd $DIR

#REPO=git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
#BRANCH=HEAD
REPO=git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
BRANCH=perf/core

REFERENCE=.
[ -d ~/linux/.git ]  REFERENCE=~/linux/
[ -d ~/linux.git/.git ]  REFERENCE=~/linux.git/
[ -d ~/tip/.git   ]  REFERENCE=~/tip/
[ -d ~/tip.git/.git   ]  REFERENCE=~/tip.git/

git clone --reference $REFERENCE --depth 1 --bare $REPO --branch $BRANCH .git
git config --local core.bare false

git config --local core.sparseCheckout true
git remote remove origin
git remote add -f origin $REPO -t $BRANCH

(
  echo '/tools/*'
  echo '/lib/*'
  echo '/include/*'
  echo '/arch/x86/lib/*'
  echo '/arch/x86/include/*'
  echo 'Makefile'
  echo '/scripts/*'

)  .git/info/sparse-checkout

git checkout $BRANCH

make -C tools/perf/ install


Re: [PATCH v3 1/2] perf/kvm: Port perf kvm to powerpc

2015-05-07 Thread Ingo Molnar

* Hemant Kumar hem...@linux.vnet.ibm.com wrote:

  # perf kvm stat report -p 60515
 Analyze events for pid(s) 60515, all VCPUs:
 
VM-EXITSamples  Samples% Time%Min Time Max
 Time Avg time
 
 H_DATA_STORAGE   500635.30% 0.13%  1.94us 49.46us 
 12.37us ( +-   0.52% )
 HV_DECREMENTER   445731.43% 0.02%  0.72us 16.14us  
 1.91us ( +-   0.96% )
SYSCALL   269018.97% 0.10%  2.84us528.24us 
 18.29us ( +-   3.75% )
 RETURN_TO_HOST   178912.61%99.76%  1.58us 672791.91us  
 27470.23us ( +-   3.00% )
   EXTERNAL240 1.69% 0.00%0.69us 10.67us  
 1.33us ( +-   5.34% )

Where is the last line misaligned? Copy  paste error or does perf kvm 
produce it in such a way?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/2] perf/kvm: Port perf kvm to powerpc

2015-05-08 Thread Ingo Molnar

* Hemant Kumar hem...@linux.vnet.ibm.com wrote:

 
 On 05/08/2015 09:58 AM, Ingo Molnar wrote:
 * Hemant Kumar hem...@linux.vnet.ibm.com wrote:
 
   # perf kvm stat report -p 60515
 Analyze events for pid(s) 60515, all VCPUs:
 
 VM-EXITSamples  Samples% Time%Min Time Max  
Time Avg time
 
 H_DATA_STORAGE   500635.30% 0.13%  1.94us 49.46us 
 12.37us ( +-   0.52% )
 HV_DECREMENTER   445731.43% 0.02%  0.72us 16.14us  
 1.91us ( +-   0.96% )
 SYSCALL   269018.97% 0.10%  2.84us528.24us 
  18.29us ( +-   3.75% )
 RETURN_TO_HOST   178912.61%99.76%  1.58us 672791.91us  
 27470.23us ( +-   3.00% )
EXTERNAL240 1.69% 0.00%0.69us 10.67us
1.33us ( +-   5.34% )
 Where is the last line misaligned? Copy  paste error or does perf kvm
 produce it in such a way?
 
 Its a copy-paste error. Thanks for pointing this out.
 
 Shall I resend the patches with the correct alignment of the o/p?

I don't think that's necessary, as long as the code is fine.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: 'perf upgrade' (was: Re: [PATCH v9 00/11] Add support for JSON event files.)

2015-04-15 Thread Ingo Molnar

* Michael Ellerman m...@ellerman.id.au wrote:

  We just merged a patch series that was first sent in 2013. Some 
  things take time to get right.
 
 The first attempt to get symbolic event name support into perf was 
 sent in 2010, that's FIVE years ago [1].

kgdb took even longer, I think it was first proposed before 2000, over 
10 years before it got merged?

fs/overlayfs/ took similarly long I think, the first (Unionfs) patches 
I remember were from around 2003 - 11 years before the functionality 
was merged?

 And what complicated feature are we asking for? The ability to map a 
 human readable name to a hex code, it has the complexity of a first 
 year programming assignment.

No, what you are asking for, and which I NAK-ed, is:

 - to add a very limited 'update perf' capability which only updates a
   single issue that you care about, with no ability to do more.
   The 'perf upgrade' prototype I sent can update all or part of perf. 
   (The latest version is attached further below.)

 - to break the 'single binary' property of perf that many people rely on

 - to add JSON parsing overhead to every invocation of perf instead of
   pre-parsing the event tables at build time and putting them into 
   a nice data structure.

 - to blindly follow some poorly constructed vendor format with no 
   high level structure, that IMHO didn't work very well when OProfile 
   was written, and misrepresenting it as 'symbolic event names'.

   Take a look at:

 https://download.01.org/perfmon/HSW/Haswell_core_V17.json

   and weep. How is one supposed to see the higher level structure of
   the events with a format like that?

 - to add an ABI to support those vendor files

And those are IMHO five good technical reasons to disagree with your 
approach.

My suggestion to resolve the technical objections and lift the NAK 
would be:

 - to add the tables to the source code, in a more human readable 
   format and (optionally) structure the event names better into a 
   higher level hierarchy, than the humungous linear dumps with no 
   explanations that you propose - while still supporting the 'raw' 
   vendor event names you want to use, for those people who are used 
   to them.

 - to pre-parse the event descriptions at build time - beyond the 
   speedup this also keeps the 'single binary' property of perf.

 - to upgrade perf as a whole unit: this helps not just your usecase
   but many other usecases as well.

   For those who only want to update event tables: with
   'perf upgrade stable' basically only new event tables (backported 
   to -stable) would be picked up, plus regression fixes: it pretty 
   much does what your proposed 'perf download' solution does, except 
   it's much more capable.

Saying 'no' and suggesting better solutions is my job as a maintainer.

Thanks,

Ingo

==={ perf-upgrade.sh }===

#!/bin/bash
#
# Simple script that fetches the 'perf' utility from Linus's tree, builds and
# installs it, by creating a shallow clone and a sparse checkout for Linux's
# tools/ directory and related dependencies:
#

DIR=~/.perf/git

rm -rf $DIR
mkdir -p $DIR || { echo 'error: No write permissions in the current 
directory?'; exit -1; }
cd $DIR

#REPO=git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
#BRANCH=HEAD
REPO=git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
BRANCH=perf/core

REFERENCE=.
[ -d ~/linux/.git ]  REFERENCE=~/linux/
[ -d ~/linux.git/.git ]  REFERENCE=~/linux.git/
[ -d ~/tip/.git   ]  REFERENCE=~/tip/
[ -d ~/tip.git/.git   ]  REFERENCE=~/tip.git/
[ -d ~/git/linux  ]  REFERENCE=~/git/linux/

git clone --reference $REFERENCE --depth 1 --bare $REPO --branch $BRANCH .git
git config --local core.bare false

git config --local core.sparseCheckout true
git remote remove origin
git remote add -f origin $REPO -t $BRANCH

(
  echo '/tools/*'
  echo '/lib/*'
  echo '/include/*'
  echo '/arch/x86/lib/*'
  echo '/arch/x86/include/*'
  echo 'Makefile'
  echo '/scripts/*'

)  .git/info/sparse-checkout

git checkout $BRANCH

make -C tools/perf/ install

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: 'perf upgrade' (was: Re: [PATCH v9 00/11] Add support for JSON event files.)

2015-04-14 Thread Ingo Molnar

* Michael Ellerman m...@ellerman.id.au wrote:

 On Tue, 2015-04-14 at 10:55 +0200, Ingo Molnar wrote:
  * Sukadev Bhattiprolu suka...@linux.vnet.ibm.com wrote:
  
   This is another attempt to resurrect Andi Kleen's patchset so users
   can specify perf events by their event names rather than raw codes.
   
   This is a rebase of Andi Kleen's patchset from Jul 30, 2014[1] to 4.0.
   (I fixed minor and not so minor conflicts).
  
  So this series shows some progress, but instead of this limited 
  checkout ability I'd still prefer it if 'perf download' downloaded 
  the latest perf code itself and built it - it shouldn't be limited 
  to just a small subset of the perf source code!
 
 Ingo, can you please stop blocking this? It's getting ridiculous.
 
 We've been waiting over 8 months for this to go in.

We just merged a patch series that was first sent in 2013. Some things 
take time to get right.

 While we've been waiting most of our users have learnt to use operf 
 instead, which doesn't require raw codes.
 
 I would also add that exactly zero users have asked for a feature 
 where perf downloads and rebuilds itself. In fact many of them would 
 consider that a security breach.

Fetching tracing scripts, plugins or other instrumentation scripts can 
be considered a 'security breach' as well. Fetching external tables 
(or network access to begin with) can be considered a 'security 
breach' as well, depending on how restricted an environment is.

But we don't design our code based on the most restrictive 
environments that are hostile to open source concepts!

Unfortunate users that are not allowed to update open source code that 
they are using should probably not update it. The other 99.9% of perf 
users would benefit from a properly done upgrading/updating feature.

Please stop thinking in terms of restricted, closed environments. 
Packaged perf will still work fine for them, and changes will still 
trickle down to them.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [GIT PULL 00/13] perf/core improvements and fixes

2015-06-25 Thread Ingo Molnar

* Arnaldo Carvalho de Melo a...@kernel.org wrote:

 Hi Ingo,
 
   Please consider pulling,
 
 - Arnaldo
 
 The following changes since commit a9a3cd900fbbcbf837d65653105e7bfc583ced09:
 
   Merge tag 'perf-core-for-mingo' of 
 git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
 (2015-06-20 01:11:11 +0200)
 
 are available in the git repository at:
 
   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
 tags/perf-core-for-mingo
 
 for you to fetch changes up to 83b2ea257eb1d43e52f76d756722aeb899a2852c:
 
   perf tools: Allow auxtrace data alignment (2015-06-23 18:28:37 -0300)
 
 
 perf/core improvements and fixes:
 
 User visible:
 
 - Move toggling event logic from 'perf top' and into hists browser, allowing
   freeze/unfreeze with event lists with more than one entry (Namhyung Kim)
 
 - Add missing newlines when dumping PERF_RECORD_FINISHED_ROUND and
   showing the Aggregated stats in 'perf report -D' (Adrian Hunter)
 
 Infrastructure:
 
 - Allow auxtrace data alignment (Adrian Hunter)
 
 - Allow events with dot (Andi Kleen)
 
 - Fix failure to 'perf probe' events on arm (He Kuang)
 
 - Add testing for Makefile.perf (Jiri Olsa)
 
 - Add test for make install with prefix (Jiri Olsa)
 
 - Fix single target build dependency check (Jiri Olsa)
 
 - Access thread_map entries via accessors, prep patch to hold more info per
   entry, for ongoing 'perf stat --per-thread' work (Jiri Olsa)
 
 - Use __weak definition from compiler.h (Sukadev Bhattiprolu)
 
 - Split perf_pmu__new_alias() (Sukadev Bhattiprolu)
 
 Signed-off-by: Arnaldo Carvalho de Melo a...@redhat.com
 
 
 Adrian Hunter (3):
   perf session: Print a newline when dumping PERF_RECORD_FINISHED_ROUND
   perf tools: Print a newline before dumping Aggregated stats
   perf tools: Allow auxtrace data alignment
 
 Andi Kleen (1):
   perf tools: Allow events with dot
 
 He Kuang (1):
   perf probe: Fix failure to probe events on arm
 
 Jiri Olsa (5):
   perf tests: Add testing for Makefile.perf
   perf tests: Add test for make install with prefix
   perf build: Fix single target build dependency check
   perf thread_map: Don't access the array entries directly
   perf thread_map: Change map entries into a struct
 
 Namhyung Kim (1):
   perf top: Move toggling event logic into hists browser
 
 Sukadev Bhattiprolu (2):
   perf pmu: Use __weak definition from linux/compiler.h
   perf pmu: Split perf_pmu__new_alias()
 
  tools/perf/Makefile |  4 +--
  tools/perf/builtin-top.c| 24 ++-
  tools/perf/builtin-trace.c  |  4 +--
  tools/perf/tests/make   | 31 ++--
  tools/perf/tests/openat-syscall-tp-fields.c |  2 +-
  tools/perf/ui/browsers/hists.c  | 19 ++--
  tools/perf/util/auxtrace.c  | 11 +--
  tools/perf/util/auxtrace.h  |  1 +
  tools/perf/util/event.c |  6 ++--
  tools/perf/util/evlist.c|  4 +--
  tools/perf/util/evsel.c |  2 +-
  tools/perf/util/parse-events.l  |  5 ++--
  tools/perf/util/pmu.c   | 45 
 +++--
  tools/perf/util/probe-event.c   |  6 +++-
  tools/perf/util/session.c   |  4 ++-
  tools/perf/util/thread_map.c| 24 ---
  tools/perf/util/thread_map.h| 16 +-
  17 files changed, 136 insertions(+), 72 deletions(-)

Pulled, thanks a lot Arnaldo!

Btw., one small thing I noticed about the status line in perf top: if I ever 
use 
'f' to freeze/unfreeze events, the following message:

  Press 'f' to disable the events or 'h' to see other hotkeys

sticks around forever, even after I look into annotation and exit it, etc.

So I don't mind some default, helpful message there (such as 'Press 'h' to see 
hotkeys'), but it appears this particular message is context and usage 
sensitive, 
which wasn't really the goal, right?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/4] perf: jevents: Program to convert JSON file to C style file

2015-05-29 Thread Ingo Molnar

* Andi Kleen a...@linux.intel.com wrote:

  So instead of this flat structure, there should at minimum be broad 
  categorization 
  of the various parts of the hardware they relate to: whether they relate to 
  the 
  branch predictor, memory caches, TLB caches, memory ops, offcore, decoders, 
  execution units, FPU ops, etc., etc. - so that they can be queried via 
  'perf 
  list'.
 
 The categorization is generally on the stem name, which already works fine 
 with 
 the existing perf list wildcard support. So for example you only want 
 branches.

 perf list br*
 ...
   br_inst_exec.all_branches 
[Speculative and retired branches]
   br_inst_exec.all_conditional  
[Speculative and retired macro-conditional branches]
   br_inst_exec.all_direct_jmp   
[Speculative and retired macro-unconditional branches excluding calls 
 and indirects]
   br_inst_exec.all_direct_near_call 
[Speculative and retired direct near calls]
   br_inst_exec.all_indirect_jump_non_call_ret   
[Speculative and retired indirect branches excluding calls and returns]
   br_inst_exec.all_indirect_near_return 
[Speculative and retired indirect return branches]
 ...
 
 Or mid level cache events:
 
 perf list l2*
 ...
   l2_l1d_wb_rqsts.all   
[Not rejected writebacks from L1D to L2 cache lines in any state]
   l2_l1d_wb_rqsts.hit_e 
[Not rejected writebacks from L1D to L2 cache lines in E state]
   l2_l1d_wb_rqsts.hit_m 
[Not rejected writebacks from L1D to L2 cache lines in M state]
   l2_l1d_wb_rqsts.miss  
[Count the number of modified Lines evicted from L1 and missed L2. 
 (Non-rejected WBs from the DCU.)]
   l2_lines_in.all   
[L2 cache lines filling L2]
 ...
 
 There are some exceptions, but generally it works this way.

You are missing my point in several ways:

1)

Firstly, there are _tons_ of 'exceptions' to the 'stem name' grouping, to the 
level that makes it unusable for high level grouping of events.

Here's the 'stem name' histogram on the SandyBridge event list:

  $ grep EventName pmu-events/arch/x86/SandyBridge_core.json  | cut -d\. -f1 | 
cut -d\ -f4 | cut -d\_ -f1 | sort | uniq -c | sort -n

  1 AGU
  1 BACLEARS
  1 EPT
  1 HW
  1 ICACHE
  1 INSTS
  1 PAGE
  1 ROB
  1 RS
  1 SQ
  2 ARITH
  2 DSB2MITE
  2 ILD
  2 LOAD
  2 LOCK
  2 LONGEST
  2 MISALIGN
  2 SIMD
  2 TLB
  3 CPL
  3 DSB
  3 INST
  3 INT
  3 LSD
  3 MACHINE
  4 CPU
  4 OTHER
  4 PARTIAL
  5 CYCLE
  5 ITLB
  6 LD
  7 L1D
  8 DTLB
 10 FP
 12 RESOURCE
 21 UOPS
 24 IDQ
 25 MEM
 37 BR
 37 L2
131 OFFCORE

Out of 386 events. This grouping has the following severe problems:

  - that's 41 'stem name' groups, way too much as a first hop high level 
structure. We want the kind of high level categorization I suggested:
cache, decoding, branches, execution pipeline, memory events, vector unit 
events - which broad categories exist in all CPUs and are microarchitecture 
independent.

  - even these 'stem names' are mostly unstructured and unreadable. The two 
examples you cited are the best case that are borderline readable, but they
cover less than 20% of all events.

  - the 'stem name' concept is not even used consistently, the names are 
essentially a random collection of Intel internal acronyms, which 
occasionally 
match up with high level concepts. These vendor defined names have very 
poor 
high level structure.

  - the 'stem names' are totally imbalanced: there's one 'super' category 'stem 
name': OFFCORE_RESPONSE, with 131 events in it and then there are super 
small 
groups in the list above. Not well suited to get a good overview about what 
measurement capabilities the hardware has.

So forget about using 'stem names' as the high level structure. These events 
have 
no high level structure and we should provide that, instead of dumping 380+ 
events 
on the unsuspecting user.

2)

Secondly, categorization and higher level hieararchy should be used to keep the 
list manageable. The fact that if _you_ know what to search for you can list 
just 
a subset does not mean anything to the new user trying to discover events.

A simple 'perf list' should list the high level categories by default, with a 
count displayed that shows how many further events are within that category. 
(compacted tree output would be usable as well.)

 The stem could be put into a separate header, but it would seem redundant to 
 me.

Higher level categories simply don't exist in these names in any usable form, 
so 
it has to be created. Just redundantly 

Re: [PATCH 2/4] perf: jevents: Program to convert JSON file to C style file

2015-05-28 Thread Ingo Molnar

* Ingo Molnar mi...@kernel.org wrote:

 
 * Jiri Olsa jo...@redhat.com wrote:
 
  On Wed, May 27, 2015 at 11:59:04PM +0900, Namhyung Kim wrote:
   Hi Andi,
   
   On Wed, May 27, 2015 at 11:40 PM, Andi Kleen a...@linux.intel.com wrote:
So we build tables of all models in the architecture, and choose
matching one when compiling perf, right?  Can't we do that when
building the tables?  IOW, why don't we check the VFM and discard
non-matching tables?  Those non-matching tables are also needed?
   
We build it for all cpus in an architecture, not all architectures.
So e.g. for an x86 binary power is not included, and vice versa.
   
   OK.
   
It always includes all CPUs for a given architecture, so it's possible
to use the perf binary on other systems than just the one it was
build on.
   
   So it selects one at run-time not build-time, good.  But I worry about
   the size of the intel tables.  How large are they?  Maybe we can make
   it dynamic-loadable if needed..
  
  just compiled Sukadev's new version with Andi's events list
  and stripped binary size is:
  
  [jolsa@krava perf]$ ls -l perf
  -rwxrwxr-x 1 jolsa jolsa 2772640 May 28 13:49 perf
  
  
  while perf on Arnaldo's perf/core is:
  
  [jolsa@krava perf]$ ls -l perf
  -rwxrwxr-x 1 jolsa jolsa 2334816 May 28 13:49 perf
  
  seems not that bad
 
 It's not bad at all.
 
 Do you have a Git tree URI where I could take a look at its current state? A 
 tree would be nice that has as many of these patches integrated as possible.

A couple of observations:

1)

The x86 JSON files are unnecessarily large, and for no good reason, for example:

 triton:~/tip/tools/perf/pmu-events/arch/x86 grep -h EdgeDetect * | sort | 
uniq -c
   5534 EdgeDetect: 0,
 57 EdgeDetect: 1,

it's ridiculous to repeat EdgeDetect: 0 more than 5 thousand times, just so 
that in 57 cases we can say '1'. Those lines should be omitted, and the default 
value should be 0.

This would reduce the source code line count of the JSON files by 40% already:

 triton:~/tip/tools/perf/pmu-events/arch/x86 grep ': 0,' * | wc -l
 42127
 triton:~/tip/tools/perf/pmu-events/arch/x86 cat * | wc -l
 103702

And no, I don't care if manufacturers release crappy JSON files - they need to 
be 
fixed/stripped before applied to our source tree.

2)

Also, the JSON files should carry more high levelstructure than they do today. 
Let's take SandyBridge_core.json as an example: it defines 386 events, but they 
are all in a 'flat' hierarchy, which is almost impossible for all but the most 
expert users to overview.

So instead of this flat structure, there should at minimum be broad 
categorization 
of the various parts of the hardware they relate to: whether they relate to the 
branch predictor, memory caches, TLB caches, memory ops, offcore, decoders, 
execution units, FPU ops, etc., etc. - so that they can be queried via 'perf 
list'.

We don't just want the import the unstructured mess that these event files are 
- 
we want to turn them into real structure. We can still keep the messy vendor 
names 
as well, like IDQ.DSB_CYCLES, but we want to impose structure as well.

3)

There should be good 'perf list' visualization for these events: grouping, 
individual names, with a good interface to query details if needed. I.e. it 
should 
be possible to browse and discover events relevant to the CPU the tool is 
executing on.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/4] perf: jevents: Program to convert JSON file to C style file

2015-05-28 Thread Ingo Molnar

* Jiri Olsa jo...@redhat.com wrote:

 On Wed, May 27, 2015 at 11:59:04PM +0900, Namhyung Kim wrote:
  Hi Andi,
  
  On Wed, May 27, 2015 at 11:40 PM, Andi Kleen a...@linux.intel.com wrote:
   So we build tables of all models in the architecture, and choose
   matching one when compiling perf, right?  Can't we do that when
   building the tables?  IOW, why don't we check the VFM and discard
   non-matching tables?  Those non-matching tables are also needed?
  
   We build it for all cpus in an architecture, not all architectures.
   So e.g. for an x86 binary power is not included, and vice versa.
  
  OK.
  
   It always includes all CPUs for a given architecture, so it's possible
   to use the perf binary on other systems than just the one it was
   build on.
  
  So it selects one at run-time not build-time, good.  But I worry about
  the size of the intel tables.  How large are they?  Maybe we can make
  it dynamic-loadable if needed..
 
 just compiled Sukadev's new version with Andi's events list
 and stripped binary size is:
 
 [jolsa@krava perf]$ ls -l perf
 -rwxrwxr-x 1 jolsa jolsa 2772640 May 28 13:49 perf
 
 
 while perf on Arnaldo's perf/core is:
 
 [jolsa@krava perf]$ ls -l perf
 -rwxrwxr-x 1 jolsa jolsa 2334816 May 28 13:49 perf
 
 seems not that bad

It's not bad at all.

Do you have a Git tree URI where I could take a look at its current state? A 
tree 
would be nice that has as many of these patches integrated as possible.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: provide more common DMA API functions V2

2015-08-18 Thread Ingo Molnar

* Andrew Morton a...@linux-foundation.org wrote:

 On Tue, 18 Aug 2015 07:38:25 +0200 Christoph Hellwig h...@lst.de wrote:
 
  On Mon, Aug 17, 2015 at 02:24:29PM -0700, Andrew Morton wrote:
   110254 bytes saved, shrinking the kernel by a whopping 0.17%. 
   Thoughts?
  
  Sounds fine to me.
 
 OK, I'll clean it up a bit, check that each uninlining actually makes
 sense and then I'll see how it goes.
 
   
   I'll merge these 5 patches for 4.3.  That means I'll release them into
   linux-next after 4.2 is released.
  
  So you only add for-4.3 code to -next after 4.2 is odd?  Isn't thast the
  wrong way around?
 
 Linus will be releasing 4.2 in 1-2 weeks and until then, linux-next is
 supposed to contain only 4.2 material.  Once 4.2 is released,
 linux-next is open for 4.3 material.

Isn't that off by one?

I.e. shouldn't this be:

 I'll merge these 5 patches for 4.4.  That means I'll release them into 
 linux-next after 4.2 is released.

 [...]
 
 Linus will be releasing 4.2 in 1-2 weeks and until then, linux-next is 
 supposed 
 to contain only 4.3 material.  Once 4.2 is released and the 4.3 merge window 
 opens, linux-next is open for 4.4 material.

?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 3/8] jump_label: introduce DEFINE_STATIC_KEY_{TRUE,FALSE}_ARRAY macros

2015-08-21 Thread Ingo Molnar

* Kevin Hao haoke...@gmail.com wrote:

 On Fri, Aug 21, 2015 at 08:28:26AM +0200, Ingo Molnar wrote:
  
  * Kevin Hao haoke...@gmail.com wrote:
  
   These are used to define a static_key_{true,false} array.
   
   Signed-off-by: Kevin Hao haoke...@gmail.com
   ---
include/linux/jump_label.h | 6 ++
1 file changed, 6 insertions(+)
   
   diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
   index 7f653e8f6690..5c1d6a49dd6b 100644
   --- a/include/linux/jump_label.h
   +++ b/include/linux/jump_label.h
   @@ -267,6 +267,12 @@ struct static_key_false {
#define DEFINE_STATIC_KEY_FALSE(name)\
 struct static_key_false name = STATIC_KEY_FALSE_INIT

   +#define DEFINE_STATIC_KEY_TRUE_ARRAY(name, n)\
   + struct static_key_true name[n] = { [0 ... n - 1] = STATIC_KEY_TRUE_INIT 
   }
   +
   +#define DEFINE_STATIC_KEY_FALSE_ARRAY(name, n)   \
   + struct static_key_false name[n] = { [0 ... n - 1] = 
   STATIC_KEY_FALSE_INIT }
  
  I think the define makes the code more obfuscated and less clear, the 
  open-coded 
  initialization is pretty dense and easy to read to begin with.
 
 OK, I will drop this patch and move the initialization of the array to the 
 corresponding patch.

Please also Cc: peterz and me to the next submission of the series - static key 
(and jump label) changes go through the locking tree normally, and there's a 
number of changes pending already for v4.3:

20f9ed1568c0 locking/static_keys: Make verify_keys() static
412758cb2670 jump label, locking/static_keys: Update docs
2bf9e0ab08c6 locking/static_keys: Provide a selftest
ed79e946732e s390/uaccess, locking/static_keys: employ static_branch_likely()
3bbfafb77a06 x86, tsc, locking/static_keys: Employ static_branch_likely()
1987c947d905 locking/static_keys: Add selftest
11276d5306b8 locking/static_keys: Add a new static_key interface
706249c222f6 locking/static_keys: Rework update logic
e33886b38cc8 locking/static_keys: Add static_key_{en,dis}able() helpers
7dcfd915bae5 jump_label: Add jump_entry_key() helper
a1efb01feca5 jump_label, locking/static_keys: Rename JUMP_LABEL_TYPE_* and 
related helpers to the static_key* pattern

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 3/8] jump_label: introduce DEFINE_STATIC_KEY_{TRUE,FALSE}_ARRAY macros

2015-08-21 Thread Ingo Molnar

* Kevin Hao haoke...@gmail.com wrote:

 These are used to define a static_key_{true,false} array.
 
 Signed-off-by: Kevin Hao haoke...@gmail.com
 ---
  include/linux/jump_label.h | 6 ++
  1 file changed, 6 insertions(+)
 
 diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
 index 7f653e8f6690..5c1d6a49dd6b 100644
 --- a/include/linux/jump_label.h
 +++ b/include/linux/jump_label.h
 @@ -267,6 +267,12 @@ struct static_key_false {
  #define DEFINE_STATIC_KEY_FALSE(name)\
   struct static_key_false name = STATIC_KEY_FALSE_INIT
  
 +#define DEFINE_STATIC_KEY_TRUE_ARRAY(name, n)\
 + struct static_key_true name[n] = { [0 ... n - 1] = STATIC_KEY_TRUE_INIT 
 }
 +
 +#define DEFINE_STATIC_KEY_FALSE_ARRAY(name, n)   \
 + struct static_key_false name[n] = { [0 ... n - 1] = 
 STATIC_KEY_FALSE_INIT }

I think the define makes the code more obfuscated and less clear, the 
open-coded 
initialization is pretty dense and easy to read to begin with.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [GIT PULL 0/1] perf/urgent fix

2015-10-07 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> From: Arnaldo Carvalho de Melo 
> 
> Hi Ingo,
> 
>   Please consider pulling,
> 
> - Arnaldo
> 
> The following changes since commit 097f70b3c4d84ffccca15195bdfde3a37c0a7c0f:
> 
>   Merge branch 'upstream' of 
> git://git.linux-mips.org/pub/scm/ralf/upstream-linus (2015-09-27 18:22:34 
> -0400)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-urgent-for-mingo
> 
> for you to fetch changes up to 9fb4765451f22c5e782c1590747717550bff34b2:
> 
>   perf tools: Fix build break on powerpc due to sample_reg_masks (2015-10-07 
> 10:20:08 -0300)
> 
> 
> perf/urgent fix:
> 
> - Fix build break on (at least) powerpc due to sample_reg_masks, not being
>   available for linking (Sukadev Bhattiprolu)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Sukadev Bhattiprolu (1):
>   perf tools: Fix build break on powerpc due to sample_reg_masks
> 
>  tools/perf/util/Build   | 2 +-
>  tools/perf/util/perf_regs.c | 2 ++
>  tools/perf/util/perf_regs.h | 1 +
>  3 files changed, 4 insertions(+), 1 deletion(-)

Pulled, thanks Arnaldo!

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 0/6] powerpc: use jump label for {cpu,mmu}_has_feature()

2015-08-25 Thread Ingo Molnar

* Kevin Hao haoke...@gmail.com wrote:

 Hi,
 
 v2:
 Drop the following two patches as suggested by Ingo and Peter:
 jump_label: no need to acquire the jump_label_mutex in jump_lable_init()
 jump_label: introduce DEFINE_STATIC_KEY_{TRUE,FALSE}_ARRAY macros
 
 v1:
 I have tried to change the {cpu,mmu}_has_feature() to use jump label two yeas
 ago [1]. But that codes seem a bit ugly. This is a reimplementation by moving 
 the
 jump_label_init() much earlier so the jump label can be used in a very earlier
 stage. Boot test on p4080ds, t2080rdb and powermac (qemu). This patch series
 is against linux-next.
 
 [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2013-September/111026.html
 
 Kevin Hao (6):
   jump_label: make it possible for the archs to invoke jump_label_init()
 much earlier
   powerpc: invoke jump_label_init() in a much earlier stage
   powerpc: kill mfvtb()
   powerpc: move the cpu_has_feature to a separate file
   powerpc: use the jump label for cpu_has_feature
   powerpc: use jump label for mmu_has_feature
 
  arch/powerpc/include/asm/cacheflush.h   |  1 +
  arch/powerpc/include/asm/cpufeatures.h  | 34 ++
  arch/powerpc/include/asm/cputable.h | 16 +++---
  arch/powerpc/include/asm/cputime.h  |  1 +
  arch/powerpc/include/asm/dbell.h|  1 +
  arch/powerpc/include/asm/dcr-native.h   |  1 +
  arch/powerpc/include/asm/mman.h |  1 +
  arch/powerpc/include/asm/mmu.h  | 29 ++
  arch/powerpc/include/asm/reg.h  |  9 
  arch/powerpc/include/asm/time.h |  3 ++-
  arch/powerpc/include/asm/xor.h  |  1 +
  arch/powerpc/kernel/align.c |  1 +
  arch/powerpc/kernel/cputable.c  | 37 
 +
  arch/powerpc/kernel/irq.c   |  1 +
  arch/powerpc/kernel/process.c   |  1 +
  arch/powerpc/kernel/setup-common.c  |  1 +
  arch/powerpc/kernel/setup_32.c  |  5 +
  arch/powerpc/kernel/setup_64.c  |  4 
  arch/powerpc/kernel/smp.c   |  1 +
  arch/powerpc/platforms/cell/pervasive.c |  1 +
  arch/powerpc/xmon/ppc-dis.c |  1 +
  kernel/jump_label.c |  3 +++
  22 files changed, 135 insertions(+), 18 deletions(-)
  create mode 100644 arch/powerpc/include/asm/cpufeatures.h

Looks good to me!

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [GIT PULL 00/16] perf/core improvements and fixes

2015-10-01 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling,
> 
> - Arnaldo
> 
> The following changes since commit 9c17dbc6eb73bdd8a6aaea1baefd37ff78d86148:
> 
>   Merge tag 'perf-core-for-mingo' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
> (2015-09-29 09:43:46 +0200)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo
> 
> for you to fetch changes up to 7f8d1ade1b19f684ed3a7c4fb1dc5d347127b438:
> 
>   perf tools: By default use the most precise "cycles" hw counter available 
> (2015-09-30 18:34:39 -0300)
> 
> 
> perf/core improvements and fixes:
> 
> User visible:
> 
> - By default use the most precise "cycles" hw counter available, i.e.
>   when the user doesn't specify any event, it will try using cycles:ppp,
>   cycles:pp, etc (Arnaldo Carvalho de Melo)

That looks really useful!

> - Remove blank lines, headers when piping output in 'perf list', so that it 
> can
>   be sanely used with 'wc -l', etc (Arnaldo Carvalho de Melo)
> 
> - Amend documentation about max_stack and synthesized callchains (Adrian 
> Hunter)
> 
> - Fix 'perf probe -l' for probes added to kernel module functions (Masami 
> Hiramatsu)
> 
> Build fixes:
> 
> - Fix shadowed declarations that break the build on older distros (Jiri Olsa)
> 
> - Fix build break on powerpc due to sample_reg_masks (Sukadev Bhattiprolu)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Adrian Hunter (1):
>   perf report: Amend documentation about max_stack and synthesized 
> callchains
> 
> Arnaldo Carvalho de Melo (7):
>   perf maps: Introduce maps__find_symbol_by_name()
>   perf machine: Use machine__kernel_map() thoroughly
>   perf machine: Add method for common kernel_map(FUNCTION) operation
>   tools lib symbol: Rename kallsyms2elf_type to kallsyms2elf_binding
>   tools lib symbol: Introduce kallsyms2elf_type
>   perf list: Remove blank lines, headers when piping output
>   perf tools: By default use the most precise "cycles" hw counter 
> available
> 
> Jiri Olsa (2):
>   tools: Fix shadowed declaration in err.h
>   perf tools: Fix shadowed declaration in parse-events.c
> 
> Masami Hiramatsu (5):
>   perf probe: Fix to remove dot suffix from second or latter events
>   perf probe: Begin and end libdwfl report session correctly
>   perf probe: Show correct source lines of probes on kmodules
>   perf probe: Fix a segfault bug in debuginfo_cache
>   perf probe: Improve error message when %return is on inlined function
> 
> Sukadev Bhattiprolu (1):
>   perf tools: Fix build break on powerpc due to sample_reg_masks
> 
>  tools/include/linux/err.h|  4 +-
>  tools/lib/symbol/kallsyms.c  |  6 ++
>  tools/lib/symbol/kallsyms.h  |  4 +-
>  tools/perf/Documentation/perf-report.txt |  2 +
>  tools/perf/builtin-kmem.c|  2 +-
>  tools/perf/builtin-list.c|  2 +-
>  tools/perf/builtin-report.c  |  2 +-
>  tools/perf/tests/code-reading.c  |  2 +-
>  tools/perf/tests/vmlinux-kallsyms.c  |  4 +-
>  tools/perf/util/Build|  2 +-
>  tools/perf/util/event.c  |  7 +--
>  tools/perf/util/evlist.c | 22 +++-
>  tools/perf/util/intel-pt.c   |  2 +-
>  tools/perf/util/machine.c| 26 -
>  tools/perf/util/machine.h|  8 ++-
>  tools/perf/util/map.c| 21 ---
>  tools/perf/util/map.h|  2 +
>  tools/perf/util/parse-events.c   | 53 +-
>  tools/perf/util/perf_regs.c  |  2 +
>  tools/perf/util/perf_regs.h  |  1 +
>  tools/perf/util/pmu.c|  2 +-
>  tools/perf/util/probe-event.c| 96 
> 
>  tools/perf/util/probe-finder.c   | 58 +--
>  tools/perf/util/symbol.c |  2 +-
>  24 files changed, 224 insertions(+), 108 deletions(-)

Pulled, thanks a lot Arnaldo!

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 0/2] Consolidate redundant register/stack access code

2016-02-09 Thread Ingo Molnar

* Michael Ellerman  wrote:

> On Tue, 2016-02-09 at 00:38 -0500, David Long wrote:
> 
> > From: "David A. Long" 
> >
> > Move duplicate and functionally equivalent code for accessing registers
> > and stack (CONFIG_HAVE_REGS_AND_STACK_ACCESS_API) from arch subdirs into
> > common kernel files.
> >
> > I'm sending this out again (with updated distribution list) because v2
> > just never got pulled in, even though I don't think there were any
> > outstanding issues.
> 
> A big cross arch patch like this would often get taken by Andrew Morton, but
> AFAICS you didn't CC him - so I just added him, perhaps he'll pick it up for
> us :D

The other problem is that the second patch is commingling changes to 6 separate 
architectures:

 16 files changed, 106 insertions(+), 343 deletions(-)

that should probably be 6 separate patches. Easier to review, easier to bisect 
to, 
easier to revert, etc.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [GIT PULL 00/16] perf/core improvements and fixes

2016-02-03 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   This is on top of the previously submitted perf-core-for-mingo tag,
> please consider applying,
> 
> - Arnaldo
> 
> The following changes since commit 5ac76283b32b116c58e362e99542182ddcfc8262:
> 
>   perf cpumap: Auto initialize cpu__max_{node,cpu} (2016-01-26 16:08:36 -0300)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-2
> 
> for you to fetch changes up to 814568db641f6587c1e98a3a85f214cb6a30fe10:
> 
>   perf build: Align the names of the build tests: (2016-01-29 17:51:04 -0300)
> 
> 
> New features:
> 
> - Port 'perf kvm stat' to PowerPC (Hemant Kumar)
> 
> Infrastructure:
> 
> - Use the 'feature-dump' target to do the feature checks just once and then
>   add code to reuse that in the tests/make makefile, speeding up the
>   'make -C tools/perf build-test' target (Wang Nan)
> 
> - Reduce the number of tests the 'build-test' target do to those that don't
>   pollute the source tree (Arnaldo Carvalho de Melo)
> 
> - Improve the output of the build tests a bit by aligning the name of the
>   tests, more can be done to filter out uninteresting info in the output
>   (Arnaldo Carvalho de Melo)
> 
> - Add perf_evlist pointer to *info_priv_size(), more prep work for
>   supporting the coresight architecture (Mathieu Poirier)
> 
> - Improve the 'perf test bp_signal' test (Wang Nan)
> 
> - Check environment before starting the BPF 'perf test', so that we can just
>   'Skip' older kernels instead of 'FAIL'ing them (Wang Nan)
> 
> - Fix cpumode of synthesized buildid event (Wang Nan)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Arnaldo Carvalho de Melo (2):
>   perf tools: Speed up build-tests by reducing the number of builds tested
>   perf build: Align the names of the build tests:
> 
> Hemant Kumar (4):
>   perf kvm/{x86,s390}: Remove dependency on uapi/kvm_perf.h
>   perf kvm/{x86,s390}: Remove const from kvm_events_tp
>   perf kvm/powerpc: Port perf kvm stat to powerpc
>   perf kvm/powerpc: Add support for HCALL reasons
> 
> Jiri Olsa (1):
>   perf build: Fix feature-dump checks, we need to test all features
> 
> Mathieu Poirier (1):
>   perf auxtrace: Add perf_evlist pointer to *info_priv_size()
> 
> Wang Nan (8):
>   tools build: Check basic headers for test-compile feature checker
>   perf build: Remove all condition feature check {C,LD}FLAGS
>   perf build: Use feature dump file for build-test
>   perf buildid: Fix cpumode of buildid event
>   perf test: Check environment before start real BPF test
>   perf test: Improve bp_signal
>   perf tools: Move timestamp creation to util
>   perf record: Use OPT_BOOLEAN_SET for buildid cache related options
> 
>  tools/build/Makefile.feature   |   8 ++
>  tools/build/feature/test-compile.c |   2 +
>  tools/perf/Makefile|  11 +-
>  tools/perf/arch/powerpc/Makefile   |   2 +
>  tools/perf/arch/powerpc/util/Build |   1 +
>  tools/perf/arch/powerpc/util/book3s_hcalls.h   | 123 ++
>  tools/perf/arch/powerpc/util/book3s_hv_exits.h |  33 +
>  tools/perf/arch/powerpc/util/kvm-stat.c| 170 
> +
>  tools/perf/arch/s390/util/kvm-stat.c   |  10 +-
>  tools/perf/arch/x86/util/intel-bts.c   |   4 +-
>  tools/perf/arch/x86/util/intel-pt.c|   4 +-
>  tools/perf/arch/x86/util/kvm-stat.c|  16 ++-
>  tools/perf/builtin-buildid-cache.c |  14 +-
>  tools/perf/builtin-kvm.c   |  38 --
>  tools/perf/builtin-record.c|  12 +-
>  tools/perf/config/Makefile | 101 +++
>  tools/perf/tests/bp_signal.c   | 140 
>  tools/perf/tests/bpf.c |  37 ++
>  tools/perf/tests/make  |  39 +-
>  tools/perf/util/auxtrace.c |   7 +-
>  tools/perf/util/auxtrace.h |   6 +-
>  tools/perf/util/build-id.c |   6 +-
>  tools/perf/util/kvm-stat.h |   8 +-
>  tools/perf/util/util.c |  17 +++
>  tools/perf/util/util.h |   1 +
>  25 files changed, 688 insertions(+), 122 deletions(-)
>  create mode 100644 tools/perf/arch/powerpc/util/book3s_hcalls.h
>  create mode 100644 tools/perf/arch/powerpc/util/book3s_hv_exits.h
>  create mode 100644 tools/perf/arch/powerpc/util/kvm-stat.c

Pulled, thanks a lot Arnaldo!

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org

Re: [PATCH 0/4] support for text-relative kallsyms table

2016-01-20 Thread Ingo Molnar

* Ard Biesheuvel  wrote:

> This implements text-relative kallsyms address tables. This was developed as 
> part of my series to implement KASLR/CONFIG_RELOCATABLE for arm64, but I 
> think 
> it may be beneficial to other architectures as well, so I am presenting it as 
> a 
> separate series.
> 
> The idea is that on 64-bit builds, it is rather wasteful to use absolute 
> addressing for kernel symbols since they are all within a couple of MBs of 
> each 
> other. On top of that, the absolute addressing implies that, when the kernel 
> is 
> relocated at runtime, each address in the table needs to be fixed up 
> individually.
> 
> Since all section-relative addresses are already emitted relative to _text, 
> it 
> is quite straight-forward to record only the offset, and add the absolute 
> address of _text at runtime when referring to the address table.
> 
> The reduction ranges from around 250 KB uncompressed vmlinux size and 10 KB 
> compressed size (s390) to 3 MB/500 KB for ppc64 (although, in the latter 
> case, 
> the reduction in uncompressed size is primarily __init data)

So since kallsyms is in unswappable kernel RAM, the uncompressed size reduction 
is 
what we care about mostly. How much bootloader load times are impacted is a 
third 
order concern.

IOW a nice change!

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 0/2] Consolidate redundant register/stack access code

2016-02-17 Thread Ingo Molnar

* David Long <dave.l...@linaro.org> wrote:

> On 02/09/2016 04:45 AM, Ingo Molnar wrote:
> >
> >* Michael Ellerman <m...@ellerman.id.au> wrote:
> >
> >>On Tue, 2016-02-09 at 00:38 -0500, David Long wrote:
> >>
> >>>From: "David A. Long" <dave.l...@linaro.org>
> >>>
> >>>Move duplicate and functionally equivalent code for accessing registers
> >>>and stack (CONFIG_HAVE_REGS_AND_STACK_ACCESS_API) from arch subdirs into
> >>>common kernel files.
> >>>
> >>>I'm sending this out again (with updated distribution list) because v2
> >>>just never got pulled in, even though I don't think there were any
> >>>outstanding issues.
> >>
> >>A big cross arch patch like this would often get taken by Andrew Morton, but
> >>AFAICS you didn't CC him - so I just added him, perhaps he'll pick it up for
> >>us :D
> >
> >The other problem is that the second patch is commingling changes to 6 
> >separate
> >architectures:
> >
> >  16 files changed, 106 insertions(+), 343 deletions(-)
> >
> >that should probably be 6 separate patches. Easier to review, easier to 
> >bisect to,
> >easier to revert, etc.
> >
> >Thanks,
> >
> > Ingo
> >
> 
> I see your point but I'm not sure it could have been broken into separate 
> successive patches that would each build for all architectures.

Why? AFAICS all the functionality appears to be conditional on 
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API, so it ought to build standalone as well, 
on 
a per arch basis, as long as the core kernel patch is applied first.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [GIT PULL 0/2] perf/urgent fixes

2016-03-31 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling,
> 
> - Arnaldo
> 
> The following changes since commit f6343be96ebbae38a07e0878810f5bbc0c38cade:
> 
>   Merge tag 'perf-urgent-for-mingo-20160329' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent 
> (2016-03-30 12:31:03 +0200)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-urgent-for-mingo-20160330
> 
> for you to fetch changes up to 9f56c092b99b40ce3cf4c6d0134ff7e513c9f1a6:
> 
>   perf jit: genelf makes assumptions about endian (2016-03-30 18:12:06 -0300)
> 
> 
> perf/urgent fixes:
> 
> - Fix determination of a callchain node's childlessness in
>   the top/report TUI, which was preventing navigating some
>   callchains, --stdio unnaffected (Andres Freund)
> 
> - Fix jitdump's genelf assumption that PowerPC is big endian
>   only (Anton Blanchard)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Andres Freund (1):
>   perf hists: Fix determination of a callchain node's childlessness
> 
> Anton Blanchard (1):
>   perf jit: genelf makes assumptions about endian
> 
>  tools/perf/ui/browsers/hists.c |  2 +-
>  tools/perf/util/genelf.h   | 24 ++--
>  2 files changed, 11 insertions(+), 15 deletions(-)

Pulled, thanks a lot Arnaldo!

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] sched/cpuacct: Check for NULL when using task_pt_regs()

2016-04-13 Thread Ingo Molnar

* Srikar Dronamraju  wrote:

> * Anton Blanchard  [2016-04-06 21:59:50]:
> 
> > Looks good, and the patch below does fix the oops for me.
> > 
> > Anton
> > --
> > 
> > task_pt_regs() can return NULL for kernel threads, so add a check.
> > This fixes an oops at boot on ppc64.
> > 
> > Signed-off-by: Anton Blanchard 
> 
> Works for me too.
> 
> Reported-and-Tested-by: Srikar Dronamraju 

Could someone please re-send the fix, because it has not reached me nor lkml.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] sched/cpuacct: Check for NULL when using task_pt_regs()

2016-04-13 Thread Ingo Molnar

* Michael Ellerman <m...@ellerman.id.au> wrote:

> On Wed, 2016-04-13 at 09:43 +0200, Ingo Molnar wrote:
> > * Srikar Dronamraju <sri...@linux.vnet.ibm.com> wrote:
> > 
> > > * Anton Blanchard <an...@samba.org> [2016-04-06 21:59:50]:
> > > 
> > > > Looks good, and the patch below does fix the oops for me.
> > > > 
> > > > Anton
> > > > --
> > > > 
> > > > task_pt_regs() can return NULL for kernel threads, so add a check.
> > > > This fixes an oops at boot on ppc64.
> > > > 
> > > > Signed-off-by: Anton Blanchard <an...@samba.org>
> > > 
> > > Works for me too.
> > > 
> > > Reported-and-Tested-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com>
> > 
> > Could someone please re-send the fix, because it has not reached me nor 
> > lkml.
> 
> It did hit LKML:
> 
> http://lkml.kernel.org/r/20160406215950.04bc3f0b@kryten
> 
> But that did have some verbiage at the top.
> 
> Anton's also resent it directly To you.

So it was in my Spam folder, due to the following SPF softfail:

  Received-SPF: softfail (google.com: domain of transitioning an...@samba.org 
does not designate 198.145.29.136 as permitted sender) client-ip=198.145.29.136;

have the patch now.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-05-02 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> > Another idea to detect missing frames: for each return address on the 
> > stack, 
> > ensure there's a corresponding "call " instruction immediately 
> > preceding 
> > the return location, where  matches what's on the stack.
> 
> Hmm, interesting.
> 
> I hope your plans include rewriting the current stack unwinder completely.  
> The 
> thing in print_context_stack is (a) hard-to-understand and hard-to-modify 
> crap 
> and (b) is called in a loop from another file using totally ridiculous 
> conventions.

So we had several attempts at making it better, any further improvements 
(including radical rewrites) are more than welcome!

The generalization between the various stack walking methods certainly didn't 
make 
things easier to read - we might want to eliminate that by using better 
primitives 
to iterate over the stack frame.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [GIT PULL 00/17] perf/core improvements and fixes

2016-05-06 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling,
> 
> - Arnaldo
> 
> 
> The following changes since commit 1b6de5917172967acd8db4d222df4225d23a8a60:
> 
>   perf/x86/intel/pt: Convert ACCESS_ONCE()s (2016-05-05 10:16:29 +0200)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-20160505
> 
> for you to fetch changes up to b6b85dad30ad7e7394990e2317a780577974a4e6:
> 
>   perf evlist: Rename variable in perf_mmap__read() (2016-05-05 21:04:04 
> -0300)
> 
> 
> perf/core improvements and fixes:
> 
> User visible:
> 
> - Order output of 'perf trace --summary' better, now the threads will
>   appear ascending order of number of events, and then, for each, in
>   descending order of syscalls by the time spent in the syscalls, so
>   that the last page produced can be the one about the most interesting
>   thread straced, suggested by Milian Wolff (Arnaldo Carvalho de Melo)
> 
> - Do not show the runtime_ms for a thread when not collecting it, that
>   is done so far only with 'perf trace --sched' (Arnaldo Carvalho de Melo)
> 
> - Fix kallsyms perf test on ppc64le (Naveen N. Rao)
> 
> Infrastructure:
> 
> - Move global variables related to presence of some keys in the sort order to 
> a
>   per hist struct, to allow code like the hists browser to work with multiple
>   hists with different lists of columns (Jiri Olsa)
> 
> - Add support for generating bpf prologue in powerpc (Naveen N. Rao)
> 
> - Fix kprobe and kretprobe handling with kallsyms on ppc64le (Naveen N. Rao)
> 
> - evlist mmap changes, prep work for supporting reading backwards (Wang Nan)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Arnaldo Carvalho de Melo (5):
>   perf machine: Introduce number of threads member
>   perf tools: Add template for generating rbtree resort class
>   perf trace: Sort summary output by number of events
>   perf trace: Sort syscalls stats by msecs in --summary
>   perf trace: Do not show the runtime_ms for a thread when not collecting 
> it
> 
> Jiri Olsa (7):
>   perf hists: Move sort__need_collapse into struct perf_hpp_list
>   perf hists: Move sort__has_parent into struct perf_hpp_list
>   perf hists: Move sort__has_sym into struct perf_hpp_list
>   perf hists: Move sort__has_dso into struct perf_hpp_list
>   perf hists: Move sort__has_socket into struct perf_hpp_list
>   perf hists: Move sort__has_thread into struct perf_hpp_list
>   perf hists: Move sort__has_comm into struct perf_hpp_list
> 
> Naveen N. Rao (3):
>   perf tools powerpc: Add support for generating bpf prologue
>   perf powerpc: Fix kprobe and kretprobe handling with kallsyms on ppc64le
>   perf symbols: Fix kallsyms perf test on ppc64le
> 
> Wang Nan (2):
>   perf evlist: Extract perf_mmap__read()
>   perf evlist: Rename variable in perf_mmap__read()
> 
>  tools/perf/arch/powerpc/Makefile|   1 +
>  tools/perf/arch/powerpc/util/dwarf-regs.c   |  40 +---
>  tools/perf/arch/powerpc/util/sym-handling.c |  43 ++--
>  tools/perf/builtin-diff.c   |   4 +-
>  tools/perf/builtin-report.c |   4 +-
>  tools/perf/builtin-top.c|   8 +-
>  tools/perf/builtin-trace.c  |  87 ++--
>  tools/perf/tests/hists_common.c |   2 +-
>  tools/perf/tests/hists_cumulate.c   |   2 +-
>  tools/perf/tests/hists_link.c   |   4 +-
>  tools/perf/tests/hists_output.c |   2 +-
>  tools/perf/ui/browsers/hists.c  |  32 +++---
>  tools/perf/ui/gtk/hists.c   |   2 +-
>  tools/perf/ui/hist.c|   2 +-
>  tools/perf/util/annotate.c  |   2 +-
>  tools/perf/util/callchain.c |   2 +-
>  tools/perf/util/evlist.c|  56 ++-
>  tools/perf/util/hist.c  |  14 +--
>  tools/perf/util/hist.h  |  10 ++
>  tools/perf/util/machine.c   |   9 +-
>  tools/perf/util/machine.h   |   1 +
>  tools/perf/util/probe-event.c   |   5 +-
>  tools/perf/util/probe-event.h   |   3 +-
>  tools/perf/util/rb_resort.h | 149 
> 
>  tools/perf/util/sort.c  |  35 +++
>  tools/perf/util/sort.h  |   7 --
>  tools/perf/util/symbol-elf.c|   7 +-
>  tools/perf/util/symbol.h|   3 +-
>  28 files changed, 382 insertions(+), 154 deletions(-)
>  create mode 100644 tools/perf/util/rb_resort.h

Pulled, thanks a lot Arnaldo!

Ingo
___
Linuxppc-dev mailing list

Re: [patch V2 30/67] powerpc/numa: Convert to hotplug state machine

2016-07-15 Thread Ingo Molnar

* Anton Blanchard  wrote:

> Hi Anna-Maria,
> 
> > >> Install the callbacks via the state machine and let the core invoke
> > >> the callbacks on the already online CPUs.  
> > >
> > > This is causing an oops on ppc64le QEMU, looks like a NULL
> > > pointer:  
> > 
> > Did you tested it against tip WIP.hotplug?
> 
> I noticed tip started failing in my CI environment which tests on QEMU.
> The failure bisected to commit 425209e0abaf2c6e3a90ce4fedb935c10652bf80

That's very useful, thanks Anton!

I have removed this commit from the series for the time being, refactored the 
followup commits (there was one trivial conflict). We can re-try this patch 
when a 
fix is found.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/9] mm: Hardened usercopy

2016-07-08 Thread Ingo Molnar

* Linus Torvalds <torva...@linux-foundation.org> wrote:

> On Fri, Jul 8, 2016 at 1:46 AM, Ingo Molnar <mi...@kernel.org> wrote:
> >
> > Could you please try to find some syscall workload that does many small user
> > copies and thus excercises this code path aggressively?
> 
> Any stat()-heavy path will hit cp_new_stat() very heavily. Think the
> usual kind of "traverse the whole tree looking for something". "git
> diff" will do it, just checking that everything is up-to-date.
> 
> That said, other things tend to dominate.

So I think a cached 'find /usr >/dev/null' might be a good one as well:

 triton:~/tip> strace -c find /usr >/dev/null
 % time seconds  usecs/call callserrors syscall
 -- --- --- - - 
  47.090.006518   0254697   newfstatat
  26.200.003627   0254795   getdents
  14.450.002000   0   1147411   fcntl
   7.330.001014   0509811   close
   3.280.000454   0128220 1 openat
   1.520.000210   0128230   fstat
   0.270.16   0 12810   write
   0.000.00   010   read

 triton:~/tip> perf stat --repeat 3 -e cycles:u,cycles:k,cycles find /usr 
>/dev/null

 Performance counter stats for 'find /usr' (3 runs):

 1,594,437,143  cycles:u
  ( +-  2.76% )
 2,570,544,009  cycles:k
  ( +-  2.50% )
 4,164,981,152  cycles  
  ( +-  2.59% )

   0.929883686 seconds time elapsed 
 ( +-  2.57% )

... and it's dominated by kernel overhead, with a fair amount of memcpy 
overhead 
as well:

   1.22%  find [kernel.kallsyms]   [k] copy_user_enhanced_fast_string   

 

But maybe there are simple shell commands that are even more user-memcpy 
intense? 

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/9] mm: Hardened usercopy

2016-07-09 Thread Ingo Molnar

* Rik van Riel  wrote:

> On Fri, 2016-07-08 at 19:22 -0700, Laura Abbott wrote:
> > 
> > Even with the SLUB fixup I'm still seeing this blow up on my arm64
> > system. This is a
> > Fedora rawhide kernel + the patches
> > 
> > [0.666700] usercopy: kernel memory exposure attempt detected from
> > fc0008b4dd58 () (8 bytes)
> > [0.666720] CPU: 2 PID: 79 Comm: modprobe Tainted:
> > GW   4.7.0-0.rc6.git1.1.hardenedusercopy.fc25.aarch64 #1
> > [0.666733] Hardware name: AppliedMicro Mustang/Mustang, BIOS
> > 1.1.0 Nov 24 2015
> > [0.666744] Call trace:
> > [0.666756] [] dump_backtrace+0x0/0x1e8
> > [0.666765] [] show_stack+0x24/0x30
> > [0.666775] [] dump_stack+0xa4/0xe0
> > [0.666785] [] __check_object_size+0x6c/0x230
> > [0.666795] [] create_elf_tables+0x74/0x420
> > [0.666805] [] load_elf_binary+0x828/0xb70
> > [0.666814] [] search_binary_handler+0xb4/0x240
> > [0.666823] [] do_execveat_common+0x63c/0x950
> > [0.666832] [] do_execve+0x3c/0x50
> > [0.666841] []
> > call_usermodehelper_exec_async+0xe8/0x148
> > [0.666850] [] ret_from_fork+0x10/0x50
> > 
> > This happens on every call to execve. This seems to be the first
> > copy_to_user in
> > create_elf_tables. I didn't get a chance to debug and I'm going out
> > of town
> > all of next week so all I have is the report unfortunately. config
> > attached.
> 
> That's odd, this should be copying a piece of kernel data (not text)
> to userspace.
> 
> from fs/binfmt_elf.c
> 
>         const char *k_platform = ELF_PLATFORM;
> 
> ...
>                 size_t len = strlen(k_platform) + 1;
>   
>                 u_platform = (elf_addr_t __user *)STACK_ALLOC(p, len);
> if (__copy_to_user(u_platform, k_platform, len))
> return -EFAULT;
> 
> from arch/arm/include/asm/elf.h:
> 
> #define ELF_PLATFORM_SIZE 8
> #define ELF_PLATFORM(elf_platform)
> 
> extern char elf_platform[];
> 
> from arch/arm/kernel/setup.c:
> 
> char elf_platform[ELF_PLATFORM_SIZE];
> EXPORT_SYMBOL(elf_platform);
> 
> ...
> 
> snprintf(elf_platform, ELF_PLATFORM_SIZE, "%s%c",
>  list->elf_name, ENDIANNESS);
> 
> How does that end up in the .text section of the
> image, instead of in one of the various data sections?
> 
> What kind of linker oddity is going on with ARM?

I think the crash happened on ARM64, not ARM.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/9] mm: Hardened usercopy

2016-07-10 Thread Ingo Molnar

* PaX Team  wrote:

> On 9 Jul 2016 at 14:27, Andy Lutomirski wrote:
> 
> > I like the series, but I have one minor nit to pick.  The effect of this 
> > series is to harden usercopy, but most of the code is really about 
> > infrastructure to validate that a pointed-to object is valid.
> 
> actually USERCOPY has never been about validating pointers. its sole purpose 
> is 
> to validate the *size* argument of copy*user calls, a very specific form of 
> runtime bounds checking.

What this code has been about originally is largely immaterial, unless you can 
formulate it into a technical argument.

There are a number of cheap tests we can do and there are a number of ways how 
a 
'pointer' can be validated runtime, without any 'size' information:

 - for example if a pointer points into a red zone straight away then we know 
it's
   bogus.

 - or if a kernel pointer is points outside the valid kernel virtual memory 
range
   we know it's bogus as well.

So while only doing a bounds check might have been the original purpose of the 
patch set, Andy's point is that it might make sense to treat this facility as a 
more generic 'object validation' code of (pointer,size) object and not limit it 
to 
'runtime bounds checking'. That kind of extended purpose behind a facility 
should 
be reflected in the naming.

Confusing names are often the source of misunderstandings and bugs.

The 9-patch series as submitted here is neither just 'bounds checking' nor just 
pure 'pointer checking', it's about validating that a (pointer,size) range of 
memory passed to a (user) memory copy function is fully within a valid object 
the 
kernel might know about (in an fast to check fashion).

This necessary means:

 - the start of the range points to a valid object to begin with (if known)

 - the range itself does not point beyond the end of the object (if known)

 - even if the kernel does not know anything about the pointed to object it can 
   do a pointer check (for example is it pointing inside kernel virtual memory) 
   and do a bounds check on the size.

Do you disagree with that?

> > Might it make sense to call the infrastructure part something else?
> 
> yes, more bikeshedding will surely help, [...]

Insulting and ridiculing a reviewer who explicitly qualified his comments with 
"one minor nit to pick" sure does not help upstream integration either. (Unless 
the goal is to prevent upstream integration.)

> [...] like the renaming of .data..read_only to .data..ro_after_init which 
> also 
> had nothing to do with init but everything to do with objects being 
> conceptually 
> read-only...

.data..ro_after_init objects get written to during bootup so it's conceptually 
quite confusing to name it "read-only" without any clear qualifiers.

That it's named consistently with its role of "read-write before init and read 
only after init" on the other hand is not confusing at all. Not sure what your 
problem is with the new name.

Names within submitted patches get renamed on a routine basis during review. 
It's 
often only minor improvements in naming (which you can consider bike shedding), 
but in this particular case the rename was clearly useful in not just improving 
the name but in avoiding an actively confusing name. So I disagree not just 
with 
the hostile tone of your reply but with your underlying technical point as well.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/9] mm: Hardened usercopy

2016-07-08 Thread Ingo Molnar

* Kees Cook  wrote:

> - I couldn't detect a measurable performance change with these features
>   enabled. Kernel build times were unchanged, hackbench was unchanged,
>   etc. I think we could flip this to "on by default" at some point.

Could you please try to find some syscall workload that does many small user 
copies and thus excercises this code path aggressively?

If that measurement works out fine then I'd prefer to enable these security 
checks 
by default.

Thaks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] kprobes: introduce weak variant of kprobe_exceptions_notify

2017-02-09 Thread Ingo Molnar

* Michael Ellerman <m...@ellerman.id.au> wrote:

> "Naveen N. Rao" <naveen.n@linux.vnet.ibm.com> writes:
> 
> > kprobe_exceptions_notify() is not used on some of the architectures such
> > as arm[64] and powerpc anymore. Introduce a weak variant for such
> > architectures.
> 
> I'll merge patch 1 & 3 via the powerpc tree for v4.11.

Acked-by: Ingo Molnar <mi...@kernel.org>

Thanks,

Ingo


Re: [PATCH tip/core/rcu 2/3] srcu: Force full grace-period ordering

2017-01-15 Thread Ingo Molnar

* Paul E. McKenney  wrote:

> > [sounds of rummaging around in the Git tree]
> > 
> > I found this commit of yours from ancient history (more than a year ago!):
> > 
> >   commit 12d560f4ea87030667438a169912380be00cea4b
> >   Author: Paul E. McKenney 
> >   Date:   Tue Jul 14 18:35:23 2015 -0700
> > 
> > rcu,locking: Privatize smp_mb__after_unlock_lock()
> > 
> > RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
> > likely the only thing that ever will use it, so this commit makes this
> > macro private to RCU.
> > 
> > Signed-off-by: Paul E. McKenney 
> > Cc: Will Deacon 
> > Cc: Peter Zijlstra 
> > Cc: Benjamin Herrenschmidt 
> > Cc: "linux-a...@vger.kernel.org" 
> > 
> > So I concur and I'm fine with your patch - or with the status quo code as 
> > well.
> 
> I already have the patch queued, so how about I keep it if I get an ack
> from the powerpc guys and drop it otherwise?

Yeah, sounds good! Your patch made me look up 'RelAcq' so it has documentation 
value as well ;-)

Thanks,

Ingo


Re: [PATCH v20 00/20] perf, tools: Add support for PMU events in JSON format

2016-09-13 Thread Ingo Molnar

* Michael Ellerman <m...@ellerman.id.au> wrote:

> Jiri Olsa <jo...@redhat.com> writes:
> 
> > On Wed, Aug 31, 2016 at 09:15:30AM -0700, Andi Kleen wrote:
> >> > > 
> >> > > > 
> >> > > > I've already made some changes in pmu-events/* to support
> >> > > > this hierarchy to see how bad the change would be.. and
> >> > > > it's not that bad ;-)
> >> > > 
> >> > > Everything has to be automated, please no manual changes.
> >> > 
> >> > sure
> >> > 
> >> > so, if you're ok with the layout, how do you want to proceed further?
> >> 
> >> If the split version is acceptable it's fine for me to merge it.
> >> 
> >> I'll add split-json to my scripting, so the next update would
> >> be split too.
> >
> > ook, I'll wait for patches then
> 
> Who are you waiting for patches from?
> 
> Would be great if this could go in for 4.9 still.

No objections from me - the latest bits were good:

  Acked-by: Ingo Molnar <mi...@kernel.org>

Thanks,

Ingo


Re: [GIT PULL 00/27] perf/core improvements and fixes

2016-09-29 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling, more to come soon,
> 
> - Arnaldo
> 
> Build and test results at the end of this message.
> 
> The following changes since commit 6b652de2b27c0a4020ce0e8f277e782b6af76096:
> 
>   Merge tag 'perf-core-for-mingo-20160922' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
> (2016-09-23 07:21:38 +0200)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-20160929
> 
> for you to fetch changes up to d18019a53a07e009899ff6b8dc5ec30f249360d9:
> 
>   perf tests: Add dwarf unwind test for powerpc (2016-09-29 11:18:21 -0300)
> 
> 
> perf/core improvements and fixes:
> 
> User visible:
> -
> 
> New features:
> 
> - Add support for using symbols in address filters with Intel PT and ARM
>   CoreSight (hardware assisted tracing facilities) (Adrian Hunter, Mathieu 
> Poirier)
> 
> Fixes:
> 
> - Fix MMAP event synthesis for pre-existing threads when no hugetlbfs
>   mount is in place (Adrian Hunter)
> 
> - Don't ignore kernel idle symbols in 'perf script' (Adrian Hunter)
> 
> - Assorted Intel PT fixes (Adrian Hunter)
> 
> Improvements:
> 
> - Fix handling of C++ symbols in 'perf probe' (Masami Hiramatsu)
> 
> - Beautify sched_[gs]et_attr return value in 'perf trace' (Arnaldo Carvalho 
> de Melo)
> 
> Infrastructure:
> ---
> 
> New features:
> 
> - Add dwarf unwind 'perf test' for powerpc (Ravi Bangoria)
> 
> Fixes:
> 
> - Fix error paths in 'perf record' (Adrian Hunter)
> 
> Documentation:
> 
> - Update documentation info about quipper, a C++ parser for converting
>   to/from perf.data/chromium profiling format (Simon Que)
> 
> Build Fixes:
> 
>   Fix building in 32 bit platform with libbabeltrace (Wang Nan)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Adrian Hunter (16):
>   perf record: Fix documentation 'event_sources' -> 'event_source'
>   perf tools: Fix MMAP event synthesis broken by MAP_HUGETLB change
>   perf script: Fix vanished idle symbols
>   perf record: Rename label 'out_symbol_exit'
>   perf record: Fix error paths
>   perf symbols: Add dso__last_symbol()
>   perf record: Add support for using symbols in address filters
>   perf probe: Increase debug level of SDT debug messages
>   perf intel-pt: Fix snapshot overlap detection decoder errors
>   perf intel-pt: Add support for recording the max non-turbo ratio
>   perf intel-pt: Fix missing error codes processing auxtrace_info
>   perf intel-pt: Add a helper function for processing AUXTRACE_INFO
>   perf intel-pt: Record address filter in AUXTRACE_INFO event
>   perf intel-pt: Read address filter from AUXTRACE_INFO event
>   perf intel-pt: Enable decoder to handle TIP.PGD with missing IP
>   perf intel-pt: Fix decoding when there are address filters
> 
> Arnaldo Carvalho de Melo (1):
>   perf trace: Beautify sched_[gs]et_attr return value
> 
> Masami Hiramatsu (4):
>   perf probe: Ignore the error of finding inline instance
>   perf probe: Skip if the function address is 0
>   perf probe: Fix to cut off incompatible chars from group name
>   perf probe: Match linkage name with mangled name
> 
> Mathieu Poirier (3):
>   perf tools: Make perf_evsel__append_filter() generic
>   perf evsel: New tracepoint specific function
>   perf evsel: Add support for address filters
> 
> Ravi Bangoria (1):
>   perf tests: Add dwarf unwind test for powerpc
> 
> Simon Que (1):
>   perf tools: Update documentation info about quipper
> 
> Wang Nan (1):
>   perf data: Fix building in 32 bit platform with libbabeltrace
> 
>  tools/perf/Documentation/perf-record.txt   |  61 +-
>  tools/perf/Documentation/perf.data-file-format.txt |   6 +-
>  tools/perf/arch/powerpc/Build  |   1 +
>  tools/perf/arch/powerpc/include/arch-tests.h   |  13 +
>  tools/perf/arch/powerpc/include/perf_regs.h|   2 +
>  tools/perf/arch/powerpc/tests/Build|   4 +
>  tools/perf/arch/powerpc/tests/arch-tests.c |  15 +
>  tools/perf/arch/powerpc/tests/dwarf-unwind.c   |  62 ++
>  tools/perf/arch/powerpc/tests/regs_load.S  |  94 +++
>  tools/perf/arch/x86/util/intel-pt.c|  57 +-
>  tools/perf/builtin-record.c|  32 +-
>  tools/perf/builtin-trace.c |  10 +-
>  tools/perf/tests/Build |   2 +-
>  tools/perf/tests/dwarf-unwind.c|   2 +-
>  tools/perf/util/auxtrace.c | 737 
> +
>  tools/perf/util/auxtrace.h |  54 ++
>  tools/perf/util/build-id.c |   

Re: [GIT PULL 00/22] perf/core improvements and fixes

2016-10-04 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling,
> 
> - Arnaldo
> 
> Build and test stats at the end of the message.
> 
> The following changes since commit 41aad2a6d4fcdda8d73c9739daf7a9f3f49499d6:
> 
>   Merge tag 'perf-core-for-mingo-20160929' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
> (2016-09-29 19:09:58 +0200)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-20161003
> 
> for you to fetch changes up to b42c7369e3f451e22c2b0be5d193955498d37546:
> 
>   perf pmu-events: Add Skylake frontend MSR support (2016-10-03 21:52:01 
> -0300)
> 
> 
> perf/core improvements and fixes:
> 
> - Allow vendors to provide JSON files describing PMU events, that then
>   get parsed to generate C tables that are linked against perf, allowing
>   the use of the names in their documentations, such as:
> 
>   # perf list l1d
> 
>   List of pre-defined events (to be used in -e):
> 
>   Cache:
> l1d.replacement
>  [L1D data line replacements]
> l1d_pend_miss.fb_full
>  [Cycles a demand request was blocked due to Fill Buffers 
> inavailability]
> l1d_pend_miss.pending
>  [L1D miss oustandings duration in cycles]
> l1d_pend_miss.pending_cycles
>  [Cycles with L1D load Misses outstanding]
> l1d_pend_miss.pending_cycles_any
>  [Cycles with L1D load Misses outstanding from any thread on physical 
> core]
> l2_trans.l1d_wb
>  [L1D writebacks that access L2 cache]
> 
>   Pipeline:
> cycle_activity.cycles_l1d_miss
>  [Cycles while L1 cache miss demand load is outstanding]
> cycle_activity.cycles_l1d_pending
>  [Cycles while L1 cache miss demand load is outstanding]
> cycle_activity.stalls_l1d_miss
>  [Execution stalls while L1 cache miss demand load is outstanding]
> cycle_activity.stalls_l1d_pending
>  [Execution stalls while L1 cache miss demand load is outstanding]
> 
>   The above example was done on a Broadwell based ThinkPad t450s after
>   downloading and installing such JSON files which will be added to the
>   tools/perf/pmu-events/ directory in a subsequent patchkit.
> 
>   Now one can use those names with -e/--event in all 'perf tools'.
>   (Andi Kleen, Sukadev Bhattiprolu)
> 
> - Add a missing pointer dereference in 'perf probe' (Colin Ian King)
> 
> - Add support for building host programs to be used in generating files
>   to be used in the build process, such as fixdep and jevents, fixing
>   the usage of these features in a cross compilation setup (Jiri Olsa)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Andi Kleen (12):
>   perf tools: Add jsmn `jasmine' JSON parser
>   perf jevents: Program to convert JSON file
>   perf tools: Support CPU id matching for x86 v2
>   perf jevents: Handle header line in mapfile
>   perf pmu: Support alias descriptions
>   perf tools: Query terminal width and use in perf list
>   perf list: Add a --no-desc flag
>   perf pmu: Add override support for event list CPUID
>   perf list jevents: Add support for event list topics
>   perf tools: Make alias matching case-insensitive
>   perf pmu-events: Fix fixed counters on Intel
>   perf pmu-events: Add Skylake frontend MSR support
> 
> Arnaldo Carvalho de Melo (1):
>   perf tools: Experiment with cppcheck
> 
> Colin Ian King (1):
>   perf probe: Check if *ptr2 is zero and not ptr2
> 
> Jiri Olsa (2):
>   tools build: Add support for host programs format
>   tools build: Make fixdep a hostprog
> 
> Sukadev Bhattiprolu (6):
>   perf pmu: Use pmu_events table to create aliases
>   perf powerpc: Support CPU ID matching for Powerpc
>   perf jevents: Add support for long descriptions
>   perf list: Support long jevents descriptions
>   perf tools: Add README for info on parsing JSON/map files
>   perf tools: Allow period= in perf stat CPU event descriptions.
> 
>  tools/build/Build  |   2 +
>  tools/build/Build.include  |   5 +
>  tools/build/Makefile   |   8 +-
>  tools/build/Makefile.build |  19 +-
>  tools/build/Makefile.include   |   4 -
>  tools/lib/subcmd/pager.c   |  16 +
>  tools/lib/subcmd/pager.h   |   1 +
>  tools/perf/Documentation/perf-list.txt |  12 +-
>  tools/perf/Makefile.perf   |  34 +-
>  tools/perf/arch/powerpc/util/header.c  |  11 +
>  tools/perf/arch/x86/util/header.c  |  24 +-
>  tools/perf/builtin-list.c  |  20 +-
>  tools/perf/pmu-events/Build|  13 +
>  tools/perf/pmu-events/README   | 147 ++
>  tools/perf/pmu-events/jevents.c| 812 
> 

Re: [GIT PULL 00/20] perf/core improvements and fixes

2016-12-06 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling,
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit e7af7b15121ca08c31a0ab9df71a41b4c53365b4:
> 
>   Merge tag 'perf-core-for-mingo-20161201' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
> (2016-12-02 10:08:03 +0100)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-20161205
> 
> for you to fetch changes up to bec60e50af83741cde1786ab475d4bf472aed6f9:
> 
>   perf annotate: Show raw form for jump instruction with indirect target 
> (2016-12-05 17:21:57 -0300)
> 
> 
> perf/core improvements and fixes:
> 
> Fixes:
> 
> - Do not show a bogus target address in 'perf annotate' for targetless powerpc
>   jump instructions such as 'bctr' (Ravi Bangoria)
> 
> - tools/build fixes related to race conditions with the fixdep utility (Jiri 
> Olsa)
> 
> - Fix building objtool with clang (Peter Foley)
> 
> Infrastructure:
> 
> - Support linking perf with clang and LLVM libraries, initially statically, 
> but
>   this limitation will be lifted and shared libraries, when available, will
>   be preferred to the static build, that should, as with other features, be
>   enabled explicitly (Wang Nan)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Jiri Olsa (7):
>   tools build: Make fixdep parsing wait for last target
>   tools build: Make the .cmd file more readable
>   tools build: Move tabs to spaces where suitable
>   perf tools: Move install-gtk target into rules area
>   perf tools: Move python/perf.so target into rules area
>   perf tools: Cleanup build directory before each test
>   perf tools: Add non config targets
> 
> Peter Foley (1):
>   tools build: Fix objtool build with clang
> 
> Ravi Bangoria (1):
>   perf annotate: Show raw form for jump instruction with indirect target
> 
> Wang Nan (11):
>   perf tools: Pass context to perf hook functions
>   perf llvm: Extract helpers in llvm-utils.c
>   tools build: Add feature detection for LLVM
>   tools build: Add feature detection for clang
>   perf build: Add clang and llvm compile and linking support
>   perf clang: Add builtin clang support ant test case
>   perf clang: Use real file system for #include
>   perf clang: Allow passing CFLAGS to builtin clang
>   perf clang: Update test case to use real BPF script
>   perf clang: Support compile IR to BPF object and add testcase
>   perf clang: Compile BPF script using builtin clang support
> 
>  tools/build/Build.include  |  20 ++--
>  tools/build/Makefile.feature   | 138 +-
>  tools/build/feature/Makefile   | 120 +--
>  tools/build/feature/test-clang.cpp |  21 
>  tools/build/feature/test-llvm.cpp  |   8 ++
>  tools/build/fixdep.c   |   5 +-
>  tools/perf/Makefile.config |  62 +---
>  tools/perf/Makefile.perf   |  56 +++
>  tools/perf/tests/Build |   1 +
>  tools/perf/tests/builtin-test.c|   9 ++
>  tools/perf/tests/clang.c   |  46 +
>  tools/perf/tests/llvm.h|   7 ++
>  tools/perf/tests/make  |   4 +-
>  tools/perf/tests/perf-hooks.c  |  14 ++-
>  tools/perf/tests/tests.h   |   3 +
>  tools/perf/util/Build  |   2 +
>  tools/perf/util/annotate.c |   3 +
>  tools/perf/util/bpf-loader.c   |  19 +++-
>  tools/perf/util/c++/Build  |   2 +
>  tools/perf/util/c++/clang-c.h  |  43 
>  tools/perf/util/c++/clang-test.cpp |  62 
>  tools/perf/util/c++/clang.cpp  | 195 
> +
>  tools/perf/util/c++/clang.h|  26 +
>  tools/perf/util/llvm-utils.c   |  76 +++
>  tools/perf/util/llvm-utils.h   |   6 ++
>  tools/perf/util/perf-hooks.c   |  10 +-
>  tools/perf/util/perf-hooks.h   |   6 +-
>  tools/perf/util/util-cxx.h |  26 +
>  28 files changed, 795 insertions(+), 195 deletions(-)
>  create mode 100644 tools/build/feature/test-clang.cpp
>  create mode 100644 tools/build/feature/test-llvm.cpp
>  create mode 100644 tools/perf/tests/clang.c
>  create mode 100644 tools/perf/util/c++/Build
>  create mode 100644 tools/perf/util/c++/clang-c.h
>  create mode 100644 tools/perf/util/c++/clang-test.cpp
>  create mode 100644 tools/perf/util/c++/clang.cpp
>  create mode 100644 tools/perf/util/c++/clang.h
>  create mode 100644 tools/perf/util/util-cxx.h
> 
>   # uname -a
>   Linux jouet 4.8.8-300.fc25.x86_64 #1 SMP Tue Nov 15 18:10:06 UTC 2016 
> x86_64 x86_64 x86_64 GNU/Linux
>   # perf test
>1: vmlinux symtab 

Re: [GIT PULL 00/29] perf/core improvements and fixes

2016-12-20 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
> Please consider pulling, I had most of this queued before your first
> pull req to Linus for 4.10, most are fixes, with 'perf sched timehist --idle'
> as a followup new feature to the 'perf sched timehist' command introduced in
> this window.
>   
>   One other thing that delayed this was the samples/bpf/ switch to
> tools/lib/bpf/ that involved fixing up merge clashes with net.git and also
> to properly test it, after more rounds than antecipated, but all seems ok
> now and would be good to get this merge issues past us ASAP.
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit e7aa8c2eb11ba69b1b69099c3c7bd6be3087b0ba:
> 
>   Merge tag 'docs-4.10' of git://git.lwn.net/linux (2016-12-12 21:58:13 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-20161220
> 
> for you to fetch changes up to 9899694a7f67714216665b87318eb367e2c5c901:
> 
>   samples/bpf: Move open_raw_sock to separate header (2016-12-20 12:00:40 
> -0300)
> 
> 
> perf/core improvements and fixes:
> 
> New features:
> 
> - Introduce 'perf sched timehist --idle', to analyse processes
>   going to/from idle state (Namhyung Kim)
> 
> Fixes:
> 
> - Allow 'perf record -u user' to continue when facing races with threads
>   going away after having scanned them via /proc (Jiri Olsa)
> 
> - Fix 'perf mem' --all-user/--all-kernel options (Jiri Olsa)
> 
> - Support jumps with multiple arguments (Ravi Bangoria)
> 
> - Fix jumps to before the function where they are located (Ravi
> Bangoria)
> 
> - Fix lock-pi help string (Davidlohr Bueso)
> 
> - Fix build of 'perf trace' in odd systems such as a RHEL PPC one (Jiri Olsa)
> 
> - Do not overwrite valid build id in 'perf diff' (Kan Liang)
> 
> - Don't throw error for zero length symbols, allowing the use of the TUI
>   in PowerPC, where such symbols became more common recently (Ravi Bangoria)
> 
> Infrastructure:
> 
> - Switch of samples/bpf/ to use tools/lib/bpf, removing libbpf
>   duplication (Joe Stringer)
> 
> - Move headers check into bash script (Jiri Olsa)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Arnaldo Carvalho de Melo (3):
>   perf tools: Remove some needless __maybe_unused
>   samples/bpf: Make perf_event_read() static
>   samples/bpf: Be consistent with bpf_load_program bpf_insn parameter
> 
> Davidlohr Bueso (1):
>   perf bench futex: Fix lock-pi help string
> 
> Jiri Olsa (7):
>   perf tools: Move headers check into bash script
>   perf mem: Fix --all-user/--all-kernel options
>   perf evsel: Use variable instead of repeating lengthy FD macro
>   perf thread_map: Add thread_map__remove function
>   perf evsel: Allow to ignore missing pid
>   perf record: Force ignore_missing_thread for uid option
>   perf trace: Check if MAP_32BIT is defined (again)
> 
> Joe Stringer (8):
>   tools lib bpf: Sync {tools,}/include/uapi/linux/bpf.h
>   tools lib bpf: use __u32 from linux/types.h
>   tools lib bpf: Add flags to bpf_create_map()
>   samples/bpf: Make samples more libbpf-centric
>   samples/bpf: Switch over to libbpf
>   tools lib bpf: Add bpf_prog_{attach,detach}
>   samples/bpf: Remove perf_event_open() declaration
>   samples/bpf: Move open_raw_sock to separate header
> 
> Kan Liang (1):
>   perf diff: Do not overwrite valid build id
> 
> Namhyung Kim (6):
>   perf sched timehist: Split is_idle_sample()
>   perf sched timehist: Introduce struct idle_time_data
>   perf sched timehist: Save callchain when entering idle
>   perf sched timehist: Skip non-idle events when necessary
>   perf sched timehist: Add -I/--idle-hist option
>   perf sched timehist: Show callchains for idle stat
> 
> Ravi Bangoria (3):
>   perf annotate: Support jump instruction with target as second operand
>   perf annotate: Fix jump target outside of function address range
>   perf annotate: Don't throw error for zero length symbols
> 
>  samples/bpf/Makefile  |  70 +--
>  samples/bpf/README.rst|   4 +-
>  samples/bpf/bpf_load.c|  21 +-
>  samples/bpf/bpf_load.h|   3 +
>  samples/bpf/fds_example.c |  13 +-
>  samples/bpf/lathist_user.c|   2 +-
>  samples/bpf/libbpf.c  | 176 ---
>  samples/bpf/libbpf.h  |  28 +-
>  samples/bpf/lwt_len_hist_user.c   |   6 +-
>  samples/bpf/offwaketime_user.c|   8 +-
>  samples/bpf/sampleip_user.c   |   7 +-
>  

Re: [PATCH tip/core/rcu 2/3] srcu: Force full grace-period ordering

2017-01-14 Thread Ingo Molnar

* Paul E. McKenney <paul...@linux.vnet.ibm.com> wrote:

> On Sun, Jan 15, 2017 at 08:11:23AM +0100, Ingo Molnar wrote:
> > 
> > * Paul E. McKenney <paul...@linux.vnet.ibm.com> wrote:
> > 
> > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > index 357b32aaea48..5fdfe874229e 100644
> > > --- a/include/linux/rcupdate.h
> > > +++ b/include/linux/rcupdate.h
> > > @@ -1175,11 +1175,11 @@ do { \
> > >   * if the UNLOCK and LOCK are executed by the same CPU or if the
> > >   * UNLOCK and LOCK operate on the same lock variable.
> > >   */
> > > -#ifdef CONFIG_PPC
> > > +#ifdef CONFIG_ARCH_WEAK_RELACQ
> > >  #define smp_mb__after_unlock_lock()  smp_mb()  /* Full ordering for 
> > > lock. */
> > > -#else /* #ifdef CONFIG_PPC */
> > > +#else /* #ifdef CONFIG_ARCH_WEAK_RELACQ */
> > >  #define smp_mb__after_unlock_lock()  do { } while (0)
> > > -#endif /* #else #ifdef CONFIG_PPC */
> > > +#endif /* #else #ifdef CONFIG_ARCH_WEAK_RELACQ */
> > >  
> > >  
> > 
> > So at the risk of sounding totally pedantic, why not structure it like the 
> > existing smp_mb__before/after*() primitives in barrier.h?
> > 
> > That allows asm-generic/barrier.h to pick up the definition - for example 
> > in the 
> > case of smp_acquire__after_ctrl_dep() we do:
> > 
> >  #ifndef smp_acquire__after_ctrl_dep
> >  #define smp_acquire__after_ctrl_dep()   smp_rmb()
> >  #endif
> > 
> > Which allows Tile to relax it:
> > 
> >   arch/tile/include/asm/barrier.h:#define smp_acquire__after_ctrl_dep()   
> > barrier()
> > 
> > I.e. I'd move the API definition out of rcupdate.h and into barrier.h - 
> > even 
> > though tree-RCU is the only user of this barrier type.
> 
> I wouldn't have any problem with that, however, some time back it was
> moved into RCU because (you guessed it!) RCU is the only user.  ;-)

Indeed ...

[sounds of rummaging around in the Git tree]

I found this commit of yours from ancient history (more than a year ago!):

  commit 12d560f4ea87030667438a169912380be00cea4b
  Author: Paul E. McKenney <paul...@linux.vnet.ibm.com>
  Date:   Tue Jul 14 18:35:23 2015 -0700

rcu,locking: Privatize smp_mb__after_unlock_lock()

RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
likely the only thing that ever will use it, so this commit makes this
macro private to RCU.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
Cc: "linux-a...@vger.kernel.org" <linux-a...@vger.kernel.org>

So I concur and I'm fine with your patch - or with the status quo code as well.

Thanks,

Ingo


Re: [PATCH tip/core/rcu 2/3] srcu: Force full grace-period ordering

2017-01-14 Thread Ingo Molnar

* Paul E. McKenney  wrote:

> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 357b32aaea48..5fdfe874229e 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -1175,11 +1175,11 @@ do { \
>   * if the UNLOCK and LOCK are executed by the same CPU or if the
>   * UNLOCK and LOCK operate on the same lock variable.
>   */
> -#ifdef CONFIG_PPC
> +#ifdef CONFIG_ARCH_WEAK_RELACQ
>  #define smp_mb__after_unlock_lock()  smp_mb()  /* Full ordering for lock. */
> -#else /* #ifdef CONFIG_PPC */
> +#else /* #ifdef CONFIG_ARCH_WEAK_RELACQ */
>  #define smp_mb__after_unlock_lock()  do { } while (0)
> -#endif /* #else #ifdef CONFIG_PPC */
> +#endif /* #else #ifdef CONFIG_ARCH_WEAK_RELACQ */
>  
>  

So at the risk of sounding totally pedantic, why not structure it like the 
existing smp_mb__before/after*() primitives in barrier.h?

That allows asm-generic/barrier.h to pick up the definition - for example in 
the 
case of smp_acquire__after_ctrl_dep() we do:

 #ifndef smp_acquire__after_ctrl_dep
 #define smp_acquire__after_ctrl_dep()   smp_rmb()
 #endif

Which allows Tile to relax it:

  arch/tile/include/asm/barrier.h:#define smp_acquire__after_ctrl_dep()   
barrier()

I.e. I'd move the API definition out of rcupdate.h and into barrier.h - even 
though tree-RCU is the only user of this barrier type.

Thanks,

Ingo


Re: [PATCH tip/core/rcu 2/3] srcu: Force full grace-period ordering

2017-01-15 Thread Ingo Molnar

* Paul E. McKenney <paul...@linux.vnet.ibm.com> wrote:

> On Sun, Jan 15, 2017 at 10:40:58AM +0100, Ingo Molnar wrote:
> > 
> > * Paul E. McKenney <paul...@linux.vnet.ibm.com> wrote:
> > 
> > > > [sounds of rummaging around in the Git tree]
> > > > 
> > > > I found this commit of yours from ancient history (more than a year 
> > > > ago!):
> > > > 
> > > >   commit 12d560f4ea87030667438a169912380be00cea4b
> > > >   Author: Paul E. McKenney <paul...@linux.vnet.ibm.com>
> > > >   Date:   Tue Jul 14 18:35:23 2015 -0700
> > > > 
> > > > rcu,locking: Privatize smp_mb__after_unlock_lock()
> > > > 
> > > > RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
> > > > likely the only thing that ever will use it, so this commit makes 
> > > > this
> > > > macro private to RCU.
> > > > 
> > > > Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
> > > > Cc: Will Deacon <will.dea...@arm.com>
> > > > Cc: Peter Zijlstra <pet...@infradead.org>
> > > > Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> > > > Cc: "linux-a...@vger.kernel.org" <linux-a...@vger.kernel.org>
> > > > 
> > > > So I concur and I'm fine with your patch - or with the status quo code 
> > > > as well.
> > > 
> > > I already have the patch queued, so how about I keep it if I get an ack
> > > from the powerpc guys and drop it otherwise?
> > 
> > Yeah, sounds good! Your patch made me look up 'RelAcq' so it has 
> > documentation 
> > value as well ;-)
> 
> ;-) ;-) ;-)
> 
> Looking forward, my guess would be that if some other code needs
> smp_mb__after_unlock_lock() or if some other architecture needs
> non-smb_mb() special handling, I should consider making it work the
> same as smp_mb__after_atomic() and friends.  Does that seem like a
> reasonable thought?

Yeah, absolutely - it's just that the pattern triggered the 'this looks a bit 
too 
specialized' response in me, but after seeing the details (again ...) I agree 
that 
this time is different!

Thanks,

Ingo


Re: [GIT PULL 0/6] perf/core improvements and fixes

2017-03-16 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling,
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit ffa86c2f1a8862cf58c873f6f14d4b2c3250fb48:
> 
>   Merge tag 'perf-core-for-mingo-4.12-20170314' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
> (2017-03-15 19:27:27 +0100)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-4.12-20170316
> 
> for you to fetch changes up to 61f35d750683b21e9e3836e309195c79c1daed74:
> 
>   uprobes: Default UPROBES_EVENTS to Y (2017-03-16 12:42:02 -0300)
> 
> 
> perf/core improvements and fixes:
> 
> New features:
> 
> - Add 'brstackinsn' field in 'perf script' to reuse the x86 instruction
>   decoder used in the Intel PT code to study hot paths to samples (Andi Kleen)
> 
> Kernel:
> 
> - Default UPROBES_EVENTS to Y (Alexei Starovoitov)
> 
> - Fix check for kretprobe offset within function entry (Naveen N. Rao)
> 
> Infrastructure:
> 
> - Introduce util func is_sdt_event() (Ravi Bangoria)
> 
> - Make perf_event__synthesize_mmap_events() scale on older kernels where
>   reading /proc/pid/maps is way slower than reading /proc/pid/task/pid/maps 
> (Stephane Eranian)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Andi Kleen (1):
>   perf script: Add 'brstackinsn' for branch stacks
> 
> Arnaldo Carvalho de Melo (2):
>   tools headers: Sync {tools/,}arch/x86/include/asm/cpufeatures.h
>   uprobes: Default UPROBES_EVENTS to Y
> 
> Naveen N. Rao (1):
>   trace/kprobes: Fix check for kretprobe offset within function entry
> 
> Ravi Bangoria (1):
>   perf probe: Introduce util func is_sdt_event()
> 
> Stephane Eranian (1):
>   perf tools: Make perf_event__synthesize_mmap_events() scale
> 
>  include/linux/kprobes.h|   1 +
>  kernel/kprobes.c   |  40 ++--
>  kernel/trace/Kconfig   |   2 +-
>  kernel/trace/trace_kprobe.c|   2 +-
>  tools/arch/x86/include/asm/cpufeatures.h   |   5 +-
>  tools/perf/Documentation/perf-script.txt   |  13 +-
>  tools/perf/builtin-script.c| 264 
> -
>  tools/perf/util/Build  |   1 +
>  tools/perf/util/dump-insn.c|  14 ++
>  tools/perf/util/dump-insn.h|  22 ++
>  tools/perf/util/event.c|   4 +-
>  .../util/intel-pt-decoder/intel-pt-insn-decoder.c  |  24 ++
>  tools/perf/util/parse-events.h |  20 ++
>  tools/perf/util/probe-event.c  |   9 +-
>  14 files changed, 381 insertions(+), 40 deletions(-)
>  create mode 100644 tools/perf/util/dump-insn.c
>  create mode 100644 tools/perf/util/dump-insn.h

Pulled, thanks a lot Arnaldo!

Ingo


Re: [GIT PULL 00/19] perf/core improvements and fixes

2017-03-15 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling,
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit 84e5b549214f2160c12318aac549de85f600c79a:
> 
>   Merge tag 'perf-core-for-mingo-4.11-20170306' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
> (2017-03-07 08:14:14 +0100)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-4.12-20170314
> 
> for you to fetch changes up to 5f6bee34707973ea7879a7857fd63ddccc92fff3:
> 
>   kprobes: Convert kprobe_exceptions_notify to use NOKPROBE_SYMBOL 
> (2017-03-14 15:17:40 -0300)
> 
> 
> perf/core improvements and fixes:
> 
> New features:
> 
> - Add PERF_RECORD_NAMESPACES so that the kernel can record information
>   required to associate samples to namespaces, helping in container
>   problem characterization.
> 
>   Now the 'perf record has a --namespace' option to ask for such info,
>   and when present, it can be used, initially, via a new sort order,
>   'cgroup_id', allowing histogram entry bucketization by a (device, inode)
>   based cgroup identifier (Hari Bathini)
> 
> - Add --next option to 'perf sched timehist', showing what is the next
>   thread to run (Brendan Gregg)
> 
> Fixes:
> 
> - Fix segfault with basic block 'cycles' sort dimension (Changbin Du)
> 
> - Add c2c to command-list.txt, making it appear in the 'perf help'
>   output (Changbin Du)
> 
> - Fix zeroing of 'abs_path' variable in the perf hists browser switch
>   file code (Changbin Du)
> 
> - Hide tips messages when -q/--quiet is given to 'perf report' (Namhyung Kim)
> 
> Infrastructure:
> 
> - Use ref_reloc_sym + offset to setup kretprobes (Naveen Rao)
> 
> - Ignore generated files pmu-events/{jevents,pmu-events.c} for git (Changbin 
> Du)
> 
> Documentation:
> 
> - Document +field style argument support for --field option (Changbin Du)
> 
> - Clarify 'perf c2c --stats' help message (Namhyung Kim)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Brendan Gregg (1):
>   perf sched timehist: Add --next option
> 
> Changbin Du (5):
>   perf tools: Missing c2c command in command-list
>   perf tools: Ignore generated files pmu-events/{jevents,pmu-events.c} 
> for git
>   perf sort: Fix segfault with basic block 'cycles' sort dimension
>   perf report: Document +field style argument support for --field option
>   perf hists browser: Fix typo in function switch_data_file
> 
> Hari Bathini (5):
>   perf: Add PERF_RECORD_NAMESPACES to include namespaces related info
>   perf tools: Add PERF_RECORD_NAMESPACES to include namespaces related 
> info
>   perf record: Synthesize namespace events for current processes
>   perf script: Add script print support for namespace events
>   perf tools: Add 'cgroup_id' sort order keyword
> 
> Namhyung Kim (3):
>   perf report: Hide tip message when -q option is given
>   perf c2c: Clarify help message of --stats option
>   perf c2c: Fix display bug when using pipe
> 
> Naveen N. Rao (5):
>   perf probe: Factor out the ftrace README scanning
>   perf kretprobes: Offset from reloc_sym if kernel supports it
>   perf powerpc: Choose local entry point with kretprobes
>   doc: trace/kprobes: add information about NOKPROBE_SYMBOL
>   kprobes: Convert kprobe_exceptions_notify to use NOKPROBE_SYMBOL
> 
>  Documentation/trace/kprobetrace.txt |   5 +-
>  include/linux/perf_event.h  |   2 +
>  include/uapi/linux/perf_event.h |  32 +-
>  kernel/events/core.c| 139 ++
>  kernel/fork.c   |   2 +
>  kernel/kprobes.c|   5 +-
>  kernel/nsproxy.c|   3 +
>  tools/include/uapi/linux/perf_event.h   |  32 +-
>  tools/perf/.gitignore   |   2 +
>  tools/perf/Documentation/perf-record.txt|   3 +
>  tools/perf/Documentation/perf-report.txt|   7 +-
>  tools/perf/Documentation/perf-sched.txt |   4 +
>  tools/perf/Documentation/perf-script.txt|   3 +
>  tools/perf/arch/powerpc/util/sym-handling.c |  14 ++-
>  tools/perf/builtin-annotate.c   |   1 +
>  tools/perf/builtin-c2c.c|   4 +-
>  tools/perf/builtin-diff.c   |   1 +
>  tools/perf/builtin-inject.c |  13 +++
>  tools/perf/builtin-kmem.c   |   1 +
>  tools/perf/builtin-kvm.c|   2 +
>  tools/perf/builtin-lock.c   |   1 +
>  tools/perf/builtin-mem.c|   1 +
>  tools/perf/builtin-record.c |  35 ++-
>  tools/perf/builtin-report.c   

Re: [PATCH v5 00/15] livepatch: hybrid consistency model

2017-03-07 Thread Ingo Molnar

* Josh Poimboeuf <jpoim...@redhat.com> wrote:

>  arch/Kconfig |   6 +
>  arch/powerpc/include/asm/thread_info.h   |   4 +-
>  arch/powerpc/kernel/signal.c |   4 +
>  arch/s390/include/asm/thread_info.h  |  24 +-
>  arch/s390/kernel/entry.S |  31 +-
>  arch/x86/Kconfig |   1 +
>  arch/x86/entry/common.c  |   9 +-
>  arch/x86/include/asm/thread_info.h   |  13 +-
>  arch/x86/include/asm/unwind.h|   6 +
>  arch/x86/kernel/stacktrace.c |  96 +++-
>  arch/x86/kernel/unwind_frame.c   |   2 +

for the x86 and scheduler changes:

Acked-by: Ingo Molnar <mi...@kernel.org>

Thanks,

Ingo


Re: [GIT PULL 00/35] perf/core improvements and fixes

2017-03-06 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> From: Arnaldo Carvalho de Melo 
> 
> Hi Ingo,
> 
>   Please consider pulling,
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit 9d020d33fc1b2faa0eb35859df1381ca5dc94ffe:
> 
>   Merge branch 'linus' into perf/urgent, to resolve conflict (2017-03-02 
> 08:05:45 +0100)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-4.11-20170306
> 
> for you to fetch changes up to 001916b94a04809a94abb07daba6f9ace01906ba:
> 
>   perf bench numa: Add more comment for -c option (2017-03-06 12:39:30 -0300)
> 
> 
> perf/core improvements and fixes:
> 
> New features:
> 
> - Allow sorting by symbol_size in 'perf report' and 'perf top' (Charles 
> Baylis)
> 
>   E.g.:
> 
>   # perf report -s symbol_size,symbol
> 
>   Samples: 9K of event 'cycles:k', Event count (approx.): 2870461623
>   Overhead  Symbol size  Symbol
> 14.55%  326  [k] flush_tlb_mm_range
>  7.20% 1045  [k] filemap_map_pages
>  5.82%  124  [k] vma_interval_tree_insert
>  5.18% 2430  [k] unmap_page_range
>  2.57%  571  [k] vma_interval_tree_remove
>  1.94%  494  [k] page_add_file_rmap
>  1.82%  740  [k] page_remove_rmap
>  1.66% 1017  [k] release_pages
>  1.57% 1636  [k] update_blocked_averages
>  1.57%   76  [k] unlock_page
> 
> - Add support for -p/--pid, -a/--all-cpus and -C/--cpu in 'perf ftrace' 
> (Namhyung Kim)
> 
> Change in behaviour:
> 
> - Make system wide (-a) the default option if no target was specified and one
>   of following conditions is met:
> 
>   - No workload specified (current behaviour)
> 
>   - A workload is specified but all requested events are system wide ones,
> like uncore ones. (Jiri Olsa)
> 
> Fixes:
> 
> - Add missing initialization to the instruction decoder used in the
>   intel PT/BTS code, which was causing lots of failures in 'perf test',
>   looking for a value when there was none (Adrian Hunter)
> 
> Infrastructure:
> 
> - Add arch code needed to adopt the kernel's refcount_t to aid in
>   catching bugs when using atomic_t as a reference counter, basically
>   cmpxchg related functions (Arnaldo Carvalho de Melo)
> 
> - Convert the code using atomic_t as reference counts to refcount_t
>   (Elena Rashetova)
> 
> - Add feature test for sched_getcpu() to more easily check for its
>   presence in the many libc implementations and accross different
>   versions of such C libraries (Arnaldo Carvalho de Melo)
> 
> - Issue a HW watchdog disable hint in 'perf stat' for when some of the
>   requested events can't get counted because a PMU counter is taken by that
>   watchdog (Borislav Petkov).
> 
> - Add mapping for Intel's KnightsMill PMU events (Karol Wachowski)
> 
> Documentation:
> 
> - Clarify the term 'convergence' in:
> 
>perf bench numa numa-mem -h --show_convergence (Jiri Olsa)
> 
> Kernel code:
> 
> - Ensure probe location is at function entry in kretprobes (Naveen N. Rao)
> 
> - Allow return probes with offsets and absolute addresses (Naveen N. Rao)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Adrian Hunter (1):
>   perf intel-PT/BTS: Add missing initialization
> 
> Arnaldo Carvalho de Melo (12):
>   tools include: Adopt __compiletime_error
>   tools arch x86: Include asm/cmpxchg.h
>   tools arch x86: Introduce atomic_cmpxchg()
>   tools include: Introduce atomic_cmpxchg_{relaxed,release}()
>   tools include: Provide gcc based cmpxchg fallback for !x86
>   tools include: Add UINT_MAX def to kernel.h
>   tools include: Adopt kernel's refcount.h
>   perf evlist: Clarify a bit the use of perf_mmap->refcnt
>   tools build: Add test for sched_getcpu()
>   perf bench futex: Use __maybe_unused
>   perf bench futex: Fix build on musl + clang
>   tools build: Use the same CC for feature detection and actual build
> 
> Borislav Petkov (1):
>   perf stat: Issue a HW watchdog disable hint
> 
> Charles Baylis (1):
>   perf tools: Allow sorting by symbol size
> 
> Elena Reshetova (9):
>   perf cgroup: Convert cgroup_sel.refcnt from atomic_t to refcount_t
>   perf cpumap: Convert cpu_map.refcnt from atomic_t to refcount_t
>   perf comm: Convert comm_str.refcnt from atomic_t to refcount_t
>   perf dso: Convert dso.refcnt from atomic_t to refcount_t
>   perf map: Convert map.refcnt from atomic_t to refcount_t
>   perf map: Convert map_groups.refcnt from atomic_t to refcount_t
>   perf evlist: Convert perf_map.refcnt from atomic_t to refcount_t
>   perf thread: convert thread.refcnt from atomic_t to refcount_t
>   perf thread_map: Convert 

Re: [GIT PULL 00/19] perf/core improvements and fixes

2017-08-14 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling,
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> 
> The following changes since commit 82119cbe8e1e32cc2a941393e59816e731681310:
> 
>   Merge tag 'perf-core-for-mingo-4.14-20170801' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
> (2017-08-10 17:07:02 +0200)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-4.14-20170814
> 
> for you to fetch changes up to 8fc375d7d36c72b4c2d55f5c24be022a939295d4:
> 
>   perf test shell: Add uprobes + backtrace ping test (2017-08-11 16:18:49 
> -0300)
> 
> 
> perf/core improvements and fixes:
> 
> Infrastructure:
> 
> - Do not consider empty files as valid srclines (Milian Wolff)
> 
> - Fix wrong size in perf_record_mmap for last kernel module,
>   which resulted in erroneous symbol resolution in at least s390x (Thomas 
> Richter)
> 
> - Add missing newline to expr parser error messages (Andi Kleen)
> 
> - Fix saved values rbtree lookup in 'perf stat' (Andi Kleen)
> 
> - Add support for shell based tests in 'perf test', add a few that
>   run 'perf probe', 'perf trace', using kprobes, uprobes to check
>   the output of those tools and the effects on the system, checking,
>   for instance, DWARF backtraces from uprobes (Arnaldo Carvalho de Melo)
> 
> Arch specific:
> 
> - Add ppc64le to audit uname list in the python scripting support (Naveen N. 
> Rao)
> 
> - Update POWER9 vendor events tables (Sukadev Bhattiprolu)
> 
> - Fix module symbol adjustment for s390x (Thomas Richter)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Andi Kleen (2):
>   perf stat: Fix saved values rbtree lookup
>   perf tools: Add missing newline to expr parser error messages
> 
> Arnaldo Carvalho de Melo (10):
>   perf test: Make 'list' subcommand match main 'perf test' 
> numbering/matching
>   perf test: Add 'struct test *' to the test functions
>   perf test: Add infrastructure to run shell based tests
>   perf test: Make 'list' use same filtering code as main 'perf test'
>   perf test shell: Add 'probe_vfs_getname' shell test
>   perf test shell: Install shell tests
>   perf test shell: Move vfs_getname probe function to lib
>   perf test shell: Add test using probe:vfs_getname and verifying results
>   perf test shell: Add test using vfs_getname + 'perf trace'
>   perf test shell: Add uprobes + backtrace ping test
> 
> Milian Wolff (2):
>   perf util: Take elf_name as const string in dso__demangle_sym
>   perf srcline: Do not consider empty files as valid srclines
> 
> Naveen N. Rao (1):
>   perf scripting python: Add ppc64le to audit uname list
> 
> Sukadev Bhattiprolu (2):
>   perf vendor events powerpc: remove suffix in mapfile
>   perf vendor events powerpc: Update POWER9 events
> 
> Thomas Richter (2):
>   perf record: Fix wrong size in perf_record_mmap for last kernel module
>   perf report: Fix module symbol adjustment for s390x
> 
>  tools/perf/Makefile.perf   |6 +-
>  tools/perf/arch/s390/util/sym-handling.c   |7 +
>  tools/perf/arch/x86/include/arch-tests.h   |   11 +-
>  tools/perf/arch/x86/tests/insn-x86.c   |2 +-
>  tools/perf/arch/x86/tests/intel-cqm.c  |2 +-
>  tools/perf/arch/x86/tests/perf-time-to-tsc.c   |2 +-
>  tools/perf/arch/x86/tests/rdpmc.c  |2 +-
>  tools/perf/pmu-events/arch/powerpc/mapfile.csv |   20 +-
>  .../perf/pmu-events/arch/powerpc/power9/cache.json |  191 +-
>  .../arch/powerpc/power9/floating-point.json|   42 +-
>  .../pmu-events/arch/powerpc/power9/frontend.json   |  517 ++--
>  .../pmu-events/arch/powerpc/power9/marked.json |  905 +++
>  .../pmu-events/arch/powerpc/power9/memory.json |  178 +-
>  .../perf/pmu-events/arch/powerpc/power9/other.json | 2768 
> 
>  .../pmu-events/arch/powerpc/power9/pipeline.json   |  779 +++---
>  tools/perf/pmu-events/arch/powerpc/power9/pmc.json |  167 +-
>  .../arch/powerpc/power9/translation.json   |  314 +--
>  .../python/Perf-Trace-Util/lib/Perf/Trace/Util.py  |1 +
>  tools/perf/tests/attr.c|2 +-
>  tools/perf/tests/backward-ring-buffer.c|2 +-
>  tools/perf/tests/bitmap.c  |2 +-
>  tools/perf/tests/bp_signal.c   |2 +-
>  tools/perf/tests/bp_signal_overflow.c  |2 +-
>  tools/perf/tests/bpf.c |4 +-
>  tools/perf/tests/builtin-test.c|  184 +-
>  tools/perf/tests/clang.c   |4 +-
>  

Re: [GIT PULL 00/25] perf/core improvements and fixes

2017-06-21 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling,
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit 007b811b4041989ec2dc91b9614aa2c41332723e:
> 
>   Merge tag 'perf-core-for-mingo-4.13-20170719' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
> (2017-06-20 10:49:08 +0200)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-4.13-20170621
> 
> for you to fetch changes up to 701516ae3dec801084bc913d21e03fce15c61a0b:
> 
>   perf script: Fix message because field list option is -F not -f (2017-06-21 
> 11:35:53 -0300)
> 
> 
> perf/core improvements ad fixes:
> 
> New features:
> 
> - Add support to measure SMI cost in 'perf stat' (Kan Liang)
> 
> - Add support for unwinding callchains in powerpc with libdw (Paolo Bonzini)
> 
> Fixes:
> 
> - Fix message: cpu list option is -C not -c (Adrian Hunter)
> 
> - Fix 'perf script' message: field list option is -F not -f (Adrian Hunter)
> 
> - Intel PT fixes: (Adrian Hunter)
> 
>   o Fix missing stack clear
>   o Ensure IP is zero when state is INTEL_PT_STATE_NO_IP
>   o Fix last_ip usage
>   o Ensure never to set 'last_ip' when packet 'count' is zero
>   o Clear FUP flag on error
>   o Fix transactions_sample_type
> 
> Infrastructure:
> 
> - Intel PT cleanups/refactorings (Adrian Hunter)
> 
>   o Use FUP always when scanning for an IP
>   o Add missing __fallthrough
>   o Remove redundant initial_skip checks
>   o Allow decoding with branch tracing disabled
>   o Add default config for pass-through branch enable
>   o Add documentation for new config terms
>   o Add decoder support for ptwrite and power event packets
>   o Add reserved byte to CBR packet payload
>   o Add decoder support for CBR events
> 
> - Move  find_process() to the only place that uses it, skimming some
>   more fat from util.[ch] (Arnaldo Carvalho de Melo)
> 
> - Do parameter validation earlier on fetch_kernel_version() (Arnaldo Carvalho 
> de Melo)
> 
> - Remove unused _ALL_SOURCE define (Arnaldo Carvalho de Melo)
> 
> - Add sysfs__write_int function (Kan Liang)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Adrian Hunter (19):
>   perf intel-pt: Move decoder error setting into one condition
>   perf intel-pt: Improve sample timestamp
>   perf intel-pt: Fix missing stack clear
>   perf intel-pt: Ensure IP is zero when state is INTEL_PT_STATE_NO_IP
>   perf intel-pt: Fix last_ip usage
>   perf intel-pt: Ensure never to set 'last_ip' when packet 'count' is zero
>   perf intel-pt: Use FUP always when scanning for an IP
>   perf intel-pt: Clear FUP flag on error
>   perf intel-pt: Add missing __fallthrough
>   perf intel-pt: Allow decoding with branch tracing disabled
>   perf intel-pt: Add default config for pass-through branch enable
>   perf intel-pt: Add documentation for new config terms
>   perf intel-pt: Add decoder support for ptwrite and power event packets
>   perf intel-pt: Add reserved byte to CBR packet payload
>   perf intel-pt: Add decoder support for CBR events
>   perf intel-pt: Remove redundant initial_skip checks
>   perf intel-pt: Fix transactions_sample_type
>   perf tools: Fix message because cpu list option is -C not -c
>   perf script: Fix message because field list option is -F not -f
> 
> Arnaldo Carvalho de Melo (3):
>   perf evsel: Adopt find_process()
>   perf tools: Do parameter validation earlier on fetch_kernel_version()
>   perf tools: Remove unused _ALL_SOURCE define
> 
> Kan Liang (2):
>   tools lib api fs: Add sysfs__write_int function
>   perf stat: Add support to measure SMI cost
> 
> Paolo Bonzini (1):
>   perf unwind: Support for powerpc
> 
>  tools/lib/api/fs/fs.c  |  30 +++
>  tools/lib/api/fs/fs.h  |   4 +
>  tools/perf/Documentation/intel-pt.txt  |  36 +++
>  tools/perf/Documentation/perf-stat.txt |  14 +
>  tools/perf/Makefile.config |   2 +-
>  tools/perf/arch/powerpc/util/Build |   2 +
>  tools/perf/arch/powerpc/util/unwind-libdw.c|  73 ++
>  tools/perf/arch/x86/util/intel-pt.c|   5 +
>  tools/perf/builtin-script.c|   2 +-
>  tools/perf/builtin-stat.c  |  49 
>  tools/perf/util/evsel.c|  39 +++
>  .../perf/util/intel-pt-decoder/intel-pt-decoder.c  | 290 
> +++--
>  .../perf/util/intel-pt-decoder/intel-pt-decoder.h  |  13 +
>  .../util/intel-pt-decoder/intel-pt-pkt-decoder.c   | 110 +++-
>  

Re: [PATCH 5/6] mm, x86: Add ARCH_HAS_ZONE_DEVICE

2017-05-23 Thread Ingo Molnar

* Oliver O'Halloran <ooh...@gmail.com> wrote:

> Currently ZONE_DEVICE depends on X86_64. This is fine for now, but it
> will get unwieldly as new platforms get ZONE_DEVICE support. Moving it
> to an arch selected Kconfig option to save us some trouble in the
> future.
> 
> Cc: x...@kernel.org
> Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>

Acked-by: Ingo Molnar <mi...@kernel.org>

Thanks,

Ingo


Re: [GIT PULL 00/13] perf/core improvements and fixes

2017-09-04 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling,
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit 1b2f76d77a277bb70d38ad0991ed7f16bbc115a9:
> 
>   Merge tag 'perf-core-for-mingo-4.14-20170829' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
> (2017-08-29 23:13:56 +0200)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-4.14-20170901
> 
> for you to fetch changes up to eba9fac017617e685d648339e29a1453a30cb065:
> 
>   perf annotate browser: Help for cycling thru hottest instructions with 
> TAB/shift+TAB (2017-09-01 14:55:40 -0300)
> 
> 
> perf/core improvements and fixes:
> 
> - Support syscall name glob matching in 'perf trace' (Arnaldo Carvalho de 
> Melo)
> 
>   e.g.:
> 
># perf trace -e pkey_*
>32.784 (0.006 ms): pkey/16018 pkey_alloc(init_val: DISABLE_WRITE) = -1 
> EINVAL Invalid argument
>32.795 (0.004 ms): pkey/16018 pkey_mprotect(start: 0x7f380d0a6000, len: 
> 4096, prot: READ|WRITE, pkey: -1) = 0
>32.801 (0.002 ms): pkey/16018 pkey_free(pkey: -1) = -1 
> EINVAL Invalid argument
>^C#
> 
> - Do not auto merge counts for explicitely specified events in
>   'perf stat' (Arnaldo Carvalho de Melo)
> 
> - Fix syntax in documentation of .perfconfig intel-pt option (Jack Henschel)
> 
> - Calculate the average cycles of iterations for loops detected by the
>   branch history support in 'perf report' (Jin Yao)
> 
> - Support PERF_SAMPLE_PHYS_ADDR as a sort key "phys_daddr" in the 'script', 
> 'mem',
>   'top' and 'report'. Also add a test entry for it in 'perf test' (Kan Liang)
> 
> - Fix 'Object code reading' 'perf test' entry in PowerPC (Ravi Bangoria)
> 
> - Remove some duplicate Power9 duplicate vendor events (described in JSON
>   files) (Sukadev Bhattiprolu)
> 
> - Add help entry in the TUI annotate browser about cycling thru hottest
>   instructions with TAB/shift+TAB (Arnaldo Carvalho de Melo)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Arnaldo Carvalho de Melo (4):
>   perf syscalltbl: Support glob matching on syscall names
>   perf trace: Support syscall name globbing
>   perf stat: Only auto-merge events that are PMU aliases
>   perf annotate browser: Help for cycling thru hottest instructions with 
> TAB/shift+TAB
> 
> Jack Henschel (1):
>   perf intel-pt: Fix syntax in documentation of config option
> 
> Jin Yao (1):
>   perf report: Calculate the average cycles of iterations
> 
> Kan Liang (5):
>   perf tools: Support new sample type for physical address
>   perf sort: Add sort option for physical address
>   perf mem: Support physical address
>   perf script: Support physical address
>   perf test: Add test case for PERF_SAMPLE_PHYS_ADDR
> 
> Ravi Bangoria (1):
>   perf test powerpc: Fix 'Object code reading' test
> 
> Sukadev Bhattiprolu (1):
>   perf vendor events powerpc: Remove duplicate events
> 
>  tools/include/uapi/linux/perf_event.h  |   4 +-
>  tools/perf/Documentation/intel-pt.txt  |   2 +-
>  tools/perf/Documentation/perf-mem.txt  |   4 +
>  tools/perf/Documentation/perf-record.txt   |   5 +-
>  tools/perf/Documentation/perf-report.txt   |   1 +
>  tools/perf/Documentation/perf-script.txt   |   2 +-
>  tools/perf/Documentation/perf-trace.txt|   2 +-
>  tools/perf/builtin-mem.c   |  97 -
>  tools/perf/builtin-record.c|   2 +
>  tools/perf/builtin-script.c|  15 ++-
>  tools/perf/builtin-stat.c  |   2 +-
>  tools/perf/builtin-trace.c |  39 ++-
>  tools/perf/perf.h  |   1 +
>  .../pmu-events/arch/powerpc/power9/frontend.json   |   7 +-
>  .../perf/pmu-events/arch/powerpc/power9/other.json | 120 
> -
>  .../pmu-events/arch/powerpc/power9/pipeline.json   |   7 +-
>  tools/perf/pmu-events/arch/powerpc/power9/pmc.json |   7 +-
>  tools/perf/tests/code-reading.c|   5 +
>  tools/perf/tests/sample-parsing.c  |   6 +-
>  tools/perf/ui/browsers/annotate.c  |   3 +-
>  tools/perf/ui/browsers/hists.c |   8 +-
>  tools/perf/ui/stdio/hist.c |  10 +-
>  tools/perf/util/callchain.c|  49 -
>  tools/perf/util/callchain.h|   9 +-
>  tools/perf/util/event.h|   1 +
>  tools/perf/util/evsel.c|  19 +++-
>  tools/perf/util/evsel.h|   1 +
>  

Re: [PATCH 0/4] PCI: Cleanup unused stuff

2017-10-07 Thread Ingo Molnar

* Bjorn Helgaas <helg...@kernel.org> wrote:

> Sorry for the long cc list.  These are pretty trivial; they just remove
> some unnecessary declarations across several arches.
> 
> ---
> 
> Bjorn Helgaas (4):
>   PCI: Remove redundant pcibios_set_master() declarations
>   PCI: Remove redundant pci_dev, pci_bus, resource declarations
>   PCI: Remove unused declarations
>   alpha/PCI: Make pdev_save_srm_config() static
> 
> 
>  arch/alpha/include/asm/pci.h|5 -
>  arch/alpha/kernel/pci.c |   11 ++-
>  arch/alpha/kernel/pci_impl.h|8 
>  arch/cris/include/asm/pci.h |9 -
>  arch/frv/include/asm/pci.h  |4 
>  arch/ia64/include/asm/pci.h |4 
>  arch/mips/include/asm/pci.h |4 
>  arch/mn10300/include/asm/pci.h  |4 
>  arch/mn10300/unit-asb2305/pci-asb2305.h |3 ---
>  arch/parisc/include/asm/pci.h   |8 
>  arch/powerpc/include/asm/pci.h  |2 --
>  arch/sh/include/asm/pci.h   |4 
>  arch/sparc/include/asm/pci_32.h |2 --
>  arch/x86/include/asm/pci.h  |2 --
>  arch/xtensa/include/asm/pci.h   |2 --
>  15 files changed, 10 insertions(+), 62 deletions(-)

Nice cleanups! For the whole series:

  Reviewed-by: Ingo Molnar <mi...@kernel.org>

Thanks,

Ingo



Re: [PATCH v9 0/6] add support for relative references in special sections

2018-07-03 Thread Ingo Molnar


* Ard Biesheuvel  wrote:

> On 27 June 2018 at 17:15, Will Deacon  wrote:
> > Hi Ard,
> >
> > On Tue, Jun 26, 2018 at 08:27:55PM +0200, Ard Biesheuvel wrote:
> >> This adds support for emitting special sections such as initcall arrays,
> >> PCI fixups and tracepoints as relative references rather than absolute
> >> references. This reduces the size by 50% on 64-bit architectures, but
> >> more importantly, it removes the need for carrying relocation metadata
> >> for these sections in relocatable kernels (e.g., for KASLR) that needs
> >> to be fixed up at boot time. On arm64, this reduces the vmlinux footprint
> >> of such a reference by 8x (8 byte absolute reference + 24 byte RELA entry
> >> vs 4 byte relative reference)
> >>
> >> Patch #3 was sent out before as a single patch. This series supersedes
> >> the previous submission. This version makes relative ksymtab entries
> >> dependent on the new Kconfig symbol HAVE_ARCH_PREL32_RELOCATIONS rather
> >> than trying to infer from kbuild test robot replies for which architectures
> >> it should be blacklisted.
> >>
> >> Patch #1 introduces the new Kconfig symbol HAVE_ARCH_PREL32_RELOCATIONS,
> >> and sets it for the main architectures that are expected to benefit the
> >> most from this feature, i.e., 64-bit architectures or ones that use
> >> runtime relocations.
> >>
> >> Patch #2 add support for #define'ing __DISABLE_EXPORTS to get rid of
> >> ksymtab/kcrctab sections in decompressor and EFI stub objects when
> >> rebuilding existing C files to run in a different context.
> >
> > I had a small question on patch 3, but it's really for my understanding.
> > So, for patches 1-3:
> >
> > Reviewed-by: Will Deacon 
> >
> 
> Thanks all.
> 
> Thomas, Ingo,
> 
> Except for the below tweak against patch #3 for powerpc, which may
> apparently get confused by an input section called .discard without
> any suffixes, this series is good to go, but requires your ack to
> proceed, so I would like to ask you to share your comments and/or
> objections. Also, any suggestions or recommendations regarding the
> route these patches should take are highly appreciated.

LGTM:

Acked-by: Ingo Molnar 

Regarding route - I suspect -mm would be good, or any other tree that does a 
lot 
of cross-arch testing?

Thanks,

Ingo


Re: [PATCH v11 0/3] mm, x86, powerpc: Enhancements to Memory Protection Keys.

2018-01-30 Thread Ingo Molnar

* Ram Pai <linux...@us.ibm.com> wrote:

> This patch series provides arch-neutral enhancements to
> enable memory-keys on new architecutes, and the corresponding
> changes in x86 and powerpc specific code to support that.
> 
> a) Provides ability to support upto 32 keys.  PowerPC
>   can handle 32 keys and hence needs this.
> 
> b) Arch-neutral code; and not the arch-specific code,
>determines the format of the string, that displays the key
>for each vma in smaps.
> 
> PowerPC implementation of memory-keys is now in powerpc/next tree.
> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next=92e3da3cf193fd27996909956c12a23c0333da44

All three patches look sane to me. If you would like to carry these generic 
bits 
in the PowerPC tree as well then:

  Reviewed-by: Ingo Molnar <mi...@kernel.org>

Thanks,

Ingo


Re: [PATCH] headers: untangle kmemleak.h from mm.h

2018-02-11 Thread Ingo Molnar

* Randy Dunlap <rdun...@infradead.org> wrote:

> From: Randy Dunlap <rdun...@infradead.org>
> 
> Currently  #includes  for no obvious
> reason. It looks like it's only a convenience, so remove kmemleak.h
> from slab.h and add  to any users of kmemleak_*
> that don't already #include it.
> Also remove  from source files that do not use it.
> 
> This is tested on i386 allmodconfig and x86_64 allmodconfig. It
> would be good to run it through the 0day bot for other $ARCHes.
> I have neither the horsepower nor the storage space for the other
> $ARCHes.
> 
> [slab.h is the second most used header file after module.h; kernel.h
> is right there with slab.h. There could be some minor error in the
> counting due to some #includes having comments after them and I
> didn't combine all of those.]
> 
> This is Lingchi patch #1 (death by a thousand cuts, applied to kernel
> header files).
> 
> Signed-off-by: Randy Dunlap <rdun...@infradead.org>

Nice find:

Reviewed-by: Ingo Molnar <mi...@kernel.org>

I agree that it needs to go through 0-day to find any hidden dependencies we 
might 
have grown due to this.

Thanks,

Ingo


Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()

2018-02-05 Thread Ingo Molnar

* Mathieu Desnoyers  wrote:

>  
> +config ARCH_HAS_MEMBARRIER_HOOKS
> + bool

Yeah, so I have renamed this to ARCH_HAS_MEMBARRIER_CALLBACKS, and propagated 
it 
through the rest of the patches. "Callback" is the canonical name, and I also 
cringe every time I see 'hook'.

Please let me know if there are any big objections against this minor cleanup.

Thanks,

Ingo


Re: [GIT PULL 00/41] perf/core improvements and fixes

2018-02-17 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling, this is on top of tip/perf/urgent.
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit 297f9233b53a08fd457815e19f1d6f2c3389857b:
> 
>   kprobes: Propagate error from disarm_kprobe_ftrace() (2018-02-16 09:12:58 
> +0100)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-4.17-20180216
> 
> for you to fetch changes up to 21316ac6803d4a1aadd74b896db8d60a92cd1140:
> 
>   perf tests shell lib: Use a wildcard to remove the vfs_getname probe 
> (2018-02-16 15:31:12 -0300)
> 
> 
> perf/core improvements and fixes:
> 
> - Fix wrong jump arrow in systems with branch records with cycles,
>   i.e. Intel's >= Skylake (Jin Yao)
> 
> - Fix 'perf record --per-thread' problem introduced when
>   implementing 'perf stat --per-thread (Jin Yao)
> 
> - Use arch__compare_symbol_names() to fix 'perf test vmlinux',
>   that was using strcmp(symbol names) while the dso routines
>   doing symbol lookups used the arch overridable one, making
>   this test fail in architectures that overrided that function
>   with something other than strcmp() (Jiri Olsa)
> 
> - Add 'perf script --show-round-event' to display
>   PERF_RECORD_FINISHED_ROUND entries (Jiri Olsa)
> 
> - Fix dwarf unwind for stripped binaries in 'perf test' (Jiri Olsa)
> 
> - Use ordered_events for 'perf report --tasks', otherwise we may get
>   artifacts when PERF_RECORD_FORK gets processed before PERF_RECORD_COMM
>   (when they got recorded in different CPUs) (Jiri Olsa)
> 
> - Add support to display group output for non group events, i.e.
>   now when one uses 'perf report --group' on a perf.data file
>   recorded without explicitly grouping events with {} (e.g.
>   "perf record -e '{cycles,instructions}'" get the same output
>   that would produce, i.e. see all those non-grouped events in
>   multiple columns, at the same time (Jiri Olsa)
> 
> - Skip non-address kallsyms entries, e.g. '(null)' for !root (Jiri Olsa)
> 
> - Kernel maps fixes wrt perf.data(report) versus live system (top)
>   (Jiri Olsa)
> 
> - Fix memory corruption when using 'perf record -j call -g -a '
>   followed by 'perf report --branch-history' (Jiri Olsa)
> 
> - ARM CoreSight fixes (Mathieu Poirier)
> 
> - Add inject capability for CoreSight Traces (Robert Waker)
> 
> - Update documentation for use of 'perf' + ARM CoreSight (Robert Walker)
> 
> - Man pages fixes (Sangwon Hong, Jaecheol Shin)
> 
> - Fix some 'perf test' cases on s/390 and x86_64 (some backtraces
>   changed with a glibc update) (Thomas Richter)
> 
> - Add detailed CPUID info in the 'perf.data' headers for s/390 to
>   then use it in 'perf annotate' (Thomas Richter)
> 
> - Add '--interval-count N' to 'perf stat', to use with -I, i.e.
>   'perf stat -I 1000 --interval-count 2' will show stats every
>1000ms, two times (yuzhoujian)
> 
> - Add 'perf stat --timeout Nms', that will run for that many
>   milliseconds and then stop, printing the counters (yuzhoujian)
> 
> - Fix description for 'perf report --mem-modex (Andi Kleen)
> 
> - Use a wildcard to remove the vfs_getname probe in the
>   'perf test' shell based test cases (Arnaldo Carvalho de Melo)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Andi Kleen (1):
>   perf report: Fix description for --mem-mode
> 
> Arnaldo Carvalho de Melo (1):
>   perf tests shell lib: Use a wildcard to remove the vfs_getname probe
> 
> Jaecheol Shin (1):
>   perf annotate: Add missing arguments in Man page
> 
> Jin Yao (2):
>   perf tools: Use target->per_thread and target->system_wide flags
>   perf report: Fix wrong jump arrow
> 
> Jiri Olsa (18):
>   perf record: Put new line after target override warning
>   perf script: Add --show-round-event to display 
> PERF_RECORD_FINISHED_ROUND
>   tools lib api fs: Add filename__read_xll function
>   tools lib api fs: Add sysfs__read_xll function
>   perf tests: Fix dwarf unwind for stripped binaries
>   perf tools: Fix comment for sort__* compare functions
>   perf report: Ask for ordered events for --tasks option
>   perf report: Add support to display group output for non group events
>   tools lib symbol: Skip non-address kallsyms line
>   perf symbols: Check if we read regular file in dso__load()
>   perf machine: Free root_dir in machine__init() error path
>   perf machine: Move kernel mmap name into struct machine
>   perf machine: Generalize machine__set_kernel_mmap()
>   perf machine: Don't search for active kernel start in 
> __machine__create_kernel_maps
>   perf machine: Remove machine__load_kallsyms()
>   perf tools: Do not create kernel maps in 

Re: [PATCH v12 01/22] selftests/x86: Move protecton key selftest to arch neutral directory

2018-02-21 Thread Ingo Molnar

* Ram Pai <linux...@us.ibm.com> wrote:

> cc: Dave Hansen <dave.han...@intel.com>
> cc: Florian Weimer <fwei...@redhat.com>
> Signed-off-by: Ram Pai <linux...@us.ibm.com>
> ---
>  tools/testing/selftests/vm/Makefile   |1 +
>  tools/testing/selftests/vm/pkey-helpers.h |  223 
>  tools/testing/selftests/vm/protection_keys.c  | 1407 
> +
>  tools/testing/selftests/x86/Makefile  |2 +-
>  tools/testing/selftests/x86/pkey-helpers.h|  223 
>  tools/testing/selftests/x86/protection_keys.c | 1407 
> -
>  6 files changed, 1632 insertions(+), 1631 deletions(-)
>  create mode 100644 tools/testing/selftests/vm/pkey-helpers.h
>  create mode 100644 tools/testing/selftests/vm/protection_keys.c
>  delete mode 100644 tools/testing/selftests/x86/pkey-helpers.h
>  delete mode 100644 tools/testing/selftests/x86/protection_keys.c

Acked-by: Ingo Molnar <mi...@kernel.org>

Thanks,

Ingo


Re: [PATCH v6 2/8] module: use relative references for __ksymtab entries

2017-12-28 Thread Ingo Molnar

* Ard Biesheuvel  wrote:

> Annoyingly, we need this because there is a single instance of a
> special section that ends up in the EFI stub code: we build lib/sort.c
> again as a EFI libstub object, and given that sort() is exported, we
> end up with a ksymtab section in the EFI stub. The sort() thing has
> caused issues before [0], so perhaps I should just clone sort.c into
> drivers/firmware/efi/libstub and get rid of that hack.
> 
> [0] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=29f9007b3182ab3f328a31da13e6b1c9072f7a95

If the root problem is early bootstrap code randomly using generic facility 
that 
isn't __init, then we should definitely improve tooling to at least detect 
these 
problems.

As bootstrap code gets improved (KASLR, more complex decompression, etc. etc.) 
we 
keep using new bits of generic facilities...

So this should definitely not be hidden by open coding that function (which has 
various other disadvantages as well), but should be turned from silent breakage 
either into non-breakage (and do so not only for sort() but for other generic 
functions as well), or should be turned into a build failure.

Thanks,

Ingo


Re: [GIT PULL 0/5] perf/urgent fixes

2018-07-30 Thread Ingo Molnar


* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling, just to get the build without warnings
> and finishing successfully in all my test environments,
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit 7f635ff187ab6be0b350b3ec06791e376af238ab:
> 
>   perf/core: Fix crash when using HW tracing kernel filters (2018-07-25 
> 11:46:22 +0200)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-urgent-for-mingo-4.18-20180730
> 
> for you to fetch changes up to 44fe619b1418ff4e9d2f9518a940fbe2fb686a08:
> 
>   perf tools: Fix the build on the alpine:edge distro (2018-07-30 13:15:03 
> -0300)
> 
> 
> perf/urgent fixes: (Arnaldo Carvalho de Melo)
> 
> - Update the tools copy of several files, including perf_event.h,
>   powerpc's asm/unistd.h (new io_pgetevents syscall), bpf.h and
>   x86's memcpy_64.s (used in 'perf bench mem'), silencing the
>   respective warnings during the perf tools build.
> 
> - Fix the build on the alpine:edge distro.
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Arnaldo Carvalho de Melo (5):
>   tools headers uapi: Update tools's copy of linux/perf_event.h
>   tools headers powerpc: Update asm/unistd.h copy to pick new
>   tools headers uapi: Refresh linux/bpf.h copy
>   tools arch: Update arch/x86/lib/memcpy_64.S copy used in 'perf bench 
> mem memcpy'
>   perf tools: Fix the build on the alpine:edge distro
> 
>  tools/arch/powerpc/include/uapi/asm/unistd.h |   1 +
>  tools/arch/x86/include/asm/mcsafe_test.h |  13 
>  tools/arch/x86/lib/memcpy_64.S   | 112 
> +--
>  tools/include/uapi/linux/bpf.h   |  28 +--
>  tools/include/uapi/linux/perf_event.h|   2 +
>  tools/perf/arch/x86/util/pmu.c   |   1 +
>  tools/perf/arch/x86/util/tsc.c   |   1 +
>  tools/perf/bench/Build   |   1 +
>  tools/perf/bench/mem-memcpy-x86-64-asm.S |   1 +
>  tools/perf/bench/mem-memcpy-x86-64-lib.c |  24 ++
>  tools/perf/perf.h|   1 +
>  tools/perf/util/header.h |   1 +
>  tools/perf/util/namespaces.h |   1 +
>  13 files changed, 124 insertions(+), 63 deletions(-)
>  create mode 100644 tools/arch/x86/include/asm/mcsafe_test.h
>  create mode 100644 tools/perf/bench/mem-memcpy-x86-64-lib.c

Pulled, thanks a lot Arnaldo!

Ingo


Re: [GIT PULL 00/21] perf/core improvements and fixes

2018-08-02 Thread Ingo Molnar


* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling, contains a recently merged
> tip/perf/urgent,
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit c2586cfbb905939b79b49a9121fb0a59a5668fd6:
> 
>   Merge remote-tracking branch 'tip/perf/urgent' into perf/core (2018-07-31 
> 09:55:45 -0300)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-4.19-20180801
> 
> for you to fetch changes up to b912885ab75c7c8aa841c615108afd755d0b97f8:
> 
>   perf trace: Do not require --no-syscalls to suppress strace like output 
> (2018-08-01 16:20:28 -0300)
> 
> 
> perf/core improvements and fixes:
> 
> perf trace: (Arnaldo Carvalho de Melo)
> 
> - Do not require --no-syscalls to suppress strace like output, i.e.
> 
>  # perf trace -e sched:*switch
> 
>   will show just sched:sched_switch events, not strace-like formatted
>   syscall events, use --syscalls to get the previous behaviour.
> 
>   If instead:
> 
>  # perf trace
> 
>   is used, i.e. no events specified, then --syscalls is implied and
>   system wide strace like formatting will be applied to all syscalls.
> 
>   The behaviour when just a syscall subset is used with '-e' is unchanged:
> 
>  # perf trace -e *sleep,sched:*switch
> 
>   will work as before: just the 'nanosleep' syscall will be strace-like
>   formatted plus the sched:sched_switch tracepoint event, system wide.
> 
> - Allow string table generators to use a default header dir, allowing
>   use of them without parameters to see the table it generates on
>   stdout, e.g.:
> 
> $ tools/perf/trace/beauty/kvm_ioctl.sh
> static const char *kvm_ioctl_cmds[] = {
> [0x00] = "GET_API_VERSION",
> [0x01] = "CREATE_VM",
> [0x02] = "GET_MSR_INDEX_LIST",
> [0x03] = "CHECK_EXTENSION",
> 
> [0xe0] = "CREATE_DEVICE",
> [0xe1] = "SET_DEVICE_ATTR",
> [0xe2] = "GET_DEVICE_ATTR",
> [0xe3] = "HAS_DEVICE_ATTR",
> };
> $
> 
>   See 'ls tools/perf/trace/beauty/*.sh' to see the available string
>   table generators.
> 
> - Add a generator for IPPROTO_ socket's protocol constants.
> 
> perf record: (Kan Liang)
> 
> - Fix error out while applying initial delay and using LBR, due to
>   the use of a PERF_TYPE_SOFTWARE/PERF_COUNT_SW_DUMMY event to track
>   PERF_RECORD_MMAP events while waiting for the initial delay. Such
>   events fail when configured asking PERF_SAMPLE_BRANCH_STACK in
>   perf_event_attr.sample_type.
> 
> perf c2c: (Jiri Olsa)
> 
> - Fix report crash for empty browser, when processing a perf.data file
>   without events of interest, either because not asked for in
>   'perf record' or because the workload didn't triggered such events.
> 
> perf list: (Michael Petlan)
> 
> - Align metric group description format with PMU event description.
> 
> perf tests: (Sandipan Das)
> 
> - Fix indexing when invoking subtests, which caused BPF tests to
>   get results for the next test in the list, with the last one
>   reporting a failure.
> 
> eBPF:
> 
> - Fix installation directory for header files included from eBPF proggies,
>   avoiding clashing with relative paths used to build other software projects
>   such as glibc. (Thomas Richter)
> 
> - Show better message when failing to load an object. (Arnaldo Carvalho de 
> Melo)
> 
> General: (Christophe Leroy)
> 
> - Allow overriding MAX_NR_CPUS at compile time, to make the tooling
>   usable in systems with less memory, in time this has to be changed
>   to properly allocate based on _NPROCESSORS_ONLN.
> 
> Architecture specific:
> 
> - Update arm64's ThunderX2 implementation defined pmu core events (Ganapatrao 
> Kulkarni)
> 
> - Fix complex event name parsing in 'perf test' for PowerPC, where the 
> 'umask' event
>   modifier isn't present. (Sandipan Das)
> 
> CoreSight ARM hardware tracing: (Leo Yan)
> 
> - Fix start tracing packet handling.
> 
> - Support dummy address value for CS_ETM_TRACE_ON packet.
> 
> - Generate branch sample when receiving a CS_ETM_TRACE_ON packet.
> 
> - Generate branch sample for CS_ETM_TRACE_ON packet.
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Arnaldo Carvalho de Melo (9):
>   perf trace beauty: Default header_dir to cwd to work without parms
>   tools include uapi: Grab a copy of linux/in.h
>   perf beauty: Add a generator for IPPROTO_ socket's protocol constants
>   perf trace beauty: Do not print NULL strarray entries
>   perf trace beauty: Add beautifiers for 'socket''s 'protocol' arg
>   perf trace: Beautify the AF_INET & AF_INET6 'socket' syscall 'protocol' 
> args
>   perf bpf: Show better message when failing to load an object
>   perf bpf: Include uapi/linux/bpf.h from the 'perf trace' script's 

Re: [PATCH v6 00/11] hugetlb: Factorize hugetlb architecture primitives

2018-08-07 Thread Ingo Molnar
  | 54 ++---
>  arch/sparc/include/asm/hugetlb.h | 40 +++--
>  arch/x86/include/asm/hugetlb.h   | 69 --
>  include/asm-generic/hugetlb.h| 88 
> +++-
>  15 files changed, 135 insertions(+), 394 deletions(-)

The x86 bits look good to me (assuming it's all tested on all relevant 
architectures, etc.)

Acked-by: Ingo Molnar 

Thanks,

Ingo


Re: [PATCH] watchdog/softlockup: Fix SOFTLOCKUP_DETECTOR=n build

2018-07-10 Thread Ingo Molnar


* Peter Zijlstra  wrote:

> On Mon, Jul 09, 2018 at 11:40:14PM +0530, Abdul Haleem wrote:
> 
> > Thanks Peter for the patch, build and boot is fine.
> > 
> > Reported-and-tested-by: Abdul Haleem 
> 
> Excellent, Ingo can you stick this in?

Sure, done!

Thanks,

Ingo


Re: [PATCH] x86, powerpc : pkey-mprotect must allow pkey-0

2018-03-09 Thread Ingo Molnar

* Ram Pai <linux...@us.ibm.com> wrote:

> Once an address range is associated with an allocated pkey, it cannot be
> reverted back to key-0. There is no valid reason for the above behavior.  On
> the contrary applications need the ability to do so.
> 
> The patch relaxes the restriction.
> 
> Tested on powerpc and x86_64.
> 
> cc: Dave Hansen <dave.han...@intel.com>
> cc: Michael Ellermen <m...@ellerman.id.au>
> cc: Ingo Molnar <mi...@kernel.org>
> Signed-off-by: Ram Pai <linux...@us.ibm.com>
> ---
>  arch/powerpc/include/asm/pkeys.h | 19 ++-
>  arch/x86/include/asm/pkeys.h |  5 +++--
>  2 files changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pkeys.h 
> b/arch/powerpc/include/asm/pkeys.h
> index 0409c80..3e8abe4 100644
> --- a/arch/powerpc/include/asm/pkeys.h
> +++ b/arch/powerpc/include/asm/pkeys.h
> @@ -101,10 +101,18 @@ static inline u16 pte_to_pkey_bits(u64 pteflags)
>  
>  static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
>  {
> - /* A reserved key is never considered as 'explicitly allocated' */
> - return ((pkey < arch_max_pkey()) &&
> - !__mm_pkey_is_reserved(pkey) &&
> - __mm_pkey_is_allocated(mm, pkey));
> + /* pkey 0 is allocated by default. */
> + if (!pkey)
> +return true;
> +
> + if (pkey < 0 || pkey >= arch_max_pkey())
> +return false;
> +
> + /* reserved keys are never allocated. */
> + if (__mm_pkey_is_reserved(pkey))
> +return false;

Please capitalize in comments consistently, i.e.:

/* Reserved keys are never allocated: */

> +
> + return(__mm_pkey_is_allocated(mm, pkey));

'return' is not a function.

Thanks,

Ingo


Re: [RFC] new SYSCALL_DEFINE/COMPAT_SYSCALL_DEFINE wrappers

2018-03-30 Thread Ingo Molnar

* John Paul Adrian Glaubitz  wrote:

> On 03/27/2018 12:40 PM, Linus Torvalds wrote:
> > On Mon, Mar 26, 2018 at 4:37 PM, John Paul Adrian Glaubitz
> >  wrote:
> >>
> >> What about a tarball with a minimal Debian x32 chroot? Then you can
> >> install interesting packages you would like to test yourself.
> > 
> > That probably works fine.
> 
> I just created a fresh Debian x32 unstable chroot using this command:
> 
> $ debootstrap --no-check-gpg --variant=minbase --arch=x32 unstable 
> debian-x32-unstable http://ftp.ports.debian.org/debian-ports
> 
> It can be downloaded from my Debian webspace along checksum files for
> verification:
> 
> > https://people.debian.org/~glaubitz/chroots/
> 
> Let me know if you run into any issues.

Here's the direct download link:

  $ wget https://people.debian.org/~glaubitz/chroots/debian-x32-unstable.tar.gz

Checksum should be:

  $ sha256sum debian-x32-unstable.tar.gz
  010844bcc76bd1a3b7a20fe47f7067ed8e429a84fa60030a2868626e8fa7ec3b  
debian-x32-unstable.tar.gz

Seems to work fine here (on a distro kernel) even if I extract all the files as 
a 
non-root user and do:

  ~/s/debian-x32-unstable> fakechroot /usr/sbin/chroot . /usr/bin/dpkg -l  | 
tail -2

  ERROR: ld.so: object 'libfakechroot.so' from LD_PRELOAD cannot be preloaded 
(cannot open shared object file): ignored.
  ii  util-linux:x32 2.31.1-0.5   x32  miscellaneous 
system utilities
  ii  zlib1g:x32 1:1.2.8.dfsg-5   x32  compression 
library - runtime

So that 'dpkg' instance appears to be running inside the chroot environment and 
is 
listing x32 installed packages.

Although I did get this warning:

  ERROR: ld.so: object 'libfakechroot.so' from LD_PRELOAD cannot be preloaded 
(cannot open shared object file): ignored.

Even with that warning, is still still a sufficiently complex test of x32 
syscall 
code paths?

BTW., "fakechroot /usr/sbin/chroot ." crashes instead of giving me a bash shell.

Thanks,

Ingo


Re: [PATCH] Extract initrd free logic from arch-specific code.

2018-03-30 Thread Ingo Molnar

* Shea Levy  wrote:

> Now only those architectures that have custom initrd free requirements
> need to define free_initrd_mem.
> 
> Signed-off-by: Shea Levy 

Please put the Kconfig symbol name this patch introduces both into the title, 
so 
that people know what to grep for.

> ---
>  arch/alpha/mm/init.c  |  8 
>  arch/arc/mm/init.c|  7 ---
>  arch/arm/Kconfig  |  1 +
>  arch/arm64/Kconfig|  1 +
>  arch/blackfin/Kconfig |  1 +
>  arch/c6x/mm/init.c|  7 ---
>  arch/cris/Kconfig |  1 +
>  arch/frv/mm/init.c| 11 ---
>  arch/h8300/mm/init.c  |  7 ---
>  arch/hexagon/Kconfig  |  1 +
>  arch/ia64/Kconfig |  1 +
>  arch/m32r/Kconfig |  1 +
>  arch/m32r/mm/init.c   | 11 ---
>  arch/m68k/mm/init.c   |  7 ---
>  arch/metag/Kconfig|  1 +
>  arch/microblaze/mm/init.c |  7 ---
>  arch/mips/Kconfig |  1 +
>  arch/mn10300/Kconfig  |  1 +
>  arch/nios2/mm/init.c  |  7 ---
>  arch/openrisc/mm/init.c   |  7 ---
>  arch/parisc/mm/init.c |  7 ---
>  arch/powerpc/mm/mem.c |  7 ---
>  arch/riscv/mm/init.c  |  6 --
>  arch/s390/Kconfig |  1 +
>  arch/score/Kconfig|  1 +
>  arch/sh/mm/init.c |  7 ---
>  arch/sparc/Kconfig|  1 +
>  arch/tile/Kconfig |  1 +
>  arch/um/kernel/mem.c  |  7 ---
>  arch/unicore32/Kconfig|  1 +
>  arch/x86/Kconfig  |  1 +
>  arch/xtensa/Kconfig   |  1 +
>  init/initramfs.c  |  7 +++
>  usr/Kconfig   |  4 
>  34 files changed, 28 insertions(+), 113 deletions(-)

Please also put it into Documentation/features/.

> diff --git a/usr/Kconfig b/usr/Kconfig
> index 43658b8a975e..7a94f6df39bf 100644
> --- a/usr/Kconfig
> +++ b/usr/Kconfig
> @@ -233,3 +233,7 @@ config INITRAMFS_COMPRESSION
>   default ".lzma" if RD_LZMA
>   default ".bz2"  if RD_BZIP2
>   default ""
> +
> +config HAVE_ARCH_FREE_INITRD_MEM
> + bool
> + default n

Help text would be nice, to tell arch maintainers what the purpose of this 
switch 
is.

Also, a nit, I think this should be named "ARCH_HAS_FREE_INITRD_MEM", which is 
the 
dominant pattern:

triton:~/tip> git grep 'select.*ARCH' arch/x86/Kconfig* | cut -f2 | cut -d_ 
-f1-2 | sort | uniq -c | sort -n
...
  2 select ARCH_USES
  2 select ARCH_WANTS
  3 select ARCH_MIGHT
  3 select ARCH_WANT
  4 select ARCH_SUPPORTS
  4 select ARCH_USE
 16 select HAVE_ARCH
 23 select ARCH_HAS

It also reads nicely in English:

  "arch has free_initrd_mem()"

While the other makes little sense:

  "have arch free_initrd_mem()"

?

Thanks,

Ingo


Re: [PATCH] Extract initrd free logic from arch-specific code.

2018-04-02 Thread Ingo Molnar

* Shea Levy  wrote:

> > Please also put it into Documentation/features/.
> 
> I switched this patch series (the latest revision v6 was just posted) to
> using weak symbols instead of Kconfig. Does it still warrant documentation?

Probably not.

Thanks,

Ingo


Re: [RFC PATCH 4/6] mm: provide generic compat_sys_readahead() implementation

2018-03-19 Thread Ingo Molnar

* Al Viro  wrote:

> On Sun, Mar 18, 2018 at 06:18:48PM +, Al Viro wrote:
> 
> > I'd done some digging in that area, will find the notes and post.
> 
> OK, found:

Very nice writeup - IMHO this should go into Documentation/!

> OTOH, consider arm.  There we have
>   * r0, r1, r2, r3, [sp,#8], [sp,#12], [sp,#16]... is the sequence
> of objects used to pass arguments
>   * 32bit and less - pick the next available slot
>   * 64bit - skip a slot if we'd already taken an odd number, then use
> the next two slots for lower and upper 32 bits of the argument.
> 
> So our classes take
> simple n-argument:0 to 6 slots
> WD4 slots
> DWW   4 slots
> WDW   5 slots
> WWDD  6 slots
> WDWW  5 slots
> WWWD  6 slots
> WWDWW 6 slots
> WDDW  7 slots (!)  Also , , !@#!@#!@#!# and other nice
> and well-deserved comments from arch maintainers, some of them even printable:
> /* It would be nice if people remember that not all the world's an i386
>when they introduce new system calls */
> SYSCALL_DEFINE4(sync_file_range2, int, fd, unsigned int, flags,
>  loff_t, offset, loff_t, nbytes)

Such idiosyncratic platform quirks that have an impact on generic code should 
be 
as self-maintaining as possible: i.e. there should be a build time warning even 
on 
x86 if someone introduces a new, suboptimally packed system call.

Otherwise we'll have such incidents again and again as new system calls get 
added.

> [snip the preprocessor horrors - the sketches I've got there are downright 
> obscene]

I still think we should consider creating a generic facility and a tool: which 
would immediately and automatically add new system calls to *every* 
architecture - 
or which would initially at least check these syscall ABI constraints.

I.e. this would start with a new generic kernel facility that warns about 
suboptimal new system call argument layouts on every architecture, not just on 
the 
affected ones.

That's a significant undertaking but should be possible to do.

Once such a facility is in place all the existing old mess is still a PITA, but 
should be manageable eventually - as no new mess is added to it.

IMHO that's the only thing that could break the somewhat deadly current dynamic 
of 
system call mappings mess. Complaining about people not knowing about quirks 
won't 
help.

One way to implement this would be to put the argument chain types (string) and 
sizes (int) into a special debug section which isn't included in the final 
kernel 
image but which can be checked at link time.

For example this attempt at creating a new system call:

  SYSCALL_DEFINE3(moron, int, fd, loff_t, offset, size_t, count)

... would translate into something like:

.name = "moron", .pattern = "WWW", .type = "int",.size = 4,
.name = NULL,  .type = "loff_t", .size = 8,
.name = NULL,  .type = "size_t", .size = 4,
.name = NULL,  .type = NULL, .size = 0, /* 
end of parameter list */

i.e. "WDW". The build-time constraint checker could then warn about:

  # error: System call "moron" uses invalid 'WWW' argument mapping for a 'WDW' 
sequence
  #please avoid long-long arguments or use 'SYSCALL_DEFINE3_WDW()' 
instead

Each architecture can provide its own syscall parameter checking logic. Both 
'stack boundary' and parameter packing rules would be straightforward to 
express 
if we had such a data structure.

Also note that this tool could also check for optimum packing, i.e. if the new 
system call is defined as:

  SYSCALL_DEFINE3_WDW(moron, int, fd, loff_t, offset, size_t, count)

... would translate to something like:

.name = "moron", .pattern = "WDW", .type = "int",.size = 4,
.name = NULL,  .type = "loff_t", .size = 8,
.name = NULL,  .type = "size_t", .size = 4,
.name = NULL,  .type = NULL, .size = 0, /* 
end of parameter list */

where the tool would print out this error:

  # error: System call "moron" uses suboptimal 'WDW' argument mapping instead 
of 'WWD'

there would be a whitelist of existing system calls that are already using an 
suboptimal argument order - but the warnings/errors would trigger for all new 
system calls.

But adding non-straight-mapped system calls would be the exception in any case.

Such tooling could also do other things, such as limit the C types used for 
system 
call defines to a well-chosen set of ABI-safe types, such as:

  3  key_t
  3  uint32_t
  4  aio_context_t
  4  mqd_t
  4  timer_t
 10  clockid_t
 10  gid_t
 10  loff_t
 10  long
 10  old_gid_t
 10  old_uid_t
 10  umode_t
 11  uid_t
 31  pid_t
 34  size_t
 69  unsigned int

Re: [RFC PATCH 4/6] mm: provide generic compat_sys_readahead() implementation

2018-03-20 Thread Ingo Molnar

* Al Viro  wrote:

> > For example this attempt at creating a new system call:
> > 
> >   SYSCALL_DEFINE3(moron, int, fd, loff_t, offset, size_t, count)
> > 
> > ... would translate into something like:
> > 
> > .name = "moron", .pattern = "WWW", .type = "int",.size = 4,
> > .name = NULL,  .type = "loff_t", .size = 8,
> > .name = NULL,  .type = "size_t", .size = 4,
> > .name = NULL,  .type = NULL, .size = 0, /* 
> > end of parameter list */
> > 
> > i.e. "WDW". The build-time constraint checker could then warn about:
> > 
> >   # error: System call "moron" uses invalid 'WWW' argument mapping for a 
> > 'WDW' sequence
> >   #please avoid long-long arguments or use 'SYSCALL_DEFINE3_WDW()' 
> > instead
> 
> ... if you do 32bit build.

Yeah - but the checking tool could do a 32-bit sizing of the types and thus the 
checks would work on all arches and on all bitness settings.

I don't think doing part of this in CPP is a good idea:

 - It won't be able to do the full range of checks

 - Wrappers should IMHO be trivial and open coded as much as possible - not 
hidden
   inside several layers of macros.

 - There should be a penalty for newly introduced, badly designed system call
   ABIs, while most CPP variants I can think of will just make bad but solvable 
   decisions palatable, AFAICS.

I.e. I think the way out of this would be two steps:

 1) for new system calls: hard-enforce the highest quality at the development
stage and hard-reject crap. No new 6-parameter system calls or badly ordered
arguments. The tool would also check new extensions to existing system 
calls, 
i.e. no more "add a crappy 4th argument to an existing system call that 
works 
on x86 but hurts MIPS".

 2) for old legacies: cleanly open code all our existing legacies and weird
wrappers. No new muck will be added to it so the line count does not matter.

... is there anything I'm missing?

Thanks,

Ingo


Re: [GIT PULL 00/14] perf/core improvements and fixes

2018-03-19 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling, this has those 31 patches that were
> blocked due to some problems (author not being the fist S-o-B, build
> broken on ppc), those issues should all be fixed and then we have 14
> patches more, described in the signed tag.
> 
> Regards,
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit 10f354a36f9a9aa1b8bffe0abc1cd43822a85bcd:
> 
>   perf test: Fix exit code for record+probe_libc_inet_pton.sh (2018-03-16 
> 13:56:31 -0300)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-4.17-20180319
> 
> for you to fetch changes up to 1cd618838b9703eabe4a75badf433382b12f6bef:
> 
>   perf tests bp_account: Fix build with clang-6 (2018-03-19 13:51:54 -0300)
> 
> 
> perf/core improvements and fixes:
> 
> - Fixes for problems experienced with new gcc 8 warnings, that treated
>   as errors, broke the build, related to snprintf and casting issues.
>   (Arnaldo Carvalho de Melo, Jiri Olsa, Josh Poinboeuf)
> 
> - Fix build of new breakpoint 'perf test' entry with clang < 6, noticed
>   on fedora 25, 26 and 27 (Arnaldo Carvalho de Melo)
> 
> - Workaround problem with symbol resolution in 'perf annotate', using
>   the symbol name already present in the objdump output (Arnaldo Carvalho de 
> Melo)
> 
> - Document 'perf top --ignore-vmlinux' (Arnaldo Carvalho de Melo)
> 
> - Fix out of bounds access on array fd when cnt is 100 in one of the
>   'perf test' entries, detected using 'cpptest' (Colin Ian King)
> 
> - Add support for the forced leader feature, i.e. 'perf report --group'
>   for a group of events not really grouped when scheduled (without using
>   {} to enclose the list of events in the command line) in pipe mode,
>   e.g.:
> 
>   $ perf record -e cycles,instructions -o - kill | perf report --group -i -
> 
> - Use right type to access array elements in 'perf probe' (Masami Hiramatsu)
> 
> - Update POWER9 vendor events (those described in JSON format) (Sukadev 
> Bhattiprolu)
> 
> - Discard head in overwrite_rb_find_range() (Yisheng Xie)
> 
> - Avoid setting 'quiet' to 'true' unnecessarily (Yisheng Xie)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Arnaldo Carvalho de Melo (4):
>   perf annotate: Use asprintf when formatting objdump command line
>   perf top: Document --ignore-vmlinux
>   perf annotate: Use ops->target.name when available for unresolved call 
> targets
>   perf tests bp_account: Fix build with clang-6
> 
> Colin Ian King (1):
>   perf tests: Fix out of bounds access on array fd when cnt is 100
> 
> Jiri Olsa (4):
>   perf record: Synthesize features before events in pipe mode
>   perf report: Support forced leader feature in pipe mode
>   perf tools: Fix snprint warnings for gcc 8
>   perf tools: Fix python extension build for gcc 8
> 
> Josh Poimboeuf (1):
>   objtool, perf: Fix GCC 8 -Wrestrict error
> 
> Masami Hiramatsu (1):
>   perf probe: Use right type to access array elements
> 
> Sukadev Bhattiprolu (1):
>   perf vendor events: Update POWER9 events
> 
> Yisheng Xie (2):
>   perf mmap: Discard head in overwrite_rb_find_range()
>   perf debug: Avoid setting 'quiet' to 'true' unnecessarily
> 
>  tools/lib/str_error_r.c|   2 +-
>  tools/perf/Documentation/perf-top.txt  |   3 +
>  tools/perf/builtin-record.c|  18 +-
>  tools/perf/builtin-report.c|  57 +++--
>  tools/perf/builtin-script.c|  22 +-
>  .../perf/pmu-events/arch/powerpc/power9/cache.json |  25 ---
>  .../pmu-events/arch/powerpc/power9/frontend.json   |  10 -
>  .../pmu-events/arch/powerpc/power9/marked.json |   5 -
>  .../pmu-events/arch/powerpc/power9/memory.json |   5 -
>  .../perf/pmu-events/arch/powerpc/power9/other.json | 241 
> ++---
>  .../pmu-events/arch/powerpc/power9/pipeline.json   |  50 ++---
>  tools/perf/pmu-events/arch/powerpc/power9/pmc.json |   5 -
>  .../arch/powerpc/power9/translation.json   |  10 +-
>  tools/perf/tests/attr.c|   4 +-
>  tools/perf/tests/bp_account.c  |  10 +-
>  tools/perf/tests/mem.c |   2 +-
>  tools/perf/tests/pmu.c |   2 +-
>  tools/perf/util/annotate.c |  20 +-
>  tools/perf/util/cgroup.c   |   2 +-
>  tools/perf/util/debug.c|   1 -
>  tools/perf/util/header.c   |  11 +-
>  tools/perf/util/mmap.c |  15 +-
>  tools/perf/util/parse-events.c |   4 +-
>  

Re: [PATCH 2/2] x86, powerpc: remove -funit-at-a-time compiler option entirely

2018-11-11 Thread Ingo Molnar


* Masahiro Yamada  wrote:

> GCC 4.6 manual says:
> 
> -funit-at-a-time
>   This option is left for compatibility reasons. -funit-at-a-time has
>   no effect, while -fno-unit-at-a-time implies -fno-toplevel-reorder
>   and -fno-section-anchors.
>   Enabled by default.
> 
> Signed-off-by: Masahiro Yamada 
> ---
> 
>  arch/powerpc/Makefile | 4 
>  arch/x86/Makefile | 4 
>  arch/x86/Makefile.um  | 5 -
>  3 files changed, 13 deletions(-)
> 
> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> index 88398fd..3508049 100644
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -130,10 +130,6 @@ else
>  
>  KBUILD_CFLAGS += -mno-red-zone
>  KBUILD_CFLAGS += -mcmodel=kernel
> -
> -# -funit-at-a-time shrinks the kernel .text considerably
> -# unfortunately it makes reading oopses harder.
> -KBUILD_CFLAGS += $(call cc-option,-funit-at-a-time)
>  endif
>  
>  ifdef CONFIG_X86_X32
> diff --git a/arch/x86/Makefile.um b/arch/x86/Makefile.um
> index 577976b..1db7913 100644
> --- a/arch/x86/Makefile.um
> +++ b/arch/x86/Makefile.um
> @@ -26,9 +26,6 @@ cflags-y += $(call cc-option,-mpreferred-stack-boundary=2)
>  # an unresolved reference.
>  cflags-y += -ffreestanding
>  
> -# gcc 4.3.0 needs -funit-at-a-time for extern inline functions.
> -KBUILD_CFLAGS += $(call cc-option,-funit-at-a-time)
> -
>  KBUILD_CFLAGS += $(cflags-y)
>  
>  else
> @@ -50,6 +47,4 @@ ELF_FORMAT := elf64-x86-64
>  LINK-$(CONFIG_LD_SCRIPT_DYN) += -Wl,-rpath,/lib64
>  LINK-y += -m64
>  
> -# Do unit-at-a-time unconditionally on x86_64, following the host
> -KBUILD_CFLAGS += $(call cc-option,-funit-at-a-time)
>  endif

Acked-by: Ingo Molnar 

Thanks,

Ingo


Re: [PATCH 02/17] x86: Add support for ZSTD-compressed kernel

2018-11-11 Thread Ingo Molnar


* Adam Borowski  wrote:

> From: Nick Terrell 
> 
> Integrates the ZSTD decompression code to the x86 pre-boot code.
> 
> Zstandard requires slightly more memory during the kernel decompression
> on x86 (192 KB vs 64 KB), and the memory usage is independent of the
> window size.
> 
> Zstandard requires memory proportional to the window size used during
> compression for decompressing the ramdisk image, since streaming mode is
> used. Newer versions of zstd (1.3.2+) list the window size of a file
> with `zstd -lv '. The absolute maximum amount of memory required
> is just over 8 MB.
> 
> Signed-off-by: Nick Terrell 
> ---
>  Documentation/x86/boot.txt| 6 +++---
>  arch/x86/Kconfig  | 1 +
>  arch/x86/boot/compressed/Makefile | 5 -
>  arch/x86/boot/compressed/misc.c   | 4 
>  arch/x86/boot/header.S| 8 +++-
>  arch/x86/include/asm/boot.h   | 6 --
>  6 files changed, 23 insertions(+), 7 deletions(-)

Acked-by: Ingo Molnar 

> diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S
> index 4c881c850125..af2efb256527 100644
> --- a/arch/x86/boot/header.S
> +++ b/arch/x86/boot/header.S
> @@ -526,8 +526,14 @@ pref_address:.quad LOAD_PHYSICAL_ADDR
> # preferred load addr
>  # the size-dependent part now grows so fast.
>  #
>  # extra_bytes = (uncompressed_size >> 8) + 65536
> +#
> +# ZSTD compressed data grows by at most 3 bytes per 128K, and only has a 22
> +# byte fixed overhead but has a maximum block size of 128K, so it needs a
> +# larger margin.
> +#
> +# extra_bytes = (uncompressed_size >> 8) + 131072
>  
> -#define ZO_z_extra_bytes ((ZO_z_output_len >> 8) + 65536)
> +#define ZO_z_extra_bytes ((ZO_z_output_len >> 8) + 131072)

This change would also affect other decompressors, not just ZSTD, 
correct?

Might want to split this change out into a separate preparatory patch to 
allow it to be bisected to, or at least mention it in the changelog more 
explicitly?

Thanks,

Ingo


Re: [PATCH v2 2/2] locking/rwsem: Optimize down_read_trylock()

2019-02-13 Thread Ingo Molnar


* Waiman Long  wrote:

> I looked at the assembly code in arch/x86/include/asm/rwsem.h. For both
> trylocks (read & write), the count is read first before attempting to
> lock it. We did the same for all trylock functions in other locks.
> Depending on how the trylock is used and how contended the lock is, it
> may help or hurt performance. Changing down_read_trylock to do an
> unconditional cmpxchg will change the performance profile of existing
> code. So I would prefer keeping the current code.
> 
> I do notice now that the generic down_write_trylock() code is doing an
> unconditional compxchg. So I wonder if we should change it to read the
> lock first like other trylocks or just leave it as it is.

No, I think we should instead move the other trylocks to the 
try-for-ownership model as well, like Linus suggested.

That's the general assumption we make in locking primitives, that we 
optimize for the common, expected case - which would be that the trylock 
succeeds, and I don't see why trylock primitives should be different.

In fact I can see more ways for read-for-sharing to perform suboptimally 
on larger systems.

Thanks,

Ingo


Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-02-10 Thread Ingo Molnar


* Waiman Long  wrote:

> On 02/07/2019 02:51 PM, Davidlohr Bueso wrote:
> > On Thu, 07 Feb 2019, Waiman Long wrote:
> >> 30 files changed, 1197 insertions(+), 1594 deletions(-)
> >
> > Performance numbers on numerous workloads, pretty please.
> >
> > I'll go and throw this at my mmap_sem intensive workloads
> > I've collected.
> >
> > Thanks,
> > Davidlohr
> 
> Thanks for getting some of the performance numbers. This is the initial
> draft after more than 1 years of hibernation. I will also get other
> performance numbers in subsequent revision of the patch.

If you could sort all the invariant preparatory patches to the head of 
the series I can merge them to reduce overall complexity and simplify 
performance testing and review of the rest.

Thanks,

Ingo


<    1   2   3   4   >