Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers

2017-11-15 Thread Peter Zijlstra
On Tue, Nov 14, 2017 at 11:10:23AM -0800, Hugh Dickins wrote:
> I was about to agree, but now I'm not so sure.  I don't know much
> about these PMC things, but at a glance it looks like what is reserved
> by x86_reserve_hardware() may later be released by x86_release_hardware(),
> and then later reserved again by x86_reserve_hardware().  And although
> the static per-cpu area would be zeroed the first time, the second time
> it will contain data left over from before, so really needs the memset?

Ah, yes. It does get reused. I think its still fine, but yes lets keep
it. Better safe than sorry and its not a hot path in any case.

Thanks!


Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers

2017-11-15 Thread Peter Zijlstra
On Tue, Nov 14, 2017 at 11:10:23AM -0800, Hugh Dickins wrote:
> I was about to agree, but now I'm not so sure.  I don't know much
> about these PMC things, but at a glance it looks like what is reserved
> by x86_reserve_hardware() may later be released by x86_release_hardware(),
> and then later reserved again by x86_reserve_hardware().  And although
> the static per-cpu area would be zeroed the first time, the second time
> it will contain data left over from before, so really needs the memset?

Ah, yes. It does get reused. I think its still fine, but yes lets keep
it. Better safe than sorry and its not a hot path in any case.

Thanks!


Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers

2017-11-14 Thread Andy Lutomirski
On Tue, Nov 14, 2017 at 11:10 AM, Hugh Dickins  wrote:
> On Tue, 14 Nov 2017, Dave Hansen wrote:
>> On 11/14/2017 10:20 AM, Peter Zijlstra wrote:
>> > On Fri, Nov 10, 2017 at 11:31:39AM -0800, Dave Hansen wrote:
>> >>  static int alloc_ds_buffer(int cpu)
>> >>  {
>> >> +  struct debug_store *ds = per_cpu_ptr(_debug_store, cpu);
>> >>
>> >> +  memset(ds, 0, sizeof(*ds));
>> > Still wondering about that memset...
>
> Sorry, my attention is far away at the moment.
>
>>
>> My guess is that it was done to mirror the zeroing done by the original
>> kzalloc().
>
> You guess right.
>
>> But, I think you're right that it's zero'd already by virtue
>> of being static:
>>
>> static
>> DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store,
>> cpu_debug_store);
>>
>> I'll queue a cleanup, or update it if I re-post the set.
>
> I was about to agree, but now I'm not so sure.  I don't know much
> about these PMC things, but at a glance it looks like what is reserved
> by x86_reserve_hardware() may later be released by x86_release_hardware(),
> and then later reserved again by x86_reserve_hardware().  And although
> the static per-cpu area would be zeroed the first time, the second time
> it will contain data left over from before, so really needs the memset?
>

For an upstream solution, I would really really like to see
DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED and friends completely gone
and to use cpu_entry_area instead.  I don't know whether this has any
material impact on this particular discussion, though.

--Andy

> Hugh


Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers

2017-11-14 Thread Andy Lutomirski
On Tue, Nov 14, 2017 at 11:10 AM, Hugh Dickins  wrote:
> On Tue, 14 Nov 2017, Dave Hansen wrote:
>> On 11/14/2017 10:20 AM, Peter Zijlstra wrote:
>> > On Fri, Nov 10, 2017 at 11:31:39AM -0800, Dave Hansen wrote:
>> >>  static int alloc_ds_buffer(int cpu)
>> >>  {
>> >> +  struct debug_store *ds = per_cpu_ptr(_debug_store, cpu);
>> >>
>> >> +  memset(ds, 0, sizeof(*ds));
>> > Still wondering about that memset...
>
> Sorry, my attention is far away at the moment.
>
>>
>> My guess is that it was done to mirror the zeroing done by the original
>> kzalloc().
>
> You guess right.
>
>> But, I think you're right that it's zero'd already by virtue
>> of being static:
>>
>> static
>> DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store,
>> cpu_debug_store);
>>
>> I'll queue a cleanup, or update it if I re-post the set.
>
> I was about to agree, but now I'm not so sure.  I don't know much
> about these PMC things, but at a glance it looks like what is reserved
> by x86_reserve_hardware() may later be released by x86_release_hardware(),
> and then later reserved again by x86_reserve_hardware().  And although
> the static per-cpu area would be zeroed the first time, the second time
> it will contain data left over from before, so really needs the memset?
>

For an upstream solution, I would really really like to see
DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED and friends completely gone
and to use cpu_entry_area instead.  I don't know whether this has any
material impact on this particular discussion, though.

--Andy

> Hugh


Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers

2017-11-14 Thread Hugh Dickins
On Tue, 14 Nov 2017, Dave Hansen wrote:
> On 11/14/2017 10:20 AM, Peter Zijlstra wrote:
> > On Fri, Nov 10, 2017 at 11:31:39AM -0800, Dave Hansen wrote:
> >>  static int alloc_ds_buffer(int cpu)
> >>  {
> >> +  struct debug_store *ds = per_cpu_ptr(_debug_store, cpu);
> >>  
> >> +  memset(ds, 0, sizeof(*ds));
> > Still wondering about that memset...

Sorry, my attention is far away at the moment.

> 
> My guess is that it was done to mirror the zeroing done by the original
> kzalloc().

You guess right.

> But, I think you're right that it's zero'd already by virtue
> of being static:
> 
> static
> DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store,
> cpu_debug_store);
> 
> I'll queue a cleanup, or update it if I re-post the set.

I was about to agree, but now I'm not so sure.  I don't know much
about these PMC things, but at a glance it looks like what is reserved
by x86_reserve_hardware() may later be released by x86_release_hardware(),
and then later reserved again by x86_reserve_hardware().  And although
the static per-cpu area would be zeroed the first time, the second time
it will contain data left over from before, so really needs the memset?

Hugh


Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers

2017-11-14 Thread Hugh Dickins
On Tue, 14 Nov 2017, Dave Hansen wrote:
> On 11/14/2017 10:20 AM, Peter Zijlstra wrote:
> > On Fri, Nov 10, 2017 at 11:31:39AM -0800, Dave Hansen wrote:
> >>  static int alloc_ds_buffer(int cpu)
> >>  {
> >> +  struct debug_store *ds = per_cpu_ptr(_debug_store, cpu);
> >>  
> >> +  memset(ds, 0, sizeof(*ds));
> > Still wondering about that memset...

Sorry, my attention is far away at the moment.

> 
> My guess is that it was done to mirror the zeroing done by the original
> kzalloc().

You guess right.

> But, I think you're right that it's zero'd already by virtue
> of being static:
> 
> static
> DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store,
> cpu_debug_store);
> 
> I'll queue a cleanup, or update it if I re-post the set.

I was about to agree, but now I'm not so sure.  I don't know much
about these PMC things, but at a glance it looks like what is reserved
by x86_reserve_hardware() may later be released by x86_release_hardware(),
and then later reserved again by x86_reserve_hardware().  And although
the static per-cpu area would be zeroed the first time, the second time
it will contain data left over from before, so really needs the memset?

Hugh


Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers

2017-11-14 Thread Dave Hansen
On 11/14/2017 10:20 AM, Peter Zijlstra wrote:
> On Fri, Nov 10, 2017 at 11:31:39AM -0800, Dave Hansen wrote:
>>  static int alloc_ds_buffer(int cpu)
>>  {
>> +struct debug_store *ds = per_cpu_ptr(_debug_store, cpu);
>>  
>> +memset(ds, 0, sizeof(*ds));
> Still wondering about that memset...

My guess is that it was done to mirror the zeroing done by the original
kzalloc().  But, I think you're right that it's zero'd already by virtue
of being static:

static
DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store,
cpu_debug_store);

I'll queue a cleanup, or update it if I re-post the set.


Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers

2017-11-14 Thread Dave Hansen
On 11/14/2017 10:20 AM, Peter Zijlstra wrote:
> On Fri, Nov 10, 2017 at 11:31:39AM -0800, Dave Hansen wrote:
>>  static int alloc_ds_buffer(int cpu)
>>  {
>> +struct debug_store *ds = per_cpu_ptr(_debug_store, cpu);
>>  
>> +memset(ds, 0, sizeof(*ds));
> Still wondering about that memset...

My guess is that it was done to mirror the zeroing done by the original
kzalloc().  But, I think you're right that it's zero'd already by virtue
of being static:

static
DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store,
cpu_debug_store);

I'll queue a cleanup, or update it if I re-post the set.


Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers

2017-11-14 Thread Peter Zijlstra
On Fri, Nov 10, 2017 at 11:31:39AM -0800, Dave Hansen wrote:
>  static int alloc_ds_buffer(int cpu)
>  {
> + struct debug_store *ds = per_cpu_ptr(_debug_store, cpu);
>  
> + memset(ds, 0, sizeof(*ds));

Still wondering about that memset...

>   per_cpu(cpu_hw_events, cpu).ds = ds;
>  
>   return 0;


Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers

2017-11-14 Thread Peter Zijlstra
On Fri, Nov 10, 2017 at 11:31:39AM -0800, Dave Hansen wrote:
>  static int alloc_ds_buffer(int cpu)
>  {
> + struct debug_store *ds = per_cpu_ptr(_debug_store, cpu);
>  
> + memset(ds, 0, sizeof(*ds));

Still wondering about that memset...

>   per_cpu(cpu_hw_events, cpu).ds = ds;
>  
>   return 0;


[PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers

2017-11-10 Thread Dave Hansen

From: Hugh Dickins 
[Dave] Add explicit _PAGE_GLOBAL
[Dave] remove KAISER #ifdefs by moving kmalloc() to plain page allocator
[Dave] reword the commit message a bit to be consistent with other patches

The BTS and PEBS buffers both have their virtual addresses
programmed into the hardware.  This means that any access to them
is performed via the page tables.  The times that the hardware
accesses these are entirely dependent on how the performance
monitoring hardware events are set up.  In other words, there is
no way for the kernel to tell when the hardware might access
these buffers.

To avoid perf crashes, place 'debug_store' in the user-mapped
per-cpu area instead of dynamically allocating.  Also use the
page allocator plus kaiser_add_mapping() to keep the BTS and PEBS
buffers user-mapped (that is, present in the user mapping, though
visible only to kernel and hardware).  The PEBS fixup buffer does
not need this treatment.

The need for a user-mapped struct debug_store showed up before doing
any conscious perf testing: in a couple of kernel paging oopses on
Westmere, implicating the debug_store offset of the per-cpu area.

Signed-off-by: Hugh Dickins 
Signed-off-by: Dave Hansen 
Cc: Moritz Lipp 
Cc: Daniel Gruss 
Cc: Michael Schwarz 
Cc: Richard Fellner 
Cc: Andy Lutomirski 
Cc: Linus Torvalds 
Cc: Kees Cook 
Cc: Hugh Dickins 
Cc: x...@kernel.org
---

 b/arch/x86/events/intel/ds.c |   49 ---
 1 file changed, 37 insertions(+), 12 deletions(-)

diff -puN 
arch/x86/events/intel/ds.c~kaiser-user-map-virtually-addressed-performance-monitoring-buffers
 arch/x86/events/intel/ds.c
--- 
a/arch/x86/events/intel/ds.c~kaiser-user-map-virtually-addressed-performance-monitoring-buffers
 2017-11-10 11:22:14.866244935 -0800
+++ b/arch/x86/events/intel/ds.c2017-11-10 11:22:14.869244935 -0800
@@ -2,11 +2,15 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 
 #include "../perf_event.h"
 
+static
+DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store, cpu_debug_store);
+
 /* The size of a BTS record in bytes: */
 #define BTS_RECORD_SIZE24
 
@@ -278,6 +282,31 @@ void fini_debug_store_on_cpu(int cpu)
 
 static DEFINE_PER_CPU(void *, insn_buffer);
 
+static void *dsalloc(size_t size, gfp_t flags, int node)
+{
+   unsigned int order = get_order(size);
+   struct page *page;
+   unsigned long addr;
+
+   page = __alloc_pages_node(node, flags | __GFP_ZERO, order);
+   if (!page)
+   return NULL;
+   addr = (unsigned long)page_address(page);
+   if (kaiser_add_mapping(addr, size, __PAGE_KERNEL | _PAGE_GLOBAL) < 0) {
+   __free_pages(page, order);
+   addr = 0;
+   }
+   return (void *)addr;
+}
+
+static void dsfree(const void *buffer, size_t size)
+{
+   if (!buffer)
+   return;
+   kaiser_remove_mapping((unsigned long)buffer, size);
+   free_pages((unsigned long)buffer, get_order(size));
+}
+
 static int alloc_pebs_buffer(int cpu)
 {
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
@@ -288,7 +317,7 @@ static int alloc_pebs_buffer(int cpu)
if (!x86_pmu.pebs)
return 0;
 
-   buffer = kzalloc_node(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
+   buffer = dsalloc(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
if (unlikely(!buffer))
return -ENOMEM;
 
@@ -299,7 +328,7 @@ static int alloc_pebs_buffer(int cpu)
if (x86_pmu.intel_cap.pebs_format < 2) {
ibuffer = kzalloc_node(PEBS_FIXUP_SIZE, GFP_KERNEL, node);
if (!ibuffer) {
-   kfree(buffer);
+   dsfree(buffer, x86_pmu.pebs_buffer_size);
return -ENOMEM;
}
per_cpu(insn_buffer, cpu) = ibuffer;
@@ -325,7 +354,8 @@ static void release_pebs_buffer(int cpu)
kfree(per_cpu(insn_buffer, cpu));
per_cpu(insn_buffer, cpu) = NULL;
 
-   kfree((void *)(unsigned long)ds->pebs_buffer_base);
+   dsfree((void *)(unsigned long)ds->pebs_buffer_base,
+   x86_pmu.pebs_buffer_size);
ds->pebs_buffer_base = 0;
 }
 
@@ -339,7 +369,7 @@ static int alloc_bts_buffer(int cpu)
if (!x86_pmu.bts)
return 0;
 
-   buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
+   buffer = dsalloc(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
if (unlikely(!buffer)) {
WARN_ONCE(1, "%s: BTS buffer allocation failure\n", __func__);
return -ENOMEM;
@@ -365,19 +395,15 @@ static void release_bts_buffer(int cpu)
if (!ds || 

[PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers

2017-11-10 Thread Dave Hansen

From: Hugh Dickins 
[Dave] Add explicit _PAGE_GLOBAL
[Dave] remove KAISER #ifdefs by moving kmalloc() to plain page allocator
[Dave] reword the commit message a bit to be consistent with other patches

The BTS and PEBS buffers both have their virtual addresses
programmed into the hardware.  This means that any access to them
is performed via the page tables.  The times that the hardware
accesses these are entirely dependent on how the performance
monitoring hardware events are set up.  In other words, there is
no way for the kernel to tell when the hardware might access
these buffers.

To avoid perf crashes, place 'debug_store' in the user-mapped
per-cpu area instead of dynamically allocating.  Also use the
page allocator plus kaiser_add_mapping() to keep the BTS and PEBS
buffers user-mapped (that is, present in the user mapping, though
visible only to kernel and hardware).  The PEBS fixup buffer does
not need this treatment.

The need for a user-mapped struct debug_store showed up before doing
any conscious perf testing: in a couple of kernel paging oopses on
Westmere, implicating the debug_store offset of the per-cpu area.

Signed-off-by: Hugh Dickins 
Signed-off-by: Dave Hansen 
Cc: Moritz Lipp 
Cc: Daniel Gruss 
Cc: Michael Schwarz 
Cc: Richard Fellner 
Cc: Andy Lutomirski 
Cc: Linus Torvalds 
Cc: Kees Cook 
Cc: Hugh Dickins 
Cc: x...@kernel.org
---

 b/arch/x86/events/intel/ds.c |   49 ---
 1 file changed, 37 insertions(+), 12 deletions(-)

diff -puN 
arch/x86/events/intel/ds.c~kaiser-user-map-virtually-addressed-performance-monitoring-buffers
 arch/x86/events/intel/ds.c
--- 
a/arch/x86/events/intel/ds.c~kaiser-user-map-virtually-addressed-performance-monitoring-buffers
 2017-11-10 11:22:14.866244935 -0800
+++ b/arch/x86/events/intel/ds.c2017-11-10 11:22:14.869244935 -0800
@@ -2,11 +2,15 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 
 #include "../perf_event.h"
 
+static
+DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store, cpu_debug_store);
+
 /* The size of a BTS record in bytes: */
 #define BTS_RECORD_SIZE24
 
@@ -278,6 +282,31 @@ void fini_debug_store_on_cpu(int cpu)
 
 static DEFINE_PER_CPU(void *, insn_buffer);
 
+static void *dsalloc(size_t size, gfp_t flags, int node)
+{
+   unsigned int order = get_order(size);
+   struct page *page;
+   unsigned long addr;
+
+   page = __alloc_pages_node(node, flags | __GFP_ZERO, order);
+   if (!page)
+   return NULL;
+   addr = (unsigned long)page_address(page);
+   if (kaiser_add_mapping(addr, size, __PAGE_KERNEL | _PAGE_GLOBAL) < 0) {
+   __free_pages(page, order);
+   addr = 0;
+   }
+   return (void *)addr;
+}
+
+static void dsfree(const void *buffer, size_t size)
+{
+   if (!buffer)
+   return;
+   kaiser_remove_mapping((unsigned long)buffer, size);
+   free_pages((unsigned long)buffer, get_order(size));
+}
+
 static int alloc_pebs_buffer(int cpu)
 {
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
@@ -288,7 +317,7 @@ static int alloc_pebs_buffer(int cpu)
if (!x86_pmu.pebs)
return 0;
 
-   buffer = kzalloc_node(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
+   buffer = dsalloc(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
if (unlikely(!buffer))
return -ENOMEM;
 
@@ -299,7 +328,7 @@ static int alloc_pebs_buffer(int cpu)
if (x86_pmu.intel_cap.pebs_format < 2) {
ibuffer = kzalloc_node(PEBS_FIXUP_SIZE, GFP_KERNEL, node);
if (!ibuffer) {
-   kfree(buffer);
+   dsfree(buffer, x86_pmu.pebs_buffer_size);
return -ENOMEM;
}
per_cpu(insn_buffer, cpu) = ibuffer;
@@ -325,7 +354,8 @@ static void release_pebs_buffer(int cpu)
kfree(per_cpu(insn_buffer, cpu));
per_cpu(insn_buffer, cpu) = NULL;
 
-   kfree((void *)(unsigned long)ds->pebs_buffer_base);
+   dsfree((void *)(unsigned long)ds->pebs_buffer_base,
+   x86_pmu.pebs_buffer_size);
ds->pebs_buffer_base = 0;
 }
 
@@ -339,7 +369,7 @@ static int alloc_bts_buffer(int cpu)
if (!x86_pmu.bts)
return 0;
 
-   buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
+   buffer = dsalloc(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
if (unlikely(!buffer)) {
WARN_ONCE(1, "%s: BTS buffer allocation failure\n", __func__);
return -ENOMEM;
@@ -365,19 +395,15 @@ static void release_bts_buffer(int cpu)
if (!ds || !x86_pmu.bts)
return;
 
-   kfree((void *)(unsigned long)ds->bts_buffer_base);
+   dsfree((void *)(unsigned long)ds->bts_buffer_base, BTS_BUFFER_SIZE);
ds->bts_buffer_base = 0;
 }
 
 static int alloc_ds_buffer(int cpu)
 {
-   int node = 

Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers

2017-11-10 Thread Peter Zijlstra
On Wed, Nov 08, 2017 at 11:47:20AM -0800, Dave Hansen wrote:
> +static
> +DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store, 
> cpu_debug_store);
> +
>  /* The size of a BTS record in bytes: */
>  #define BTS_RECORD_SIZE  24
>  
> @@ -278,6 +282,39 @@ void fini_debug_store_on_cpu(int cpu)
>  
>  static DEFINE_PER_CPU(void *, insn_buffer);
>  
> +static void *dsalloc(size_t size, gfp_t flags, int node)
> +{
> +#ifdef CONFIG_KAISER
> + unsigned int order = get_order(size);
> + struct page *page;
> + unsigned long addr;
> +
> + page = __alloc_pages_node(node, flags | __GFP_ZERO, order);
> + if (!page)
> + return NULL;
> + addr = (unsigned long)page_address(page);
> + if (kaiser_add_mapping(addr, size, __PAGE_KERNEL | _PAGE_GLOBAL) < 0) {
> + __free_pages(page, order);
> + addr = 0;
> + }
> + return (void *)addr;
> +#else
> + return kmalloc_node(size, flags | __GFP_ZERO, node);
> +#endif
> +}
> +
> +static void dsfree(const void *buffer, size_t size)
> +{
> +#ifdef CONFIG_KAISER
> + if (!buffer)
> + return;
> + kaiser_remove_mapping((unsigned long)buffer, size);
> + free_pages((unsigned long)buffer, get_order(size));
> +#else
> + kfree(buffer);
> +#endif
> +}

You might as well use __alloc_pages_node() / free_pages()
unconditionally. Those buffers are at least one page in size.

That should also get rid of the #ifdef muck.

>  static int alloc_ds_buffer(int cpu)
>  {
> - int node = cpu_to_node(cpu);
> - struct debug_store *ds;
> -
> - ds = kzalloc_node(sizeof(*ds), GFP_KERNEL, node);
> - if (unlikely(!ds))
> - return -ENOMEM;
> + struct debug_store *ds = per_cpu_ptr(_debug_store, cpu);
>  
> + memset(ds, 0, sizeof(*ds));

Why the memset() ? isn't static per-cpu memory 0 initialized


Re: [PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers

2017-11-10 Thread Peter Zijlstra
On Wed, Nov 08, 2017 at 11:47:20AM -0800, Dave Hansen wrote:
> +static
> +DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store, 
> cpu_debug_store);
> +
>  /* The size of a BTS record in bytes: */
>  #define BTS_RECORD_SIZE  24
>  
> @@ -278,6 +282,39 @@ void fini_debug_store_on_cpu(int cpu)
>  
>  static DEFINE_PER_CPU(void *, insn_buffer);
>  
> +static void *dsalloc(size_t size, gfp_t flags, int node)
> +{
> +#ifdef CONFIG_KAISER
> + unsigned int order = get_order(size);
> + struct page *page;
> + unsigned long addr;
> +
> + page = __alloc_pages_node(node, flags | __GFP_ZERO, order);
> + if (!page)
> + return NULL;
> + addr = (unsigned long)page_address(page);
> + if (kaiser_add_mapping(addr, size, __PAGE_KERNEL | _PAGE_GLOBAL) < 0) {
> + __free_pages(page, order);
> + addr = 0;
> + }
> + return (void *)addr;
> +#else
> + return kmalloc_node(size, flags | __GFP_ZERO, node);
> +#endif
> +}
> +
> +static void dsfree(const void *buffer, size_t size)
> +{
> +#ifdef CONFIG_KAISER
> + if (!buffer)
> + return;
> + kaiser_remove_mapping((unsigned long)buffer, size);
> + free_pages((unsigned long)buffer, get_order(size));
> +#else
> + kfree(buffer);
> +#endif
> +}

You might as well use __alloc_pages_node() / free_pages()
unconditionally. Those buffers are at least one page in size.

That should also get rid of the #ifdef muck.

>  static int alloc_ds_buffer(int cpu)
>  {
> - int node = cpu_to_node(cpu);
> - struct debug_store *ds;
> -
> - ds = kzalloc_node(sizeof(*ds), GFP_KERNEL, node);
> - if (unlikely(!ds))
> - return -ENOMEM;
> + struct debug_store *ds = per_cpu_ptr(_debug_store, cpu);
>  
> + memset(ds, 0, sizeof(*ds));

Why the memset() ? isn't static per-cpu memory 0 initialized


[PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers

2017-11-08 Thread Dave Hansen

From: Hugh Dickins 
[Dave] Add explicit _PAGE_GLOBAL

The BTS and PEBS buffers both have their virtual addresses programmed
into the hardware.  This means that we have to access them via the page
tables.  The times that the hardware accesses these are entirely
dependent on how the performance monitoring hardware events are set up.
In other words, we have no idea when we might need to access these
buffers.

Avoid perf crashes: place debug_store in the user-mapped per-cpu area
instead of allocating, and use page allocator plus kaiser_add_mapping()
to keep the BTS and PEBS buffers user-mapped (that is, present in the
user mapping, though visible only to kernel and hardware).  The PEBS
fixup buffer does not need this treatment.

The need for a user-mapped struct debug_store showed up before doing
any conscious perf testing: in a couple of kernel paging oopses on
Westmere, implicating the debug_store offset of the per-cpu area.

Signed-off-by: Hugh Dickins 
Signed-off-by: Dave Hansen 
Cc: Moritz Lipp 
Cc: Daniel Gruss 
Cc: Michael Schwarz 
Cc: Richard Fellner 
Cc: Andy Lutomirski 
Cc: Linus Torvalds 
Cc: Kees Cook 
Cc: Hugh Dickins 
Cc: x...@kernel.org
---

 b/arch/x86/events/intel/ds.c |   57 +--
 1 file changed, 45 insertions(+), 12 deletions(-)

diff -puN 
arch/x86/events/intel/ds.c~kaiser-user-map-virtually-addressed-performance-monitoring-buffers
 arch/x86/events/intel/ds.c
--- 
a/arch/x86/events/intel/ds.c~kaiser-user-map-virtually-addressed-performance-monitoring-buffers
 2017-11-08 10:45:35.659681379 -0800
+++ b/arch/x86/events/intel/ds.c2017-11-08 10:45:35.662681379 -0800
@@ -2,11 +2,15 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 
 #include "../perf_event.h"
 
+static
+DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store, cpu_debug_store);
+
 /* The size of a BTS record in bytes: */
 #define BTS_RECORD_SIZE24
 
@@ -278,6 +282,39 @@ void fini_debug_store_on_cpu(int cpu)
 
 static DEFINE_PER_CPU(void *, insn_buffer);
 
+static void *dsalloc(size_t size, gfp_t flags, int node)
+{
+#ifdef CONFIG_KAISER
+   unsigned int order = get_order(size);
+   struct page *page;
+   unsigned long addr;
+
+   page = __alloc_pages_node(node, flags | __GFP_ZERO, order);
+   if (!page)
+   return NULL;
+   addr = (unsigned long)page_address(page);
+   if (kaiser_add_mapping(addr, size, __PAGE_KERNEL | _PAGE_GLOBAL) < 0) {
+   __free_pages(page, order);
+   addr = 0;
+   }
+   return (void *)addr;
+#else
+   return kmalloc_node(size, flags | __GFP_ZERO, node);
+#endif
+}
+
+static void dsfree(const void *buffer, size_t size)
+{
+#ifdef CONFIG_KAISER
+   if (!buffer)
+   return;
+   kaiser_remove_mapping((unsigned long)buffer, size);
+   free_pages((unsigned long)buffer, get_order(size));
+#else
+   kfree(buffer);
+#endif
+}
+
 static int alloc_pebs_buffer(int cpu)
 {
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
@@ -288,7 +325,7 @@ static int alloc_pebs_buffer(int cpu)
if (!x86_pmu.pebs)
return 0;
 
-   buffer = kzalloc_node(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
+   buffer = dsalloc(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
if (unlikely(!buffer))
return -ENOMEM;
 
@@ -299,7 +336,7 @@ static int alloc_pebs_buffer(int cpu)
if (x86_pmu.intel_cap.pebs_format < 2) {
ibuffer = kzalloc_node(PEBS_FIXUP_SIZE, GFP_KERNEL, node);
if (!ibuffer) {
-   kfree(buffer);
+   dsfree(buffer, x86_pmu.pebs_buffer_size);
return -ENOMEM;
}
per_cpu(insn_buffer, cpu) = ibuffer;
@@ -325,7 +362,8 @@ static void release_pebs_buffer(int cpu)
kfree(per_cpu(insn_buffer, cpu));
per_cpu(insn_buffer, cpu) = NULL;
 
-   kfree((void *)(unsigned long)ds->pebs_buffer_base);
+   dsfree((void *)(unsigned long)ds->pebs_buffer_base,
+   x86_pmu.pebs_buffer_size);
ds->pebs_buffer_base = 0;
 }
 
@@ -339,7 +377,7 @@ static int alloc_bts_buffer(int cpu)
if (!x86_pmu.bts)
return 0;
 
-   buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
+   buffer = dsalloc(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
if (unlikely(!buffer)) {
WARN_ONCE(1, "%s: BTS buffer allocation failure\n", __func__);
return -ENOMEM;
@@ -365,19 +403,15 @@ static void release_bts_buffer(int cpu)
if (!ds || !x86_pmu.bts)
return;
 
-   

[PATCH 18/30] x86, kaiser: map virtually-addressed performance monitoring buffers

2017-11-08 Thread Dave Hansen

From: Hugh Dickins 
[Dave] Add explicit _PAGE_GLOBAL

The BTS and PEBS buffers both have their virtual addresses programmed
into the hardware.  This means that we have to access them via the page
tables.  The times that the hardware accesses these are entirely
dependent on how the performance monitoring hardware events are set up.
In other words, we have no idea when we might need to access these
buffers.

Avoid perf crashes: place debug_store in the user-mapped per-cpu area
instead of allocating, and use page allocator plus kaiser_add_mapping()
to keep the BTS and PEBS buffers user-mapped (that is, present in the
user mapping, though visible only to kernel and hardware).  The PEBS
fixup buffer does not need this treatment.

The need for a user-mapped struct debug_store showed up before doing
any conscious perf testing: in a couple of kernel paging oopses on
Westmere, implicating the debug_store offset of the per-cpu area.

Signed-off-by: Hugh Dickins 
Signed-off-by: Dave Hansen 
Cc: Moritz Lipp 
Cc: Daniel Gruss 
Cc: Michael Schwarz 
Cc: Richard Fellner 
Cc: Andy Lutomirski 
Cc: Linus Torvalds 
Cc: Kees Cook 
Cc: Hugh Dickins 
Cc: x...@kernel.org
---

 b/arch/x86/events/intel/ds.c |   57 +--
 1 file changed, 45 insertions(+), 12 deletions(-)

diff -puN 
arch/x86/events/intel/ds.c~kaiser-user-map-virtually-addressed-performance-monitoring-buffers
 arch/x86/events/intel/ds.c
--- 
a/arch/x86/events/intel/ds.c~kaiser-user-map-virtually-addressed-performance-monitoring-buffers
 2017-11-08 10:45:35.659681379 -0800
+++ b/arch/x86/events/intel/ds.c2017-11-08 10:45:35.662681379 -0800
@@ -2,11 +2,15 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 
 #include "../perf_event.h"
 
+static
+DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store, cpu_debug_store);
+
 /* The size of a BTS record in bytes: */
 #define BTS_RECORD_SIZE24
 
@@ -278,6 +282,39 @@ void fini_debug_store_on_cpu(int cpu)
 
 static DEFINE_PER_CPU(void *, insn_buffer);
 
+static void *dsalloc(size_t size, gfp_t flags, int node)
+{
+#ifdef CONFIG_KAISER
+   unsigned int order = get_order(size);
+   struct page *page;
+   unsigned long addr;
+
+   page = __alloc_pages_node(node, flags | __GFP_ZERO, order);
+   if (!page)
+   return NULL;
+   addr = (unsigned long)page_address(page);
+   if (kaiser_add_mapping(addr, size, __PAGE_KERNEL | _PAGE_GLOBAL) < 0) {
+   __free_pages(page, order);
+   addr = 0;
+   }
+   return (void *)addr;
+#else
+   return kmalloc_node(size, flags | __GFP_ZERO, node);
+#endif
+}
+
+static void dsfree(const void *buffer, size_t size)
+{
+#ifdef CONFIG_KAISER
+   if (!buffer)
+   return;
+   kaiser_remove_mapping((unsigned long)buffer, size);
+   free_pages((unsigned long)buffer, get_order(size));
+#else
+   kfree(buffer);
+#endif
+}
+
 static int alloc_pebs_buffer(int cpu)
 {
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
@@ -288,7 +325,7 @@ static int alloc_pebs_buffer(int cpu)
if (!x86_pmu.pebs)
return 0;
 
-   buffer = kzalloc_node(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
+   buffer = dsalloc(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
if (unlikely(!buffer))
return -ENOMEM;
 
@@ -299,7 +336,7 @@ static int alloc_pebs_buffer(int cpu)
if (x86_pmu.intel_cap.pebs_format < 2) {
ibuffer = kzalloc_node(PEBS_FIXUP_SIZE, GFP_KERNEL, node);
if (!ibuffer) {
-   kfree(buffer);
+   dsfree(buffer, x86_pmu.pebs_buffer_size);
return -ENOMEM;
}
per_cpu(insn_buffer, cpu) = ibuffer;
@@ -325,7 +362,8 @@ static void release_pebs_buffer(int cpu)
kfree(per_cpu(insn_buffer, cpu));
per_cpu(insn_buffer, cpu) = NULL;
 
-   kfree((void *)(unsigned long)ds->pebs_buffer_base);
+   dsfree((void *)(unsigned long)ds->pebs_buffer_base,
+   x86_pmu.pebs_buffer_size);
ds->pebs_buffer_base = 0;
 }
 
@@ -339,7 +377,7 @@ static int alloc_bts_buffer(int cpu)
if (!x86_pmu.bts)
return 0;
 
-   buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
+   buffer = dsalloc(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
if (unlikely(!buffer)) {
WARN_ONCE(1, "%s: BTS buffer allocation failure\n", __func__);
return -ENOMEM;
@@ -365,19 +403,15 @@ static void release_bts_buffer(int cpu)
if (!ds || !x86_pmu.bts)
return;
 
-   kfree((void *)(unsigned long)ds->bts_buffer_base);
+   dsfree((void *)(unsigned long)ds->bts_buffer_base, BTS_BUFFER_SIZE);
ds->bts_buffer_base = 0;
 }
 
 static int alloc_ds_buffer(int cpu)
 {
-   int node = cpu_to_node(cpu);
-   struct debug_store *ds;
-
-