Re: [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation

2016-09-23 Thread Alexander Shishkin
Peter Zijlstra  writes:

> On Fri, Sep 23, 2016 at 05:27:22PM +0300, Alexander Shishkin wrote:
>> > Afaict there's no actual need to hide the AUX buffer for this sampling
>> > stuff; the user knows about all this and can simply mmap() the AUX part.
>> 
>> Yes, you're right here. We could also re-use the AUX record, adding a
>> new flag for this. It may be even better if I can work out the
>> inheritance (the current code doesn't handle inheritance at the moment
>> in case we decide to scrap it).
>
> What is the exact problem with inheritance? You can inherit PT (and
> other) events just fine, and their output redirects to the original
> (AUX) buffer too.
>
> Is the problem untangling which part of the AUX buffer belongs to which
> task upon sample?

Tasks we can figure out from id samples on RECORD_AUX (assuming we're
using those), but which event (if you have multiple) does a sample
belong to is trickier.

Cutting out samples becomes more interesting as normally RECORD_AUX
don't overlap, we can keep it that way and then the samples will
naturally be non-overlapping, but will all be different sizes. And there
is a question of waking up the consumer often enough to copy out all the
samples before the buffer overwrites itself. Let me think a bit.

Regards,
--
Alex


Re: [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation

2016-09-23 Thread Alexander Shishkin
Peter Zijlstra <pet...@infradead.org> writes:

> On Fri, Sep 23, 2016 at 02:27:21PM +0300, Alexander Shishkin wrote:
>> In order to be able to allocate perf ring buffers in non-mmap path, we
>> need to make sure we can still account the memory to the user and that
>> they don't exceed their mlock limit.
>> 
>> This patch moves ring buffer memory accounting down the rb_alloc() path
>> so that its callers won't have to worry about it. This also serves the
>> additional purpose of slightly cleaning up perf_mmap().
>
> While I like a cleanup of that code (it really can use one), I'm not a
> big fan of hidden buffers like this. Why is this needed?

So what I wanted is a similar interface to call stack sampling or pretty
much anything else sampling that we have at the moment. The user would
ask for AUX samples of, say, intel_pt, and would get a sample with PT
stuff right in the perf buffer every time their main event overflows.

They don't *need* to know that we have a kernel event with a ring buffer
under the hood. This was one of the use cases of 'hidden' ring
buffers. The other two are process core dump and system core dump ([1]
tried to do it without involving perf at all, for reference).

> A quick look through the patches also leaves me wondering on the design
> and interface of this thing. A few words explaining the overall design
> would be nice.

Right; here goes. PERF_SAMPLE_AUX is set in the attr.sample_type of the
event that you want to sample. Then, using that event's
attr.aux_sample_type as the PMU 'type' and attr.aux_sample_config as
'config' we create a kernel event. For this kernel event, we then
allocate a ring buffer with 0 data pages and as many aux pages as would
fit the attr.aux_sample_size.

Then, we hook into the perf_prepare_sample()/perf_output_sample() path
so that When the original event goes off, we first stop the kernel
event, then memcpy the data from the 'hidden' aux buffer into the
original event's perf buffer under PERF_SAMPLE_AUX and then restart the
kernel event. This all is happening on the local cpu. The 'hidden' aux
buffer is running in overwrite mode, so we copy attr.aux_sample_size
bytes every time, which means there may be overlaps between samples, but
the tooling has logic to handle this.

This is about it. Before creating a new counter we first look for an
existing one that fits the bill wrt filtering bits; if there is one, we
grab its reference and use it instead. This is so that one could do
things like

$ perf record -Aintel_pt -e 'cycles,instructions,branch-misses' ls

or

$ perf record -Aintel_pt -e 'sched:*' -a sleep 10

> Afaict there's no actual need to hide the AUX buffer for this sampling
> stuff; the user knows about all this and can simply mmap() the AUX part.

Yes, you're right here. We could also re-use the AUX record, adding a
new flag for this. It may be even better if I can work out the
inheritance (the current code doesn't handle inheritance at the moment
in case we decide to scrap it).

> The sample could either point to locations in the AUX buffer, or (as I
> think this code does) memcpy bits out.

Yes and yes, it does.

> Ideally we'd pass the AUX-event into the syscall, that way you avoid all
> the find_aux_event crud. I'm not sure we want to overload the group_fd
> thing more (its already very hard to create counter groups in a cgroup
> for example) ..

It can be also stuffed into the attribute or ioctl()ed. The latter is
probably the best.

> Coredump was mentioned somewhere, but I'm not sure I've seen
> code/interfaces for that. How was that envisioned to work?

Ok, so what I have is a new RLIMIT_PERF, which is set to the aux data
sample to be included in the [process] core dump. At the
prlimit(RLIMIT_PERF) time, given that RLIMIT_CORE is also nonzero, I
create a kernel event with a 'hidden' buffer. The PMU for this event is,
in this scenario, a system-wide setting, which is a tad iffy, seeing as
we now have 2 PMUs in the system that can be used for this, but which
are mutually exclusive.

Now, when the core dump is written, we check if there's such an event on
the task's perf context and if there is, we dump_emit() data from the
hidden buffer into the file. The difference with sampling is that this
kernel event is also inheritable, so that when the task fork()s, a new
event is created. The memory is counted against
sysctl_perf_event_mlock+user's RLIMIT_MEMLOCK (just like the rest of
perf buffers), so when the user is out of it, no new events are created.

The rlimit as interface to enable this seems weirder the more I look at
it, which is also the reason why I haven't sent it out yet. The other
ideas I had for this were a prctl(), which would be more
straightforward, would also allow to specify the PMU, but, unlike
prlimit() would only work on the current process. Yet another way would
be to go through perf_event_open() and then somehow feed the

Re: [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation

2016-09-23 Thread Alexander Shishkin
Peter Zijlstra  writes:

> On Fri, Sep 23, 2016 at 02:27:21PM +0300, Alexander Shishkin wrote:
>> In order to be able to allocate perf ring buffers in non-mmap path, we
>> need to make sure we can still account the memory to the user and that
>> they don't exceed their mlock limit.
>> 
>> This patch moves ring buffer memory accounting down the rb_alloc() path
>> so that its callers won't have to worry about it. This also serves the
>> additional purpose of slightly cleaning up perf_mmap().
>
> While I like a cleanup of that code (it really can use one), I'm not a
> big fan of hidden buffers like this. Why is this needed?

So what I wanted is a similar interface to call stack sampling or pretty
much anything else sampling that we have at the moment. The user would
ask for AUX samples of, say, intel_pt, and would get a sample with PT
stuff right in the perf buffer every time their main event overflows.

They don't *need* to know that we have a kernel event with a ring buffer
under the hood. This was one of the use cases of 'hidden' ring
buffers. The other two are process core dump and system core dump ([1]
tried to do it without involving perf at all, for reference).

> A quick look through the patches also leaves me wondering on the design
> and interface of this thing. A few words explaining the overall design
> would be nice.

Right; here goes. PERF_SAMPLE_AUX is set in the attr.sample_type of the
event that you want to sample. Then, using that event's
attr.aux_sample_type as the PMU 'type' and attr.aux_sample_config as
'config' we create a kernel event. For this kernel event, we then
allocate a ring buffer with 0 data pages and as many aux pages as would
fit the attr.aux_sample_size.

Then, we hook into the perf_prepare_sample()/perf_output_sample() path
so that When the original event goes off, we first stop the kernel
event, then memcpy the data from the 'hidden' aux buffer into the
original event's perf buffer under PERF_SAMPLE_AUX and then restart the
kernel event. This all is happening on the local cpu. The 'hidden' aux
buffer is running in overwrite mode, so we copy attr.aux_sample_size
bytes every time, which means there may be overlaps between samples, but
the tooling has logic to handle this.

This is about it. Before creating a new counter we first look for an
existing one that fits the bill wrt filtering bits; if there is one, we
grab its reference and use it instead. This is so that one could do
things like

$ perf record -Aintel_pt -e 'cycles,instructions,branch-misses' ls

or

$ perf record -Aintel_pt -e 'sched:*' -a sleep 10

> Afaict there's no actual need to hide the AUX buffer for this sampling
> stuff; the user knows about all this and can simply mmap() the AUX part.

Yes, you're right here. We could also re-use the AUX record, adding a
new flag for this. It may be even better if I can work out the
inheritance (the current code doesn't handle inheritance at the moment
in case we decide to scrap it).

> The sample could either point to locations in the AUX buffer, or (as I
> think this code does) memcpy bits out.

Yes and yes, it does.

> Ideally we'd pass the AUX-event into the syscall, that way you avoid all
> the find_aux_event crud. I'm not sure we want to overload the group_fd
> thing more (its already very hard to create counter groups in a cgroup
> for example) ..

It can be also stuffed into the attribute or ioctl()ed. The latter is
probably the best.

> Coredump was mentioned somewhere, but I'm not sure I've seen
> code/interfaces for that. How was that envisioned to work?

Ok, so what I have is a new RLIMIT_PERF, which is set to the aux data
sample to be included in the [process] core dump. At the
prlimit(RLIMIT_PERF) time, given that RLIMIT_CORE is also nonzero, I
create a kernel event with a 'hidden' buffer. The PMU for this event is,
in this scenario, a system-wide setting, which is a tad iffy, seeing as
we now have 2 PMUs in the system that can be used for this, but which
are mutually exclusive.

Now, when the core dump is written, we check if there's such an event on
the task's perf context and if there is, we dump_emit() data from the
hidden buffer into the file. The difference with sampling is that this
kernel event is also inheritable, so that when the task fork()s, a new
event is created. The memory is counted against
sysctl_perf_event_mlock+user's RLIMIT_MEMLOCK (just like the rest of
perf buffers), so when the user is out of it, no new events are created.

The rlimit as interface to enable this seems weirder the more I look at
it, which is also the reason why I haven't sent it out yet. The other
ideas I had for this were a prctl(), which would be more
straightforward, would also allow to specify the PMU, but, unlike
prlimit() would only work on the current process. Yet another way would
be to go through perf_event_open() and then somehow feed the event into
the e

[RFC PATCH 3/6] perf: Add a helper for looking up pmus by type

2016-09-23 Thread Alexander Shishkin
This patch adds a helper for looking up a registered pmu by its type.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 kernel/events/core.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2e8a0e389b..b64a5c611f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8780,6 +8780,18 @@ static int perf_try_init_event(struct pmu *pmu, struct 
perf_event *event)
return ret;
 }
 
+/* call under pmus_srcu */
+static struct pmu *__perf_find_pmu(u32 type)
+{
+   struct pmu *pmu;
+
+   rcu_read_lock();
+   pmu = idr_find(_idr, type);
+   rcu_read_unlock();
+
+   return pmu;
+}
+
 static struct pmu *perf_init_event(struct perf_event *event)
 {
struct pmu *pmu = NULL;
@@ -8788,9 +8800,7 @@ static struct pmu *perf_init_event(struct perf_event 
*event)
 
idx = srcu_read_lock(_srcu);
 
-   rcu_read_lock();
-   pmu = idr_find(_idr, event->attr.type);
-   rcu_read_unlock();
+   pmu = __perf_find_pmu(event->attr.type);
if (pmu) {
ret = perf_try_init_event(pmu, event);
if (ret)
-- 
2.9.3



[RFC PATCH 5/6] perf: Disable PMU around address filter adjustment

2016-09-23 Thread Alexander Shishkin
If the PMU is used for AUX data sampling, the hardware event that triggers
it may come in during a critical section of address filters adjustment (if
it's delivered as NMI) and in turn call the address filter sync code,
causing a spinlock recursion.

To prevent this, disable the PMU around these critical sections, so that
AUX sampling won't take place there.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 kernel/events/core.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index fdb20fdeb1..f6582df1c9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6377,6 +6377,12 @@ static void perf_event_addr_filters_exec(struct 
perf_event *event, void *data)
if (!has_addr_filter(event))
return;
 
+   /*
+* if sampling kicks in in the critical section,
+* we risk spinlock recursion on the ifh::lock
+*/
+   perf_pmu_disable(event->pmu);
+
raw_spin_lock_irqsave(>lock, flags);
list_for_each_entry(filter, >list, entry) {
if (filter->inode) {
@@ -6387,6 +6393,8 @@ static void perf_event_addr_filters_exec(struct 
perf_event *event, void *data)
count++;
}
 
+   perf_pmu_enable(event->pmu);
+
if (restart)
event->addr_filters_gen++;
raw_spin_unlock_irqrestore(>lock, flags);
@@ -6942,6 +6950,8 @@ static void __perf_addr_filters_adjust(struct perf_event 
*event, void *data)
if (!file)
return;
 
+   perf_pmu_disable(event->pmu);
+
raw_spin_lock_irqsave(>lock, flags);
list_for_each_entry(filter, >list, entry) {
if (perf_addr_filter_match(filter, file, off,
@@ -6953,6 +6963,8 @@ static void __perf_addr_filters_adjust(struct perf_event 
*event, void *data)
count++;
}
 
+   perf_pmu_enable(event->pmu);
+
if (restart)
event->addr_filters_gen++;
raw_spin_unlock_irqrestore(>lock, flags);
@@ -8126,6 +8138,8 @@ static void perf_event_addr_filters_apply(struct 
perf_event *event)
 
down_read(>mmap_sem);
 
+   perf_pmu_disable(event->pmu);
+
raw_spin_lock_irqsave(>lock, flags);
list_for_each_entry(filter, >list, entry) {
event->addr_filters_offs[count] = 0;
@@ -8144,6 +8158,8 @@ static void perf_event_addr_filters_apply(struct 
perf_event *event)
event->addr_filters_gen++;
raw_spin_unlock_irqrestore(>lock, flags);
 
+   perf_pmu_enable(event->pmu);
+
up_read(>mmap_sem);
 
mmput(mm);
-- 
2.9.3



[RFC PATCH 3/6] perf: Add a helper for looking up pmus by type

2016-09-23 Thread Alexander Shishkin
This patch adds a helper for looking up a registered pmu by its type.

Signed-off-by: Alexander Shishkin 
---
 kernel/events/core.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2e8a0e389b..b64a5c611f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8780,6 +8780,18 @@ static int perf_try_init_event(struct pmu *pmu, struct 
perf_event *event)
return ret;
 }
 
+/* call under pmus_srcu */
+static struct pmu *__perf_find_pmu(u32 type)
+{
+   struct pmu *pmu;
+
+   rcu_read_lock();
+   pmu = idr_find(_idr, type);
+   rcu_read_unlock();
+
+   return pmu;
+}
+
 static struct pmu *perf_init_event(struct perf_event *event)
 {
struct pmu *pmu = NULL;
@@ -8788,9 +8800,7 @@ static struct pmu *perf_init_event(struct perf_event 
*event)
 
idx = srcu_read_lock(_srcu);
 
-   rcu_read_lock();
-   pmu = idr_find(_idr, event->attr.type);
-   rcu_read_unlock();
+   pmu = __perf_find_pmu(event->attr.type);
if (pmu) {
ret = perf_try_init_event(pmu, event);
if (ret)
-- 
2.9.3



[RFC PATCH 5/6] perf: Disable PMU around address filter adjustment

2016-09-23 Thread Alexander Shishkin
If the PMU is used for AUX data sampling, the hardware event that triggers
it may come in during a critical section of address filters adjustment (if
it's delivered as NMI) and in turn call the address filter sync code,
causing a spinlock recursion.

To prevent this, disable the PMU around these critical sections, so that
AUX sampling won't take place there.

Signed-off-by: Alexander Shishkin 
---
 kernel/events/core.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index fdb20fdeb1..f6582df1c9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6377,6 +6377,12 @@ static void perf_event_addr_filters_exec(struct 
perf_event *event, void *data)
if (!has_addr_filter(event))
return;
 
+   /*
+* if sampling kicks in in the critical section,
+* we risk spinlock recursion on the ifh::lock
+*/
+   perf_pmu_disable(event->pmu);
+
raw_spin_lock_irqsave(>lock, flags);
list_for_each_entry(filter, >list, entry) {
if (filter->inode) {
@@ -6387,6 +6393,8 @@ static void perf_event_addr_filters_exec(struct 
perf_event *event, void *data)
count++;
}
 
+   perf_pmu_enable(event->pmu);
+
if (restart)
event->addr_filters_gen++;
raw_spin_unlock_irqrestore(>lock, flags);
@@ -6942,6 +6950,8 @@ static void __perf_addr_filters_adjust(struct perf_event 
*event, void *data)
if (!file)
return;
 
+   perf_pmu_disable(event->pmu);
+
raw_spin_lock_irqsave(>lock, flags);
list_for_each_entry(filter, >list, entry) {
if (perf_addr_filter_match(filter, file, off,
@@ -6953,6 +6963,8 @@ static void __perf_addr_filters_adjust(struct perf_event 
*event, void *data)
count++;
}
 
+   perf_pmu_enable(event->pmu);
+
if (restart)
event->addr_filters_gen++;
raw_spin_unlock_irqrestore(>lock, flags);
@@ -8126,6 +8138,8 @@ static void perf_event_addr_filters_apply(struct 
perf_event *event)
 
down_read(>mmap_sem);
 
+   perf_pmu_disable(event->pmu);
+
raw_spin_lock_irqsave(>lock, flags);
list_for_each_entry(filter, >list, entry) {
event->addr_filters_offs[count] = 0;
@@ -8144,6 +8158,8 @@ static void perf_event_addr_filters_apply(struct 
perf_event *event)
event->addr_filters_gen++;
raw_spin_unlock_irqrestore(>lock, flags);
 
+   perf_pmu_enable(event->pmu);
+
up_read(>mmap_sem);
 
mmput(mm);
-- 
2.9.3



[RFC PATCH 6/6] perf: Disable IRQs in address filter sync path

2016-09-23 Thread Alexander Shishkin
If PMU callbacks are executed in hardirq context, the address filter
sync code needs to disable interrupts when taking its spinlock to be
consistent with the rest of its users. This may happen if the PMU is
used in AUX sampling.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 kernel/events/core.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f6582df1c9..047c495c94 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2568,16 +2568,17 @@ static int perf_event_stop(struct perf_event *event, 
int restart)
 void perf_event_addr_filters_sync(struct perf_event *event)
 {
struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+   unsigned long flags;
 
if (!has_addr_filter(event))
return;
 
-   raw_spin_lock(>lock);
+   raw_spin_lock_irqsave(>lock, flags);
if (event->addr_filters_gen != event->hw.addr_filters_gen) {
event->pmu->addr_filters_sync(event);
event->hw.addr_filters_gen = event->addr_filters_gen;
}
-   raw_spin_unlock(>lock);
+   raw_spin_unlock_irqrestore(>lock, flags);
 }
 EXPORT_SYMBOL_GPL(perf_event_addr_filters_sync);
 
-- 
2.9.3



[RFC PATCH 6/6] perf: Disable IRQs in address filter sync path

2016-09-23 Thread Alexander Shishkin
If PMU callbacks are executed in hardirq context, the address filter
sync code needs to disable interrupts when taking its spinlock to be
consistent with the rest of its users. This may happen if the PMU is
used in AUX sampling.

Signed-off-by: Alexander Shishkin 
---
 kernel/events/core.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f6582df1c9..047c495c94 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2568,16 +2568,17 @@ static int perf_event_stop(struct perf_event *event, 
int restart)
 void perf_event_addr_filters_sync(struct perf_event *event)
 {
struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+   unsigned long flags;
 
if (!has_addr_filter(event))
return;
 
-   raw_spin_lock(>lock);
+   raw_spin_lock_irqsave(>lock, flags);
if (event->addr_filters_gen != event->hw.addr_filters_gen) {
event->pmu->addr_filters_sync(event);
event->hw.addr_filters_gen = event->addr_filters_gen;
}
-   raw_spin_unlock(>lock);
+   raw_spin_unlock_irqrestore(>lock, flags);
 }
 EXPORT_SYMBOL_GPL(perf_event_addr_filters_sync);
 
-- 
2.9.3



[RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation

2016-09-23 Thread Alexander Shishkin
In order to be able to allocate perf ring buffers in non-mmap path, we
need to make sure we can still account the memory to the user and that
they don't exceed their mlock limit.

This patch moves ring buffer memory accounting down the rb_alloc() path
so that its callers won't have to worry about it. This also serves the
additional purpose of slightly cleaning up perf_mmap().

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 kernel/events/core.c|  67 +++
 kernel/events/internal.h|   5 +-
 kernel/events/ring_buffer.c | 127 ++--
 3 files changed, 123 insertions(+), 76 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7c0d263f6b..2e8a0e389b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4933,6 +4933,8 @@ void ring_buffer_put(struct ring_buffer *rb)
if (!atomic_dec_and_test(>refcount))
return;
 
+   ring_buffer_unaccount(rb, false);
+
WARN_ON_ONCE(!list_empty(>event_list));
 
call_rcu(>rcu_head, rb_free_rcu);
@@ -4967,9 +4969,6 @@ static void perf_mmap_close(struct vm_area_struct *vma)
struct perf_event *event = vma->vm_file->private_data;
 
struct ring_buffer *rb = ring_buffer_get(event);
-   struct user_struct *mmap_user = rb->mmap_user;
-   int mmap_locked = rb->mmap_locked;
-   unsigned long size = perf_data_size(rb);
 
if (event->pmu->event_unmapped)
event->pmu->event_unmapped(event);
@@ -4989,11 +4988,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 */
perf_pmu_output_stop(event);
 
-   /* now it's safe to free the pages */
-   atomic_long_sub(rb->aux_nr_pages, _user->locked_vm);
-   vma->vm_mm->pinned_vm -= rb->aux_mmap_locked;
-
-   /* this has to be the last one */
+   /* now it's safe to free the pages; ought to be the last one */
rb_free_aux(rb);
WARN_ON_ONCE(atomic_read(>aux_refcount));
 
@@ -5054,19 +5049,6 @@ again:
}
rcu_read_unlock();
 
-   /*
-* It could be there's still a few 0-ref events on the list; they'll
-* get cleaned up by free_event() -- they'll also still have their
-* ref on the rb and will free it whenever they are done with it.
-*
-* Aside from that, this buffer is 'fully' detached and unmapped,
-* undo the VM accounting.
-*/
-
-   atomic_long_sub((size >> PAGE_SHIFT) + 1, _user->locked_vm);
-   vma->vm_mm->pinned_vm -= mmap_locked;
-   free_uid(mmap_user);
-
 out_put:
ring_buffer_put(rb); /* could be last */
 }
@@ -5081,13 +5063,9 @@ static const struct vm_operations_struct perf_mmap_vmops 
= {
 static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 {
struct perf_event *event = file->private_data;
-   unsigned long user_locked, user_lock_limit;
-   struct user_struct *user = current_user();
-   unsigned long locked, lock_limit;
struct ring_buffer *rb = NULL;
unsigned long vma_size;
unsigned long nr_pages;
-   long user_extra = 0, extra = 0;
int ret = 0, flags = 0;
 
/*
@@ -5158,7 +5136,6 @@ static int perf_mmap(struct file *file, struct 
vm_area_struct *vma)
}
 
atomic_set(>aux_mmap_count, 1);
-   user_extra = nr_pages;
 
goto accounting;
}
@@ -5195,49 +5172,24 @@ again:
goto unlock;
}
 
-   user_extra = nr_pages + 1;
-
 accounting:
-   user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
-
-   /*
-* Increase the limit linearly with more CPUs:
-*/
-   user_lock_limit *= num_online_cpus();
-
-   user_locked = atomic_long_read(>locked_vm) + user_extra;
-
-   if (user_locked > user_lock_limit)
-   extra = user_locked - user_lock_limit;
-
-   lock_limit = rlimit(RLIMIT_MEMLOCK);
-   lock_limit >>= PAGE_SHIFT;
-   locked = vma->vm_mm->pinned_vm + extra;
-
-   if ((locked > lock_limit) && perf_paranoid_tracepoint_raw() &&
-   !capable(CAP_IPC_LOCK)) {
-   ret = -EPERM;
-   goto unlock;
-   }
-
WARN_ON(!rb && event->rb);
 
if (vma->vm_flags & VM_WRITE)
flags |= RING_BUFFER_WRITABLE;
 
if (!rb) {
-   rb = rb_alloc(nr_pages,
+   rb = rb_alloc(vma->vm_mm, nr_pages,
  event->attr.watermark ? 
event->attr.wakeup_watermark : 0,
  event->cpu, flags);
 
-   if (!rb) {
-   ret = -ENOMEM;
+   if (IS_ERR_OR_NULL(rb)) {
+ 

[RFC PATCH 4/6] perf: Add infrastructure for using AUX data in perf samples

2016-09-23 Thread Alexander Shishkin
AUX data can be used to annotate perf events such as performance counters
or tracepoints/breakpoints by including it in sample records when
PERF_SAMPLE_AUX flag is set. Such samples would be instrumental in debugging
and profiling by providing, for example, a history of instruction flow
leading up to the event's overflow.

To facilitate this, this patch adds code to create a kernel counter with a
ring buffer to track and collect AUX data that is then copied out into the
sampled events' perf data stream as samples.

The user interface is extended to allow for this, new attribute fields are
added:

  * aux_sample_type: specify PMU on which the AUX data generating event
 is created;
  * aux_sample_config: event config (maps to attribute's config field),
  * aux_sample_size: size of the sample to be written.

This kernel counter is configured similarly to the event that is being
annotated with regards to filtering (exclude_{hv,idle,user,kernel}) and
enabled state (disabled, enable_on_exec) to make sure that the sampler
is not tracking any out of context activity. One sampler can be used
for multiple events.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 include/linux/perf_event.h  |  12 ++
 include/uapi/linux/perf_event.h |  16 +-
 kernel/events/core.c| 315 +++-
 3 files changed, 341 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 5c5362584a..7121cf7b5c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -101,6 +101,12 @@ struct perf_branch_stack {
struct perf_branch_entryentries[0];
 };
 
+struct perf_aux_record {
+   u64 size;
+   unsigned long   from;
+   unsigned long   to;
+};
+
 struct task_struct;
 
 /*
@@ -532,6 +538,7 @@ struct swevent_hlist {
 #define PERF_ATTACH_GROUP  0x02
 #define PERF_ATTACH_TASK   0x04
 #define PERF_ATTACH_TASK_DATA  0x08
+#define PERF_ATTACH_SAMPLING   0x10
 
 struct perf_cgroup;
 struct ring_buffer;
@@ -691,6 +698,9 @@ struct perf_event {
perf_overflow_handler_t overflow_handler;
void*overflow_handler_context;
 
+   struct perf_event   *aux_sampler;
+   atomic_long_t   aux_samplees_count;
+
 #ifdef CONFIG_EVENT_TRACING
struct trace_event_call *tp_event;
struct event_filter *filter;
@@ -888,6 +898,7 @@ struct perf_sample_data {
 */
u64 addr;
struct perf_raw_record  *raw;
+   struct perf_aux_record  aux;
struct perf_branch_stack*br_stack;
u64 period;
u64 weight;
@@ -937,6 +948,7 @@ static inline void perf_sample_data_init(struct 
perf_sample_data *data,
/* remaining struct members initialized in perf_prepare_sample() */
data->addr = addr;
data->raw  = NULL;
+   data->aux.from = data->aux.to = data->aux.size = 0;
data->br_stack = NULL;
data->period = period;
data->weight = 0;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index c66a485a24..1bf3f2c358 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -139,8 +139,9 @@ enum perf_event_sample_format {
PERF_SAMPLE_IDENTIFIER  = 1U << 16,
PERF_SAMPLE_TRANSACTION = 1U << 17,
PERF_SAMPLE_REGS_INTR   = 1U << 18,
+   PERF_SAMPLE_AUX = 1U << 19,
 
-   PERF_SAMPLE_MAX = 1U << 19, /* non-ABI */
+   PERF_SAMPLE_MAX = 1U << 20, /* non-ABI */
 };
 
 /*
@@ -273,6 +274,9 @@ enum perf_event_read_format {
/* add: sample_stack_user */
 #define PERF_ATTR_SIZE_VER4104 /* add: sample_regs_intr */
 #define PERF_ATTR_SIZE_VER5112 /* add: aux_watermark */
+#define PERF_ATTR_SIZE_VER6136 /* add: aux_sample_type */
+   /* add: aux_sample_config */
+   /* add: aux_sample_size */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -390,6 +394,14 @@ struct perf_event_attr {
__u32   aux_watermark;
__u16   sample_max_stack;
__u16   __reserved_2;   /* align to __u64 */
+
+   /*
+* AUX area sampling configuration
+*/
+   __u64   aux_sample_config;  /* event config for AUX sampling */
+   __u64   aux_sample_size;/* desired sample size */
+   __u32   aux_sample_type;/* pmu::type of an AUX PMU */
+   __u32   __reserved_3;   /* align to __u64 */
 };
 
 #define perf_flags(attr)   (*(&

[RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation

2016-09-23 Thread Alexander Shishkin
In order to be able to allocate perf ring buffers in non-mmap path, we
need to make sure we can still account the memory to the user and that
they don't exceed their mlock limit.

This patch moves ring buffer memory accounting down the rb_alloc() path
so that its callers won't have to worry about it. This also serves the
additional purpose of slightly cleaning up perf_mmap().

Signed-off-by: Alexander Shishkin 
---
 kernel/events/core.c|  67 +++
 kernel/events/internal.h|   5 +-
 kernel/events/ring_buffer.c | 127 ++--
 3 files changed, 123 insertions(+), 76 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7c0d263f6b..2e8a0e389b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4933,6 +4933,8 @@ void ring_buffer_put(struct ring_buffer *rb)
if (!atomic_dec_and_test(>refcount))
return;
 
+   ring_buffer_unaccount(rb, false);
+
WARN_ON_ONCE(!list_empty(>event_list));
 
call_rcu(>rcu_head, rb_free_rcu);
@@ -4967,9 +4969,6 @@ static void perf_mmap_close(struct vm_area_struct *vma)
struct perf_event *event = vma->vm_file->private_data;
 
struct ring_buffer *rb = ring_buffer_get(event);
-   struct user_struct *mmap_user = rb->mmap_user;
-   int mmap_locked = rb->mmap_locked;
-   unsigned long size = perf_data_size(rb);
 
if (event->pmu->event_unmapped)
event->pmu->event_unmapped(event);
@@ -4989,11 +4988,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 */
perf_pmu_output_stop(event);
 
-   /* now it's safe to free the pages */
-   atomic_long_sub(rb->aux_nr_pages, _user->locked_vm);
-   vma->vm_mm->pinned_vm -= rb->aux_mmap_locked;
-
-   /* this has to be the last one */
+   /* now it's safe to free the pages; ought to be the last one */
rb_free_aux(rb);
WARN_ON_ONCE(atomic_read(>aux_refcount));
 
@@ -5054,19 +5049,6 @@ again:
}
rcu_read_unlock();
 
-   /*
-* It could be there's still a few 0-ref events on the list; they'll
-* get cleaned up by free_event() -- they'll also still have their
-* ref on the rb and will free it whenever they are done with it.
-*
-* Aside from that, this buffer is 'fully' detached and unmapped,
-* undo the VM accounting.
-*/
-
-   atomic_long_sub((size >> PAGE_SHIFT) + 1, _user->locked_vm);
-   vma->vm_mm->pinned_vm -= mmap_locked;
-   free_uid(mmap_user);
-
 out_put:
ring_buffer_put(rb); /* could be last */
 }
@@ -5081,13 +5063,9 @@ static const struct vm_operations_struct perf_mmap_vmops 
= {
 static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 {
struct perf_event *event = file->private_data;
-   unsigned long user_locked, user_lock_limit;
-   struct user_struct *user = current_user();
-   unsigned long locked, lock_limit;
struct ring_buffer *rb = NULL;
unsigned long vma_size;
unsigned long nr_pages;
-   long user_extra = 0, extra = 0;
int ret = 0, flags = 0;
 
/*
@@ -5158,7 +5136,6 @@ static int perf_mmap(struct file *file, struct 
vm_area_struct *vma)
}
 
atomic_set(>aux_mmap_count, 1);
-   user_extra = nr_pages;
 
goto accounting;
}
@@ -5195,49 +5172,24 @@ again:
goto unlock;
}
 
-   user_extra = nr_pages + 1;
-
 accounting:
-   user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
-
-   /*
-* Increase the limit linearly with more CPUs:
-*/
-   user_lock_limit *= num_online_cpus();
-
-   user_locked = atomic_long_read(>locked_vm) + user_extra;
-
-   if (user_locked > user_lock_limit)
-   extra = user_locked - user_lock_limit;
-
-   lock_limit = rlimit(RLIMIT_MEMLOCK);
-   lock_limit >>= PAGE_SHIFT;
-   locked = vma->vm_mm->pinned_vm + extra;
-
-   if ((locked > lock_limit) && perf_paranoid_tracepoint_raw() &&
-   !capable(CAP_IPC_LOCK)) {
-   ret = -EPERM;
-   goto unlock;
-   }
-
WARN_ON(!rb && event->rb);
 
if (vma->vm_flags & VM_WRITE)
flags |= RING_BUFFER_WRITABLE;
 
if (!rb) {
-   rb = rb_alloc(nr_pages,
+   rb = rb_alloc(vma->vm_mm, nr_pages,
  event->attr.watermark ? 
event->attr.wakeup_watermark : 0,
  event->cpu, flags);
 
-   if (!rb) {
-   ret = -ENOMEM;
+   if (IS_ERR_OR_NULL(rb)) {
+   ret = PTR_ERR(rb);
+   

[RFC PATCH 4/6] perf: Add infrastructure for using AUX data in perf samples

2016-09-23 Thread Alexander Shishkin
AUX data can be used to annotate perf events such as performance counters
or tracepoints/breakpoints by including it in sample records when
PERF_SAMPLE_AUX flag is set. Such samples would be instrumental in debugging
and profiling by providing, for example, a history of instruction flow
leading up to the event's overflow.

To facilitate this, this patch adds code to create a kernel counter with a
ring buffer to track and collect AUX data that is then copied out into the
sampled events' perf data stream as samples.

The user interface is extended to allow for this, new attribute fields are
added:

  * aux_sample_type: specify PMU on which the AUX data generating event
 is created;
  * aux_sample_config: event config (maps to attribute's config field),
  * aux_sample_size: size of the sample to be written.

This kernel counter is configured similarly to the event that is being
annotated with regards to filtering (exclude_{hv,idle,user,kernel}) and
enabled state (disabled, enable_on_exec) to make sure that the sampler
is not tracking any out of context activity. One sampler can be used
for multiple events.

Signed-off-by: Alexander Shishkin 
---
 include/linux/perf_event.h  |  12 ++
 include/uapi/linux/perf_event.h |  16 +-
 kernel/events/core.c| 315 +++-
 3 files changed, 341 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 5c5362584a..7121cf7b5c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -101,6 +101,12 @@ struct perf_branch_stack {
struct perf_branch_entryentries[0];
 };
 
+struct perf_aux_record {
+   u64 size;
+   unsigned long   from;
+   unsigned long   to;
+};
+
 struct task_struct;
 
 /*
@@ -532,6 +538,7 @@ struct swevent_hlist {
 #define PERF_ATTACH_GROUP  0x02
 #define PERF_ATTACH_TASK   0x04
 #define PERF_ATTACH_TASK_DATA  0x08
+#define PERF_ATTACH_SAMPLING   0x10
 
 struct perf_cgroup;
 struct ring_buffer;
@@ -691,6 +698,9 @@ struct perf_event {
perf_overflow_handler_t overflow_handler;
void*overflow_handler_context;
 
+   struct perf_event   *aux_sampler;
+   atomic_long_t   aux_samplees_count;
+
 #ifdef CONFIG_EVENT_TRACING
struct trace_event_call *tp_event;
struct event_filter *filter;
@@ -888,6 +898,7 @@ struct perf_sample_data {
 */
u64 addr;
struct perf_raw_record  *raw;
+   struct perf_aux_record  aux;
struct perf_branch_stack*br_stack;
u64 period;
u64 weight;
@@ -937,6 +948,7 @@ static inline void perf_sample_data_init(struct 
perf_sample_data *data,
/* remaining struct members initialized in perf_prepare_sample() */
data->addr = addr;
data->raw  = NULL;
+   data->aux.from = data->aux.to = data->aux.size = 0;
data->br_stack = NULL;
data->period = period;
data->weight = 0;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index c66a485a24..1bf3f2c358 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -139,8 +139,9 @@ enum perf_event_sample_format {
PERF_SAMPLE_IDENTIFIER  = 1U << 16,
PERF_SAMPLE_TRANSACTION = 1U << 17,
PERF_SAMPLE_REGS_INTR   = 1U << 18,
+   PERF_SAMPLE_AUX = 1U << 19,
 
-   PERF_SAMPLE_MAX = 1U << 19, /* non-ABI */
+   PERF_SAMPLE_MAX = 1U << 20, /* non-ABI */
 };
 
 /*
@@ -273,6 +274,9 @@ enum perf_event_read_format {
/* add: sample_stack_user */
 #define PERF_ATTR_SIZE_VER4104 /* add: sample_regs_intr */
 #define PERF_ATTR_SIZE_VER5112 /* add: aux_watermark */
+#define PERF_ATTR_SIZE_VER6136 /* add: aux_sample_type */
+   /* add: aux_sample_config */
+   /* add: aux_sample_size */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -390,6 +394,14 @@ struct perf_event_attr {
__u32   aux_watermark;
__u16   sample_max_stack;
__u16   __reserved_2;   /* align to __u64 */
+
+   /*
+* AUX area sampling configuration
+*/
+   __u64   aux_sample_config;  /* event config for AUX sampling */
+   __u64   aux_sample_size;/* desired sample size */
+   __u32   aux_sample_type;/* pmu::type of an AUX PMU */
+   __u32   __reserved_3;   /* align to __u64 */
 };
 
 #define perf_flags(attr)   (*(&(attr)->read_format + 1))
@@ -773,6 +785,8 @@ enum pe

[RFC PATCH 2/6] perf: Add api to (de-)allocate AUX buffers for kernel counters

2016-09-23 Thread Alexander Shishkin
Several use cases for AUX data, namely event sampling (including a piece of
AUX data in some perf event sample, so that the user can get, for example,
instruction traces leading up to a certain event like a breakpoint or a
hardware event), process core dumps (providing user with a history of a
process' instruction flow leading up to a crash), system crash dumps and
storing AUX data in pstore across reboot (to facilitate post-mortem
investigation of a system crash) require different parts of the kernel code
to be able to configure hardware to produce AUX data and collect it when it
is needed.

Luckily, there is already an api for in-kernel perf events, which has several
users. This proposal is to extend that api to allow in-kernel users to
allocate AUX buffers for kernel counters. Such users will call
rb_alloc_kernel() to allocate what they want and later copy the data out to
other backends, e.g. a sample in another event's ring buffer or a core dump
file. These buffers are never mapped to userspace.

There are no additional constraints or requirements on the pmu drivers.

A typical user of this interface will first create a kernel counter with a
call to perf_event_create_kernel_counter() and then allocate a ring buffer
for it with rb_alloc_kernel(). Data can then be copied out from the AUX
buffer using rb_output_aux(), which is passed a callback that will write
chunks of AUX data into the desired destination, such as perf_output_copy()
or dump_emit(). Caller needs to use perf_event_disable to make sure that the
counter is not active while it copies data out.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 kernel/events/internal.h| 19 +++
 kernel/events/ring_buffer.c | 83 +
 2 files changed, 102 insertions(+)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index a7ce82b670..4ae300ee02 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -56,6 +56,9 @@ struct ring_buffer {
void*data_pages[0];
 };
 
+typedef unsigned long (*aux_copyfn)(void *data, const void *src,
+   unsigned long len);
+
 extern void rb_free(struct ring_buffer *rb);
 extern void ring_buffer_unaccount(struct ring_buffer *rb, bool aux);
 
@@ -82,6 +85,11 @@ extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
pgoff_t pgoff, int nr_pages, long watermark, int flags);
 extern void rb_free_aux(struct ring_buffer *rb);
+extern long rb_output_aux(struct ring_buffer *rb, unsigned long from,
+ unsigned long to, aux_copyfn copyfn, void *data);
+extern int rb_alloc_kernel(struct perf_event *event, int nr_pages,
+  int aux_nr_pages);
+extern void rb_free_kernel(struct ring_buffer *rb, struct perf_event *event);
 extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
 extern void ring_buffer_put(struct ring_buffer *rb);
 
@@ -126,6 +134,17 @@ static inline unsigned long perf_aux_size(struct 
ring_buffer *rb)
return rb->aux_nr_pages << PAGE_SHIFT;
 }
 
+static inline bool kernel_rb_event(struct perf_event *event)
+{
+   /*
+* Having a ring buffer and not being on any ring buffers' wakeup
+* list means it was attached by rb_alloc_kernel() and not
+* ring_buffer_attach(). It's the only case when these two
+* conditions take place at the same time.
+*/
+   return event->rb && list_empty(>rb_entry);
+}
+
 #define __DEFINE_OUTPUT_COPY_BODY(advance_buf, memcpy_func, ...)   \
 {  \
unsigned long size, written;\
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 484ce09d96..9244a4fa9b 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -578,6 +578,40 @@ void ring_buffer_unaccount(struct ring_buffer *rb, bool 
aux)
free_uid(rb->mmap_user);
 }
 
+/*
+ * Copy out AUX data from a ring_buffer using a supplied callback.
+ */
+long rb_output_aux(struct ring_buffer *rb, unsigned long from,
+  unsigned long to, aux_copyfn copyfn, void *data)
+{
+   unsigned long tocopy, remainder, len = 0;
+   void *addr;
+
+   from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+   to &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+
+   do {
+   tocopy = PAGE_SIZE - offset_in_page(from);
+   if (to > from)
+   tocopy = min(tocopy, to - from);
+   if (!tocopy)
+   break;
+
+   addr = rb->aux_pages[from >> PAGE_SHIFT];
+   addr += offset_in_page(from);
+
+   remainder = copyfn(data, addr, tocopy);

[RFC PATCH 2/6] perf: Add api to (de-)allocate AUX buffers for kernel counters

2016-09-23 Thread Alexander Shishkin
Several use cases for AUX data, namely event sampling (including a piece of
AUX data in some perf event sample, so that the user can get, for example,
instruction traces leading up to a certain event like a breakpoint or a
hardware event), process core dumps (providing user with a history of a
process' instruction flow leading up to a crash), system crash dumps and
storing AUX data in pstore across reboot (to facilitate post-mortem
investigation of a system crash) require different parts of the kernel code
to be able to configure hardware to produce AUX data and collect it when it
is needed.

Luckily, there is already an api for in-kernel perf events, which has several
users. This proposal is to extend that api to allow in-kernel users to
allocate AUX buffers for kernel counters. Such users will call
rb_alloc_kernel() to allocate what they want and later copy the data out to
other backends, e.g. a sample in another event's ring buffer or a core dump
file. These buffers are never mapped to userspace.

There are no additional constraints or requirements on the pmu drivers.

A typical user of this interface will first create a kernel counter with a
call to perf_event_create_kernel_counter() and then allocate a ring buffer
for it with rb_alloc_kernel(). Data can then be copied out from the AUX
buffer using rb_output_aux(), which is passed a callback that will write
chunks of AUX data into the desired destination, such as perf_output_copy()
or dump_emit(). Caller needs to use perf_event_disable to make sure that the
counter is not active while it copies data out.

Signed-off-by: Alexander Shishkin 
---
 kernel/events/internal.h| 19 +++
 kernel/events/ring_buffer.c | 83 +
 2 files changed, 102 insertions(+)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index a7ce82b670..4ae300ee02 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -56,6 +56,9 @@ struct ring_buffer {
void*data_pages[0];
 };
 
+typedef unsigned long (*aux_copyfn)(void *data, const void *src,
+   unsigned long len);
+
 extern void rb_free(struct ring_buffer *rb);
 extern void ring_buffer_unaccount(struct ring_buffer *rb, bool aux);
 
@@ -82,6 +85,11 @@ extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
pgoff_t pgoff, int nr_pages, long watermark, int flags);
 extern void rb_free_aux(struct ring_buffer *rb);
+extern long rb_output_aux(struct ring_buffer *rb, unsigned long from,
+ unsigned long to, aux_copyfn copyfn, void *data);
+extern int rb_alloc_kernel(struct perf_event *event, int nr_pages,
+  int aux_nr_pages);
+extern void rb_free_kernel(struct ring_buffer *rb, struct perf_event *event);
 extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
 extern void ring_buffer_put(struct ring_buffer *rb);
 
@@ -126,6 +134,17 @@ static inline unsigned long perf_aux_size(struct 
ring_buffer *rb)
return rb->aux_nr_pages << PAGE_SHIFT;
 }
 
+static inline bool kernel_rb_event(struct perf_event *event)
+{
+   /*
+* Having a ring buffer and not being on any ring buffers' wakeup
+* list means it was attached by rb_alloc_kernel() and not
+* ring_buffer_attach(). It's the only case when these two
+* conditions take place at the same time.
+*/
+   return event->rb && list_empty(>rb_entry);
+}
+
 #define __DEFINE_OUTPUT_COPY_BODY(advance_buf, memcpy_func, ...)   \
 {  \
unsigned long size, written;\
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 484ce09d96..9244a4fa9b 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -578,6 +578,40 @@ void ring_buffer_unaccount(struct ring_buffer *rb, bool 
aux)
free_uid(rb->mmap_user);
 }
 
+/*
+ * Copy out AUX data from a ring_buffer using a supplied callback.
+ */
+long rb_output_aux(struct ring_buffer *rb, unsigned long from,
+  unsigned long to, aux_copyfn copyfn, void *data)
+{
+   unsigned long tocopy, remainder, len = 0;
+   void *addr;
+
+   from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+   to &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+
+   do {
+   tocopy = PAGE_SIZE - offset_in_page(from);
+   if (to > from)
+   tocopy = min(tocopy, to - from);
+   if (!tocopy)
+   break;
+
+   addr = rb->aux_pages[from >> PAGE_SHIFT];
+   addr += offset_in_page(from);
+
+   remainder = copyfn(data, addr, tocopy);
+   if (remainder)
+  

[RFC PATCH 0/6] perf: Add AUX data sampling

2016-09-23 Thread Alexander Shishkin
Hi Peter,

This is an RFC, I'm not sending the tooling bits in this series,
although they can be found here [1].

This series introduces AUX data sampling for perf events, which in
case of our instruction/branch tracing PMUs like Intel PT, BTS, CS
ETM means execution flow history leading up to a perf event's
overflow.

The bulk of code is in 4/6, which adds attribute fields, creates
kernel events to generate the AUX data, takes samples and takes care
of all the tricky. 1/6 and 6/6 may also be considered separately from
this series. In particular, I suspect that 6/6 applies today to the
architectures that deliver PMIs as IRQs. Mathieu?

[1] 
https://git.kernel.org/cgit/linux/kernel/git/ash/linux.git/log/?h=perf-aux-sampling

Alexander Shishkin (6):
  perf: Move mlock accounting to ring buffer allocation
  perf: Add api to (de-)allocate AUX buffers for kernel counters
  perf: Add a helper for looking up pmus by type
  perf: Add infrastructure for using AUX data in perf samples
  perf: Disable PMU around address filter adjustment
  perf: Disable IRQs in address filter sync path

 include/linux/perf_event.h  |  12 ++
 include/uapi/linux/perf_event.h |  16 +-
 kernel/events/core.c| 419 +---
 kernel/events/internal.h|  24 ++-
 kernel/events/ring_buffer.c | 210 ++--
 5 files changed, 598 insertions(+), 83 deletions(-)

-- 
2.9.3



[RFC PATCH 0/6] perf: Add AUX data sampling

2016-09-23 Thread Alexander Shishkin
Hi Peter,

This is an RFC, I'm not sending the tooling bits in this series,
although they can be found here [1].

This series introduces AUX data sampling for perf events, which in
case of our instruction/branch tracing PMUs like Intel PT, BTS, CS
ETM means execution flow history leading up to a perf event's
overflow.

The bulk of code is in 4/6, which adds attribute fields, creates
kernel events to generate the AUX data, takes samples and takes care
of all the tricky. 1/6 and 6/6 may also be considered separately from
this series. In particular, I suspect that 6/6 applies today to the
architectures that deliver PMIs as IRQs. Mathieu?

[1] 
https://git.kernel.org/cgit/linux/kernel/git/ash/linux.git/log/?h=perf-aux-sampling

Alexander Shishkin (6):
  perf: Move mlock accounting to ring buffer allocation
  perf: Add api to (de-)allocate AUX buffers for kernel counters
  perf: Add a helper for looking up pmus by type
  perf: Add infrastructure for using AUX data in perf samples
  perf: Disable PMU around address filter adjustment
  perf: Disable IRQs in address filter sync path

 include/linux/perf_event.h  |  12 ++
 include/uapi/linux/perf_event.h |  16 +-
 kernel/events/core.c| 419 +---
 kernel/events/internal.h|  24 ++-
 kernel/events/ring_buffer.c | 210 ++--
 5 files changed, 598 insertions(+), 83 deletions(-)

-- 
2.9.3



[tip:perf/urgent] perf/core: Limit matching exclusive events to one PMU

2016-09-22 Thread tip-bot for Alexander Shishkin
Commit-ID:  3bf6215a1b30db7df6083c708caab3fe1a8e8abe
Gitweb: http://git.kernel.org/tip/3bf6215a1b30db7df6083c708caab3fe1a8e8abe
Author: Alexander Shishkin <alexander.shish...@linux.intel.com>
AuthorDate: Tue, 20 Sep 2016 18:48:11 +0300
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Thu, 22 Sep 2016 14:56:09 +0200

perf/core: Limit matching exclusive events to one PMU

An "exclusive" PMU is the one that can only have one event scheduled in
at any given time. There may be more than one of such PMUs in a system,
though, like Intel PT and BTS. It should be allowed to have one event
for either of those inside the same context (there may be other constraints
that may prevent this, but those would be hardware-specific). However,
the exclusivity code is written so that only one event from any of the
"exclusive" PMUs is allowed in a context.

Fix this by making the exclusive event filter explicitly match two events'
PMUs.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijls...@chello.nl>
Cc: Arnaldo Carvalho de Melo <a...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Stephane Eranian <eran...@google.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Vince Weaver <vincent.wea...@maine.edu>
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160920154811.3255-3-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 kernel/events/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index a54f2c2..fc9bb22 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3929,7 +3929,7 @@ static void exclusive_event_destroy(struct perf_event 
*event)
 
 static bool exclusive_event_match(struct perf_event *e1, struct perf_event *e2)
 {
-   if ((e1->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) &&
+   if ((e1->pmu == e2->pmu) &&
(e1->cpu == e2->cpu ||
 e1->cpu == -1 ||
 e2->cpu == -1))


[tip:perf/urgent] perf/core: Limit matching exclusive events to one PMU

2016-09-22 Thread tip-bot for Alexander Shishkin
Commit-ID:  3bf6215a1b30db7df6083c708caab3fe1a8e8abe
Gitweb: http://git.kernel.org/tip/3bf6215a1b30db7df6083c708caab3fe1a8e8abe
Author: Alexander Shishkin 
AuthorDate: Tue, 20 Sep 2016 18:48:11 +0300
Committer:  Ingo Molnar 
CommitDate: Thu, 22 Sep 2016 14:56:09 +0200

perf/core: Limit matching exclusive events to one PMU

An "exclusive" PMU is the one that can only have one event scheduled in
at any given time. There may be more than one of such PMUs in a system,
though, like Intel PT and BTS. It should be allowed to have one event
for either of those inside the same context (there may be other constraints
that may prevent this, but those would be hardware-specific). However,
the exclusivity code is written so that only one event from any of the
"exclusive" PMUs is allowed in a context.

Fix this by making the exclusive event filter explicitly match two events'
PMUs.

Signed-off-by: Alexander Shishkin 
Acked-by: Peter Zijlstra 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160920154811.3255-3-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/events/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index a54f2c2..fc9bb22 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3929,7 +3929,7 @@ static void exclusive_event_destroy(struct perf_event 
*event)
 
 static bool exclusive_event_match(struct perf_event *e1, struct perf_event *e2)
 {
-   if ((e1->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) &&
+   if ((e1->pmu == e2->pmu) &&
(e1->cpu == e2->cpu ||
 e1->cpu == -1 ||
 e2->cpu == -1))


[tip:perf/urgent] perf/x86/intel/bts: Make it an exclusive PMU

2016-09-22 Thread tip-bot for Alexander Shishkin
Commit-ID:  08b90f0655258411a1b41d856331e20e7ec8d55c
Gitweb: http://git.kernel.org/tip/08b90f0655258411a1b41d856331e20e7ec8d55c
Author: Alexander Shishkin <alexander.shish...@linux.intel.com>
AuthorDate: Tue, 20 Sep 2016 18:48:10 +0300
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Thu, 22 Sep 2016 14:56:08 +0200

perf/x86/intel/bts: Make it an exclusive PMU

Just like intel_pt, intel_bts can only handle one event at a time,
which is the reason we introduced PERF_PMU_CAP_EXCLUSIVE in the first
place. However, at the moment one can have as many intel_bts events
within the same context at the same time as one pleases. Only one of
them, however, will get scheduled and receive the actual trace data.

Fix this by making intel_bts an "exclusive" PMU.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijls...@chello.nl>
Cc: Arnaldo Carvalho de Melo <a...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Stephane Eranian <eran...@google.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Vince Weaver <vincent.wea...@maine.edu>
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160920154811.3255-2-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 arch/x86/events/intel/bts.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 6ff66ef..982c9e3 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -584,7 +584,8 @@ static __init int bts_init(void)
if (!boot_cpu_has(X86_FEATURE_DTES64) || !x86_pmu.bts)
return -ENODEV;
 
-   bts_pmu.capabilities= PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE;
+   bts_pmu.capabilities= PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE |
+ PERF_PMU_CAP_EXCLUSIVE;
bts_pmu.task_ctx_nr = perf_sw_context;
bts_pmu.event_init  = bts_event_init;
bts_pmu.add = bts_event_add;


[tip:perf/urgent] perf/x86/intel/bts: Make it an exclusive PMU

2016-09-22 Thread tip-bot for Alexander Shishkin
Commit-ID:  08b90f0655258411a1b41d856331e20e7ec8d55c
Gitweb: http://git.kernel.org/tip/08b90f0655258411a1b41d856331e20e7ec8d55c
Author: Alexander Shishkin 
AuthorDate: Tue, 20 Sep 2016 18:48:10 +0300
Committer:  Ingo Molnar 
CommitDate: Thu, 22 Sep 2016 14:56:08 +0200

perf/x86/intel/bts: Make it an exclusive PMU

Just like intel_pt, intel_bts can only handle one event at a time,
which is the reason we introduced PERF_PMU_CAP_EXCLUSIVE in the first
place. However, at the moment one can have as many intel_bts events
within the same context at the same time as one pleases. Only one of
them, however, will get scheduled and receive the actual trace data.

Fix this by making intel_bts an "exclusive" PMU.

Signed-off-by: Alexander Shishkin 
Acked-by: Peter Zijlstra 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160920154811.3255-2-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/intel/bts.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 6ff66ef..982c9e3 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -584,7 +584,8 @@ static __init int bts_init(void)
if (!boot_cpu_has(X86_FEATURE_DTES64) || !x86_pmu.bts)
return -ENODEV;
 
-   bts_pmu.capabilities= PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE;
+   bts_pmu.capabilities= PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE |
+ PERF_PMU_CAP_EXCLUSIVE;
bts_pmu.task_ctx_nr = perf_sw_context;
bts_pmu.event_init  = bts_event_init;
bts_pmu.add = bts_event_add;


[PATCH 2/2] perf: Limit matching exclusive events to one PMU

2016-09-20 Thread Alexander Shishkin
An "exclusive" PMU is the one that can only have one event scheduled in
at any given time. There may be more than one of such PMUs in a system,
though, like Intel PT and BTS. It should be allowed to have one event
for either of those inside the same context (there may be other constraints
that may prevent this, but those would be hardware-specific). However,
the exclusivity code is written so that only one event from any of the
"exclusive" PMUs is allowed in a context.

Fix this by making the exclusive event filter explicitly match two events'
PMUs.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 kernel/events/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index fedba316cc..7c0d263f6b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3958,7 +3958,7 @@ static void exclusive_event_destroy(struct perf_event 
*event)
 
 static bool exclusive_event_match(struct perf_event *e1, struct perf_event *e2)
 {
-   if ((e1->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) &&
+   if ((e1->pmu == e2->pmu) &&
(e1->cpu == e2->cpu ||
 e1->cpu == -1 ||
 e2->cpu == -1))
-- 
2.9.3



[PATCH 0/2] perf, bts: Make BTS exclusive again

2016-09-20 Thread Alexander Shishkin
Hi Peter,

While looking at something else I noticed that the exclusive event
filter only allows one such event per context, whereas it should allow
one such event from each PMU that has PERF_PMU_CAP_EXCLUSIVE. At the
same time, intel_bts PMU doesn't even have this capability set (which
is why on systems that allow coexistance of PT and BTS it still works,
see ccbebba4c6 for more context).

Alexander Shishkin (2):
  perf/x86/intel/bts: Make it an exclusive PMU
  perf: Limit matching exclusive events to one PMU

 arch/x86/events/intel/bts.c | 3 ++-
 kernel/events/core.c| 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

-- 
2.9.3



[PATCH 2/2] perf: Limit matching exclusive events to one PMU

2016-09-20 Thread Alexander Shishkin
An "exclusive" PMU is the one that can only have one event scheduled in
at any given time. There may be more than one of such PMUs in a system,
though, like Intel PT and BTS. It should be allowed to have one event
for either of those inside the same context (there may be other constraints
that may prevent this, but those would be hardware-specific). However,
the exclusivity code is written so that only one event from any of the
"exclusive" PMUs is allowed in a context.

Fix this by making the exclusive event filter explicitly match two events'
PMUs.

Signed-off-by: Alexander Shishkin 
---
 kernel/events/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index fedba316cc..7c0d263f6b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3958,7 +3958,7 @@ static void exclusive_event_destroy(struct perf_event 
*event)
 
 static bool exclusive_event_match(struct perf_event *e1, struct perf_event *e2)
 {
-   if ((e1->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) &&
+   if ((e1->pmu == e2->pmu) &&
(e1->cpu == e2->cpu ||
 e1->cpu == -1 ||
 e2->cpu == -1))
-- 
2.9.3



[PATCH 0/2] perf, bts: Make BTS exclusive again

2016-09-20 Thread Alexander Shishkin
Hi Peter,

While looking at something else I noticed that the exclusive event
filter only allows one such event per context, whereas it should allow
one such event from each PMU that has PERF_PMU_CAP_EXCLUSIVE. At the
same time, intel_bts PMU doesn't even have this capability set (which
is why on systems that allow coexistance of PT and BTS it still works,
see ccbebba4c6 for more context).

Alexander Shishkin (2):
  perf/x86/intel/bts: Make it an exclusive PMU
  perf: Limit matching exclusive events to one PMU

 arch/x86/events/intel/bts.c | 3 ++-
 kernel/events/core.c| 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

-- 
2.9.3



[PATCH 1/2] perf/x86/intel/bts: Make it an exclusive PMU

2016-09-20 Thread Alexander Shishkin
Just like intel_pt, intel_bts can only handle one event at a time,
which is the reason we introduced PERF_PMU_CAP_EXCLUSIVE in the first
place. However, at the moment one can have as many intel_bts events
within the same context at the same time as one pleases. Only one of
them, however, will get scheduled and receive the actual trace data.

Fix this by making intel_bts an "exclusive" PMU.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 arch/x86/events/intel/bts.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index bdcd651099..6112c3d538 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -584,7 +584,8 @@ static __init int bts_init(void)
if (!boot_cpu_has(X86_FEATURE_DTES64) || !x86_pmu.bts)
return -ENODEV;
 
-   bts_pmu.capabilities= PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE;
+   bts_pmu.capabilities= PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE |
+ PERF_PMU_CAP_EXCLUSIVE;
bts_pmu.task_ctx_nr = perf_sw_context;
bts_pmu.event_init  = bts_event_init;
bts_pmu.add = bts_event_add;
-- 
2.9.3



[PATCH 1/2] perf/x86/intel/bts: Make it an exclusive PMU

2016-09-20 Thread Alexander Shishkin
Just like intel_pt, intel_bts can only handle one event at a time,
which is the reason we introduced PERF_PMU_CAP_EXCLUSIVE in the first
place. However, at the moment one can have as many intel_bts events
within the same context at the same time as one pleases. Only one of
them, however, will get scheduled and receive the actual trace data.

Fix this by making intel_bts an "exclusive" PMU.

Signed-off-by: Alexander Shishkin 
---
 arch/x86/events/intel/bts.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index bdcd651099..6112c3d538 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -584,7 +584,8 @@ static __init int bts_init(void)
if (!boot_cpu_has(X86_FEATURE_DTES64) || !x86_pmu.bts)
return -ENODEV;
 
-   bts_pmu.capabilities= PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE;
+   bts_pmu.capabilities= PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE |
+ PERF_PMU_CAP_EXCLUSIVE;
bts_pmu.task_ctx_nr = perf_sw_context;
bts_pmu.event_init  = bts_event_init;
bts_pmu.add = bts_event_add;
-- 
2.9.3



Re: [PATCH] perf/x86/intel/bts: don't dereference ds unconditionally

2016-09-20 Thread Alexander Shishkin
Alexander Shishkin <alexander.shish...@linux.intel.com> writes:

> Sebastian Andrzej Siewior <sebast...@breakpoint.cc> writes:
>
>> From: Sebastian Andrzej Siewior <bige...@linutronix.de>
>>
>> Since commit 4d4c47412464 ("perf/x86/intel/bts: Fix BTS PMI detection")
>> my box goes boom on boot:
>>
>> |  node  #0, CPUs:  #1 #2 #3 #4 #5 #6 #7
>> | BUG: unable to handle kernel NULL pointer dereference at 0018
>> | IP: [] intel_bts_interrupt+0x43/0x130
>> | Call Trace:
>> |   d [] intel_pmu_handle_irq+0x51/0x4b0
>> |  [] perf_event_nmi_handler+0x27/0x40
>>
>> I don't know what is going on here but ds is not always dereferenced
>> unconditionally hence here the `ds' check to avoid the crash.
>
> Good catch! I'm going to guess you don't have the NMI watchdog enabled?

That is to say,

Reviewed-by: Alexander Shishkin <alexander.shish...@linux.intel.com>


Re: [PATCH] perf/x86/intel/bts: don't dereference ds unconditionally

2016-09-20 Thread Alexander Shishkin
Alexander Shishkin  writes:

> Sebastian Andrzej Siewior  writes:
>
>> From: Sebastian Andrzej Siewior 
>>
>> Since commit 4d4c47412464 ("perf/x86/intel/bts: Fix BTS PMI detection")
>> my box goes boom on boot:
>>
>> |  node  #0, CPUs:  #1 #2 #3 #4 #5 #6 #7
>> | BUG: unable to handle kernel NULL pointer dereference at 0018
>> | IP: [] intel_bts_interrupt+0x43/0x130
>> | Call Trace:
>> |   d [] intel_pmu_handle_irq+0x51/0x4b0
>> |  [] perf_event_nmi_handler+0x27/0x40
>>
>> I don't know what is going on here but ds is not always dereferenced
>> unconditionally hence here the `ds' check to avoid the crash.
>
> Good catch! I'm going to guess you don't have the NMI watchdog enabled?

That is to say,

Reviewed-by: Alexander Shishkin 


Re: [PATCH] perf/x86/intel/bts: don't dereference ds unconditionally

2016-09-20 Thread Alexander Shishkin
Sebastian Andrzej Siewior  writes:

> From: Sebastian Andrzej Siewior 
>
> Since commit 4d4c47412464 ("perf/x86/intel/bts: Fix BTS PMI detection")
> my box goes boom on boot:
>
> |  node  #0, CPUs:  #1 #2 #3 #4 #5 #6 #7
> | BUG: unable to handle kernel NULL pointer dereference at 0018
> | IP: [] intel_bts_interrupt+0x43/0x130
> | Call Trace:
> |   d [] intel_pmu_handle_irq+0x51/0x4b0
> |  [] perf_event_nmi_handler+0x27/0x40
>
> I don't know what is going on here but ds is not always dereferenced
> unconditionally hence here the `ds' check to avoid the crash.

Good catch! I'm going to guess you don't have the NMI watchdog enabled?

Thanks,
--
Alex


Re: [PATCH] perf/x86/intel/bts: don't dereference ds unconditionally

2016-09-20 Thread Alexander Shishkin
Sebastian Andrzej Siewior  writes:

> From: Sebastian Andrzej Siewior 
>
> Since commit 4d4c47412464 ("perf/x86/intel/bts: Fix BTS PMI detection")
> my box goes boom on boot:
>
> |  node  #0, CPUs:  #1 #2 #3 #4 #5 #6 #7
> | BUG: unable to handle kernel NULL pointer dereference at 0018
> | IP: [] intel_bts_interrupt+0x43/0x130
> | Call Trace:
> |   d [] intel_pmu_handle_irq+0x51/0x4b0
> |  [] perf_event_nmi_handler+0x27/0x40
>
> I don't know what is going on here but ds is not always dereferenced
> unconditionally hence here the `ds' check to avoid the crash.

Good catch! I'm going to guess you don't have the NMI watchdog enabled?

Thanks,
--
Alex


[tip:perf/core] perf/x86/intel/pt: Add support for PTWRITE and power event tracing

2016-09-19 Thread tip-bot for Alexander Shishkin
Commit-ID:  8ee83b2ab3d1987cbd80c9f2c6f2b12fed87b51e
Gitweb: http://git.kernel.org/tip/8ee83b2ab3d1987cbd80c9f2c6f2b12fed87b51e
Author: Alexander Shishkin <alexander.shish...@linux.intel.com>
AuthorDate: Fri, 16 Sep 2016 16:48:19 +0300
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Tue, 20 Sep 2016 01:18:28 +0200

perf/x86/intel/pt: Add support for PTWRITE and power event tracing

The Intel PT facility grew some new functionality:

  * PTWRITE packet carries the payload of the new PTWRITE instruction
that can be used to instrument Intel PT traces with user-supplied
data. Packets of this type are only generated if 'ptwrite' capability
is set and PTWEn bit is set in the event attribute's config. Flow
update packets (FUP) can be generated on PTWRITE packets if FUPonPTW
config bit is set. Setting these bits is not allowed if 'ptwrite'
capability is not set.

  * PWRE, PWRX, MWAIT, EXSTOP packets communicate core power management
events. These depend on 'power_event_tracing' capability and are
enabled by setting PwrEvtEn bit in the event attribute.

Extend the driver capabilities and provide the proper sanity checks in the
event validation function.

[ tglx: Massaged changelog ]

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <a...@infradead.org>
Cc: Peter Zijlstra <a.p.zijls...@chello.nl>
Cc: vi...@deater.net
Cc: eran...@google.com
Cc: Adrian Hunter <adrian.hun...@intel.com>
Link: 
http://lkml.kernel.org/r/20160916134819.1978-1-alexander.shish...@linux.intel.com
Signed-off-by: Thomas Gleixner <t...@linutronix.de>

---
 arch/x86/events/intel/pt.c | 24 +++-
 arch/x86/events/intel/pt.h |  5 +
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 04bb5fb..18d18fd 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -69,6 +69,8 @@ static struct pt_cap_desc {
PT_CAP(psb_cyc, 0, CR_EBX, BIT(1)),
PT_CAP(ip_filtering,0, CR_EBX, BIT(2)),
PT_CAP(mtc, 0, CR_EBX, BIT(3)),
+   PT_CAP(ptwrite, 0, CR_EBX, BIT(4)),
+   PT_CAP(power_event_trace,   0, CR_EBX, BIT(5)),
PT_CAP(topa_output, 0, CR_ECX, BIT(0)),
PT_CAP(topa_multiple_entries,   0, CR_ECX, BIT(1)),
PT_CAP(single_range_output, 0, CR_ECX, BIT(2)),
@@ -259,10 +261,16 @@ fail:
 #define RTIT_CTL_MTC   (RTIT_CTL_MTC_EN| \
 RTIT_CTL_MTC_RANGE)
 
+#define RTIT_CTL_PTW   (RTIT_CTL_PTW_EN| \
+RTIT_CTL_FUP_ON_PTW)
+
 #define PT_CONFIG_MASK (RTIT_CTL_TSC_EN| \
RTIT_CTL_DISRETC| \
RTIT_CTL_CYC_PSB| \
-   RTIT_CTL_MTC)
+   RTIT_CTL_MTC| \
+   RTIT_CTL_PWR_EVT_EN | \
+   RTIT_CTL_FUP_ON_PTW | \
+   RTIT_CTL_PTW_EN)
 
 static bool pt_event_valid(struct perf_event *event)
 {
@@ -311,6 +319,20 @@ static bool pt_event_valid(struct perf_event *event)
return false;
}
 
+   if (config & RTIT_CTL_PWR_EVT_EN &&
+   !pt_cap_get(PT_CAP_power_event_trace))
+   return false;
+
+   if (config & RTIT_CTL_PTW) {
+   if (!pt_cap_get(PT_CAP_ptwrite))
+   return false;
+
+   /* FUPonPTW without PTW doesn't make sense */
+   if ((config & RTIT_CTL_FUP_ON_PTW) &&
+   !(config & RTIT_CTL_PTW_EN))
+   return false;
+   }
+
return true;
 }
 
diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index efffa4a..53473c2 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -26,11 +26,14 @@
 #define RTIT_CTL_CYCLEACC  BIT(1)
 #define RTIT_CTL_OSBIT(2)
 #define RTIT_CTL_USR   BIT(3)
+#define RTIT_CTL_PWR_EVT_ENBIT(4)
+#define RTIT_CTL_FUP_ON_PTWBIT(5)
 #define RTIT_CTL_CR3EN BIT(7)
 #define RTIT_CTL_TOPA  BIT(8)
 #define RTIT_CTL_MTC_ENBIT(9)
 #define RTIT_CTL_TSC_ENBIT(10)
 #define RTIT_CTL_DISRETC   BIT(11)
+#define RTIT_CTL_PTW_ENBIT(12)
 #define RTIT_CTL_BRANCH_EN BIT(13)
 #define RTIT_CTL_MTC_RANGE_OFFSET  14
 #define RTIT_CTL_MTC_RANGE (0x0full << RTIT_CTL_MTC_RANGE_OFFSET)
@@ -91,6 +94,8 @@ enum pt_capabilities {
PT_CAP_psb_cyc,
PT_CAP_ip_filtering,
PT_CAP_mtc,
+   PT_CAP_ptwrite,
+   PT_CAP_power_event_trace,
PT_CAP_topa_output,
PT_CAP_topa_multiple_entries,
PT_CAP_single_range_output,


[tip:perf/core] perf/x86/intel/pt: Add support for PTWRITE and power event tracing

2016-09-19 Thread tip-bot for Alexander Shishkin
Commit-ID:  8ee83b2ab3d1987cbd80c9f2c6f2b12fed87b51e
Gitweb: http://git.kernel.org/tip/8ee83b2ab3d1987cbd80c9f2c6f2b12fed87b51e
Author: Alexander Shishkin 
AuthorDate: Fri, 16 Sep 2016 16:48:19 +0300
Committer:  Thomas Gleixner 
CommitDate: Tue, 20 Sep 2016 01:18:28 +0200

perf/x86/intel/pt: Add support for PTWRITE and power event tracing

The Intel PT facility grew some new functionality:

  * PTWRITE packet carries the payload of the new PTWRITE instruction
that can be used to instrument Intel PT traces with user-supplied
data. Packets of this type are only generated if 'ptwrite' capability
is set and PTWEn bit is set in the event attribute's config. Flow
update packets (FUP) can be generated on PTWRITE packets if FUPonPTW
config bit is set. Setting these bits is not allowed if 'ptwrite'
capability is not set.

  * PWRE, PWRX, MWAIT, EXSTOP packets communicate core power management
events. These depend on 'power_event_tracing' capability and are
enabled by setting PwrEvtEn bit in the event attribute.

Extend the driver capabilities and provide the proper sanity checks in the
event validation function.

[ tglx: Massaged changelog ]

Signed-off-by: Alexander Shishkin 
Cc: Arnaldo Carvalho de Melo 
Cc: Peter Zijlstra 
Cc: vi...@deater.net
Cc: eran...@google.com
Cc: Adrian Hunter 
Link: 
http://lkml.kernel.org/r/20160916134819.1978-1-alexander.shish...@linux.intel.com
Signed-off-by: Thomas Gleixner 

---
 arch/x86/events/intel/pt.c | 24 +++-
 arch/x86/events/intel/pt.h |  5 +
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 04bb5fb..18d18fd 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -69,6 +69,8 @@ static struct pt_cap_desc {
PT_CAP(psb_cyc, 0, CR_EBX, BIT(1)),
PT_CAP(ip_filtering,0, CR_EBX, BIT(2)),
PT_CAP(mtc, 0, CR_EBX, BIT(3)),
+   PT_CAP(ptwrite, 0, CR_EBX, BIT(4)),
+   PT_CAP(power_event_trace,   0, CR_EBX, BIT(5)),
PT_CAP(topa_output, 0, CR_ECX, BIT(0)),
PT_CAP(topa_multiple_entries,   0, CR_ECX, BIT(1)),
PT_CAP(single_range_output, 0, CR_ECX, BIT(2)),
@@ -259,10 +261,16 @@ fail:
 #define RTIT_CTL_MTC   (RTIT_CTL_MTC_EN| \
 RTIT_CTL_MTC_RANGE)
 
+#define RTIT_CTL_PTW   (RTIT_CTL_PTW_EN| \
+RTIT_CTL_FUP_ON_PTW)
+
 #define PT_CONFIG_MASK (RTIT_CTL_TSC_EN| \
RTIT_CTL_DISRETC| \
RTIT_CTL_CYC_PSB| \
-   RTIT_CTL_MTC)
+   RTIT_CTL_MTC| \
+   RTIT_CTL_PWR_EVT_EN | \
+   RTIT_CTL_FUP_ON_PTW | \
+   RTIT_CTL_PTW_EN)
 
 static bool pt_event_valid(struct perf_event *event)
 {
@@ -311,6 +319,20 @@ static bool pt_event_valid(struct perf_event *event)
return false;
}
 
+   if (config & RTIT_CTL_PWR_EVT_EN &&
+   !pt_cap_get(PT_CAP_power_event_trace))
+   return false;
+
+   if (config & RTIT_CTL_PTW) {
+   if (!pt_cap_get(PT_CAP_ptwrite))
+   return false;
+
+   /* FUPonPTW without PTW doesn't make sense */
+   if ((config & RTIT_CTL_FUP_ON_PTW) &&
+   !(config & RTIT_CTL_PTW_EN))
+   return false;
+   }
+
return true;
 }
 
diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index efffa4a..53473c2 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -26,11 +26,14 @@
 #define RTIT_CTL_CYCLEACC  BIT(1)
 #define RTIT_CTL_OSBIT(2)
 #define RTIT_CTL_USR   BIT(3)
+#define RTIT_CTL_PWR_EVT_ENBIT(4)
+#define RTIT_CTL_FUP_ON_PTWBIT(5)
 #define RTIT_CTL_CR3EN BIT(7)
 #define RTIT_CTL_TOPA  BIT(8)
 #define RTIT_CTL_MTC_ENBIT(9)
 #define RTIT_CTL_TSC_ENBIT(10)
 #define RTIT_CTL_DISRETC   BIT(11)
+#define RTIT_CTL_PTW_ENBIT(12)
 #define RTIT_CTL_BRANCH_EN BIT(13)
 #define RTIT_CTL_MTC_RANGE_OFFSET  14
 #define RTIT_CTL_MTC_RANGE (0x0full << RTIT_CTL_MTC_RANGE_OFFSET)
@@ -91,6 +94,8 @@ enum pt_capabilities {
PT_CAP_psb_cyc,
PT_CAP_ip_filtering,
PT_CAP_mtc,
+   PT_CAP_ptwrite,
+   PT_CAP_power_event_trace,
PT_CAP_topa_output,
PT_CAP_topa_multiple_entries,
PT_CAP_single_range_output,


[PATCH] perf/x86/intel/pt: Add enables for PTWRITE and power event tracing

2016-09-16 Thread Alexander Shishkin
Intel PT chapter in the new Intel Architecture SDM adds several packets
and corresponding enable bits that control packet generation. Also,
additional bits in the Intel PT CPUID leaf were added to enumerate
presence and parameters of these new packets and features.

The packets and enables are:

  * PTWRITE packet carries the payload of the new PTWRITE instruction
that can be used to instrument Intel PT traces with user-supplied
data. Packets of this type are only generated if 'ptwrite' capability
is set and PTWEn bit is set in the event attribute's config. Flow
update packets (FUP) can be generated on PTWRITE packets if FUPonPTW
config bit is set. Setting these bits is not allowed if 'ptwrite'
capability is not set.

  * PWRE, PWRX, MWAIT, EXSTOP packets communicate core power management
events. These depend on 'power_event_tracing' capability and are
enabled by setting PwrEvtEn bit in the event attribute.

This patch adds corresponding bit and register definitions, PMU driver
capabilities based on CPUID enumeration, new attribute format bits for
the new featurens and extends event configuration validation function
to take these into account.

Cc: Adrian Hunter <adrian.hun...@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 arch/x86/events/intel/pt.c | 24 +++-
 arch/x86/events/intel/pt.h |  5 +
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 861a7d9cb6..c5047b8f77 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -69,6 +69,8 @@ static struct pt_cap_desc {
PT_CAP(psb_cyc, 0, CR_EBX, BIT(1)),
PT_CAP(ip_filtering,0, CR_EBX, BIT(2)),
PT_CAP(mtc, 0, CR_EBX, BIT(3)),
+   PT_CAP(ptwrite, 0, CR_EBX, BIT(4)),
+   PT_CAP(power_event_trace,   0, CR_EBX, BIT(5)),
PT_CAP(topa_output, 0, CR_ECX, BIT(0)),
PT_CAP(topa_multiple_entries,   0, CR_ECX, BIT(1)),
PT_CAP(single_range_output, 0, CR_ECX, BIT(2)),
@@ -259,10 +261,16 @@ fail:
 #define RTIT_CTL_MTC   (RTIT_CTL_MTC_EN| \
 RTIT_CTL_MTC_RANGE)
 
+#define RTIT_CTL_PTW   (RTIT_CTL_PTW_EN| \
+RTIT_CTL_FUP_ON_PTW)
+
 #define PT_CONFIG_MASK (RTIT_CTL_TSC_EN| \
RTIT_CTL_DISRETC| \
RTIT_CTL_CYC_PSB| \
-   RTIT_CTL_MTC)
+   RTIT_CTL_MTC| \
+   RTIT_CTL_PWR_EVT_EN | \
+   RTIT_CTL_FUP_ON_PTW | \
+   RTIT_CTL_PTW_EN)
 
 static bool pt_event_valid(struct perf_event *event)
 {
@@ -311,6 +319,20 @@ static bool pt_event_valid(struct perf_event *event)
return false;
}
 
+   if (config & RTIT_CTL_PWR_EVT_EN &&
+   !pt_cap_get(PT_CAP_power_event_trace))
+   return false;
+
+   if (config & RTIT_CTL_PTW) {
+   if (!pt_cap_get(PT_CAP_ptwrite))
+   return false;
+
+   /* FUPonPTW without PTW doesn't make sense */
+   if ((config & RTIT_CTL_FUP_ON_PTW) &&
+   !(config & RTIT_CTL_PTW_EN))
+   return false;
+   }
+
return true;
 }
 
diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index efffa4a09f..53473c21b5 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -26,11 +26,14 @@
 #define RTIT_CTL_CYCLEACC  BIT(1)
 #define RTIT_CTL_OSBIT(2)
 #define RTIT_CTL_USR   BIT(3)
+#define RTIT_CTL_PWR_EVT_ENBIT(4)
+#define RTIT_CTL_FUP_ON_PTWBIT(5)
 #define RTIT_CTL_CR3EN BIT(7)
 #define RTIT_CTL_TOPA  BIT(8)
 #define RTIT_CTL_MTC_ENBIT(9)
 #define RTIT_CTL_TSC_ENBIT(10)
 #define RTIT_CTL_DISRETC   BIT(11)
+#define RTIT_CTL_PTW_ENBIT(12)
 #define RTIT_CTL_BRANCH_EN BIT(13)
 #define RTIT_CTL_MTC_RANGE_OFFSET  14
 #define RTIT_CTL_MTC_RANGE (0x0full << RTIT_CTL_MTC_RANGE_OFFSET)
@@ -91,6 +94,8 @@ enum pt_capabilities {
PT_CAP_psb_cyc,
PT_CAP_ip_filtering,
PT_CAP_mtc,
+   PT_CAP_ptwrite,
+   PT_CAP_power_event_trace,
PT_CAP_topa_output,
PT_CAP_topa_multiple_entries,
PT_CAP_single_range_output,
-- 
2.9.3



[PATCH] perf/x86/intel/pt: Add enables for PTWRITE and power event tracing

2016-09-16 Thread Alexander Shishkin
Intel PT chapter in the new Intel Architecture SDM adds several packets
and corresponding enable bits that control packet generation. Also,
additional bits in the Intel PT CPUID leaf were added to enumerate
presence and parameters of these new packets and features.

The packets and enables are:

  * PTWRITE packet carries the payload of the new PTWRITE instruction
that can be used to instrument Intel PT traces with user-supplied
data. Packets of this type are only generated if 'ptwrite' capability
is set and PTWEn bit is set in the event attribute's config. Flow
update packets (FUP) can be generated on PTWRITE packets if FUPonPTW
config bit is set. Setting these bits is not allowed if 'ptwrite'
capability is not set.

  * PWRE, PWRX, MWAIT, EXSTOP packets communicate core power management
events. These depend on 'power_event_tracing' capability and are
enabled by setting PwrEvtEn bit in the event attribute.

This patch adds corresponding bit and register definitions, PMU driver
capabilities based on CPUID enumeration, new attribute format bits for
the new featurens and extends event configuration validation function
to take these into account.

Cc: Adrian Hunter 
Signed-off-by: Alexander Shishkin 
---
 arch/x86/events/intel/pt.c | 24 +++-
 arch/x86/events/intel/pt.h |  5 +
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 861a7d9cb6..c5047b8f77 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -69,6 +69,8 @@ static struct pt_cap_desc {
PT_CAP(psb_cyc, 0, CR_EBX, BIT(1)),
PT_CAP(ip_filtering,0, CR_EBX, BIT(2)),
PT_CAP(mtc, 0, CR_EBX, BIT(3)),
+   PT_CAP(ptwrite, 0, CR_EBX, BIT(4)),
+   PT_CAP(power_event_trace,   0, CR_EBX, BIT(5)),
PT_CAP(topa_output, 0, CR_ECX, BIT(0)),
PT_CAP(topa_multiple_entries,   0, CR_ECX, BIT(1)),
PT_CAP(single_range_output, 0, CR_ECX, BIT(2)),
@@ -259,10 +261,16 @@ fail:
 #define RTIT_CTL_MTC   (RTIT_CTL_MTC_EN| \
 RTIT_CTL_MTC_RANGE)
 
+#define RTIT_CTL_PTW   (RTIT_CTL_PTW_EN| \
+RTIT_CTL_FUP_ON_PTW)
+
 #define PT_CONFIG_MASK (RTIT_CTL_TSC_EN| \
RTIT_CTL_DISRETC| \
RTIT_CTL_CYC_PSB| \
-   RTIT_CTL_MTC)
+   RTIT_CTL_MTC| \
+   RTIT_CTL_PWR_EVT_EN | \
+   RTIT_CTL_FUP_ON_PTW | \
+   RTIT_CTL_PTW_EN)
 
 static bool pt_event_valid(struct perf_event *event)
 {
@@ -311,6 +319,20 @@ static bool pt_event_valid(struct perf_event *event)
return false;
}
 
+   if (config & RTIT_CTL_PWR_EVT_EN &&
+   !pt_cap_get(PT_CAP_power_event_trace))
+   return false;
+
+   if (config & RTIT_CTL_PTW) {
+   if (!pt_cap_get(PT_CAP_ptwrite))
+   return false;
+
+   /* FUPonPTW without PTW doesn't make sense */
+   if ((config & RTIT_CTL_FUP_ON_PTW) &&
+   !(config & RTIT_CTL_PTW_EN))
+   return false;
+   }
+
return true;
 }
 
diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index efffa4a09f..53473c21b5 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -26,11 +26,14 @@
 #define RTIT_CTL_CYCLEACC  BIT(1)
 #define RTIT_CTL_OSBIT(2)
 #define RTIT_CTL_USR   BIT(3)
+#define RTIT_CTL_PWR_EVT_ENBIT(4)
+#define RTIT_CTL_FUP_ON_PTWBIT(5)
 #define RTIT_CTL_CR3EN BIT(7)
 #define RTIT_CTL_TOPA  BIT(8)
 #define RTIT_CTL_MTC_ENBIT(9)
 #define RTIT_CTL_TSC_ENBIT(10)
 #define RTIT_CTL_DISRETC   BIT(11)
+#define RTIT_CTL_PTW_ENBIT(12)
 #define RTIT_CTL_BRANCH_EN BIT(13)
 #define RTIT_CTL_MTC_RANGE_OFFSET  14
 #define RTIT_CTL_MTC_RANGE (0x0full << RTIT_CTL_MTC_RANGE_OFFSET)
@@ -91,6 +94,8 @@ enum pt_capabilities {
PT_CAP_psb_cyc,
PT_CAP_ip_filtering,
PT_CAP_mtc,
+   PT_CAP_ptwrite,
+   PT_CAP_power_event_trace,
PT_CAP_topa_output,
PT_CAP_topa_multiple_entries,
PT_CAP_single_range_output,
-- 
2.9.3



[tip:perf/urgent] perf/x86/intel/pt: Do validate the size of a kernel address filter

2016-09-16 Thread tip-bot for Alexander Shishkin
Commit-ID:  1155bafcb79208abc6ae234c6e135ac70607755c
Gitweb: http://git.kernel.org/tip/1155bafcb79208abc6ae234c6e135ac70607755c
Author: Alexander Shishkin <alexander.shish...@linux.intel.com>
AuthorDate: Thu, 15 Sep 2016 18:13:52 +0300
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Fri, 16 Sep 2016 11:14:16 +0200

perf/x86/intel/pt: Do validate the size of a kernel address filter

Right now, the kernel address filters in PT are prone to integer overflow
that may happen in adding filter's size to its offset to obtain the end
of the range. Such an overflow would also throw a #GP in the PT event
configuration path.

Fix this by explicitly validating the result of this calculation.

Reported-by: Adrian Hunter <adrian.hun...@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
Acked-by: Peter Zijlstra <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijls...@chello.nl>
Cc: Stephane Eranian <eran...@google.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Vince Weaver <vincent.wea...@maine.edu>
Cc: sta...@vger.kernel.org # v4.7
Cc: sta...@vger.kernel.org#v4.7
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160915151352.21306-4-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 arch/x86/events/intel/pt.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 1f94963..861a7d9 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1089,8 +1089,13 @@ static int pt_event_addr_filters_validate(struct 
list_head *filters)
if (!filter->range || !filter->size)
return -EOPNOTSUPP;
 
-   if (!filter->inode && !valid_kernel_ip(filter->offset))
-   return -EINVAL;
+   if (!filter->inode) {
+   if (!valid_kernel_ip(filter->offset))
+   return -EINVAL;
+
+   if (!valid_kernel_ip(filter->offset + filter->size))
+   return -EINVAL;
+   }
 
if (++range > pt_cap_get(PT_CAP_num_address_ranges))
return -EOPNOTSUPP;


[tip:perf/urgent] perf/x86/intel/pt: Do validate the size of a kernel address filter

2016-09-16 Thread tip-bot for Alexander Shishkin
Commit-ID:  1155bafcb79208abc6ae234c6e135ac70607755c
Gitweb: http://git.kernel.org/tip/1155bafcb79208abc6ae234c6e135ac70607755c
Author: Alexander Shishkin 
AuthorDate: Thu, 15 Sep 2016 18:13:52 +0300
Committer:  Ingo Molnar 
CommitDate: Fri, 16 Sep 2016 11:14:16 +0200

perf/x86/intel/pt: Do validate the size of a kernel address filter

Right now, the kernel address filters in PT are prone to integer overflow
that may happen in adding filter's size to its offset to obtain the end
of the range. Such an overflow would also throw a #GP in the PT event
configuration path.

Fix this by explicitly validating the result of this calculation.

Reported-by: Adrian Hunter 
Signed-off-by: Alexander Shishkin 
Acked-by: Peter Zijlstra 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: sta...@vger.kernel.org # v4.7
Cc: sta...@vger.kernel.org#v4.7
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160915151352.21306-4-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/intel/pt.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 1f94963..861a7d9 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1089,8 +1089,13 @@ static int pt_event_addr_filters_validate(struct 
list_head *filters)
if (!filter->range || !filter->size)
return -EOPNOTSUPP;
 
-   if (!filter->inode && !valid_kernel_ip(filter->offset))
-   return -EINVAL;
+   if (!filter->inode) {
+   if (!valid_kernel_ip(filter->offset))
+   return -EINVAL;
+
+   if (!valid_kernel_ip(filter->offset + filter->size))
+   return -EINVAL;
+   }
 
if (++range > pt_cap_get(PT_CAP_num_address_ranges))
return -EOPNOTSUPP;


[tip:perf/urgent] perf/x86/intel/pt: Fix kernel address filter's offset validation

2016-09-16 Thread tip-bot for Alexander Shishkin
Commit-ID:  ddfdad991e55b65c1cc4ee29502f6dceee04455a
Gitweb: http://git.kernel.org/tip/ddfdad991e55b65c1cc4ee29502f6dceee04455a
Author: Alexander Shishkin <alexander.shish...@linux.intel.com>
AuthorDate: Thu, 15 Sep 2016 18:13:51 +0300
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Fri, 16 Sep 2016 11:14:16 +0200

perf/x86/intel/pt: Fix kernel address filter's offset validation

The kernel_ip() filter is used mostly by the DS/LBR code to look at the
branch addresses, but Intel PT also uses it to validate the address
filter offsets for kernel addresses, for which it is not sufficient:
supplying something in bits 64:48 that's not a sign extension of the lower
address bits (like 0xf00d) throws a #GP.

This patch adds address validation for the user supplied kernel filters.

Reported-by: Adrian Hunter <adrian.hun...@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
Acked-by: Peter Zijlstra <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijls...@chello.nl>
Cc: Stephane Eranian <eran...@google.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Vince Weaver <vincent.wea...@maine.edu>
Cc: sta...@vger.kernel.org # v4.7
Cc: sta...@vger.kernel.org#v4.7
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160915151352.21306-3-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 arch/x86/events/intel/pt.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 5ec0100..1f94963 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1074,6 +1074,11 @@ static void pt_addr_filters_fini(struct perf_event 
*event)
event->hw.addr_filters = NULL;
 }
 
+static inline bool valid_kernel_ip(unsigned long ip)
+{
+   return virt_addr_valid(ip) && kernel_ip(ip);
+}
+
 static int pt_event_addr_filters_validate(struct list_head *filters)
 {
struct perf_addr_filter *filter;
@@ -1084,7 +1089,7 @@ static int pt_event_addr_filters_validate(struct 
list_head *filters)
if (!filter->range || !filter->size)
return -EOPNOTSUPP;
 
-   if (!filter->inode && !kernel_ip(filter->offset))
+   if (!filter->inode && !valid_kernel_ip(filter->offset))
return -EINVAL;
 
if (++range > pt_cap_get(PT_CAP_num_address_ranges))


[tip:perf/urgent] perf/x86/intel/pt: Fix kernel address filter's offset validation

2016-09-16 Thread tip-bot for Alexander Shishkin
Commit-ID:  ddfdad991e55b65c1cc4ee29502f6dceee04455a
Gitweb: http://git.kernel.org/tip/ddfdad991e55b65c1cc4ee29502f6dceee04455a
Author: Alexander Shishkin 
AuthorDate: Thu, 15 Sep 2016 18:13:51 +0300
Committer:  Ingo Molnar 
CommitDate: Fri, 16 Sep 2016 11:14:16 +0200

perf/x86/intel/pt: Fix kernel address filter's offset validation

The kernel_ip() filter is used mostly by the DS/LBR code to look at the
branch addresses, but Intel PT also uses it to validate the address
filter offsets for kernel addresses, for which it is not sufficient:
supplying something in bits 64:48 that's not a sign extension of the lower
address bits (like 0xf00d) throws a #GP.

This patch adds address validation for the user supplied kernel filters.

Reported-by: Adrian Hunter 
Signed-off-by: Alexander Shishkin 
Acked-by: Peter Zijlstra 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: sta...@vger.kernel.org # v4.7
Cc: sta...@vger.kernel.org#v4.7
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160915151352.21306-3-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/intel/pt.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 5ec0100..1f94963 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1074,6 +1074,11 @@ static void pt_addr_filters_fini(struct perf_event 
*event)
event->hw.addr_filters = NULL;
 }
 
+static inline bool valid_kernel_ip(unsigned long ip)
+{
+   return virt_addr_valid(ip) && kernel_ip(ip);
+}
+
 static int pt_event_addr_filters_validate(struct list_head *filters)
 {
struct perf_addr_filter *filter;
@@ -1084,7 +1089,7 @@ static int pt_event_addr_filters_validate(struct 
list_head *filters)
if (!filter->range || !filter->size)
return -EOPNOTSUPP;
 
-   if (!filter->inode && !kernel_ip(filter->offset))
+   if (!filter->inode && !valid_kernel_ip(filter->offset))
return -EINVAL;
 
if (++range > pt_cap_get(PT_CAP_num_address_ranges))


[tip:perf/urgent] perf/x86/intel/pt: Fix an off-by-one in address filter configuration

2016-09-16 Thread tip-bot for Alexander Shishkin
Commit-ID:  95f60084acbcee6c466256cf26eb52191fad9edc
Gitweb: http://git.kernel.org/tip/95f60084acbcee6c466256cf26eb52191fad9edc
Author: Alexander Shishkin <alexander.shish...@linux.intel.com>
AuthorDate: Thu, 15 Sep 2016 18:13:50 +0300
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Fri, 16 Sep 2016 11:14:16 +0200

perf/x86/intel/pt: Fix an off-by-one in address filter configuration

PT address filter configuration requires that a range is specified by
its first and last address, but at the moment we're obtaining the end
of the range by adding user specified size to its start, which is off
by one from what it actually needs to be.

Fix this and make sure that zero-sized filters don't pass the filter
validation.

Reported-by: Adrian Hunter <adrian.hun...@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
Acked-by: Peter Zijlstra <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijls...@chello.nl>
Cc: Stephane Eranian <eran...@google.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Vince Weaver <vincent.wea...@maine.edu>
Cc: sta...@vger.kernel.org # v4.7
Cc: sta...@vger.kernel.org#v4.7
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160915151352.21306-2-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 arch/x86/events/intel/pt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 04bb5fb..5ec0100 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1081,7 +1081,7 @@ static int pt_event_addr_filters_validate(struct 
list_head *filters)
 
list_for_each_entry(filter, filters, entry) {
/* PT doesn't support single address triggers */
-   if (!filter->range)
+   if (!filter->range || !filter->size)
return -EOPNOTSUPP;
 
if (!filter->inode && !kernel_ip(filter->offset))
@@ -,7 +,7 @@ static void pt_event_addr_filters_sync(struct perf_event 
*event)
} else {
/* apply the offset */
msr_a = filter->offset + offs[range];
-   msr_b = filter->size + msr_a;
+   msr_b = filter->size + msr_a - 1;
}
 
filters->filter[range].msr_a  = msr_a;


[tip:perf/urgent] perf/x86/intel/pt: Fix an off-by-one in address filter configuration

2016-09-16 Thread tip-bot for Alexander Shishkin
Commit-ID:  95f60084acbcee6c466256cf26eb52191fad9edc
Gitweb: http://git.kernel.org/tip/95f60084acbcee6c466256cf26eb52191fad9edc
Author: Alexander Shishkin 
AuthorDate: Thu, 15 Sep 2016 18:13:50 +0300
Committer:  Ingo Molnar 
CommitDate: Fri, 16 Sep 2016 11:14:16 +0200

perf/x86/intel/pt: Fix an off-by-one in address filter configuration

PT address filter configuration requires that a range is specified by
its first and last address, but at the moment we're obtaining the end
of the range by adding user specified size to its start, which is off
by one from what it actually needs to be.

Fix this and make sure that zero-sized filters don't pass the filter
validation.

Reported-by: Adrian Hunter 
Signed-off-by: Alexander Shishkin 
Acked-by: Peter Zijlstra 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: sta...@vger.kernel.org # v4.7
Cc: sta...@vger.kernel.org#v4.7
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160915151352.21306-2-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/intel/pt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 04bb5fb..5ec0100 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1081,7 +1081,7 @@ static int pt_event_addr_filters_validate(struct 
list_head *filters)
 
list_for_each_entry(filter, filters, entry) {
/* PT doesn't support single address triggers */
-   if (!filter->range)
+   if (!filter->range || !filter->size)
return -EOPNOTSUPP;
 
if (!filter->inode && !kernel_ip(filter->offset))
@@ -,7 +,7 @@ static void pt_event_addr_filters_sync(struct perf_event 
*event)
} else {
/* apply the offset */
msr_a = filter->offset + offs[range];
-   msr_b = filter->size + msr_a;
+   msr_b = filter->size + msr_a - 1;
}
 
filters->filter[range].msr_a  = msr_a;


[PATCH v1 2/3] perf/x86/intel/pt: Fix kernel address filter's offset validation

2016-09-15 Thread Alexander Shishkin
The kernel_ip() filter is used mostly by the DS/LBR code to look at the
branch addresses, but Intel PT also uses it to validate the address
filter offsets for kernel addresses, for which it is not sufficient:
supplying something in bits 64:48 that's not a sign extension of the lower
address bits (like 0xf00d) throws a #GP.

This patch adds address validation for the user supplied kernel filters.

Cc: sta...@vger.kernel.org # v4.7
Reported-by: Adrian Hunter <adrian.hun...@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 arch/x86/events/intel/pt.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 5ec0100e3f..1f94963a28 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1074,6 +1074,11 @@ static void pt_addr_filters_fini(struct perf_event 
*event)
event->hw.addr_filters = NULL;
 }
 
+static inline bool valid_kernel_ip(unsigned long ip)
+{
+   return virt_addr_valid(ip) && kernel_ip(ip);
+}
+
 static int pt_event_addr_filters_validate(struct list_head *filters)
 {
struct perf_addr_filter *filter;
@@ -1084,7 +1089,7 @@ static int pt_event_addr_filters_validate(struct 
list_head *filters)
if (!filter->range || !filter->size)
return -EOPNOTSUPP;
 
-   if (!filter->inode && !kernel_ip(filter->offset))
+   if (!filter->inode && !valid_kernel_ip(filter->offset))
return -EINVAL;
 
if (++range > pt_cap_get(PT_CAP_num_address_ranges))
-- 
2.9.3



[PATCH v1 1/3] perf/x86/intel/pt: Fix an off-by-one in address filter configuration

2016-09-15 Thread Alexander Shishkin
PT address filter configuration requires that a range is specified by
its first and last address, but at the moment we're obtaining the end
of the range by adding user specified size to its start, which is off
by one from what it actually needs to be.

Fix this and make sure that zero-sized filters don't pass the filter
validation.

Cc: sta...@vger.kernel.org # v4.7
Reported-by: Adrian Hunter <adrian.hun...@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 arch/x86/events/intel/pt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 04bb5fb5a8..5ec0100e3f 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1081,7 +1081,7 @@ static int pt_event_addr_filters_validate(struct 
list_head *filters)
 
list_for_each_entry(filter, filters, entry) {
/* PT doesn't support single address triggers */
-   if (!filter->range)
+   if (!filter->range || !filter->size)
return -EOPNOTSUPP;
 
if (!filter->inode && !kernel_ip(filter->offset))
@@ -,7 +,7 @@ static void pt_event_addr_filters_sync(struct perf_event 
*event)
} else {
/* apply the offset */
msr_a = filter->offset + offs[range];
-   msr_b = filter->size + msr_a;
+   msr_b = filter->size + msr_a - 1;
}
 
filters->filter[range].msr_a  = msr_a;
-- 
2.9.3



[PATCH v1 2/3] perf/x86/intel/pt: Fix kernel address filter's offset validation

2016-09-15 Thread Alexander Shishkin
The kernel_ip() filter is used mostly by the DS/LBR code to look at the
branch addresses, but Intel PT also uses it to validate the address
filter offsets for kernel addresses, for which it is not sufficient:
supplying something in bits 64:48 that's not a sign extension of the lower
address bits (like 0xf00d) throws a #GP.

This patch adds address validation for the user supplied kernel filters.

Cc: sta...@vger.kernel.org # v4.7
Reported-by: Adrian Hunter 
Signed-off-by: Alexander Shishkin 
---
 arch/x86/events/intel/pt.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 5ec0100e3f..1f94963a28 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1074,6 +1074,11 @@ static void pt_addr_filters_fini(struct perf_event 
*event)
event->hw.addr_filters = NULL;
 }
 
+static inline bool valid_kernel_ip(unsigned long ip)
+{
+   return virt_addr_valid(ip) && kernel_ip(ip);
+}
+
 static int pt_event_addr_filters_validate(struct list_head *filters)
 {
struct perf_addr_filter *filter;
@@ -1084,7 +1089,7 @@ static int pt_event_addr_filters_validate(struct 
list_head *filters)
if (!filter->range || !filter->size)
return -EOPNOTSUPP;
 
-   if (!filter->inode && !kernel_ip(filter->offset))
+   if (!filter->inode && !valid_kernel_ip(filter->offset))
return -EINVAL;
 
if (++range > pt_cap_get(PT_CAP_num_address_ranges))
-- 
2.9.3



[PATCH v1 1/3] perf/x86/intel/pt: Fix an off-by-one in address filter configuration

2016-09-15 Thread Alexander Shishkin
PT address filter configuration requires that a range is specified by
its first and last address, but at the moment we're obtaining the end
of the range by adding user specified size to its start, which is off
by one from what it actually needs to be.

Fix this and make sure that zero-sized filters don't pass the filter
validation.

Cc: sta...@vger.kernel.org # v4.7
Reported-by: Adrian Hunter 
Signed-off-by: Alexander Shishkin 
---
 arch/x86/events/intel/pt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 04bb5fb5a8..5ec0100e3f 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1081,7 +1081,7 @@ static int pt_event_addr_filters_validate(struct 
list_head *filters)
 
list_for_each_entry(filter, filters, entry) {
/* PT doesn't support single address triggers */
-   if (!filter->range)
+   if (!filter->range || !filter->size)
return -EOPNOTSUPP;
 
if (!filter->inode && !kernel_ip(filter->offset))
@@ -,7 +,7 @@ static void pt_event_addr_filters_sync(struct perf_event 
*event)
} else {
/* apply the offset */
msr_a = filter->offset + offs[range];
-   msr_b = filter->size + msr_a;
+   msr_b = filter->size + msr_a - 1;
}
 
filters->filter[range].msr_a  = msr_a;
-- 
2.9.3



[PATCH v1 0/3] perf/x86/intel/pt: Address filtering fixes for perf/urgent

2016-09-15 Thread Alexander Shishkin
Hi again,

Apologies for the botched recipient list in the previous one. So I
moved the virt_addr_valid() inside pt.c as well to save cycles in the
LBR code.

These are 3 bugs that Adrian found; all 3 seem like good -stable
candidates.

Alexander Shishkin (3):
  perf/x86/intel/pt: Fix an off-by-one in address filter configuration
  perf/x86/intel/pt: Fix kernel address filter's offset validation
  perf/x86/intel/pt: Do validate the size of a kernel address filter

 arch/x86/events/intel/pt.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

-- 
2.9.3



[PATCH v1 0/3] perf/x86/intel/pt: Address filtering fixes for perf/urgent

2016-09-15 Thread Alexander Shishkin
Hi again,

Apologies for the botched recipient list in the previous one. So I
moved the virt_addr_valid() inside pt.c as well to save cycles in the
LBR code.

These are 3 bugs that Adrian found; all 3 seem like good -stable
candidates.

Alexander Shishkin (3):
  perf/x86/intel/pt: Fix an off-by-one in address filter configuration
  perf/x86/intel/pt: Fix kernel address filter's offset validation
  perf/x86/intel/pt: Do validate the size of a kernel address filter

 arch/x86/events/intel/pt.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

-- 
2.9.3



[PATCH v1 3/3] perf/x86/intel/pt: Do validate the size of a kernel address filter

2016-09-15 Thread Alexander Shishkin
Right now, the kernel address filters in PT are prone to integer overflow
that may happen in adding filter's size to its offset to obtain the end
of the range. Such an overflow would also throw a #GP in the PT event
configuration path.

Fix this by explicitly validating the result of this calculation.

Cc: sta...@vger.kernel.org # v4.7
Reported-by: Adrian Hunter <adrian.hun...@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 arch/x86/events/intel/pt.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 1f94963a28..861a7d9cb6 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1089,8 +1089,13 @@ static int pt_event_addr_filters_validate(struct 
list_head *filters)
if (!filter->range || !filter->size)
return -EOPNOTSUPP;
 
-   if (!filter->inode && !valid_kernel_ip(filter->offset))
-   return -EINVAL;
+   if (!filter->inode) {
+   if (!valid_kernel_ip(filter->offset))
+   return -EINVAL;
+
+   if (!valid_kernel_ip(filter->offset + filter->size))
+   return -EINVAL;
+   }
 
if (++range > pt_cap_get(PT_CAP_num_address_ranges))
return -EOPNOTSUPP;
-- 
2.9.3



[PATCH v1 3/3] perf/x86/intel/pt: Do validate the size of a kernel address filter

2016-09-15 Thread Alexander Shishkin
Right now, the kernel address filters in PT are prone to integer overflow
that may happen in adding filter's size to its offset to obtain the end
of the range. Such an overflow would also throw a #GP in the PT event
configuration path.

Fix this by explicitly validating the result of this calculation.

Cc: sta...@vger.kernel.org # v4.7
Reported-by: Adrian Hunter 
Signed-off-by: Alexander Shishkin 
---
 arch/x86/events/intel/pt.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 1f94963a28..861a7d9cb6 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1089,8 +1089,13 @@ static int pt_event_addr_filters_validate(struct 
list_head *filters)
if (!filter->range || !filter->size)
return -EOPNOTSUPP;
 
-   if (!filter->inode && !valid_kernel_ip(filter->offset))
-   return -EINVAL;
+   if (!filter->inode) {
+   if (!valid_kernel_ip(filter->offset))
+   return -EINVAL;
+
+   if (!valid_kernel_ip(filter->offset + filter->size))
+   return -EINVAL;
+   }
 
if (++range > pt_cap_get(PT_CAP_num_address_ranges))
return -EOPNOTSUPP;
-- 
2.9.3



[PATCH 2/3] perf/x86: Tighten up the kernel_ip() check

2016-09-15 Thread Alexander Shishkin
The kernel_ip() filter is used mostly by the DS/LBR code to look at the
branch addresses, but Intel PT also uses it to validate the address
filter offsets for kernel addresses, for which it is not sufficient:
supplying something in bits 64:48 that's not a sign extension of the lower
address bits (like 0xf00d) throws a #GP.

In the interest of improving everybody's kernel address checks, this
patch adds address validation to kernel_ip().

Cc: sta...@vger.kernel.org # 4.7
Reported-by: Adrian Hunter <adrian.hun...@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 arch/x86/events/perf_event.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 5874d8de1f..88fb389356 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -783,7 +783,7 @@ static inline bool kernel_ip(unsigned long ip)
 #ifdef CONFIG_X86_32
return ip > PAGE_OFFSET;
 #else
-   return (long)ip < 0;
+   return (long)ip < 0 && virt_addr_valid(ip);
 #endif
 }
 
-- 
2.9.3



[PATCH 2/3] perf/x86: Tighten up the kernel_ip() check

2016-09-15 Thread Alexander Shishkin
The kernel_ip() filter is used mostly by the DS/LBR code to look at the
branch addresses, but Intel PT also uses it to validate the address
filter offsets for kernel addresses, for which it is not sufficient:
supplying something in bits 64:48 that's not a sign extension of the lower
address bits (like 0xf00d) throws a #GP.

In the interest of improving everybody's kernel address checks, this
patch adds address validation to kernel_ip().

Cc: sta...@vger.kernel.org # 4.7
Reported-by: Adrian Hunter 
Signed-off-by: Alexander Shishkin 
---
 arch/x86/events/perf_event.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 5874d8de1f..88fb389356 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -783,7 +783,7 @@ static inline bool kernel_ip(unsigned long ip)
 #ifdef CONFIG_X86_32
return ip > PAGE_OFFSET;
 #else
-   return (long)ip < 0;
+   return (long)ip < 0 && virt_addr_valid(ip);
 #endif
 }
 
-- 
2.9.3



[PATCH 3/3] perf/x86/intel/pt: Do validate the size of a kernel address filter

2016-09-15 Thread Alexander Shishkin
Right now, the kernel address filters in PT are prone to integer overflow
that may happen in adding filter's size to its offset to obtain the end
of the range. Such an overflow would also throw a #GP in the PT event
configuration path.

Fix this by explicitly validating the result of this calculation.

Cc: sta...@vger.kernel.org # 4.7
Reported-by: Adrian Hunter <adrian.hun...@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 arch/x86/events/intel/pt.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 5ec0100e3f..834ce06b00 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1084,8 +1084,13 @@ static int pt_event_addr_filters_validate(struct 
list_head *filters)
if (!filter->range || !filter->size)
return -EOPNOTSUPP;
 
-   if (!filter->inode && !kernel_ip(filter->offset))
-   return -EINVAL;
+   if (!filter->inode) {
+   if (!kernel_ip(filter->offset))
+   return -EINVAL;
+
+   if (!kernel_ip(filter->offset + filter->size))
+   return -EINVAL;
+   }
 
if (++range > pt_cap_get(PT_CAP_num_address_ranges))
return -EOPNOTSUPP;
-- 
2.9.3



[PATCH 3/3] perf/x86/intel/pt: Do validate the size of a kernel address filter

2016-09-15 Thread Alexander Shishkin
Right now, the kernel address filters in PT are prone to integer overflow
that may happen in adding filter's size to its offset to obtain the end
of the range. Such an overflow would also throw a #GP in the PT event
configuration path.

Fix this by explicitly validating the result of this calculation.

Cc: sta...@vger.kernel.org # 4.7
Reported-by: Adrian Hunter 
Signed-off-by: Alexander Shishkin 
---
 arch/x86/events/intel/pt.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 5ec0100e3f..834ce06b00 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1084,8 +1084,13 @@ static int pt_event_addr_filters_validate(struct 
list_head *filters)
if (!filter->range || !filter->size)
return -EOPNOTSUPP;
 
-   if (!filter->inode && !kernel_ip(filter->offset))
-   return -EINVAL;
+   if (!filter->inode) {
+   if (!kernel_ip(filter->offset))
+   return -EINVAL;
+
+   if (!kernel_ip(filter->offset + filter->size))
+   return -EINVAL;
+   }
 
if (++range > pt_cap_get(PT_CAP_num_address_ranges))
return -EOPNOTSUPP;
-- 
2.9.3



[PATCH 1/3] perf/x86/intel/pt: Fix an off-by-one in address filter configuration

2016-09-15 Thread Alexander Shishkin
PT address filter configuration requires that a range is specified by
its first and last address, but at the moment we're obtaining the end
of the range by adding user specified size to its start, which is off
by one from what it actually needs to be.

Fix this and make sure that zero-sized filters don't pass the filter
validation.

Cc: sta...@vger.kernel.org # 4.7
Reported-by: Adrian Hunter <adrian.hun...@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 arch/x86/events/intel/pt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 04bb5fb5a8..5ec0100e3f 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1081,7 +1081,7 @@ static int pt_event_addr_filters_validate(struct 
list_head *filters)
 
list_for_each_entry(filter, filters, entry) {
/* PT doesn't support single address triggers */
-   if (!filter->range)
+   if (!filter->range || !filter->size)
return -EOPNOTSUPP;
 
if (!filter->inode && !kernel_ip(filter->offset))
@@ -,7 +,7 @@ static void pt_event_addr_filters_sync(struct perf_event 
*event)
} else {
/* apply the offset */
msr_a = filter->offset + offs[range];
-   msr_b = filter->size + msr_a;
+   msr_b = filter->size + msr_a - 1;
}
 
filters->filter[range].msr_a  = msr_a;
-- 
2.9.3



[PATCH 1/3] perf/x86/intel/pt: Fix an off-by-one in address filter configuration

2016-09-15 Thread Alexander Shishkin
PT address filter configuration requires that a range is specified by
its first and last address, but at the moment we're obtaining the end
of the range by adding user specified size to its start, which is off
by one from what it actually needs to be.

Fix this and make sure that zero-sized filters don't pass the filter
validation.

Cc: sta...@vger.kernel.org # 4.7
Reported-by: Adrian Hunter 
Signed-off-by: Alexander Shishkin 
---
 arch/x86/events/intel/pt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 04bb5fb5a8..5ec0100e3f 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1081,7 +1081,7 @@ static int pt_event_addr_filters_validate(struct 
list_head *filters)
 
list_for_each_entry(filter, filters, entry) {
/* PT doesn't support single address triggers */
-   if (!filter->range)
+   if (!filter->range || !filter->size)
return -EOPNOTSUPP;
 
if (!filter->inode && !kernel_ip(filter->offset))
@@ -,7 +,7 @@ static void pt_event_addr_filters_sync(struct perf_event 
*event)
} else {
/* apply the offset */
msr_a = filter->offset + offs[range];
-   msr_b = filter->size + msr_a;
+   msr_b = filter->size + msr_a - 1;
}
 
filters->filter[range].msr_a  = msr_a;
-- 
2.9.3



[PATCH 0/3] perf/x86/intel/pt: Address filtering fixes for perf/urgent

2016-09-15 Thread Alexander Shishkin
Hi,

Hoping that it's not too late, here are fixes for issues that Adrian
found. All three are good for -stable afaict. Two of the bugs result
in a #GP and one in a misconfigured filter.

2/3 can be moved into the PT driver as well, but I decided that others
may benefit from it at least theoretically.

Alexander Shishkin (3):
  perf/x86/intel/pt: Fix an off-by-one in address filter configuration
  perf/x86: Tighten up the kernel_ip() check
  perf/x86/intel/pt: Do validate the size of a kernel address filter

 arch/x86/events/intel/pt.c   | 13 +
 arch/x86/events/perf_event.h |  2 +-
 2 files changed, 10 insertions(+), 5 deletions(-)

-- 
2.9.3



[PATCH 0/3] perf/x86/intel/pt: Address filtering fixes for perf/urgent

2016-09-15 Thread Alexander Shishkin
Hi,

Hoping that it's not too late, here are fixes for issues that Adrian
found. All three are good for -stable afaict. Two of the bugs result
in a #GP and one in a misconfigured filter.

2/3 can be moved into the PT driver as well, but I decided that others
may benefit from it at least theoretically.

Alexander Shishkin (3):
  perf/x86/intel/pt: Fix an off-by-one in address filter configuration
  perf/x86: Tighten up the kernel_ip() check
  perf/x86/intel/pt: Do validate the size of a kernel address filter

 arch/x86/events/intel/pt.c   | 13 +
 arch/x86/events/perf_event.h |  2 +-
 2 files changed, 10 insertions(+), 5 deletions(-)

-- 
2.9.3



[tip:perf/urgent] perf/x86/intel: Don't disable "intel_bts" around "intel" event batching

2016-09-15 Thread tip-bot for Alexander Shishkin
Commit-ID:  cecf62352aee2b4fe114aafd1b8c5f265a4243ce
Gitweb: http://git.kernel.org/tip/cecf62352aee2b4fe114aafd1b8c5f265a4243ce
Author: Alexander Shishkin <alexander.shish...@linux.intel.com>
AuthorDate: Thu, 15 Sep 2016 11:22:33 +0300
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Thu, 15 Sep 2016 11:25:26 +0200

perf/x86/intel: Don't disable "intel_bts" around "intel" event batching

At the moment, intel_bts events get disabled from intel PMU's disable
callback, which includes event scheduling transactions of said PMU,
which have nothing to do with intel_bts events.

We do want to keep intel_bts events off inside the PMI handler to
avoid filling up their buffer too soon.

This patch moves intel_bts enabling/disabling directly to the PMI
handler.

Reported-by: Vince Weaver <vincent.wea...@maine.edu>
Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <a...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijls...@chello.nl>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Stephane Eranian <eran...@google.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160915082233.11065-1-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 arch/x86/events/intel/core.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 2cbde2f..4c9a79b 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -1730,9 +1730,11 @@ static __initconst const u64 knl_hw_cache_extra_regs
  * disabled state if called consecutively.
  *
  * During consecutive calls, the same disable value will be written to related
- * registers, so the PMU state remains unchanged. hw.state in
- * intel_bts_disable_local will remain PERF_HES_STOPPED too in consecutive
- * calls.
+ * registers, so the PMU state remains unchanged.
+ *
+ * intel_bts events don't coexist with intel PMU's BTS events because of
+ * x86_add_exclusive(x86_lbr_exclusive_lbr); there's no need to keep them
+ * disabled around intel PMU's event batching etc, only inside the PMI handler.
  */
 static void __intel_pmu_disable_all(void)
 {
@@ -1742,8 +1744,6 @@ static void __intel_pmu_disable_all(void)
 
if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
intel_pmu_disable_bts();
-   else
-   intel_bts_disable_local();
 
intel_pmu_pebs_disable_all();
 }
@@ -1771,8 +1771,7 @@ static void __intel_pmu_enable_all(int added, bool pmi)
return;
 
intel_pmu_enable_bts(event->hw.config);
-   } else
-   intel_bts_enable_local();
+   }
 }
 
 static void intel_pmu_enable_all(int added)
@@ -2073,6 +2072,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 */
if (!x86_pmu.late_ack)
apic_write(APIC_LVTPC, APIC_DM_NMI);
+   intel_bts_disable_local();
__intel_pmu_disable_all();
handled = intel_pmu_drain_bts_buffer();
handled += intel_bts_interrupt();
@@ -2172,6 +2172,7 @@ done:
/* Only restore PMU state when it's active. See x86_pmu_disable(). */
if (cpuc->enabled)
__intel_pmu_enable_all(0, true);
+   intel_bts_enable_local();
 
/*
 * Only unmask the NMI after the overflow counters


[tip:perf/urgent] perf/x86/intel: Don't disable "intel_bts" around "intel" event batching

2016-09-15 Thread tip-bot for Alexander Shishkin
Commit-ID:  cecf62352aee2b4fe114aafd1b8c5f265a4243ce
Gitweb: http://git.kernel.org/tip/cecf62352aee2b4fe114aafd1b8c5f265a4243ce
Author: Alexander Shishkin 
AuthorDate: Thu, 15 Sep 2016 11:22:33 +0300
Committer:  Ingo Molnar 
CommitDate: Thu, 15 Sep 2016 11:25:26 +0200

perf/x86/intel: Don't disable "intel_bts" around "intel" event batching

At the moment, intel_bts events get disabled from intel PMU's disable
callback, which includes event scheduling transactions of said PMU,
which have nothing to do with intel_bts events.

We do want to keep intel_bts events off inside the PMI handler to
avoid filling up their buffer too soon.

This patch moves intel_bts enabling/disabling directly to the PMI
handler.

Reported-by: Vince Weaver 
Signed-off-by: Alexander Shishkin 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160915082233.11065-1-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/intel/core.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 2cbde2f..4c9a79b 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -1730,9 +1730,11 @@ static __initconst const u64 knl_hw_cache_extra_regs
  * disabled state if called consecutively.
  *
  * During consecutive calls, the same disable value will be written to related
- * registers, so the PMU state remains unchanged. hw.state in
- * intel_bts_disable_local will remain PERF_HES_STOPPED too in consecutive
- * calls.
+ * registers, so the PMU state remains unchanged.
+ *
+ * intel_bts events don't coexist with intel PMU's BTS events because of
+ * x86_add_exclusive(x86_lbr_exclusive_lbr); there's no need to keep them
+ * disabled around intel PMU's event batching etc, only inside the PMI handler.
  */
 static void __intel_pmu_disable_all(void)
 {
@@ -1742,8 +1744,6 @@ static void __intel_pmu_disable_all(void)
 
if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
intel_pmu_disable_bts();
-   else
-   intel_bts_disable_local();
 
intel_pmu_pebs_disable_all();
 }
@@ -1771,8 +1771,7 @@ static void __intel_pmu_enable_all(int added, bool pmi)
return;
 
intel_pmu_enable_bts(event->hw.config);
-   } else
-   intel_bts_enable_local();
+   }
 }
 
 static void intel_pmu_enable_all(int added)
@@ -2073,6 +2072,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 */
if (!x86_pmu.late_ack)
apic_write(APIC_LVTPC, APIC_DM_NMI);
+   intel_bts_disable_local();
__intel_pmu_disable_all();
handled = intel_pmu_drain_bts_buffer();
handled += intel_bts_interrupt();
@@ -2172,6 +2172,7 @@ done:
/* Only restore PMU state when it's active. See x86_pmu_disable(). */
if (cpuc->enabled)
__intel_pmu_enable_all(0, true);
+   intel_bts_enable_local();
 
/*
 * Only unmask the NMI after the overflow counters


[PATCH] perf/x86/intel: Don't disable "intel_bts" around "intel" event batching

2016-09-15 Thread Alexander Shishkin
At the moment, intel_bts events get disabled from intel PMU's disable
callback, which includes event scheduling transactions of said PMU,
which have nothing to do with intel_bts events.

We do want to keep intel_bts events off inside the PMI handler to
avoid filling up their buffer too soon.

This patch moves intel_bts enabling/disabling directly to the PMI
handler.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 arch/x86/events/intel/core.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 88792f846d..e2d71513c9 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -1730,9 +1730,11 @@ static __initconst const u64 knl_hw_cache_extra_regs
  * disabled state if called consecutively.
  *
  * During consecutive calls, the same disable value will be written to related
- * registers, so the PMU state remains unchanged. hw.state in
- * intel_bts_disable_local will remain PERF_HES_STOPPED too in consecutive
- * calls.
+ * registers, so the PMU state remains unchanged.
+ *
+ * intel_bts events don't coexist with intel pmu's BTS events because of
+ * x86_add_exclusive(x86_lbr_exclusive_lbr); there's no need to keep them
+ * disabled around intel pmu's event batching etc, only inside the PMI handler.
  */
 static void __intel_pmu_disable_all(void)
 {
@@ -1742,8 +1744,6 @@ static void __intel_pmu_disable_all(void)
 
if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
intel_pmu_disable_bts();
-   else
-   intel_bts_disable_local();
 
intel_pmu_pebs_disable_all();
 }
@@ -1771,8 +1771,7 @@ static void __intel_pmu_enable_all(int added, bool pmi)
return;
 
intel_pmu_enable_bts(event->hw.config);
-   } else
-   intel_bts_enable_local();
+   }
 }
 
 static void intel_pmu_enable_all(int added)
@@ -2076,6 +2075,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 */
if (!x86_pmu.late_ack)
apic_write(APIC_LVTPC, APIC_DM_NMI);
+   intel_bts_disable_local();
__intel_pmu_disable_all();
handled = intel_pmu_drain_bts_buffer();
handled += intel_bts_interrupt();
@@ -2175,6 +2175,7 @@ done:
/* Only restore PMU state when it's active. See x86_pmu_disable(). */
if (cpuc->enabled)
__intel_pmu_enable_all(0, true);
+   intel_bts_enable_local();
 
/*
 * Only unmask the NMI after the overflow counters
-- 
2.9.3



[PATCH] perf/x86/intel: Don't disable "intel_bts" around "intel" event batching

2016-09-15 Thread Alexander Shishkin
At the moment, intel_bts events get disabled from intel PMU's disable
callback, which includes event scheduling transactions of said PMU,
which have nothing to do with intel_bts events.

We do want to keep intel_bts events off inside the PMI handler to
avoid filling up their buffer too soon.

This patch moves intel_bts enabling/disabling directly to the PMI
handler.

Signed-off-by: Alexander Shishkin 
---
 arch/x86/events/intel/core.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 88792f846d..e2d71513c9 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -1730,9 +1730,11 @@ static __initconst const u64 knl_hw_cache_extra_regs
  * disabled state if called consecutively.
  *
  * During consecutive calls, the same disable value will be written to related
- * registers, so the PMU state remains unchanged. hw.state in
- * intel_bts_disable_local will remain PERF_HES_STOPPED too in consecutive
- * calls.
+ * registers, so the PMU state remains unchanged.
+ *
+ * intel_bts events don't coexist with intel pmu's BTS events because of
+ * x86_add_exclusive(x86_lbr_exclusive_lbr); there's no need to keep them
+ * disabled around intel pmu's event batching etc, only inside the PMI handler.
  */
 static void __intel_pmu_disable_all(void)
 {
@@ -1742,8 +1744,6 @@ static void __intel_pmu_disable_all(void)
 
if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
intel_pmu_disable_bts();
-   else
-   intel_bts_disable_local();
 
intel_pmu_pebs_disable_all();
 }
@@ -1771,8 +1771,7 @@ static void __intel_pmu_enable_all(int added, bool pmi)
return;
 
intel_pmu_enable_bts(event->hw.config);
-   } else
-   intel_bts_enable_local();
+   }
 }
 
 static void intel_pmu_enable_all(int added)
@@ -2076,6 +2075,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 */
if (!x86_pmu.late_ack)
apic_write(APIC_LVTPC, APIC_DM_NMI);
+   intel_bts_disable_local();
__intel_pmu_disable_all();
handled = intel_pmu_drain_bts_buffer();
handled += intel_bts_interrupt();
@@ -2175,6 +2175,7 @@ done:
/* Only restore PMU state when it's active. See x86_pmu_disable(). */
if (cpuc->enabled)
__intel_pmu_enable_all(0, true);
+   intel_bts_enable_local();
 
/*
 * Only unmask the NMI after the overflow counters
-- 
2.9.3



Re: perf: perf_fuzzer still triggers bts warning

2016-09-15 Thread Alexander Shishkin
Vince Weaver  writes:

> Hello
>
> I'm running 4.8-rc6 git from this morning (with the various perf fixes).
>
> Fuzzing on Skylake I still managed to trigger the following warning.

I was sure that I'd sent the fix for this one in the earlier series, but
somehow I missed it. Sending now.

Thanks,
--
Alex


Re: perf: perf_fuzzer still triggers bts warning

2016-09-15 Thread Alexander Shishkin
Vince Weaver  writes:

> Hello
>
> I'm running 4.8-rc6 git from this morning (with the various perf fixes).
>
> Fuzzing on Skylake I still managed to trigger the following warning.

I was sure that I'd sent the fix for this one in the earlier series, but
somehow I missed it. Sending now.

Thanks,
--
Alex


[tip:perf/core] perf/x86/intel/bts: Fix confused ordering of PMU callbacks

2016-09-10 Thread tip-bot for Alexander Shishkin
Commit-ID:  a9a94401c2b5805c71e39427b1af1bf1b9f67cd0
Gitweb: http://git.kernel.org/tip/a9a94401c2b5805c71e39427b1af1bf1b9f67cd0
Author: Alexander Shishkin <alexander.shish...@linux.intel.com>
AuthorDate: Tue, 6 Sep 2016 16:23:51 +0300
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Sat, 10 Sep 2016 11:15:37 +0200

perf/x86/intel/bts: Fix confused ordering of PMU callbacks

The intel_bts driver is using a CPU-local 'started' variable to order
callbacks and PMIs and make sure that AUX transactions don't get messed
up. However, the ordering rules in regard to this variable is a complete
mess, which recently resulted in perf_fuzzer-triggered warnings and
panics.

The general ordering rule that is patch is enforcing is that this
cpu-local variable be set only when the cpu-local AUX transaction is
active; consequently, this variable is to be checked before the AUX
related bits can be touched.

Reported-by: Vince Weaver <vincent.wea...@maine.edu>
Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Stephane Eranian <eran...@google.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160906132353.19887-4-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 arch/x86/events/intel/bts.c | 104 ++--
 1 file changed, 80 insertions(+), 24 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 0a6e393..61e1d71 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -31,7 +31,17 @@
 struct bts_ctx {
struct perf_output_handle   handle;
struct debug_store  ds_back;
-   int started;
+   int state;
+};
+
+/* BTS context states: */
+enum {
+   /* no ongoing AUX transactions */
+   BTS_STATE_STOPPED = 0,
+   /* AUX transaction is on, BTS tracing is disabled */
+   BTS_STATE_INACTIVE,
+   /* AUX transaction is on, BTS tracing is running */
+   BTS_STATE_ACTIVE,
 };
 
 static DEFINE_PER_CPU(struct bts_ctx, bts_ctx);
@@ -204,6 +214,15 @@ static void bts_update(struct bts_ctx *bts)
 static int
 bts_buffer_reset(struct bts_buffer *buf, struct perf_output_handle *handle);
 
+/*
+ * Ordering PMU callbacks wrt themselves and the PMI is done by means
+ * of bts::state, which:
+ *  - is set when bts::handle::event is valid, that is, between
+ *perf_aux_output_begin() and perf_aux_output_end();
+ *  - is zero otherwise;
+ *  - is ordered against bts::handle::event with a compiler barrier.
+ */
+
 static void __bts_event_start(struct perf_event *event)
 {
struct bts_ctx *bts = this_cpu_ptr(_ctx);
@@ -221,10 +240,13 @@ static void __bts_event_start(struct perf_event *event)
 
/*
 * local barrier to make sure that ds configuration made it
-* before we enable BTS
+* before we enable BTS and bts::state goes ACTIVE
 */
wmb();
 
+   /* INACTIVE/STOPPED -> ACTIVE */
+   WRITE_ONCE(bts->state, BTS_STATE_ACTIVE);
+
intel_pmu_enable_bts(config);
 
 }
@@ -251,9 +273,6 @@ static void bts_event_start(struct perf_event *event, int 
flags)
 
__bts_event_start(event);
 
-   /* PMI handler: this counter is running and likely generating PMIs */
-   ACCESS_ONCE(bts->started) = 1;
-
return;
 
 fail_end_stop:
@@ -263,30 +282,34 @@ fail_stop:
event->hw.state = PERF_HES_STOPPED;
 }
 
-static void __bts_event_stop(struct perf_event *event)
+static void __bts_event_stop(struct perf_event *event, int state)
 {
+   struct bts_ctx *bts = this_cpu_ptr(_ctx);
+
+   /* ACTIVE -> INACTIVE(PMI)/STOPPED(->stop()) */
+   WRITE_ONCE(bts->state, state);
+
/*
 * No extra synchronization is mandated by the documentation to have
 * BTS data stores globally visible.
 */
intel_pmu_disable_bts();
-
-   if (event->hw.state & PERF_HES_STOPPED)
-   return;
-
-   ACCESS_ONCE(event->hw.state) |= PERF_HES_STOPPED;
 }
 
 static void bts_event_stop(struct perf_event *event, int flags)
 {
struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
struct bts_ctx *bts = this_cpu_ptr(_ctx);
-   struct bts_buffer *buf = perf_get_aux(>handle);
+   struct bts_buffer *buf = NULL;
+   int state = READ_ONCE(bts->state);
+
+   if (state == BTS_STATE_ACTIVE)
+   __bts_event_stop(event, BTS_STATE_STOPPED);
 
-   /* PMI handler: don't restart this counter *

[tip:perf/core] perf/core: Fix aux_mmap_count vs aux_refcount order

2016-09-10 Thread tip-bot for Alexander Shishkin
Commit-ID:  b79ccadd6bb10e72cf784a298ca6dc1398eb9a24
Gitweb: http://git.kernel.org/tip/b79ccadd6bb10e72cf784a298ca6dc1398eb9a24
Author: Alexander Shishkin <alexander.shish...@linux.intel.com>
AuthorDate: Tue, 6 Sep 2016 16:23:50 +0300
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Sat, 10 Sep 2016 11:15:36 +0200

perf/core: Fix aux_mmap_count vs aux_refcount order

The order of accesses to ring buffer's aux_mmap_count and aux_refcount
has to be preserved across the users, namely perf_mmap_close() and
perf_aux_output_begin(), otherwise the inversion can result in the latter
holding the last reference to the aux buffer and subsequently free'ing
it in atomic context, triggering a warning.

> [ cut here ]
> WARNING: CPU: 0 PID: 257 at kernel/events/ring_buffer.c:541 
> __rb_free_aux+0x11a/0x130
> CPU: 0 PID: 257 Comm: stopbug Not tainted 4.8.0-rc1+ #2596
> Call Trace:
>  [] __warn+0xcb/0xf0
>  [] warn_slowpath_null+0x1d/0x20
>  [] __rb_free_aux+0x11a/0x130
>  [] rb_free_aux+0x18/0x20
>  [] perf_aux_output_begin+0x163/0x1e0
>  [] bts_event_start+0x3a/0xd0
>  [] bts_event_add+0x5d/0x80
>  [] event_sched_in.isra.104+0xf6/0x2f0
>  [] group_sched_in+0x6e/0x190
>  [] ctx_sched_in+0x2fe/0x5f0
>  [] perf_event_sched_in+0x60/0x80
>  [] ctx_resched+0x5b/0x90
>  [] __perf_event_enable+0x1e1/0x240
>  [] event_function+0xa9/0x180
>  [] ? perf_cgroup_attach+0x70/0x70
>  [] remote_function+0x3f/0x50
>  [] flush_smp_call_function_queue+0x83/0x150
>  [] generic_smp_call_function_single_interrupt+0x13/0x60
>  [] smp_call_function_single_interrupt+0x27/0x40
>  [] call_function_single_interrupt+0x89/0x90
>  [] finish_task_switch+0xa6/0x210
>  [] ? finish_task_switch+0x67/0x210
>  [] __schedule+0x3dd/0xb50
>  [] schedule+0x35/0x80
>  [] sys_sched_yield+0x61/0x70
>  [] entry_SYSCALL_64_fastpath+0x18/0xa8
> ---[ end trace 6235f556f5ea83a9 ]---

This patch puts the checks in perf_aux_output_begin() in the same order
as that of perf_mmap_close().

Reported-by: Vince Weaver <vincent.wea...@maine.edu>
Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Stephane Eranian <eran...@google.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160906132353.19887-3-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 kernel/events/ring_buffer.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index ae9b90d..257fa46 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -330,15 +330,22 @@ void *perf_aux_output_begin(struct perf_output_handle 
*handle,
if (!rb)
return NULL;
 
-   if (!rb_has_aux(rb) || !atomic_inc_not_zero(>aux_refcount))
+   if (!rb_has_aux(rb))
goto err;
 
/*
-* If rb::aux_mmap_count is zero (and rb_has_aux() above went through),
-* the aux buffer is in perf_mmap_close(), about to get freed.
+* If aux_mmap_count is zero, the aux buffer is in perf_mmap_close(),
+* about to get freed, so we leave immediately.
+*
+* Checking rb::aux_mmap_count and rb::refcount has to be done in
+* the same order, see perf_mmap_close. Otherwise we end up freeing
+* aux pages in this path, which is a bug, because in_atomic().
 */
if (!atomic_read(>aux_mmap_count))
-   goto err_put;
+   goto err;
+
+   if (!atomic_inc_not_zero(>aux_refcount))
+   goto err;
 
/*
 * Nesting is not supported for AUX area, make sure nested


[tip:perf/core] perf/x86/intel/bts: Fix confused ordering of PMU callbacks

2016-09-10 Thread tip-bot for Alexander Shishkin
Commit-ID:  a9a94401c2b5805c71e39427b1af1bf1b9f67cd0
Gitweb: http://git.kernel.org/tip/a9a94401c2b5805c71e39427b1af1bf1b9f67cd0
Author: Alexander Shishkin 
AuthorDate: Tue, 6 Sep 2016 16:23:51 +0300
Committer:  Ingo Molnar 
CommitDate: Sat, 10 Sep 2016 11:15:37 +0200

perf/x86/intel/bts: Fix confused ordering of PMU callbacks

The intel_bts driver is using a CPU-local 'started' variable to order
callbacks and PMIs and make sure that AUX transactions don't get messed
up. However, the ordering rules in regard to this variable is a complete
mess, which recently resulted in perf_fuzzer-triggered warnings and
panics.

The general ordering rule that is patch is enforcing is that this
cpu-local variable be set only when the cpu-local AUX transaction is
active; consequently, this variable is to be checked before the AUX
related bits can be touched.

Reported-by: Vince Weaver 
Signed-off-by: Alexander Shishkin 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160906132353.19887-4-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/intel/bts.c | 104 ++--
 1 file changed, 80 insertions(+), 24 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 0a6e393..61e1d71 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -31,7 +31,17 @@
 struct bts_ctx {
struct perf_output_handle   handle;
struct debug_store  ds_back;
-   int started;
+   int state;
+};
+
+/* BTS context states: */
+enum {
+   /* no ongoing AUX transactions */
+   BTS_STATE_STOPPED = 0,
+   /* AUX transaction is on, BTS tracing is disabled */
+   BTS_STATE_INACTIVE,
+   /* AUX transaction is on, BTS tracing is running */
+   BTS_STATE_ACTIVE,
 };
 
 static DEFINE_PER_CPU(struct bts_ctx, bts_ctx);
@@ -204,6 +214,15 @@ static void bts_update(struct bts_ctx *bts)
 static int
 bts_buffer_reset(struct bts_buffer *buf, struct perf_output_handle *handle);
 
+/*
+ * Ordering PMU callbacks wrt themselves and the PMI is done by means
+ * of bts::state, which:
+ *  - is set when bts::handle::event is valid, that is, between
+ *perf_aux_output_begin() and perf_aux_output_end();
+ *  - is zero otherwise;
+ *  - is ordered against bts::handle::event with a compiler barrier.
+ */
+
 static void __bts_event_start(struct perf_event *event)
 {
struct bts_ctx *bts = this_cpu_ptr(_ctx);
@@ -221,10 +240,13 @@ static void __bts_event_start(struct perf_event *event)
 
/*
 * local barrier to make sure that ds configuration made it
-* before we enable BTS
+* before we enable BTS and bts::state goes ACTIVE
 */
wmb();
 
+   /* INACTIVE/STOPPED -> ACTIVE */
+   WRITE_ONCE(bts->state, BTS_STATE_ACTIVE);
+
intel_pmu_enable_bts(config);
 
 }
@@ -251,9 +273,6 @@ static void bts_event_start(struct perf_event *event, int 
flags)
 
__bts_event_start(event);
 
-   /* PMI handler: this counter is running and likely generating PMIs */
-   ACCESS_ONCE(bts->started) = 1;
-
return;
 
 fail_end_stop:
@@ -263,30 +282,34 @@ fail_stop:
event->hw.state = PERF_HES_STOPPED;
 }
 
-static void __bts_event_stop(struct perf_event *event)
+static void __bts_event_stop(struct perf_event *event, int state)
 {
+   struct bts_ctx *bts = this_cpu_ptr(_ctx);
+
+   /* ACTIVE -> INACTIVE(PMI)/STOPPED(->stop()) */
+   WRITE_ONCE(bts->state, state);
+
/*
 * No extra synchronization is mandated by the documentation to have
 * BTS data stores globally visible.
 */
intel_pmu_disable_bts();
-
-   if (event->hw.state & PERF_HES_STOPPED)
-   return;
-
-   ACCESS_ONCE(event->hw.state) |= PERF_HES_STOPPED;
 }
 
 static void bts_event_stop(struct perf_event *event, int flags)
 {
struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
struct bts_ctx *bts = this_cpu_ptr(_ctx);
-   struct bts_buffer *buf = perf_get_aux(>handle);
+   struct bts_buffer *buf = NULL;
+   int state = READ_ONCE(bts->state);
+
+   if (state == BTS_STATE_ACTIVE)
+   __bts_event_stop(event, BTS_STATE_STOPPED);
 
-   /* PMI handler: don't restart this counter */
-   ACCESS_ONCE(bts->started) = 0;
+   if (state != BTS_STATE_STOPPED)
+   buf = perf_get_aux(>handle);
 
-   __bts_event_stop(event);
+   event->hw.state |= PERF_HES_STOPPED;
 
if (flags & PERF_EF_UPDATE) {
bts_update(bts);
@@ -296,6 +319,7 @@ static void bts_event_stop(struct perf_event *event, int 
flags

[tip:perf/core] perf/core: Fix aux_mmap_count vs aux_refcount order

2016-09-10 Thread tip-bot for Alexander Shishkin
Commit-ID:  b79ccadd6bb10e72cf784a298ca6dc1398eb9a24
Gitweb: http://git.kernel.org/tip/b79ccadd6bb10e72cf784a298ca6dc1398eb9a24
Author: Alexander Shishkin 
AuthorDate: Tue, 6 Sep 2016 16:23:50 +0300
Committer:  Ingo Molnar 
CommitDate: Sat, 10 Sep 2016 11:15:36 +0200

perf/core: Fix aux_mmap_count vs aux_refcount order

The order of accesses to ring buffer's aux_mmap_count and aux_refcount
has to be preserved across the users, namely perf_mmap_close() and
perf_aux_output_begin(), otherwise the inversion can result in the latter
holding the last reference to the aux buffer and subsequently free'ing
it in atomic context, triggering a warning.

> [ cut here ]
> WARNING: CPU: 0 PID: 257 at kernel/events/ring_buffer.c:541 
> __rb_free_aux+0x11a/0x130
> CPU: 0 PID: 257 Comm: stopbug Not tainted 4.8.0-rc1+ #2596
> Call Trace:
>  [] __warn+0xcb/0xf0
>  [] warn_slowpath_null+0x1d/0x20
>  [] __rb_free_aux+0x11a/0x130
>  [] rb_free_aux+0x18/0x20
>  [] perf_aux_output_begin+0x163/0x1e0
>  [] bts_event_start+0x3a/0xd0
>  [] bts_event_add+0x5d/0x80
>  [] event_sched_in.isra.104+0xf6/0x2f0
>  [] group_sched_in+0x6e/0x190
>  [] ctx_sched_in+0x2fe/0x5f0
>  [] perf_event_sched_in+0x60/0x80
>  [] ctx_resched+0x5b/0x90
>  [] __perf_event_enable+0x1e1/0x240
>  [] event_function+0xa9/0x180
>  [] ? perf_cgroup_attach+0x70/0x70
>  [] remote_function+0x3f/0x50
>  [] flush_smp_call_function_queue+0x83/0x150
>  [] generic_smp_call_function_single_interrupt+0x13/0x60
>  [] smp_call_function_single_interrupt+0x27/0x40
>  [] call_function_single_interrupt+0x89/0x90
>  [] finish_task_switch+0xa6/0x210
>  [] ? finish_task_switch+0x67/0x210
>  [] __schedule+0x3dd/0xb50
>  [] schedule+0x35/0x80
>  [] sys_sched_yield+0x61/0x70
>  [] entry_SYSCALL_64_fastpath+0x18/0xa8
> ---[ end trace 6235f556f5ea83a9 ]---

This patch puts the checks in perf_aux_output_begin() in the same order
as that of perf_mmap_close().

Reported-by: Vince Weaver 
Signed-off-by: Alexander Shishkin 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160906132353.19887-3-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/events/ring_buffer.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index ae9b90d..257fa46 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -330,15 +330,22 @@ void *perf_aux_output_begin(struct perf_output_handle 
*handle,
if (!rb)
return NULL;
 
-   if (!rb_has_aux(rb) || !atomic_inc_not_zero(>aux_refcount))
+   if (!rb_has_aux(rb))
goto err;
 
/*
-* If rb::aux_mmap_count is zero (and rb_has_aux() above went through),
-* the aux buffer is in perf_mmap_close(), about to get freed.
+* If aux_mmap_count is zero, the aux buffer is in perf_mmap_close(),
+* about to get freed, so we leave immediately.
+*
+* Checking rb::aux_mmap_count and rb::refcount has to be done in
+* the same order, see perf_mmap_close. Otherwise we end up freeing
+* aux pages in this path, which is a bug, because in_atomic().
 */
if (!atomic_read(>aux_mmap_count))
-   goto err_put;
+   goto err;
+
+   if (!atomic_inc_not_zero(>aux_refcount))
+   goto err;
 
/*
 * Nesting is not supported for AUX area, make sure nested


[tip:perf/core] perf/x86/intel/bts: Fix BTS PMI detection

2016-09-10 Thread tip-bot for Alexander Shishkin
Commit-ID:  4d4c474124649198d9b0a065c06f9362cf18e14e
Gitweb: http://git.kernel.org/tip/4d4c474124649198d9b0a065c06f9362cf18e14e
Author: Alexander Shishkin <alexander.shish...@linux.intel.com>
AuthorDate: Tue, 6 Sep 2016 16:23:52 +0300
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Sat, 10 Sep 2016 11:15:38 +0200

perf/x86/intel/bts: Fix BTS PMI detection

Since BTS doesn't have a dedicated PMI status bit, the driver needs to
take extra care to check for the condition that triggers it to avoid
spurious NMI warnings.

Regardless of the local BTS context state, the only way of knowing that
the NMI is ours is to compare the write pointer against the interrupt
threshold.

Reported-by: Vince Weaver <vincent.wea...@maine.edu>
Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Stephane Eranian <eran...@google.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160906132353.19887-5-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 arch/x86/events/intel/bts.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 61e1d71..9233edf 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -446,26 +446,37 @@ bts_buffer_reset(struct bts_buffer *buf, struct 
perf_output_handle *handle)
 
 int intel_bts_interrupt(void)
 {
+   struct debug_store *ds = this_cpu_ptr(_hw_events)->ds;
struct bts_ctx *bts = this_cpu_ptr(_ctx);
struct perf_event *event = bts->handle.event;
struct bts_buffer *buf;
s64 old_head;
-   int err = -ENOSPC;
+   int err = -ENOSPC, handled = 0;
+
+   /*
+* The only surefire way of knowing if this NMI is ours is by checking
+* the write ptr against the PMI threshold.
+*/
+   if (ds->bts_index >= ds->bts_interrupt_threshold)
+   handled = 1;
 
/*
 * this is wrapped in intel_bts_enable_local/intel_bts_disable_local,
 * so we can only be INACTIVE or STOPPED
 */
if (READ_ONCE(bts->state) == BTS_STATE_STOPPED)
-   return 0;
+   return handled;
 
buf = perf_get_aux(>handle);
+   if (!buf)
+   return handled;
+
/*
 * Skip snapshot counters: they don't use the interrupt, but
 * there's no other way of telling, because the pointer will
 * keep moving
 */
-   if (!buf || buf->snapshot)
+   if (buf->snapshot)
return 0;
 
old_head = local_read(>head);
@@ -473,7 +484,7 @@ int intel_bts_interrupt(void)
 
/* no new data */
if (old_head == local_read(>head))
-   return 0;
+   return handled;
 
perf_aux_output_end(>handle, local_xchg(>data_size, 0),
!!local_xchg(>lost, 0));


[tip:perf/core] perf/core: Fix a race between mmap_close() and set_output() of AUX events

2016-09-10 Thread tip-bot for Alexander Shishkin
Commit-ID:  767ae08678c2c796bcd7f582ee457aee20a28a1e
Gitweb: http://git.kernel.org/tip/767ae08678c2c796bcd7f582ee457aee20a28a1e
Author: Alexander Shishkin <alexander.shish...@linux.intel.com>
AuthorDate: Tue, 6 Sep 2016 16:23:49 +0300
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Sat, 10 Sep 2016 11:15:36 +0200

perf/core: Fix a race between mmap_close() and set_output() of AUX events

In the mmap_close() path we need to stop all the AUX events that are
writing data to the AUX area that we are unmapping, before we can
safely free the pages. To determine if an event needs to be stopped,
we're comparing its ->rb against the one that's getting unmapped.
However, a SET_OUTPUT ioctl may turn up inside an AUX transaction
and swizzle event::rb to some other ring buffer, but the transaction
will keep writing data to the old ring buffer until the event gets
scheduled out. At this point, mmap_close() will skip over such an
event and will proceed to free the AUX area, while it's still being
used by this event, which will set off a warning in the mmap_close()
path and cause a memory corruption.

To avoid this, always stop an AUX event before its ->rb is updated;
this will release the (potentially) last reference on the AUX area
of the buffer. If the event gets restarted, its new ring buffer will
be used. If another SET_OUTPUT comes and switches it back to the
old ring buffer that's getting unmapped, it's also fine: this
ring buffer's aux_mmap_count will be zero and AUX transactions won't
start any more.

Reported-by: Vince Weaver <vincent.wea...@maine.edu>
Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Stephane Eranian <eran...@google.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160906132353.19887-2-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 kernel/events/core.c | 31 +--
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 07ac859..a54f2c2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2496,11 +2496,11 @@ static int __perf_event_stop(void *info)
return 0;
 }
 
-static int perf_event_restart(struct perf_event *event)
+static int perf_event_stop(struct perf_event *event, int restart)
 {
struct stop_event_data sd = {
.event  = event,
-   .restart= 1,
+   .restart= restart,
};
int ret = 0;
 
@@ -4845,6 +4845,19 @@ static void ring_buffer_attach(struct perf_event *event,
spin_unlock_irqrestore(>event_lock, flags);
}
 
+   /*
+* Avoid racing with perf_mmap_close(AUX): stop the event
+* before swizzling the event::rb pointer; if it's getting
+* unmapped, its aux_mmap_count will be 0 and it won't
+* restart. See the comment in __perf_pmu_output_stop().
+*
+* Data will inevitably be lost when set_output is done in
+* mid-air, but then again, whoever does it like this is
+* not in for the data anyway.
+*/
+   if (has_aux(event))
+   perf_event_stop(event, 0);
+
rcu_assign_pointer(event->rb, rb);
 
if (old_rb) {
@@ -6120,7 +6133,7 @@ static void perf_event_addr_filters_exec(struct 
perf_event *event, void *data)
raw_spin_unlock_irqrestore(>lock, flags);
 
if (restart)
-   perf_event_restart(event);
+   perf_event_stop(event, 1);
 }
 
 void perf_event_exec(void)
@@ -6164,7 +6177,13 @@ static void __perf_event_output_stop(struct perf_event 
*event, void *data)
 
/*
 * In case of inheritance, it will be the parent that links to the
-* ring-buffer, but it will be the child that's actually using it:
+* ring-buffer, but it will be the child that's actually using it.
+*
+* We are using event::rb to determine if the event should be stopped,
+* however this may race with ring_buffer_attach() (through set_output),
+* which will make us skip the event that actually needs to be stopped.
+* So ring_buffer_attach() has to stop an aux event before re-assigning
+* its rb pointer.
 */
if (rcu_dereference(parent->rb) == rb)
ro->err = __perf_event_stop();
@@ -6678,7 +6697,7 @@ static void __perf_addr_filters_adjust(struct perf_event 
*event, void *data)
raw_spin_unlock_irqrestore(>lock, flags);
 
if (restart)
-   perf_

[tip:perf/core] perf/x86/intel/bts: Fix BTS PMI detection

2016-09-10 Thread tip-bot for Alexander Shishkin
Commit-ID:  4d4c474124649198d9b0a065c06f9362cf18e14e
Gitweb: http://git.kernel.org/tip/4d4c474124649198d9b0a065c06f9362cf18e14e
Author: Alexander Shishkin 
AuthorDate: Tue, 6 Sep 2016 16:23:52 +0300
Committer:  Ingo Molnar 
CommitDate: Sat, 10 Sep 2016 11:15:38 +0200

perf/x86/intel/bts: Fix BTS PMI detection

Since BTS doesn't have a dedicated PMI status bit, the driver needs to
take extra care to check for the condition that triggers it to avoid
spurious NMI warnings.

Regardless of the local BTS context state, the only way of knowing that
the NMI is ours is to compare the write pointer against the interrupt
threshold.

Reported-by: Vince Weaver 
Signed-off-by: Alexander Shishkin 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160906132353.19887-5-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/intel/bts.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 61e1d71..9233edf 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -446,26 +446,37 @@ bts_buffer_reset(struct bts_buffer *buf, struct 
perf_output_handle *handle)
 
 int intel_bts_interrupt(void)
 {
+   struct debug_store *ds = this_cpu_ptr(_hw_events)->ds;
struct bts_ctx *bts = this_cpu_ptr(_ctx);
struct perf_event *event = bts->handle.event;
struct bts_buffer *buf;
s64 old_head;
-   int err = -ENOSPC;
+   int err = -ENOSPC, handled = 0;
+
+   /*
+* The only surefire way of knowing if this NMI is ours is by checking
+* the write ptr against the PMI threshold.
+*/
+   if (ds->bts_index >= ds->bts_interrupt_threshold)
+   handled = 1;
 
/*
 * this is wrapped in intel_bts_enable_local/intel_bts_disable_local,
 * so we can only be INACTIVE or STOPPED
 */
if (READ_ONCE(bts->state) == BTS_STATE_STOPPED)
-   return 0;
+   return handled;
 
buf = perf_get_aux(>handle);
+   if (!buf)
+   return handled;
+
/*
 * Skip snapshot counters: they don't use the interrupt, but
 * there's no other way of telling, because the pointer will
 * keep moving
 */
-   if (!buf || buf->snapshot)
+   if (buf->snapshot)
return 0;
 
old_head = local_read(>head);
@@ -473,7 +484,7 @@ int intel_bts_interrupt(void)
 
/* no new data */
if (old_head == local_read(>head))
-   return 0;
+   return handled;
 
perf_aux_output_end(>handle, local_xchg(>data_size, 0),
!!local_xchg(>lost, 0));


[tip:perf/core] perf/core: Fix a race between mmap_close() and set_output() of AUX events

2016-09-10 Thread tip-bot for Alexander Shishkin
Commit-ID:  767ae08678c2c796bcd7f582ee457aee20a28a1e
Gitweb: http://git.kernel.org/tip/767ae08678c2c796bcd7f582ee457aee20a28a1e
Author: Alexander Shishkin 
AuthorDate: Tue, 6 Sep 2016 16:23:49 +0300
Committer:  Ingo Molnar 
CommitDate: Sat, 10 Sep 2016 11:15:36 +0200

perf/core: Fix a race between mmap_close() and set_output() of AUX events

In the mmap_close() path we need to stop all the AUX events that are
writing data to the AUX area that we are unmapping, before we can
safely free the pages. To determine if an event needs to be stopped,
we're comparing its ->rb against the one that's getting unmapped.
However, a SET_OUTPUT ioctl may turn up inside an AUX transaction
and swizzle event::rb to some other ring buffer, but the transaction
will keep writing data to the old ring buffer until the event gets
scheduled out. At this point, mmap_close() will skip over such an
event and will proceed to free the AUX area, while it's still being
used by this event, which will set off a warning in the mmap_close()
path and cause a memory corruption.

To avoid this, always stop an AUX event before its ->rb is updated;
this will release the (potentially) last reference on the AUX area
of the buffer. If the event gets restarted, its new ring buffer will
be used. If another SET_OUTPUT comes and switches it back to the
old ring buffer that's getting unmapped, it's also fine: this
ring buffer's aux_mmap_count will be zero and AUX transactions won't
start any more.

Reported-by: Vince Weaver 
Signed-off-by: Alexander Shishkin 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160906132353.19887-2-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/events/core.c | 31 +--
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 07ac859..a54f2c2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2496,11 +2496,11 @@ static int __perf_event_stop(void *info)
return 0;
 }
 
-static int perf_event_restart(struct perf_event *event)
+static int perf_event_stop(struct perf_event *event, int restart)
 {
struct stop_event_data sd = {
.event  = event,
-   .restart= 1,
+   .restart= restart,
};
int ret = 0;
 
@@ -4845,6 +4845,19 @@ static void ring_buffer_attach(struct perf_event *event,
spin_unlock_irqrestore(>event_lock, flags);
}
 
+   /*
+* Avoid racing with perf_mmap_close(AUX): stop the event
+* before swizzling the event::rb pointer; if it's getting
+* unmapped, its aux_mmap_count will be 0 and it won't
+* restart. See the comment in __perf_pmu_output_stop().
+*
+* Data will inevitably be lost when set_output is done in
+* mid-air, but then again, whoever does it like this is
+* not in for the data anyway.
+*/
+   if (has_aux(event))
+   perf_event_stop(event, 0);
+
rcu_assign_pointer(event->rb, rb);
 
if (old_rb) {
@@ -6120,7 +6133,7 @@ static void perf_event_addr_filters_exec(struct 
perf_event *event, void *data)
raw_spin_unlock_irqrestore(>lock, flags);
 
if (restart)
-   perf_event_restart(event);
+   perf_event_stop(event, 1);
 }
 
 void perf_event_exec(void)
@@ -6164,7 +6177,13 @@ static void __perf_event_output_stop(struct perf_event 
*event, void *data)
 
/*
 * In case of inheritance, it will be the parent that links to the
-* ring-buffer, but it will be the child that's actually using it:
+* ring-buffer, but it will be the child that's actually using it.
+*
+* We are using event::rb to determine if the event should be stopped,
+* however this may race with ring_buffer_attach() (through set_output),
+* which will make us skip the event that actually needs to be stopped.
+* So ring_buffer_attach() has to stop an aux event before re-assigning
+* its rb pointer.
 */
if (rcu_dereference(parent->rb) == rb)
ro->err = __perf_event_stop();
@@ -6678,7 +6697,7 @@ static void __perf_addr_filters_adjust(struct perf_event 
*event, void *data)
raw_spin_unlock_irqrestore(>lock, flags);
 
if (restart)
-   perf_event_restart(event);
+   perf_event_stop(event, 1);
 }
 
 /*
@@ -7867,7 +7886,7 @@ static void perf_event_addr_filters_apply(struct 
perf_event *event)
mmput(mm);
 
 restart:
-   perf_event_restart(event);
+   perf_event_stop(event, 1);
 }
 
 /*


[tip:perf/core] perf/x86/intel/bts: Kill a silly warning

2016-09-10 Thread tip-bot for Alexander Shishkin
Commit-ID:  ef9ef3befa0d76008e988a9ed9fe439e803351b9
Gitweb: http://git.kernel.org/tip/ef9ef3befa0d76008e988a9ed9fe439e803351b9
Author: Alexander Shishkin <alexander.shish...@linux.intel.com>
AuthorDate: Tue, 6 Sep 2016 16:23:53 +0300
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Sat, 10 Sep 2016 11:15:38 +0200

perf/x86/intel/bts: Kill a silly warning

At the moment, intel_bts will WARN() out if there is more than one
event writing to the same ring buffer, via SET_OUTPUT, and will only
send data from one event to a buffer.

There is no reason to have this warning in, so kill it.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Stephane Eranian <eran...@google.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160906132353.19887-6-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 arch/x86/events/intel/bts.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 9233edf..bdcd651 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -378,8 +378,6 @@ bts_buffer_reset(struct bts_buffer *buf, struct 
perf_output_handle *handle)
return 0;
 
head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1);
-   if (WARN_ON_ONCE(head != local_read(>head)))
-   return -EINVAL;
 
phys = >buf[buf->cur_buf];
space = phys->offset + phys->displacement + phys->size - head;


[tip:perf/core] perf/x86/intel/bts: Kill a silly warning

2016-09-10 Thread tip-bot for Alexander Shishkin
Commit-ID:  ef9ef3befa0d76008e988a9ed9fe439e803351b9
Gitweb: http://git.kernel.org/tip/ef9ef3befa0d76008e988a9ed9fe439e803351b9
Author: Alexander Shishkin 
AuthorDate: Tue, 6 Sep 2016 16:23:53 +0300
Committer:  Ingo Molnar 
CommitDate: Sat, 10 Sep 2016 11:15:38 +0200

perf/x86/intel/bts: Kill a silly warning

At the moment, intel_bts will WARN() out if there is more than one
event writing to the same ring buffer, via SET_OUTPUT, and will only
send data from one event to a buffer.

There is no reason to have this warning in, so kill it.

Signed-off-by: Alexander Shishkin 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160906132353.19887-6-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/intel/bts.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 9233edf..bdcd651 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -378,8 +378,6 @@ bts_buffer_reset(struct bts_buffer *buf, struct 
perf_output_handle *handle)
return 0;
 
head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1);
-   if (WARN_ON_ONCE(head != local_read(>head)))
-   return -EINVAL;
 
phys = >buf[buf->cur_buf];
space = phys->offset + phys->displacement + phys->size - head;


Re: [PATCH v2 0/5] perf, bts: Fallout from the fuzzer for perf/urgent

2016-09-08 Thread Alexander Shishkin
Vince Weaver  writes:

> On the skylake machine with the original 5 patches I got this after 
> continuing to fuzz.  Sorry about the lack of frame pointer, next
> compile will have it enabled.
>
> If it matters, prior to this I hit the unrelated
> [25510.278199] WARNING: CPU: 1 PID: 25405 at kernel/events/core.c:3554 
> perf_event_read+0x18f/0x1a0
>
>
> [28682.174684] WARNING: CPU: 7 PID: 31992 at kernel/events/core.c:4961 
> perf_mmap_close+0x2e1/0x2f0

It just keeps on giving, doesn't it. I've got the same thing here during
the night, looking at it again. Did you capture the seed by any chance?

Thanks,
--
Alex


Re: [PATCH v2 0/5] perf, bts: Fallout from the fuzzer for perf/urgent

2016-09-08 Thread Alexander Shishkin
Vince Weaver  writes:

> On the skylake machine with the original 5 patches I got this after 
> continuing to fuzz.  Sorry about the lack of frame pointer, next
> compile will have it enabled.
>
> If it matters, prior to this I hit the unrelated
> [25510.278199] WARNING: CPU: 1 PID: 25405 at kernel/events/core.c:3554 
> perf_event_read+0x18f/0x1a0
>
>
> [28682.174684] WARNING: CPU: 7 PID: 31992 at kernel/events/core.c:4961 
> perf_mmap_close+0x2e1/0x2f0

It just keeps on giving, doesn't it. I've got the same thing here during
the night, looking at it again. Did you capture the seed by any chance?

Thanks,
--
Alex


Re: [PATCH v2 0/5] perf, bts: Fallout from the fuzzer for perf/urgent

2016-09-08 Thread Alexander Shishkin
Ingo Molnar <mi...@kernel.org> writes:

> * Alexander Shishkin <alexander.shish...@linux.intel.com> wrote:
>
>> Ingo Molnar <mi...@kernel.org> writes:
>> 
>> > * Alexander Shishkin <alexander.shish...@linux.intel.com> wrote:
>> >
>> >> Hi,
>> >> 
>> >> There were more bugs since the previous version, plus the BTS barriers 
>> >> got 
>> >> fixed. With these patches, my testcase keeps running and no spurious NMI 
>> >> warnings pop up any more.
>> >
>> > Could you please also run the fuzzer that Vince uses, does it now pass on 
>> > hardware 
>> > you have access to?
>> 
>> Sure. And yes, I did catch a warning, which calls for one more patch
>> (below). Also one unrelated thing in PEBS that Peter fixed.
>> 
>> > I'd like to make "passes the fuzzer" a standard requirement before new 
>> > changes are 
>> > accepted to perf core.
>> 
>> Let's make it so.
>> 
>> For the sake of consistency, this one needs to go before 3/5. I'll
>> re-send the whole series, though, if need be. I've got 2 perf_fuzzers
>> running on this meanwhile.
>
> Yeah, please re-send it - and please also Vince's Reported-by tag to all 
> commits 
> that would explain failures that Vince reported.
>
> Also, please document how much and what type of fuzzer testing the series 
> got: 
> fuzzer version, time it ran and (rough) hardware it ran on would be useful. 
> (That 
> way we can look back later on whether there was any fuzzer testing on AMD 
> systems 
> for example, which you might not be able to perform.)

Not sure if run time is useful with the fuzzer, but otherwise seems
reasonable. How do we put this stuff into a commit message, though?

Perf-Fuzzer-ID: d18d23aae6

?

> It would also be very, very nice to also add a Documentation/perf/testing.txt 
> step 
> by step ELI5 style document that explains how to set up and run the fuzzer!

Actually, I'd like that too as I'm not sure I'm doing it 100% right (had
to patch it some time ago to stop it from segfaulting).

>> + * intel_bts events don't coexist with intel pmu's BTS events because of
>> + * x86_add_exclusive(x86_lbr_exclusive_lbr); there's no need to keep them
>> + * disabled around intel pmu's event batching etc, only inside the PMI 
>> handler.
>
> Pet peeve nit: please capitalize 'PMU' correctly and consistently (upper 
> case). 
> This paragraph has both variants: "PMU" and "pmu" which is the worst variant 
> really.

Sure.

Regards,
--
Alex


Re: [PATCH v2 0/5] perf, bts: Fallout from the fuzzer for perf/urgent

2016-09-08 Thread Alexander Shishkin
Ingo Molnar  writes:

> * Alexander Shishkin  wrote:
>
>> Ingo Molnar  writes:
>> 
>> > * Alexander Shishkin  wrote:
>> >
>> >> Hi,
>> >> 
>> >> There were more bugs since the previous version, plus the BTS barriers 
>> >> got 
>> >> fixed. With these patches, my testcase keeps running and no spurious NMI 
>> >> warnings pop up any more.
>> >
>> > Could you please also run the fuzzer that Vince uses, does it now pass on 
>> > hardware 
>> > you have access to?
>> 
>> Sure. And yes, I did catch a warning, which calls for one more patch
>> (below). Also one unrelated thing in PEBS that Peter fixed.
>> 
>> > I'd like to make "passes the fuzzer" a standard requirement before new 
>> > changes are 
>> > accepted to perf core.
>> 
>> Let's make it so.
>> 
>> For the sake of consistency, this one needs to go before 3/5. I'll
>> re-send the whole series, though, if need be. I've got 2 perf_fuzzers
>> running on this meanwhile.
>
> Yeah, please re-send it - and please also Vince's Reported-by tag to all 
> commits 
> that would explain failures that Vince reported.
>
> Also, please document how much and what type of fuzzer testing the series 
> got: 
> fuzzer version, time it ran and (rough) hardware it ran on would be useful. 
> (That 
> way we can look back later on whether there was any fuzzer testing on AMD 
> systems 
> for example, which you might not be able to perform.)

Not sure if run time is useful with the fuzzer, but otherwise seems
reasonable. How do we put this stuff into a commit message, though?

Perf-Fuzzer-ID: d18d23aae6

?

> It would also be very, very nice to also add a Documentation/perf/testing.txt 
> step 
> by step ELI5 style document that explains how to set up and run the fuzzer!

Actually, I'd like that too as I'm not sure I'm doing it 100% right (had
to patch it some time ago to stop it from segfaulting).

>> + * intel_bts events don't coexist with intel pmu's BTS events because of
>> + * x86_add_exclusive(x86_lbr_exclusive_lbr); there's no need to keep them
>> + * disabled around intel pmu's event batching etc, only inside the PMI 
>> handler.
>
> Pet peeve nit: please capitalize 'PMU' correctly and consistently (upper 
> case). 
> This paragraph has both variants: "PMU" and "pmu" which is the worst variant 
> really.

Sure.

Regards,
--
Alex


Re: [PATCH v2 0/5] perf, bts: Fallout from the fuzzer for perf/urgent

2016-09-07 Thread Alexander Shishkin
Vince Weaver <vi...@deater.net> writes:

> On Wed, 7 Sep 2016, Alexander Shishkin wrote:
>
>> Sure. And yes, I did catch a warning, which calls for one more patch
>> (below). Also one unrelated thing in PEBS that Peter fixed.
>
> Does that fix this which I just got on my skylake machine (4.8-rc5 with 
> your other 5 patches applied)
>
> [ 5351.822559] WARNING: CPU: 3 PID: 19191 at arch/x86/events/intel/bts.c:344 
> event_function+0xa1/0x160

Yes, it fixes a problem that triggered this warning. Can't tell from the
absence of backtrace what's going on here, though.

Regards,
--
Alex


Re: [PATCH v2 0/5] perf, bts: Fallout from the fuzzer for perf/urgent

2016-09-07 Thread Alexander Shishkin
Vince Weaver  writes:

> On Wed, 7 Sep 2016, Alexander Shishkin wrote:
>
>> Sure. And yes, I did catch a warning, which calls for one more patch
>> (below). Also one unrelated thing in PEBS that Peter fixed.
>
> Does that fix this which I just got on my skylake machine (4.8-rc5 with 
> your other 5 patches applied)
>
> [ 5351.822559] WARNING: CPU: 3 PID: 19191 at arch/x86/events/intel/bts.c:344 
> event_function+0xa1/0x160

Yes, it fixes a problem that triggered this warning. Can't tell from the
absence of backtrace what's going on here, though.

Regards,
--
Alex


Re: [PATCH v2 0/5] perf, bts: Fallout from the fuzzer for perf/urgent

2016-09-07 Thread Alexander Shishkin
Ingo Molnar <mi...@kernel.org> writes:

> * Alexander Shishkin <alexander.shish...@linux.intel.com> wrote:
>
>> Hi,
>> 
>> There were more bugs since the previous version, plus the BTS barriers got 
>> fixed. With these patches, my testcase keeps running and no spurious NMI 
>> warnings pop up any more.
>
> Could you please also run the fuzzer that Vince uses, does it now pass on 
> hardware 
> you have access to?

Sure. And yes, I did catch a warning, which calls for one more patch
(below). Also one unrelated thing in PEBS that Peter fixed.

> I'd like to make "passes the fuzzer" a standard requirement before new 
> changes are 
> accepted to perf core.

Let's make it so.

For the sake of consistency, this one needs to go before 3/5. I'll
re-send the whole series, though, if need be. I've got 2 perf_fuzzers
running on this meanwhile.

>From c170c607bfdc3804578033faa43f342a5d95eb6c Mon Sep 17 00:00:00 2001
From: Alexander Shishkin <alexander.shish...@linux.intel.com>
Date: Wed, 7 Sep 2016 18:05:08 +0300
Subject: [PATCH] perf/x86/intel: Don't disable "intel_bts" around "intel"
 event batching

At the moment, intel_bts events get disabled from intel PMU's disable
callback, which includes event scheduling transactions of said PMU,
which have nothing to do with intel_bts events.

We do want to keep intel_bts events off inside the PMI handler to
avoid filling up their buffer too soon.

This patch moves intel_bts enabling/disabling directly to the PMI
handler.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 arch/x86/events/intel/core.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 88792f846d..e2d71513c9 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -1730,9 +1730,11 @@ static __initconst const u64 knl_hw_cache_extra_regs
  * disabled state if called consecutively.
  *
  * During consecutive calls, the same disable value will be written to related
- * registers, so the PMU state remains unchanged. hw.state in
- * intel_bts_disable_local will remain PERF_HES_STOPPED too in consecutive
- * calls.
+ * registers, so the PMU state remains unchanged.
+ *
+ * intel_bts events don't coexist with intel pmu's BTS events because of
+ * x86_add_exclusive(x86_lbr_exclusive_lbr); there's no need to keep them
+ * disabled around intel pmu's event batching etc, only inside the PMI handler.
  */
 static void __intel_pmu_disable_all(void)
 {
@@ -1742,8 +1744,6 @@ static void __intel_pmu_disable_all(void)
 
if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
intel_pmu_disable_bts();
-   else
-   intel_bts_disable_local();
 
intel_pmu_pebs_disable_all();
 }
@@ -1771,8 +1771,7 @@ static void __intel_pmu_enable_all(int added, bool pmi)
return;
 
intel_pmu_enable_bts(event->hw.config);
-   } else
-   intel_bts_enable_local();
+   }
 }
 
 static void intel_pmu_enable_all(int added)
@@ -2076,6 +2075,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 */
if (!x86_pmu.late_ack)
apic_write(APIC_LVTPC, APIC_DM_NMI);
+   intel_bts_disable_local();
__intel_pmu_disable_all();
handled = intel_pmu_drain_bts_buffer();
handled += intel_bts_interrupt();
@@ -2175,6 +2175,7 @@ done:
/* Only restore PMU state when it's active. See x86_pmu_disable(). */
if (cpuc->enabled)
__intel_pmu_enable_all(0, true);
+   intel_bts_enable_local();
 
/*
 * Only unmask the NMI after the overflow counters
-- 
2.9.3



Re: [PATCH v2 0/5] perf, bts: Fallout from the fuzzer for perf/urgent

2016-09-07 Thread Alexander Shishkin
Ingo Molnar  writes:

> * Alexander Shishkin  wrote:
>
>> Hi,
>> 
>> There were more bugs since the previous version, plus the BTS barriers got 
>> fixed. With these patches, my testcase keeps running and no spurious NMI 
>> warnings pop up any more.
>
> Could you please also run the fuzzer that Vince uses, does it now pass on 
> hardware 
> you have access to?

Sure. And yes, I did catch a warning, which calls for one more patch
(below). Also one unrelated thing in PEBS that Peter fixed.

> I'd like to make "passes the fuzzer" a standard requirement before new 
> changes are 
> accepted to perf core.

Let's make it so.

For the sake of consistency, this one needs to go before 3/5. I'll
re-send the whole series, though, if need be. I've got 2 perf_fuzzers
running on this meanwhile.

>From c170c607bfdc3804578033faa43f342a5d95eb6c Mon Sep 17 00:00:00 2001
From: Alexander Shishkin 
Date: Wed, 7 Sep 2016 18:05:08 +0300
Subject: [PATCH] perf/x86/intel: Don't disable "intel_bts" around "intel"
 event batching

At the moment, intel_bts events get disabled from intel PMU's disable
callback, which includes event scheduling transactions of said PMU,
which have nothing to do with intel_bts events.

We do want to keep intel_bts events off inside the PMI handler to
avoid filling up their buffer too soon.

This patch moves intel_bts enabling/disabling directly to the PMI
handler.

Signed-off-by: Alexander Shishkin 
---
 arch/x86/events/intel/core.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 88792f846d..e2d71513c9 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -1730,9 +1730,11 @@ static __initconst const u64 knl_hw_cache_extra_regs
  * disabled state if called consecutively.
  *
  * During consecutive calls, the same disable value will be written to related
- * registers, so the PMU state remains unchanged. hw.state in
- * intel_bts_disable_local will remain PERF_HES_STOPPED too in consecutive
- * calls.
+ * registers, so the PMU state remains unchanged.
+ *
+ * intel_bts events don't coexist with intel pmu's BTS events because of
+ * x86_add_exclusive(x86_lbr_exclusive_lbr); there's no need to keep them
+ * disabled around intel pmu's event batching etc, only inside the PMI handler.
  */
 static void __intel_pmu_disable_all(void)
 {
@@ -1742,8 +1744,6 @@ static void __intel_pmu_disable_all(void)
 
if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
intel_pmu_disable_bts();
-   else
-   intel_bts_disable_local();
 
intel_pmu_pebs_disable_all();
 }
@@ -1771,8 +1771,7 @@ static void __intel_pmu_enable_all(int added, bool pmi)
return;
 
intel_pmu_enable_bts(event->hw.config);
-   } else
-   intel_bts_enable_local();
+   }
 }
 
 static void intel_pmu_enable_all(int added)
@@ -2076,6 +2075,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 */
if (!x86_pmu.late_ack)
apic_write(APIC_LVTPC, APIC_DM_NMI);
+   intel_bts_disable_local();
__intel_pmu_disable_all();
handled = intel_pmu_drain_bts_buffer();
handled += intel_bts_interrupt();
@@ -2175,6 +2175,7 @@ done:
/* Only restore PMU state when it's active. See x86_pmu_disable(). */
if (cpuc->enabled)
__intel_pmu_enable_all(0, true);
+   intel_bts_enable_local();
 
/*
 * Only unmask the NMI after the overflow counters
-- 
2.9.3



[PATCH v2 5/5] perf/x86/intel/bts: Kill a silly warning

2016-09-06 Thread Alexander Shishkin
At the moment, intel_bts will WARN() out if there is more than one
event writing to the same ring buffer, via SET_OUTPUT, and will only
send data from one event to a buffer.

There is no reason to have this warning in, so kill it.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 arch/x86/events/intel/bts.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 9233edf993..bdcd651099 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -378,8 +378,6 @@ bts_buffer_reset(struct bts_buffer *buf, struct 
perf_output_handle *handle)
return 0;
 
head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1);
-   if (WARN_ON_ONCE(head != local_read(>head)))
-   return -EINVAL;
 
phys = >buf[buf->cur_buf];
space = phys->offset + phys->displacement + phys->size - head;
-- 
2.9.3



[PATCH v2 5/5] perf/x86/intel/bts: Kill a silly warning

2016-09-06 Thread Alexander Shishkin
At the moment, intel_bts will WARN() out if there is more than one
event writing to the same ring buffer, via SET_OUTPUT, and will only
send data from one event to a buffer.

There is no reason to have this warning in, so kill it.

Signed-off-by: Alexander Shishkin 
---
 arch/x86/events/intel/bts.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 9233edf993..bdcd651099 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -378,8 +378,6 @@ bts_buffer_reset(struct bts_buffer *buf, struct 
perf_output_handle *handle)
return 0;
 
head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1);
-   if (WARN_ON_ONCE(head != local_read(>head)))
-   return -EINVAL;
 
phys = >buf[buf->cur_buf];
space = phys->offset + phys->displacement + phys->size - head;
-- 
2.9.3



[PATCH v2 4/5] perf/x86/intel/bts: Fix BTS PMI detection

2016-09-06 Thread Alexander Shishkin
Since BTS doesn't have a dedicated PMI status bit, the driver needs to
take extra care to check for the condition that triggers it to avoid
spurious NMI warnings.

Regardless of the local BTS context state, the only way of knowing that
the NMI is ours is to compare the write pointer against the interrupt
threshold.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 arch/x86/events/intel/bts.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 61e1d713b1..9233edf993 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -446,26 +446,37 @@ bts_buffer_reset(struct bts_buffer *buf, struct 
perf_output_handle *handle)
 
 int intel_bts_interrupt(void)
 {
+   struct debug_store *ds = this_cpu_ptr(_hw_events)->ds;
struct bts_ctx *bts = this_cpu_ptr(_ctx);
struct perf_event *event = bts->handle.event;
struct bts_buffer *buf;
s64 old_head;
-   int err = -ENOSPC;
+   int err = -ENOSPC, handled = 0;
+
+   /*
+* The only surefire way of knowing if this NMI is ours is by checking
+* the write ptr against the PMI threshold.
+*/
+   if (ds->bts_index >= ds->bts_interrupt_threshold)
+   handled = 1;
 
/*
 * this is wrapped in intel_bts_enable_local/intel_bts_disable_local,
 * so we can only be INACTIVE or STOPPED
 */
if (READ_ONCE(bts->state) == BTS_STATE_STOPPED)
-   return 0;
+   return handled;
 
buf = perf_get_aux(>handle);
+   if (!buf)
+   return handled;
+
/*
 * Skip snapshot counters: they don't use the interrupt, but
 * there's no other way of telling, because the pointer will
 * keep moving
 */
-   if (!buf || buf->snapshot)
+   if (buf->snapshot)
return 0;
 
old_head = local_read(>head);
@@ -473,7 +484,7 @@ int intel_bts_interrupt(void)
 
/* no new data */
if (old_head == local_read(>head))
-   return 0;
+   return handled;
 
perf_aux_output_end(>handle, local_xchg(>data_size, 0),
!!local_xchg(>lost, 0));
-- 
2.9.3



[PATCH v2 4/5] perf/x86/intel/bts: Fix BTS PMI detection

2016-09-06 Thread Alexander Shishkin
Since BTS doesn't have a dedicated PMI status bit, the driver needs to
take extra care to check for the condition that triggers it to avoid
spurious NMI warnings.

Regardless of the local BTS context state, the only way of knowing that
the NMI is ours is to compare the write pointer against the interrupt
threshold.

Signed-off-by: Alexander Shishkin 
---
 arch/x86/events/intel/bts.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 61e1d713b1..9233edf993 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -446,26 +446,37 @@ bts_buffer_reset(struct bts_buffer *buf, struct 
perf_output_handle *handle)
 
 int intel_bts_interrupt(void)
 {
+   struct debug_store *ds = this_cpu_ptr(_hw_events)->ds;
struct bts_ctx *bts = this_cpu_ptr(_ctx);
struct perf_event *event = bts->handle.event;
struct bts_buffer *buf;
s64 old_head;
-   int err = -ENOSPC;
+   int err = -ENOSPC, handled = 0;
+
+   /*
+* The only surefire way of knowing if this NMI is ours is by checking
+* the write ptr against the PMI threshold.
+*/
+   if (ds->bts_index >= ds->bts_interrupt_threshold)
+   handled = 1;
 
/*
 * this is wrapped in intel_bts_enable_local/intel_bts_disable_local,
 * so we can only be INACTIVE or STOPPED
 */
if (READ_ONCE(bts->state) == BTS_STATE_STOPPED)
-   return 0;
+   return handled;
 
buf = perf_get_aux(>handle);
+   if (!buf)
+   return handled;
+
/*
 * Skip snapshot counters: they don't use the interrupt, but
 * there's no other way of telling, because the pointer will
 * keep moving
 */
-   if (!buf || buf->snapshot)
+   if (buf->snapshot)
return 0;
 
old_head = local_read(>head);
@@ -473,7 +484,7 @@ int intel_bts_interrupt(void)
 
/* no new data */
if (old_head == local_read(>head))
-   return 0;
+   return handled;
 
perf_aux_output_end(>handle, local_xchg(>data_size, 0),
!!local_xchg(>lost, 0));
-- 
2.9.3



[PATCH v2 2/5] perf: Fix aux_mmap_count vs aux_refcount order

2016-09-06 Thread Alexander Shishkin
The order of accesses to ring buffer's aux_mmap_count and aux_refcount
has to be preserved across the users, namely perf_mmap_close() and
perf_aux_output_begin(), otherwise the inversion can result in the latter
holding the last reference to the aux buffer and subsequently free'ing
it in atomic context, triggering a warning.

> [ cut here ]
> WARNING: CPU: 0 PID: 257 at kernel/events/ring_buffer.c:541 
> __rb_free_aux+0x11a/0x130
> CPU: 0 PID: 257 Comm: stopbug Not tainted 4.8.0-rc1+ #2596
> Call Trace:
>  [] __warn+0xcb/0xf0
>  [] warn_slowpath_null+0x1d/0x20
>  [] __rb_free_aux+0x11a/0x130
>  [] rb_free_aux+0x18/0x20
>  [] perf_aux_output_begin+0x163/0x1e0
>  [] bts_event_start+0x3a/0xd0
>  [] bts_event_add+0x5d/0x80
>  [] event_sched_in.isra.104+0xf6/0x2f0
>  [] group_sched_in+0x6e/0x190
>  [] ctx_sched_in+0x2fe/0x5f0
>  [] perf_event_sched_in+0x60/0x80
>  [] ctx_resched+0x5b/0x90
>  [] __perf_event_enable+0x1e1/0x240
>  [] event_function+0xa9/0x180
>  [] ? perf_cgroup_attach+0x70/0x70
>  [] remote_function+0x3f/0x50
>  [] flush_smp_call_function_queue+0x83/0x150
>  [] generic_smp_call_function_single_interrupt+0x13/0x60
>  [] smp_call_function_single_interrupt+0x27/0x40
>  [] call_function_single_interrupt+0x89/0x90
>  [] finish_task_switch+0xa6/0x210
>  [] ? finish_task_switch+0x67/0x210
>  [] __schedule+0x3dd/0xb50
>  [] schedule+0x35/0x80
>  [] sys_sched_yield+0x61/0x70
>  [] entry_SYSCALL_64_fastpath+0x18/0xa8
> ---[ end trace 6235f556f5ea83a9 ]---

This patch puts the checks in perf_aux_output_begin() in the same order
as that of perf_mmap_close().

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 kernel/events/ring_buffer.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index ae9b90dc9a..71c9194b1f 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -330,15 +330,19 @@ void *perf_aux_output_begin(struct perf_output_handle 
*handle,
if (!rb)
return NULL;
 
-   if (!rb_has_aux(rb) || !atomic_inc_not_zero(>aux_refcount))
-   goto err;
-
/*
-* If rb::aux_mmap_count is zero (and rb_has_aux() above went through),
-* the aux buffer is in perf_mmap_close(), about to get freed.
+* If aux_mmap_count is zero, the aux buffer is in perf_mmap_close(),
+* about to get freed, so we leave immediately.
+*
+* Checking rb::aux_mmap_count and rb::refcount has to be done in
+* the same order, see perf_mmap_close. Otherwise we end up freeing
+* aux pages in this path, which is a bug, because in_atomic().
 */
if (!atomic_read(>aux_mmap_count))
-   goto err_put;
+   goto err;
+
+   if (!rb_has_aux(rb) || !atomic_inc_not_zero(>aux_refcount))
+   goto err;
 
/*
 * Nesting is not supported for AUX area, make sure nested
-- 
2.9.3



[PATCH v2 2/5] perf: Fix aux_mmap_count vs aux_refcount order

2016-09-06 Thread Alexander Shishkin
The order of accesses to ring buffer's aux_mmap_count and aux_refcount
has to be preserved across the users, namely perf_mmap_close() and
perf_aux_output_begin(), otherwise the inversion can result in the latter
holding the last reference to the aux buffer and subsequently free'ing
it in atomic context, triggering a warning.

> [ cut here ]
> WARNING: CPU: 0 PID: 257 at kernel/events/ring_buffer.c:541 
> __rb_free_aux+0x11a/0x130
> CPU: 0 PID: 257 Comm: stopbug Not tainted 4.8.0-rc1+ #2596
> Call Trace:
>  [] __warn+0xcb/0xf0
>  [] warn_slowpath_null+0x1d/0x20
>  [] __rb_free_aux+0x11a/0x130
>  [] rb_free_aux+0x18/0x20
>  [] perf_aux_output_begin+0x163/0x1e0
>  [] bts_event_start+0x3a/0xd0
>  [] bts_event_add+0x5d/0x80
>  [] event_sched_in.isra.104+0xf6/0x2f0
>  [] group_sched_in+0x6e/0x190
>  [] ctx_sched_in+0x2fe/0x5f0
>  [] perf_event_sched_in+0x60/0x80
>  [] ctx_resched+0x5b/0x90
>  [] __perf_event_enable+0x1e1/0x240
>  [] event_function+0xa9/0x180
>  [] ? perf_cgroup_attach+0x70/0x70
>  [] remote_function+0x3f/0x50
>  [] flush_smp_call_function_queue+0x83/0x150
>  [] generic_smp_call_function_single_interrupt+0x13/0x60
>  [] smp_call_function_single_interrupt+0x27/0x40
>  [] call_function_single_interrupt+0x89/0x90
>  [] finish_task_switch+0xa6/0x210
>  [] ? finish_task_switch+0x67/0x210
>  [] __schedule+0x3dd/0xb50
>  [] schedule+0x35/0x80
>  [] sys_sched_yield+0x61/0x70
>  [] entry_SYSCALL_64_fastpath+0x18/0xa8
> ---[ end trace 6235f556f5ea83a9 ]---

This patch puts the checks in perf_aux_output_begin() in the same order
as that of perf_mmap_close().

Signed-off-by: Alexander Shishkin 
---
 kernel/events/ring_buffer.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index ae9b90dc9a..71c9194b1f 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -330,15 +330,19 @@ void *perf_aux_output_begin(struct perf_output_handle 
*handle,
if (!rb)
return NULL;
 
-   if (!rb_has_aux(rb) || !atomic_inc_not_zero(>aux_refcount))
-   goto err;
-
/*
-* If rb::aux_mmap_count is zero (and rb_has_aux() above went through),
-* the aux buffer is in perf_mmap_close(), about to get freed.
+* If aux_mmap_count is zero, the aux buffer is in perf_mmap_close(),
+* about to get freed, so we leave immediately.
+*
+* Checking rb::aux_mmap_count and rb::refcount has to be done in
+* the same order, see perf_mmap_close. Otherwise we end up freeing
+* aux pages in this path, which is a bug, because in_atomic().
 */
if (!atomic_read(>aux_mmap_count))
-   goto err_put;
+   goto err;
+
+   if (!rb_has_aux(rb) || !atomic_inc_not_zero(>aux_refcount))
+   goto err;
 
/*
 * Nesting is not supported for AUX area, make sure nested
-- 
2.9.3



[PATCH v2 3/5] perf/x86/intel/bts: Fix confused ordering of PMU callbacks

2016-09-06 Thread Alexander Shishkin
The intel_bts driver is using a cpu-local 'started' variable to order
callbacks and PMIs and make sure that AUX transactions don't get messed
up. However, the ordering rules in regard to this variable is a complete
mess, which recently resulted in perf_fuzzer-triggered warnings and
panics.

The general ordering rule that is patch is enforcing is that this
cpu-local variable be set only when the cpu-local AUX transaction is
active; consequently, this variable is to be checked before the AUX
related bits can be touched.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 arch/x86/events/intel/bts.c | 104 ++--
 1 file changed, 80 insertions(+), 24 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 0a6e393a2e..61e1d713b1 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -31,7 +31,17 @@
 struct bts_ctx {
struct perf_output_handle   handle;
struct debug_store  ds_back;
-   int started;
+   int state;
+};
+
+/* BTS context states: */
+enum {
+   /* no ongoing AUX transactions */
+   BTS_STATE_STOPPED = 0,
+   /* AUX transaction is on, BTS tracing is disabled */
+   BTS_STATE_INACTIVE,
+   /* AUX transaction is on, BTS tracing is running */
+   BTS_STATE_ACTIVE,
 };
 
 static DEFINE_PER_CPU(struct bts_ctx, bts_ctx);
@@ -204,6 +214,15 @@ static void bts_update(struct bts_ctx *bts)
 static int
 bts_buffer_reset(struct bts_buffer *buf, struct perf_output_handle *handle);
 
+/*
+ * Ordering PMU callbacks wrt themselves and the PMI is done by means
+ * of bts::state, which:
+ *  - is set when bts::handle::event is valid, that is, between
+ *perf_aux_output_begin() and perf_aux_output_end();
+ *  - is zero otherwise;
+ *  - is ordered against bts::handle::event with a compiler barrier.
+ */
+
 static void __bts_event_start(struct perf_event *event)
 {
struct bts_ctx *bts = this_cpu_ptr(_ctx);
@@ -221,10 +240,13 @@ static void __bts_event_start(struct perf_event *event)
 
/*
 * local barrier to make sure that ds configuration made it
-* before we enable BTS
+* before we enable BTS and bts::state goes ACTIVE
 */
wmb();
 
+   /* INACTIVE/STOPPED -> ACTIVE */
+   WRITE_ONCE(bts->state, BTS_STATE_ACTIVE);
+
intel_pmu_enable_bts(config);
 
 }
@@ -251,9 +273,6 @@ static void bts_event_start(struct perf_event *event, int 
flags)
 
__bts_event_start(event);
 
-   /* PMI handler: this counter is running and likely generating PMIs */
-   ACCESS_ONCE(bts->started) = 1;
-
return;
 
 fail_end_stop:
@@ -263,30 +282,34 @@ fail_stop:
event->hw.state = PERF_HES_STOPPED;
 }
 
-static void __bts_event_stop(struct perf_event *event)
+static void __bts_event_stop(struct perf_event *event, int state)
 {
+   struct bts_ctx *bts = this_cpu_ptr(_ctx);
+
+   /* ACTIVE -> INACTIVE(PMI)/STOPPED(->stop()) */
+   WRITE_ONCE(bts->state, state);
+
/*
 * No extra synchronization is mandated by the documentation to have
 * BTS data stores globally visible.
 */
intel_pmu_disable_bts();
-
-   if (event->hw.state & PERF_HES_STOPPED)
-   return;
-
-   ACCESS_ONCE(event->hw.state) |= PERF_HES_STOPPED;
 }
 
 static void bts_event_stop(struct perf_event *event, int flags)
 {
struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
struct bts_ctx *bts = this_cpu_ptr(_ctx);
-   struct bts_buffer *buf = perf_get_aux(>handle);
+   struct bts_buffer *buf = NULL;
+   int state = READ_ONCE(bts->state);
+
+   if (state == BTS_STATE_ACTIVE)
+   __bts_event_stop(event, BTS_STATE_STOPPED);
 
-   /* PMI handler: don't restart this counter */
-   ACCESS_ONCE(bts->started) = 0;
+   if (state != BTS_STATE_STOPPED)
+   buf = perf_get_aux(>handle);
 
-   __bts_event_stop(event);
+   event->hw.state |= PERF_HES_STOPPED;
 
if (flags & PERF_EF_UPDATE) {
bts_update(bts);
@@ -296,6 +319,7 @@ static void bts_event_stop(struct perf_event *event, int 
flags)
bts->handle.head =
local_xchg(>data_size,
   buf->nr_pages << PAGE_SHIFT);
+
perf_aux_output_end(>handle, 
local_xchg(>data_size, 0),
!!local_xchg(>lost, 0));
}
@@ -310,8 +334,20 @@ static void bts_event_stop(struct perf_event *event, int 
flags)
 void intel_bts_enable_local(void)
 {
struct bts_ctx *bts = this_cpu_ptr(_ctx);
+   int state = READ_ONCE(bts->state);
+
+   /*
+* Here we transition from 

[PATCH v2 3/5] perf/x86/intel/bts: Fix confused ordering of PMU callbacks

2016-09-06 Thread Alexander Shishkin
The intel_bts driver is using a cpu-local 'started' variable to order
callbacks and PMIs and make sure that AUX transactions don't get messed
up. However, the ordering rules in regard to this variable is a complete
mess, which recently resulted in perf_fuzzer-triggered warnings and
panics.

The general ordering rule that is patch is enforcing is that this
cpu-local variable be set only when the cpu-local AUX transaction is
active; consequently, this variable is to be checked before the AUX
related bits can be touched.

Signed-off-by: Alexander Shishkin 
---
 arch/x86/events/intel/bts.c | 104 ++--
 1 file changed, 80 insertions(+), 24 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 0a6e393a2e..61e1d713b1 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -31,7 +31,17 @@
 struct bts_ctx {
struct perf_output_handle   handle;
struct debug_store  ds_back;
-   int started;
+   int state;
+};
+
+/* BTS context states: */
+enum {
+   /* no ongoing AUX transactions */
+   BTS_STATE_STOPPED = 0,
+   /* AUX transaction is on, BTS tracing is disabled */
+   BTS_STATE_INACTIVE,
+   /* AUX transaction is on, BTS tracing is running */
+   BTS_STATE_ACTIVE,
 };
 
 static DEFINE_PER_CPU(struct bts_ctx, bts_ctx);
@@ -204,6 +214,15 @@ static void bts_update(struct bts_ctx *bts)
 static int
 bts_buffer_reset(struct bts_buffer *buf, struct perf_output_handle *handle);
 
+/*
+ * Ordering PMU callbacks wrt themselves and the PMI is done by means
+ * of bts::state, which:
+ *  - is set when bts::handle::event is valid, that is, between
+ *perf_aux_output_begin() and perf_aux_output_end();
+ *  - is zero otherwise;
+ *  - is ordered against bts::handle::event with a compiler barrier.
+ */
+
 static void __bts_event_start(struct perf_event *event)
 {
struct bts_ctx *bts = this_cpu_ptr(_ctx);
@@ -221,10 +240,13 @@ static void __bts_event_start(struct perf_event *event)
 
/*
 * local barrier to make sure that ds configuration made it
-* before we enable BTS
+* before we enable BTS and bts::state goes ACTIVE
 */
wmb();
 
+   /* INACTIVE/STOPPED -> ACTIVE */
+   WRITE_ONCE(bts->state, BTS_STATE_ACTIVE);
+
intel_pmu_enable_bts(config);
 
 }
@@ -251,9 +273,6 @@ static void bts_event_start(struct perf_event *event, int 
flags)
 
__bts_event_start(event);
 
-   /* PMI handler: this counter is running and likely generating PMIs */
-   ACCESS_ONCE(bts->started) = 1;
-
return;
 
 fail_end_stop:
@@ -263,30 +282,34 @@ fail_stop:
event->hw.state = PERF_HES_STOPPED;
 }
 
-static void __bts_event_stop(struct perf_event *event)
+static void __bts_event_stop(struct perf_event *event, int state)
 {
+   struct bts_ctx *bts = this_cpu_ptr(_ctx);
+
+   /* ACTIVE -> INACTIVE(PMI)/STOPPED(->stop()) */
+   WRITE_ONCE(bts->state, state);
+
/*
 * No extra synchronization is mandated by the documentation to have
 * BTS data stores globally visible.
 */
intel_pmu_disable_bts();
-
-   if (event->hw.state & PERF_HES_STOPPED)
-   return;
-
-   ACCESS_ONCE(event->hw.state) |= PERF_HES_STOPPED;
 }
 
 static void bts_event_stop(struct perf_event *event, int flags)
 {
struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
struct bts_ctx *bts = this_cpu_ptr(_ctx);
-   struct bts_buffer *buf = perf_get_aux(>handle);
+   struct bts_buffer *buf = NULL;
+   int state = READ_ONCE(bts->state);
+
+   if (state == BTS_STATE_ACTIVE)
+   __bts_event_stop(event, BTS_STATE_STOPPED);
 
-   /* PMI handler: don't restart this counter */
-   ACCESS_ONCE(bts->started) = 0;
+   if (state != BTS_STATE_STOPPED)
+   buf = perf_get_aux(>handle);
 
-   __bts_event_stop(event);
+   event->hw.state |= PERF_HES_STOPPED;
 
if (flags & PERF_EF_UPDATE) {
bts_update(bts);
@@ -296,6 +319,7 @@ static void bts_event_stop(struct perf_event *event, int 
flags)
bts->handle.head =
local_xchg(>data_size,
   buf->nr_pages << PAGE_SHIFT);
+
perf_aux_output_end(>handle, 
local_xchg(>data_size, 0),
!!local_xchg(>lost, 0));
}
@@ -310,8 +334,20 @@ static void bts_event_stop(struct perf_event *event, int 
flags)
 void intel_bts_enable_local(void)
 {
struct bts_ctx *bts = this_cpu_ptr(_ctx);
+   int state = READ_ONCE(bts->state);
+
+   /*
+* Here we transition from INACTIVE to ACTIVE;
+* if we in

[PATCH v2 1/5] perf: Fix a race between mmap_close and set_output of AUX events

2016-09-06 Thread Alexander Shishkin
In the mmap_close path we need to stop all the AUX events that are
writing data to the AUX area that we are unmapping, before we can
safely free the pages. To determine if an event needs to be stopped,
we're comparing its ->rb against the one that's getting unmapped.
However, a SET_OUTPUT ioctl may turn up inside an AUX transaction
and swizzle event::rb to some other ring buffer, but the transaction
will keep writing data to the old ring buffer until the event gets
scheduled out. At this point, mmap_close will skip over such an
event and will proceed to free the AUX area, while it's still being
used by this event, which will set off a warning in the mmap_close
path and cause a memory corruption.

To avoid this, always stop an AUX event before its ->rb is updated;
this will release the (potentially) last reference on the AUX area
of the buffer. If the event gets restarted, its new ring buffer will
be used. If another SET_OUTPUT comes and switches it back to the
old ring buffer that's getting unmapped, it's also fine: this
ring buffer's aux_mmap_count will be zero and AUX transactions won't
start any more.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 kernel/events/core.c | 31 +--
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 803481cb6c..71df77e234 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2464,11 +2464,11 @@ static int __perf_event_stop(void *info)
return 0;
 }
 
-static int perf_event_restart(struct perf_event *event)
+static int perf_event_stop(struct perf_event *event, int restart)
 {
struct stop_event_data sd = {
.event  = event,
-   .restart= 1,
+   .restart= restart,
};
int ret = 0;
 
@@ -4811,6 +4811,19 @@ static void ring_buffer_attach(struct perf_event *event,
spin_unlock_irqrestore(>event_lock, flags);
}
 
+   /*
+* Avoid racing with perf_mmap_close(AUX): stop the event
+* before swizzling the event::rb pointer; if it's getting
+* unmapped, its aux_mmap_count will be 0 and it won't
+* restart. See the comment in __perf_pmu_output_stop().
+*
+* Data will inevitably be lost when set_output is done in
+* mid-air, but then again, whoever does it like this is
+* not in for the data anyway.
+*/
+   if (has_aux(event))
+   perf_event_stop(event, 0);
+
rcu_assign_pointer(event->rb, rb);
 
if (old_rb) {
@@ -6086,7 +6099,7 @@ static void perf_event_addr_filters_exec(struct 
perf_event *event, void *data)
raw_spin_unlock_irqrestore(>lock, flags);
 
if (restart)
-   perf_event_restart(event);
+   perf_event_stop(event, 1);
 }
 
 void perf_event_exec(void)
@@ -6130,7 +6143,13 @@ static void __perf_event_output_stop(struct perf_event 
*event, void *data)
 
/*
 * In case of inheritance, it will be the parent that links to the
-* ring-buffer, but it will be the child that's actually using it:
+* ring-buffer, but it will be the child that's actually using it.
+*
+* We are using event::rb to determine if the event should be stopped,
+* however this may race with ring_buffer_attach() (through set_output),
+* which will make us skip the event that actually needs to be stopped.
+* So ring_buffer_attach() has to stop an aux event before re-assigning
+* its rb pointer.
 */
if (rcu_dereference(parent->rb) == rb)
ro->err = __perf_event_stop();
@@ -6653,7 +6672,7 @@ static void __perf_addr_filters_adjust(struct perf_event 
*event, void *data)
raw_spin_unlock_irqrestore(>lock, flags);
 
if (restart)
-   perf_event_restart(event);
+   perf_event_stop(event, 1);
 }
 
 /*
@@ -7831,7 +7850,7 @@ static void perf_event_addr_filters_apply(struct 
perf_event *event)
mmput(mm);
 
 restart:
-   perf_event_restart(event);
+   perf_event_stop(event, 1);
 }
 
 /*
-- 
2.9.3



[PATCH v2 1/5] perf: Fix a race between mmap_close and set_output of AUX events

2016-09-06 Thread Alexander Shishkin
In the mmap_close path we need to stop all the AUX events that are
writing data to the AUX area that we are unmapping, before we can
safely free the pages. To determine if an event needs to be stopped,
we're comparing its ->rb against the one that's getting unmapped.
However, a SET_OUTPUT ioctl may turn up inside an AUX transaction
and swizzle event::rb to some other ring buffer, but the transaction
will keep writing data to the old ring buffer until the event gets
scheduled out. At this point, mmap_close will skip over such an
event and will proceed to free the AUX area, while it's still being
used by this event, which will set off a warning in the mmap_close
path and cause a memory corruption.

To avoid this, always stop an AUX event before its ->rb is updated;
this will release the (potentially) last reference on the AUX area
of the buffer. If the event gets restarted, its new ring buffer will
be used. If another SET_OUTPUT comes and switches it back to the
old ring buffer that's getting unmapped, it's also fine: this
ring buffer's aux_mmap_count will be zero and AUX transactions won't
start any more.

Signed-off-by: Alexander Shishkin 
---
 kernel/events/core.c | 31 +--
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 803481cb6c..71df77e234 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2464,11 +2464,11 @@ static int __perf_event_stop(void *info)
return 0;
 }
 
-static int perf_event_restart(struct perf_event *event)
+static int perf_event_stop(struct perf_event *event, int restart)
 {
struct stop_event_data sd = {
.event  = event,
-   .restart= 1,
+   .restart= restart,
};
int ret = 0;
 
@@ -4811,6 +4811,19 @@ static void ring_buffer_attach(struct perf_event *event,
spin_unlock_irqrestore(>event_lock, flags);
}
 
+   /*
+* Avoid racing with perf_mmap_close(AUX): stop the event
+* before swizzling the event::rb pointer; if it's getting
+* unmapped, its aux_mmap_count will be 0 and it won't
+* restart. See the comment in __perf_pmu_output_stop().
+*
+* Data will inevitably be lost when set_output is done in
+* mid-air, but then again, whoever does it like this is
+* not in for the data anyway.
+*/
+   if (has_aux(event))
+   perf_event_stop(event, 0);
+
rcu_assign_pointer(event->rb, rb);
 
if (old_rb) {
@@ -6086,7 +6099,7 @@ static void perf_event_addr_filters_exec(struct 
perf_event *event, void *data)
raw_spin_unlock_irqrestore(>lock, flags);
 
if (restart)
-   perf_event_restart(event);
+   perf_event_stop(event, 1);
 }
 
 void perf_event_exec(void)
@@ -6130,7 +6143,13 @@ static void __perf_event_output_stop(struct perf_event 
*event, void *data)
 
/*
 * In case of inheritance, it will be the parent that links to the
-* ring-buffer, but it will be the child that's actually using it:
+* ring-buffer, but it will be the child that's actually using it.
+*
+* We are using event::rb to determine if the event should be stopped,
+* however this may race with ring_buffer_attach() (through set_output),
+* which will make us skip the event that actually needs to be stopped.
+* So ring_buffer_attach() has to stop an aux event before re-assigning
+* its rb pointer.
 */
if (rcu_dereference(parent->rb) == rb)
ro->err = __perf_event_stop();
@@ -6653,7 +6672,7 @@ static void __perf_addr_filters_adjust(struct perf_event 
*event, void *data)
raw_spin_unlock_irqrestore(>lock, flags);
 
if (restart)
-   perf_event_restart(event);
+   perf_event_stop(event, 1);
 }
 
 /*
@@ -7831,7 +7850,7 @@ static void perf_event_addr_filters_apply(struct 
perf_event *event)
mmput(mm);
 
 restart:
-   perf_event_restart(event);
+   perf_event_stop(event, 1);
 }
 
 /*
-- 
2.9.3



[PATCH v2 0/5] perf, bts: Fallout from the fuzzer for perf/urgent

2016-09-06 Thread Alexander Shishkin
Hi,

There were more bugs since the previous version, plus the BTS barriers
got fixed. With these patches, my testcase keeps running and no
spurious NMI warnings pop up any more.

Original story:

Recently Vince has reported warnings and panics coming from the
general direction of AUX tracing. I found two bugs which manifest
similarly, one in intel_bts driver and one in AUX unmapping path.

Both are triggered by racing SET_OUTPUT against mmap_close while
running AUX tracing. I have a test case that set fire to the kernel
within a few seconds by doing this, which I can share if anyone
cares.

These are all good candidates for 4.7-stable and the BTS ones can be
theoretically backported further.

Alexander Shishkin (5):
  perf: Fix a race between mmap_close and set_output of AUX events
  perf: Fix aux_mmap_count vs aux_refcount order
  perf/x86/intel/bts: Fix confused ordering of PMU callbacks
  perf/x86/intel/bts: Fix BTS PMI detection
  perf/x86/intel/bts: Kill a silly warning

 arch/x86/events/intel/bts.c | 123 +---
 kernel/events/core.c|  31 ---
 kernel/events/ring_buffer.c |  16 +++---
 3 files changed, 129 insertions(+), 41 deletions(-)

-- 
2.9.3



[PATCH v2 0/5] perf, bts: Fallout from the fuzzer for perf/urgent

2016-09-06 Thread Alexander Shishkin
Hi,

There were more bugs since the previous version, plus the BTS barriers
got fixed. With these patches, my testcase keeps running and no
spurious NMI warnings pop up any more.

Original story:

Recently Vince has reported warnings and panics coming from the
general direction of AUX tracing. I found two bugs which manifest
similarly, one in intel_bts driver and one in AUX unmapping path.

Both are triggered by racing SET_OUTPUT against mmap_close while
running AUX tracing. I have a test case that set fire to the kernel
within a few seconds by doing this, which I can share if anyone
cares.

These are all good candidates for 4.7-stable and the BTS ones can be
theoretically backported further.

Alexander Shishkin (5):
  perf: Fix a race between mmap_close and set_output of AUX events
  perf: Fix aux_mmap_count vs aux_refcount order
  perf/x86/intel/bts: Fix confused ordering of PMU callbacks
  perf/x86/intel/bts: Fix BTS PMI detection
  perf/x86/intel/bts: Kill a silly warning

 arch/x86/events/intel/bts.c | 123 +---
 kernel/events/core.c|  31 ---
 kernel/events/ring_buffer.c |  16 +++---
 3 files changed, 129 insertions(+), 41 deletions(-)

-- 
2.9.3



Re: [git pull] stm class/intel_th: Updates for char-misc-next

2016-08-31 Thread Alexander Shishkin
Greg KH  writes:

>>   git://git.kernel.org/pub/scm/linux/kernel/git/ash/stm.git 
>> tags/stm-for-greg-20160701
>
> I'm already all caught up with stm patches, right?  If not, can you
> resend the pull requests?

Yes, you are all caught up with stm/intel_th indeed. Thank you for
confirming though!

Regards,
--
Alex


Re: [git pull] stm class/intel_th: Updates for char-misc-next

2016-08-31 Thread Alexander Shishkin
Greg KH  writes:

>>   git://git.kernel.org/pub/scm/linux/kernel/git/ash/stm.git 
>> tags/stm-for-greg-20160701
>
> I'm already all caught up with stm patches, right?  If not, can you
> resend the pull requests?

Yes, you are all caught up with stm/intel_th indeed. Thank you for
confirming though!

Regards,
--
Alex


Re: [PATCH 1/3] perf/x86/intel/bts: Fix confused ordering of PMU callbacks

2016-08-29 Thread Alexander Shishkin
Peter Zijlstra <pet...@infradead.org> writes:

> On Wed, Aug 24, 2016 at 05:15:54PM +0300, Alexander Shishkin wrote:
>> @@ -221,10 +245,13 @@ static void __bts_event_start(struct perf_event *event)
>>  
>>  /*
>>   * local barrier to make sure that ds configuration made it
>> - * before we enable BTS
>> + * before we enable BTS and bts::state goes ACTIVE
>>   */
>>  wmb();
>>  
>> +/* INACTIVE/STOPPED -> ACTIVE */
>> +WRITE_ONCE(bts->state, BTS_STATE_ACTIVE);
>> +
>>  intel_pmu_enable_bts(config);
>>  
>>  }
>
> Alexander, were you going to post a new version of this patch without
> the barrier confusion?

Yes, only the testcase exploded again over the weekend, looking at it
again.

Regards,
--
Alex


Re: [PATCH 1/3] perf/x86/intel/bts: Fix confused ordering of PMU callbacks

2016-08-29 Thread Alexander Shishkin
Peter Zijlstra  writes:

> On Wed, Aug 24, 2016 at 05:15:54PM +0300, Alexander Shishkin wrote:
>> @@ -221,10 +245,13 @@ static void __bts_event_start(struct perf_event *event)
>>  
>>  /*
>>   * local barrier to make sure that ds configuration made it
>> - * before we enable BTS
>> + * before we enable BTS and bts::state goes ACTIVE
>>   */
>>  wmb();
>>  
>> +/* INACTIVE/STOPPED -> ACTIVE */
>> +WRITE_ONCE(bts->state, BTS_STATE_ACTIVE);
>> +
>>  intel_pmu_enable_bts(config);
>>  
>>  }
>
> Alexander, were you going to post a new version of this patch without
> the barrier confusion?

Yes, only the testcase exploded again over the weekend, looking at it
again.

Regards,
--
Alex


Re: [PATCH 1/3] perf/x86/intel/bts: Fix confused ordering of PMU callbacks

2016-08-27 Thread Alexander Shishkin
Yes, Will's patch notwithstanding, these patches fix the
warnings/oopses that you observed.

On 26 August 2016 at 23:49, Vince Weaver <vi...@deater.net> wrote:
> On Wed, 24 Aug 2016, Alexander Shishkin wrote:
>
>> Alexander Shishkin <alexander.shish...@linux.intel.com> writes:
>>
>> > Alexander Shishkin <alexander.shish...@linux.intel.com> writes:
>> >
>> >> Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
>> >
>> > Ok, this one is broken, please disregard.
>>
>> Vince, can you try the following (with the other two in this series)?
>>
>
> Sorry for the delay.  Not sure if testing was still needed (or if it was
> superceded by Will Deacon's patch) but I ran the fuzzer for a few hours
> with this patch and the other two applied and didn't see any of the NMI
> errors that I saw with the original patch.
>
> Vince


Re: [PATCH 1/3] perf/x86/intel/bts: Fix confused ordering of PMU callbacks

2016-08-27 Thread Alexander Shishkin
Yes, Will's patch notwithstanding, these patches fix the
warnings/oopses that you observed.

On 26 August 2016 at 23:49, Vince Weaver  wrote:
> On Wed, 24 Aug 2016, Alexander Shishkin wrote:
>
>> Alexander Shishkin  writes:
>>
>> > Alexander Shishkin  writes:
>> >
>> >> Signed-off-by: Alexander Shishkin 
>> >
>> > Ok, this one is broken, please disregard.
>>
>> Vince, can you try the following (with the other two in this series)?
>>
>
> Sorry for the delay.  Not sure if testing was still needed (or if it was
> superceded by Will Deacon's patch) but I ran the fuzzer for a few hours
> with this patch and the other two applied and didn't see any of the NMI
> errors that I saw with the original patch.
>
> Vince


Re: [PATCH 1/3] perf/x86/intel/bts: Fix confused ordering of PMU callbacks

2016-08-24 Thread Alexander Shishkin
Alexander Shishkin <alexander.shish...@linux.intel.com> writes:

> Alexander Shishkin <alexander.shish...@linux.intel.com> writes:
>
>> Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
>
> Ok, this one is broken, please disregard.

Vince, can you try the following (with the other two in this series)?

---

>From 68713194b3df8e565c4d319a80e9e7338fa1ec13 Mon Sep 17 00:00:00 2001
From: Alexander Shishkin <alexander.shish...@linux.intel.com>
Date: Tue, 23 Aug 2016 10:50:57 +0300
Subject: [PATCH] perf/x86/intel/bts: Fix confused ordering of PMU callbacks

The intel_bts driver is using a cpu-local 'started' variable to order
callbacks and PMIs and make sure that AUX transactions don't get messed
up. However, the ordering rules in regard to this variable is a complete
mess, which recently resulted in perf_fuzzer-triggered warnings and
panics.

The general ordering rule that is patch is enforcing is that this
cpu-local variable be set only when the cpu-local AUX transaction is
active; consequently, this variable is to be checked before the AUX
related bits can be touched.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
---
 arch/x86/events/intel/bts.c | 119 +++-
 1 file changed, 96 insertions(+), 23 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 0a6e393a2e..13ce76e895 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -31,7 +31,18 @@
 struct bts_ctx {
struct perf_output_handle   handle;
struct debug_store  ds_back;
-   int started;
+   local_t pmi_pending;
+   int state;
+};
+
+/* BTS context states: */
+enum {
+   /* no ongoing AUX transactions */
+   BTS_STATE_STOPPED = 0,
+   /* AUX transaction is on, BTS tracing is disabled */
+   BTS_STATE_INACTIVE,
+   /* AUX transaction is on, BTS tracing is running */
+   BTS_STATE_ACTIVE,
 };
 
 static DEFINE_PER_CPU(struct bts_ctx, bts_ctx);
@@ -191,6 +202,9 @@ static void bts_update(struct bts_ctx *bts)
if (ds->bts_index >= ds->bts_absolute_maximum)
local_inc(>lost);
 
+   if (ds->bts_index >= ds->bts_interrupt_threshold)
+   local_inc(>pmi_pending);
+
/*
 * old and head are always in the same physical buffer, so we
 * can subtract them to get the data size.
@@ -204,6 +218,16 @@ static void bts_update(struct bts_ctx *bts)
 static int
 bts_buffer_reset(struct bts_buffer *buf, struct perf_output_handle *handle);
 
+/*
+ * Ordering PMU callbacks wrt themselves and the PMI is done by means
+ * of bts::state, which:
+ *  - is set when bts::handle::event is valid, that is, between
+ *perf_aux_output_begin() and perf_aux_output_end();
+ *  - is zero otherwise;
+ *  - is ordered against bts::handle::event with a local barrier to
+ *enforce CPU ordering.
+ */
+
 static void __bts_event_start(struct perf_event *event)
 {
struct bts_ctx *bts = this_cpu_ptr(_ctx);
@@ -221,10 +245,13 @@ static void __bts_event_start(struct perf_event *event)
 
/*
 * local barrier to make sure that ds configuration made it
-* before we enable BTS
+* before we enable BTS and bts::state goes ACTIVE
 */
wmb();
 
+   /* INACTIVE/STOPPED -> ACTIVE */
+   WRITE_ONCE(bts->state, BTS_STATE_ACTIVE);
+
intel_pmu_enable_bts(config);
 
 }
@@ -251,9 +278,6 @@ static void bts_event_start(struct perf_event *event, int 
flags)
 
__bts_event_start(event);
 
-   /* PMI handler: this counter is running and likely generating PMIs */
-   ACCESS_ONCE(bts->started) = 1;
-
return;
 
 fail_end_stop:
@@ -265,28 +289,36 @@ fail_stop:
 
 static void __bts_event_stop(struct perf_event *event)
 {
+   struct bts_ctx *bts = this_cpu_ptr(_ctx);
+
+   /* ACTIVE -> INACTIVE */
+   WRITE_ONCE(bts->state, BTS_STATE_INACTIVE);
+
/*
 * No extra synchronization is mandated by the documentation to have
 * BTS data stores globally visible.
 */
intel_pmu_disable_bts();
-
-   if (event->hw.state & PERF_HES_STOPPED)
-   return;
-
-   ACCESS_ONCE(event->hw.state) |= PERF_HES_STOPPED;
 }
 
 static void bts_event_stop(struct perf_event *event, int flags)
 {
struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
struct bts_ctx *bts = this_cpu_ptr(_ctx);
-   struct bts_buffer *buf = perf_get_aux(>handle);
+   struct bts_buffer *buf;
+
+   if (READ_ONCE(bts->state) == BTS_STATE_ACTIVE)
+   __bts_event_stop(event);
+
+   /*
+* order buf (handle::event load) against bts::state store;
+* matches wmb

<    5   6   7   8   9   10   11   12   13   14   >