Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-17 Thread Peter Zijlstra
On Fri, Apr 17, 2015 at 08:20:37PM +0200, Andi Kleen wrote:
> On Fri, Apr 17, 2015 at 04:44:07PM +0200, Peter Zijlstra wrote:
> > On Fri, Apr 17, 2015 at 02:19:58PM +, Liang, Kan wrote:
> > 
> > > > But that brings us to patch 1 of this series, how is that correct in 
> > > > the face of
> > > > this? There is an arbitrary delay (A->B) added to the period.
> > > > And the Changelog of course never did bother to make that clear.
> 
> That's how perf and other profilers always behaved. The PMI
> is not part of the period. The automatic PEBS reload is not in any way
> different. It's much faster than a PMI, but it's also not zero cost.
> 
> This is not a gap in measurement though -- there is no other code
> running during that time on that CPU. It's simply overhead from the
> measurement mechanism.
> 
> > > 
> > > OK. I will update the changelog for patch 1 as below.
> > > ---
> > > When a fixed period is specified, this patch make perf use the PEBS
> > > auto reload mechanism. This makes normal profiling faster, because
> > > it avoids one costly MSR write in the PMI handler.
> > 
> > > However, the reset value will be loaded by hardware assist. There is 
> > > a little bit delay compared to previous non-auto-reload mechanism.
> > > The delay is arbitrary but very small.
> > 
> > What is very small? And doesn't that mean its significant at exactly the
> > point this patch series is aimed at, namely very short period.
> 
> The assist cost is 400-800 cycles, assuming common cases with everything
> cached. The minimum period the patch currently uses is 1. In that
> extreme case it can be ~10% if cycles are used.

Thanks, please include all this information.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-17 Thread Andi Kleen
On Fri, Apr 17, 2015 at 04:44:07PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 17, 2015 at 02:19:58PM +, Liang, Kan wrote:
> 
> > > But that brings us to patch 1 of this series, how is that correct in the 
> > > face of
> > > this? There is an arbitrary delay (A->B) added to the period.
> > > And the Changelog of course never did bother to make that clear.

That's how perf and other profilers always behaved. The PMI
is not part of the period. The automatic PEBS reload is not in any way
different. It's much faster than a PMI, but it's also not zero cost.

This is not a gap in measurement though -- there is no other code
running during that time on that CPU. It's simply overhead from the
measurement mechanism.

> > 
> > OK. I will update the changelog for patch 1 as below.
> > ---
> > When a fixed period is specified, this patch make perf use the PEBS
> > auto reload mechanism. This makes normal profiling faster, because
> > it avoids one costly MSR write in the PMI handler.
> 
> > However, the reset value will be loaded by hardware assist. There is 
> > a little bit delay compared to previous non-auto-reload mechanism.
> > The delay is arbitrary but very small.
> 
> What is very small? And doesn't that mean its significant at exactly the
> point this patch series is aimed at, namely very short period.

The assist cost is 400-800 cycles, assuming common cases with everything
cached. The minimum period the patch currently uses is 1. In that
extreme case it can be ~10% if cycles are used.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-17 Thread Peter Zijlstra
On Fri, Apr 17, 2015 at 02:19:58PM +, Liang, Kan wrote:

> > But that brings us to patch 1 of this series, how is that correct in the 
> > face of
> > this? There is an arbitrary delay (A->B) added to the period.
> > And the Changelog of course never did bother to make that clear.
> 
> OK. I will update the changelog for patch 1 as below.
> ---
> When a fixed period is specified, this patch make perf use the PEBS
> auto reload mechanism. This makes normal profiling faster, because
> it avoids one costly MSR write in the PMI handler.

> However, the reset value will be loaded by hardware assist. There is 
> a little bit delay compared to previous non-auto-reload mechanism.
> The delay is arbitrary but very small.

What is very small? And doesn't that mean its significant at exactly the
point this patch series is aimed at, namely very short period.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-17 Thread Liang, Kan


> -Original Message-
> From: Peter Zijlstra [mailto:pet...@infradead.org]
> Sent: Friday, April 17, 2015 9:13 AM
> To: Liang, Kan
> Cc: linux-kernel@vger.kernel.org; mi...@kernel.org;
> a...@infradead.org; eran...@google.com; a...@firstfloor.org
> Subject: Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS
> buffer
> 
> On Fri, Apr 17, 2015 at 12:50:33PM +, Liang, Kan wrote:
> >
> >
> > > >
> > > >  A) the CTRn value reaches 0:
> > > > - the corresponding bit in GLOBAL_STATUS gets set
> > > > - we start arming the hardware assist
> > > >
> > > > < some unspecified amount of time later --
> > > > this could cover multiple events of interest >
> > > >
> > > >  B) the hardware assist is armed, any next event will trigger it
> > > >
> > > >  C) a matching event happens:
> > > > - the hardware assist triggers and generates a PEBS record
> > > >   this includes a copy of GLOBAL_STATUS at this moment
> > > > - if we auto-reload we (re)set CTRn
> > >
> > > Is this actually true? Do we reload here or on A ?
> > >
> >
> > Yes, on C.
> > According to SDM Volume 3, 18.7.1.1, the reset value will be loaded
> > after each PEBS record is written, which is done by hw assist.
> 
> OK, then I did indeed remember that right.
> 
> But that brings us to patch 1 of this series, how is that correct in the face 
> of
> this? There is an arbitrary delay (A->B) added to the period.
> And the Changelog of course never did bother to make that clear.

OK. I will update the changelog for patch 1 as below.
---
When a fixed period is specified, this patch make perf use the PEBS
auto reload mechanism. This makes normal profiling faster, because
it avoids one costly MSR write in the PMI handler.
However, the reset value will be loaded by hardware assist. There is 
a little bit delay compared to previous non-auto-reload mechanism.
The delay is arbitrary but very small.

Thanks,
Kan


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-17 Thread Peter Zijlstra
On Fri, Apr 17, 2015 at 12:50:33PM +, Liang, Kan wrote:
> 
> 
> > >
> > >  A) the CTRn value reaches 0:
> > > - the corresponding bit in GLOBAL_STATUS gets set
> > > - we start arming the hardware assist
> > >
> > > < some unspecified amount of time later --
> > > this could cover multiple events of interest >
> > >
> > >  B) the hardware assist is armed, any next event will trigger it
> > >
> > >  C) a matching event happens:
> > > - the hardware assist triggers and generates a PEBS record
> > >   this includes a copy of GLOBAL_STATUS at this moment
> > > - if we auto-reload we (re)set CTRn
> > 
> > Is this actually true? Do we reload here or on A ?
> > 
> 
> Yes, on C.
> According to SDM Volume 3, 18.7.1.1, the reset value will be
> loaded after each PEBS record is written, which is done
> by hw assist.

OK, then I did indeed remember that right.

But that brings us to patch 1 of this series, how is that correct in the
face of this? There is an arbitrary delay (A->B) added to the period.
And the Changelog of course never did bother to make that clear.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-17 Thread Liang, Kan


> >
> >  A) the CTRn value reaches 0:
> > - the corresponding bit in GLOBAL_STATUS gets set
> > - we start arming the hardware assist
> >
> > < some unspecified amount of time later --
> > this could cover multiple events of interest >
> >
> >  B) the hardware assist is armed, any next event will trigger it
> >
> >  C) a matching event happens:
> > - the hardware assist triggers and generates a PEBS record
> >   this includes a copy of GLOBAL_STATUS at this moment
> > - if we auto-reload we (re)set CTRn
> 
> Is this actually true? Do we reload here or on A ?
> 

Yes, on C.
According to SDM Volume 3, 18.7.1.1, the reset value will be
loaded after each PEBS record is written, which is done
by hw assist.


> > - we clear the relevant bit in GLOBAL_STATUS
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-17 Thread Peter Zijlstra
On Thu, Apr 16, 2015 at 02:53:42PM +0200, Peter Zijlstra wrote:
> When the PEBS interrupt threshold is larger than one record and the
> machine supports multiple PEBS events, the records of these events are
> mixed up and we need to demultiplex them.
> 
> Demuxing the records is hard because the hardware is deficient. The
> hardware has two issues that, when combined, create impossible scenarios
> to demux.
> 
> The first issue is that the 'status' field of the PEBS record is a copy
> of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
> problem let us first describe the regular PEBS cycle:
> 
>  A) the CTRn value reaches 0:
> - the corresponding bit in GLOBAL_STATUS gets set
> - we start arming the hardware assist
> 
> < some unspecified amount of time later --
> this could cover multiple events of interest >
> 
>  B) the hardware assist is armed, any next event will trigger it
> 
>  C) a matching event happens:
> - the hardware assist triggers and generates a PEBS record
>   this includes a copy of GLOBAL_STATUS at this moment
> - if we auto-reload we (re)set CTRn

Is this actually true? Do we reload here or on A ?

> - we clear the relevant bit in GLOBAL_STATUS
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-17 Thread Peter Zijlstra
On Thu, Apr 16, 2015 at 02:53:42PM +0200, Peter Zijlstra wrote:
 When the PEBS interrupt threshold is larger than one record and the
 machine supports multiple PEBS events, the records of these events are
 mixed up and we need to demultiplex them.
 
 Demuxing the records is hard because the hardware is deficient. The
 hardware has two issues that, when combined, create impossible scenarios
 to demux.
 
 The first issue is that the 'status' field of the PEBS record is a copy
 of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
 problem let us first describe the regular PEBS cycle:
 
  A) the CTRn value reaches 0:
 - the corresponding bit in GLOBAL_STATUS gets set
 - we start arming the hardware assist
 
  some unspecified amount of time later --
 this could cover multiple events of interest 
 
  B) the hardware assist is armed, any next event will trigger it
 
  C) a matching event happens:
 - the hardware assist triggers and generates a PEBS record
   this includes a copy of GLOBAL_STATUS at this moment
 - if we auto-reload we (re)set CTRn

Is this actually true? Do we reload here or on A ?

 - we clear the relevant bit in GLOBAL_STATUS
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-17 Thread Liang, Kan


 
   A) the CTRn value reaches 0:
  - the corresponding bit in GLOBAL_STATUS gets set
  - we start arming the hardware assist
 
   some unspecified amount of time later --
  this could cover multiple events of interest 
 
   B) the hardware assist is armed, any next event will trigger it
 
   C) a matching event happens:
  - the hardware assist triggers and generates a PEBS record
this includes a copy of GLOBAL_STATUS at this moment
  - if we auto-reload we (re)set CTRn
 
 Is this actually true? Do we reload here or on A ?
 

Yes, on C.
According to SDM Volume 3, 18.7.1.1, the reset value will be
loaded after each PEBS record is written, which is done
by hw assist.


  - we clear the relevant bit in GLOBAL_STATUS
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-17 Thread Peter Zijlstra
On Fri, Apr 17, 2015 at 12:50:33PM +, Liang, Kan wrote:
 
 
  
A) the CTRn value reaches 0:
   - the corresponding bit in GLOBAL_STATUS gets set
   - we start arming the hardware assist
  
some unspecified amount of time later --
   this could cover multiple events of interest 
  
B) the hardware assist is armed, any next event will trigger it
  
C) a matching event happens:
   - the hardware assist triggers and generates a PEBS record
 this includes a copy of GLOBAL_STATUS at this moment
   - if we auto-reload we (re)set CTRn
  
  Is this actually true? Do we reload here or on A ?
  
 
 Yes, on C.
 According to SDM Volume 3, 18.7.1.1, the reset value will be
 loaded after each PEBS record is written, which is done
 by hw assist.

OK, then I did indeed remember that right.

But that brings us to patch 1 of this series, how is that correct in the
face of this? There is an arbitrary delay (A-B) added to the period.
And the Changelog of course never did bother to make that clear.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-17 Thread Peter Zijlstra
On Fri, Apr 17, 2015 at 02:19:58PM +, Liang, Kan wrote:

  But that brings us to patch 1 of this series, how is that correct in the 
  face of
  this? There is an arbitrary delay (A-B) added to the period.
  And the Changelog of course never did bother to make that clear.
 
 OK. I will update the changelog for patch 1 as below.
 ---
 When a fixed period is specified, this patch make perf use the PEBS
 auto reload mechanism. This makes normal profiling faster, because
 it avoids one costly MSR write in the PMI handler.

 However, the reset value will be loaded by hardware assist. There is 
 a little bit delay compared to previous non-auto-reload mechanism.
 The delay is arbitrary but very small.

What is very small? And doesn't that mean its significant at exactly the
point this patch series is aimed at, namely very short period.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-17 Thread Liang, Kan


 -Original Message-
 From: Peter Zijlstra [mailto:pet...@infradead.org]
 Sent: Friday, April 17, 2015 9:13 AM
 To: Liang, Kan
 Cc: linux-kernel@vger.kernel.org; mi...@kernel.org;
 a...@infradead.org; eran...@google.com; a...@firstfloor.org
 Subject: Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS
 buffer
 
 On Fri, Apr 17, 2015 at 12:50:33PM +, Liang, Kan wrote:
 
 
   
 A) the CTRn value reaches 0:
- the corresponding bit in GLOBAL_STATUS gets set
- we start arming the hardware assist
   
 some unspecified amount of time later --
this could cover multiple events of interest 
   
 B) the hardware assist is armed, any next event will trigger it
   
 C) a matching event happens:
- the hardware assist triggers and generates a PEBS record
  this includes a copy of GLOBAL_STATUS at this moment
- if we auto-reload we (re)set CTRn
  
   Is this actually true? Do we reload here or on A ?
  
 
  Yes, on C.
  According to SDM Volume 3, 18.7.1.1, the reset value will be loaded
  after each PEBS record is written, which is done by hw assist.
 
 OK, then I did indeed remember that right.
 
 But that brings us to patch 1 of this series, how is that correct in the face 
 of
 this? There is an arbitrary delay (A-B) added to the period.
 And the Changelog of course never did bother to make that clear.

OK. I will update the changelog for patch 1 as below.
---
When a fixed period is specified, this patch make perf use the PEBS
auto reload mechanism. This makes normal profiling faster, because
it avoids one costly MSR write in the PMI handler.
However, the reset value will be loaded by hardware assist. There is 
a little bit delay compared to previous non-auto-reload mechanism.
The delay is arbitrary but very small.

Thanks,
Kan


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-17 Thread Peter Zijlstra
On Fri, Apr 17, 2015 at 08:20:37PM +0200, Andi Kleen wrote:
 On Fri, Apr 17, 2015 at 04:44:07PM +0200, Peter Zijlstra wrote:
  On Fri, Apr 17, 2015 at 02:19:58PM +, Liang, Kan wrote:
  
But that brings us to patch 1 of this series, how is that correct in 
the face of
this? There is an arbitrary delay (A-B) added to the period.
And the Changelog of course never did bother to make that clear.
 
 That's how perf and other profilers always behaved. The PMI
 is not part of the period. The automatic PEBS reload is not in any way
 different. It's much faster than a PMI, but it's also not zero cost.
 
 This is not a gap in measurement though -- there is no other code
 running during that time on that CPU. It's simply overhead from the
 measurement mechanism.
 
   
   OK. I will update the changelog for patch 1 as below.
   ---
   When a fixed period is specified, this patch make perf use the PEBS
   auto reload mechanism. This makes normal profiling faster, because
   it avoids one costly MSR write in the PMI handler.
  
   However, the reset value will be loaded by hardware assist. There is 
   a little bit delay compared to previous non-auto-reload mechanism.
   The delay is arbitrary but very small.
  
  What is very small? And doesn't that mean its significant at exactly the
  point this patch series is aimed at, namely very short period.
 
 The assist cost is 400-800 cycles, assuming common cases with everything
 cached. The minimum period the patch currently uses is 1. In that
 extreme case it can be ~10% if cycles are used.

Thanks, please include all this information.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-17 Thread Andi Kleen
On Fri, Apr 17, 2015 at 04:44:07PM +0200, Peter Zijlstra wrote:
 On Fri, Apr 17, 2015 at 02:19:58PM +, Liang, Kan wrote:
 
   But that brings us to patch 1 of this series, how is that correct in the 
   face of
   this? There is an arbitrary delay (A-B) added to the period.
   And the Changelog of course never did bother to make that clear.

That's how perf and other profilers always behaved. The PMI
is not part of the period. The automatic PEBS reload is not in any way
different. It's much faster than a PMI, but it's also not zero cost.

This is not a gap in measurement though -- there is no other code
running during that time on that CPU. It's simply overhead from the
measurement mechanism.

  
  OK. I will update the changelog for patch 1 as below.
  ---
  When a fixed period is specified, this patch make perf use the PEBS
  auto reload mechanism. This makes normal profiling faster, because
  it avoids one costly MSR write in the PMI handler.
 
  However, the reset value will be loaded by hardware assist. There is 
  a little bit delay compared to previous non-auto-reload mechanism.
  The delay is arbitrary but very small.
 
 What is very small? And doesn't that mean its significant at exactly the
 point this patch series is aimed at, namely very short period.

The assist cost is 400-800 cycles, assuming common cases with everything
cached. The minimum period the patch currently uses is 1. In that
extreme case it can be ~10% if cycles are used.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-16 Thread Peter Zijlstra
On Thu, Apr 09, 2015 at 12:37:44PM -0400, Kan Liang wrote:
> From: Yan, Zheng 
> 
> When PEBS interrupt threshold is larger than one, the PEBS buffer
> may include multiple records for each PEBS event. This patch makes
> the code first count how many records each PEBS event has, then
> output the samples in batch.
> 
> One corner case needs to mention is that the PEBS hardware doesn't
> deal well with collisions. The records for the events can be collapsed
> into a single one, and it's not possible to reconstruct all events that
> caused the PEBS record.
> Here are some cases which can be called collisions.
>  - PEBS events happen near to each other, so the hardware merges them.
>  - PEBS events happen near to each other, but they are not merged.
>The GLOBAL_STATUS for first counter is clear before generating event
>for next counter. Only the first record can be treated as collisions.
>  - Same as case2, but the first counter isn't clear before generating
>event for next counter. All the records are treated as collision
>until a record with only one bit set for PEBS event.
> 
> GLOBAL_STATUS could be set by both PEBS and non-PEBS events. Multiple
> non-PEBS bit set doesn't count as collisions.
> 
> In practice collisions are extremely rare, as long as different PEBS
> events are used. The periods are typically very large, so any collision
> is unlikely. When collision happens, we drop the PEBS record.
> The only way you can get a lot of collision is when you count the same
> thing multiple times. But it is not a useful configuration.
> 
> Here are some numbers about collisions.
> Four frequently occurring events
> (cycles:p,instructions:p,branches:p,mem-stores:p) are tested
> 
> Test events which are sampled together   collision rate
> cycles:p,instructions:p  0.25%
> cycles:p,instructions:p,branches:p   0.30%
> cycles:p,instructions:p,branches:p,mem-stores:p  0.35%
> 
> cycles:p,cycles:p98.52%

*sigh* you're really going to make me write this :-(

The sad part is that I had already written a large part of it for you in
a previous email (
lkml.kernel.org/r/20150330200710.go27...@worktop.programming.kicks-ass.net
).

And yes, writing a good Changelog takes a lot of time and effort,
sometimes more than the actual patch, and that is OK.

There's a *PLEASE CLARIFY* in the below, please do that.

Also the below talks about a PERF_RECORD_SAMPLES_LOST, please also
implement that.

---

When the PEBS interrupt threshold is larger than one record and the
machine supports multiple PEBS events, the records of these events are
mixed up and we need to demultiplex them.

Demuxing the records is hard because the hardware is deficient. The
hardware has two issues that, when combined, create impossible scenarios
to demux.

The first issue is that the 'status' field of the PEBS record is a copy
of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
problem let us first describe the regular PEBS cycle:

 A) the CTRn value reaches 0:
- the corresponding bit in GLOBAL_STATUS gets set
- we start arming the hardware assist

< some unspecified amount of time later --
this could cover multiple events of interest >

 B) the hardware assist is armed, any next event will trigger it

 C) a matching event happens:
- the hardware assist triggers and generates a PEBS record
  this includes a copy of GLOBAL_STATUS at this moment
- if we auto-reload we (re)set CTRn
- we clear the relevant bit in GLOBAL_STATUS

Now consider the following chain of events:

 A0, B0, A1, C0

The event generated for counter 0 will include a status with counter 1
set, even though its not at all related to the record. A similar thing
can happen with a !PEBS event if it just happens to overflow at the
right moment.

The second issue is that the hardware will only emit one record for two
or more counters if the event that triggers the assist is 'close' --
*PLEASE CLARIFY* either the very same instruction or retired in the same
cycle?

For instance, consider this chain of events:

 A0, B0, A1, B1, C01

Where C01 is an event that triggers both hardware assists (the
instruction matches both criteria), we will generate but a single
record, but again with both counters listed in the status field.

This time the record pertains to both events.

Note that these two cases are different but undistinguishable with the
data as generated. Therefore demuxing records with multiple PEBS bits
(we can safely ignore status bits for !PEBS counters) is impossible.

Furthermore we cannot emit the record to both events because that might
cause a data leak -- the events might not have the same privileges -- so
what this patch does is discard such events.

The assumption/hope is that such discards will be rare, and to make sure
the user is not left in the dark about this we'll emit a
PERF_RECORD_SAMPLES_LOST 

Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-16 Thread Peter Zijlstra
On Thu, Apr 09, 2015 at 12:37:44PM -0400, Kan Liang wrote:
 From: Yan, Zheng zheng.z@intel.com
 
 When PEBS interrupt threshold is larger than one, the PEBS buffer
 may include multiple records for each PEBS event. This patch makes
 the code first count how many records each PEBS event has, then
 output the samples in batch.
 
 One corner case needs to mention is that the PEBS hardware doesn't
 deal well with collisions. The records for the events can be collapsed
 into a single one, and it's not possible to reconstruct all events that
 caused the PEBS record.
 Here are some cases which can be called collisions.
  - PEBS events happen near to each other, so the hardware merges them.
  - PEBS events happen near to each other, but they are not merged.
The GLOBAL_STATUS for first counter is clear before generating event
for next counter. Only the first record can be treated as collisions.
  - Same as case2, but the first counter isn't clear before generating
event for next counter. All the records are treated as collision
until a record with only one bit set for PEBS event.
 
 GLOBAL_STATUS could be set by both PEBS and non-PEBS events. Multiple
 non-PEBS bit set doesn't count as collisions.
 
 In practice collisions are extremely rare, as long as different PEBS
 events are used. The periods are typically very large, so any collision
 is unlikely. When collision happens, we drop the PEBS record.
 The only way you can get a lot of collision is when you count the same
 thing multiple times. But it is not a useful configuration.
 
 Here are some numbers about collisions.
 Four frequently occurring events
 (cycles:p,instructions:p,branches:p,mem-stores:p) are tested
 
 Test events which are sampled together   collision rate
 cycles:p,instructions:p  0.25%
 cycles:p,instructions:p,branches:p   0.30%
 cycles:p,instructions:p,branches:p,mem-stores:p  0.35%
 
 cycles:p,cycles:p98.52%

*sigh* you're really going to make me write this :-(

The sad part is that I had already written a large part of it for you in
a previous email (
lkml.kernel.org/r/20150330200710.go27...@worktop.programming.kicks-ass.net
).

And yes, writing a good Changelog takes a lot of time and effort,
sometimes more than the actual patch, and that is OK.

There's a *PLEASE CLARIFY* in the below, please do that.

Also the below talks about a PERF_RECORD_SAMPLES_LOST, please also
implement that.

---

When the PEBS interrupt threshold is larger than one record and the
machine supports multiple PEBS events, the records of these events are
mixed up and we need to demultiplex them.

Demuxing the records is hard because the hardware is deficient. The
hardware has two issues that, when combined, create impossible scenarios
to demux.

The first issue is that the 'status' field of the PEBS record is a copy
of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
problem let us first describe the regular PEBS cycle:

 A) the CTRn value reaches 0:
- the corresponding bit in GLOBAL_STATUS gets set
- we start arming the hardware assist

 some unspecified amount of time later --
this could cover multiple events of interest 

 B) the hardware assist is armed, any next event will trigger it

 C) a matching event happens:
- the hardware assist triggers and generates a PEBS record
  this includes a copy of GLOBAL_STATUS at this moment
- if we auto-reload we (re)set CTRn
- we clear the relevant bit in GLOBAL_STATUS

Now consider the following chain of events:

 A0, B0, A1, C0

The event generated for counter 0 will include a status with counter 1
set, even though its not at all related to the record. A similar thing
can happen with a !PEBS event if it just happens to overflow at the
right moment.

The second issue is that the hardware will only emit one record for two
or more counters if the event that triggers the assist is 'close' --
*PLEASE CLARIFY* either the very same instruction or retired in the same
cycle?

For instance, consider this chain of events:

 A0, B0, A1, B1, C01

Where C01 is an event that triggers both hardware assists (the
instruction matches both criteria), we will generate but a single
record, but again with both counters listed in the status field.

This time the record pertains to both events.

Note that these two cases are different but undistinguishable with the
data as generated. Therefore demuxing records with multiple PEBS bits
(we can safely ignore status bits for !PEBS counters) is impossible.

Furthermore we cannot emit the record to both events because that might
cause a data leak -- the events might not have the same privileges -- so
what this patch does is discard such events.

The assumption/hope is that such discards will be rare, and to make sure
the user is not left in the dark about this we'll emit a
PERF_RECORD_SAMPLES_LOST record with the number 

Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-15 Thread Peter Zijlstra
On Thu, Apr 09, 2015 at 12:37:44PM -0400, Kan Liang wrote:
> +/* Clear all non-PEBS bits */
> +static u64
> +nonpebs_bit_clear(u64 pebs_status)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
> + struct perf_event *event;
> + int bit;
> +
> + for_each_set_bit(bit, (unsigned long *)_status, 64) {
> +
> + if (bit >= x86_pmu.max_pebs_events)
> + clear_bit(bit, (unsigned long *)_status);
> + else {
> + event = cpuc->events[bit];
> + WARN_ON_ONCE(!event);
> +
> + if (!event->attr.precise_ip)
> + clear_bit(bit, (unsigned long *)_status);
> + }
> + }
> +
> + return pebs_status;
> +}

What was wrong with:

status = p->status & cpuc->pebs_enabled;

?

We use the same index bits in the PEBS_ENABLE MSR as in the status reg,
right? If you're really paranoid you can mask out the high (>31) bits
too I suppose.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-15 Thread Peter Zijlstra
On Thu, Apr 09, 2015 at 12:37:44PM -0400, Kan Liang wrote:

> The only way you can get a lot of collision is when you count the same
> thing multiple times. But it is not a useful configuration.

Not entirely true; I _think_ you can be unfortunate if you measure with
a userspace only PEBS event along with either a kernel or unrestricted
PEBS event. Imagine the event triggering and setting the overflow flag
right before entering the kernel. Then all kernel side events will end
up with multiple bits set.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-15 Thread Peter Zijlstra
On Thu, Apr 09, 2015 at 12:37:44PM -0400, Kan Liang wrote:

 The only way you can get a lot of collision is when you count the same
 thing multiple times. But it is not a useful configuration.

Not entirely true; I _think_ you can be unfortunate if you measure with
a userspace only PEBS event along with either a kernel or unrestricted
PEBS event. Imagine the event triggering and setting the overflow flag
right before entering the kernel. Then all kernel side events will end
up with multiple bits set.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-15 Thread Peter Zijlstra
On Thu, Apr 09, 2015 at 12:37:44PM -0400, Kan Liang wrote:
 +/* Clear all non-PEBS bits */
 +static u64
 +nonpebs_bit_clear(u64 pebs_status)
 +{
 + struct cpu_hw_events *cpuc = this_cpu_ptr(cpu_hw_events);
 + struct perf_event *event;
 + int bit;
 +
 + for_each_set_bit(bit, (unsigned long *)pebs_status, 64) {
 +
 + if (bit = x86_pmu.max_pebs_events)
 + clear_bit(bit, (unsigned long *)pebs_status);
 + else {
 + event = cpuc-events[bit];
 + WARN_ON_ONCE(!event);
 +
 + if (!event-attr.precise_ip)
 + clear_bit(bit, (unsigned long *)pebs_status);
 + }
 + }
 +
 + return pebs_status;
 +}

What was wrong with:

status = p-status  cpuc-pebs_enabled;

?

We use the same index bits in the PEBS_ENABLE MSR as in the status reg,
right? If you're really paranoid you can mask out the high (31) bits
too I suppose.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-09 Thread Andi Kleen
> + for_each_set_bit(bit, (unsigned long *)_status, 64) {
> +
> + if (bit >= x86_pmu.max_pebs_events)
> + clear_bit(bit, (unsigned long *)_status);
> + else {
> + event = cpuc->events[bit];
> + WARN_ON_ONCE(!event);
> +
> + if (!event->attr.precise_ip)
> + clear_bit(bit, (unsigned long *)_status);

Precompute the mask of non pebs events first in the caller.
Then this function would be just a & ~mask

BTW clear_bit is atomic, if you're local you should always use
__clear_bit.


> +}
> +
> +static inline void *
> +get_next_pebs_record_by_bit(void *base, void *top, int bit)
> +{
> + void *at;
> + u64 pebs_status;
> +
> + if (base == NULL)
> + return NULL;
> +
> + for (at = base; at < top; at += x86_pmu.pebs_record_size) {
> + struct pebs_record_nhm *p = at;
> +
> + if (p->status & (1 << bit)) {

Use test_bit.

> +
> + if (p->status == (1 << bit))
> + return at;
> +
> + /* clear non-PEBS bit and re-check */
> + pebs_status = nonpebs_bit_clear(p->status);
> + if (pebs_status == (1 << bit))
> + return at;
> + }
> + }
> + return NULL;

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer

2015-04-09 Thread Andi Kleen
 + for_each_set_bit(bit, (unsigned long *)pebs_status, 64) {
 +
 + if (bit = x86_pmu.max_pebs_events)
 + clear_bit(bit, (unsigned long *)pebs_status);
 + else {
 + event = cpuc-events[bit];
 + WARN_ON_ONCE(!event);
 +
 + if (!event-attr.precise_ip)
 + clear_bit(bit, (unsigned long *)pebs_status);

Precompute the mask of non pebs events first in the caller.
Then this function would be just a  ~mask

BTW clear_bit is atomic, if you're local you should always use
__clear_bit.


 +}
 +
 +static inline void *
 +get_next_pebs_record_by_bit(void *base, void *top, int bit)
 +{
 + void *at;
 + u64 pebs_status;
 +
 + if (base == NULL)
 + return NULL;
 +
 + for (at = base; at  top; at += x86_pmu.pebs_record_size) {
 + struct pebs_record_nhm *p = at;
 +
 + if (p-status  (1  bit)) {

Use test_bit.

 +
 + if (p-status == (1  bit))
 + return at;
 +
 + /* clear non-PEBS bit and re-check */
 + pebs_status = nonpebs_bit_clear(p-status);
 + if (pebs_status == (1  bit))
 + return at;
 + }
 + }
 + return NULL;

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/