subject:"Re\: \[PATCH 0\/2\] x86\/intel_rdt and perf\/x86\: Fix lack of coordination with perf"

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-10 Thread Reinette Chatre

Just one clarification ...

On 8/10/2018 9:25 AM, Reinette Chatre wrote:
> static inline int x86_perf_event_error_state(struct perf_event *event)
> {
> int ret = 0;
> u64 tmp;
> 
> ret = perf_event_read_local(event, , NULL, NULL);
> if (ret < 0)
> return ret;
> 
> if (event->attr.pinned && event->oncpu != smp_processor_id())
> return -EBUSY;

I am preparing a patch series and in that the above extra test will be
included as the last sanity check in perf_event_read_local() as you
suggested. This inline function will thus go away and the only error
state check would be a call to perf_event_read_local().

Reinette

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-10 Thread Reinette Chatre

Just one clarification ...

On 8/10/2018 9:25 AM, Reinette Chatre wrote:
> static inline int x86_perf_event_error_state(struct perf_event *event)
> {
> int ret = 0;
> u64 tmp;
> 
> ret = perf_event_read_local(event, , NULL, NULL);
> if (ret < 0)
> return ret;
> 
> if (event->attr.pinned && event->oncpu != smp_processor_id())
> return -EBUSY;

I am preparing a patch series and in that the above extra test will be
included as the last sanity check in perf_event_read_local() as you
suggested. This inline function will thus go away and the only error
state check would be a call to perf_event_read_local().

Reinette

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-10 Thread Reinette Chatre

Hi Peter,

On 8/8/2018 10:33 AM, Reinette Chatre wrote:
> On 8/8/2018 12:51 AM, Peter Zijlstra wrote:
>> On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote:
  - I don't much fancy people accessing the guts of events like that;
would not an inline function like:

static inline u64 x86_perf_rdpmc(struct perf_event *event)
{
u64 val;

lockdep_assert_irqs_disabled();

rdpmcl(event->hw.event_base_rdpmc, val);
return val;
}

Work for you?
>>>
>>> No. This does not provide accurate results. Implementing the above produces:
>>> pseudo_lock_mea-366   [002] 34.950740: pseudo_lock_l2: hits=4096
>>> miss=4
>>
>> But it being an inline function should allow the compiler to optimize
>> and lift the event->hw.event_base_rdpmc load like you now do manually.
>> Also, like Tony already suggested, you can prime that load just fine by
>> doing an extra invocation.
>>
>> (and note that the above function is _much_ simpler than
>> perf_event_read_local())
> 
> Unfortunately I do not find this to be the case. When I implement
> x86_perf_rdpmc() _exactly_ as you suggest above and do the measurement like:
> 
> l2_hits_before = x86_perf_rdpmc(l2_hit_event);
> l2_miss_before = x86_perf_rdpmc(l2_miss_event);
> l2_hits_before = x86_perf_rdpmc(l2_hit_event);
> l2_miss_before = x86_perf_rdpmc(l2_miss_event);
> /* read memory */
> l2_hits_after = x86_perf_rdpmc(l2_hit_event);
> l2_miss_after = x86_perf_rdpmc(l2_miss_event);
> 
> 
> Then the results are not accurate, neither are the consistently
> inaccurate to consider a constant adjustment:
> 
> pseudo_lock_mea-409   [002]    194.322611: pseudo_lock_l2: hits=4100
> miss=0
> pseudo_lock_mea-412   [002]    195.520203: pseudo_lock_l2: hits=4096
> miss=3
> pseudo_lock_mea-415   [002]    196.571114: pseudo_lock_l2: hits=4097
> miss=3
> pseudo_lock_mea-422   [002]    197.629118: pseudo_lock_l2: hits=4097
> miss=3
> pseudo_lock_mea-425   [002]    198.687160: pseudo_lock_l2: hits=4096
> miss=3
> pseudo_lock_mea-428   [002]    199.744156: pseudo_lock_l2: hits=4096
> miss=2
> pseudo_lock_mea-431   [002]    200.801131: pseudo_lock_l2: hits=4097
> miss=2
> pseudo_lock_mea-434   [002]    201.858141: pseudo_lock_l2: hits=4097
> miss=2
> pseudo_lock_mea-437   [002]    202.917168: pseudo_lock_l2: hits=4096
> miss=2
> 
> I was able to test Tony's theory and replacing the reading of the
> "after" counts with a direct rdpmcl() improve the results. What I mean
> is this:
> 
> l2_hit_pmcnum = x86_perf_rdpmc_ctr_get(l2_hit_event);
> l2_miss_pmcnum = x86_perf_rdpmc_ctr_get(l2_miss_event);
> l2_hits_before = x86_perf_rdpmc(l2_hit_event);
> l2_miss_before = x86_perf_rdpmc(l2_miss_event);
> l2_hits_before = x86_perf_rdpmc(l2_hit_event);
> l2_miss_before = x86_perf_rdpmc(l2_miss_event);
> /* read memory */
> rdpmcl(l2_hit_pmcnum, l2_hits_after);
> rdpmcl(l2_miss_pmcnum, l2_miss_after);
> 
> I did not run my full tests with the above but a simple read of 256KB
> pseudo-locked memory gives:
> pseudo_lock_mea-492   [002]    372.001385: pseudo_lock_l2: hits=4096
> miss=0
> pseudo_lock_mea-495   [002]    373.059748: pseudo_lock_l2: hits=4096
> miss=0
> pseudo_lock_mea-498   [002]    374.117027: pseudo_lock_l2: hits=4096
> miss=0
> pseudo_lock_mea-501   [002]    375.182864: pseudo_lock_l2: hits=4096
> miss=0
> pseudo_lock_mea-504   [002]    376.243958: pseudo_lock_l2: hits=4096
> miss=0
> 
> We thus seem to be encountering the issue Tony predicted where the
> memory being tested is evicting the earlier measurement code and data.

I thoroughly reviewed this email thread to ensure that all your feedback
is being addressed. At this time I believe the current solution does so
since it addresses all requirements I was able to capture:
- Use in-kernel interface to perf.
- Do not write directly to PMU registers.
- Do not introduce another PMU owner. perf maintains role as performing
  resource arbitration for PMU.
- User space is able to use perf and resctrl at the same time.
- event_base_rdpmc is accessed and used only within an interrupts
  disabled section.
- Internals of events are never accessed directly, inline function used.
- Due to "pinned" usage the scheduling of event may have failed.  Error
  state is checked in recommended way and have a credible error
  handling.
- use X86_CONFIG

The pseudocode of the current solution is presented below. With this
solution I am able to address our customer requirement to be able to
measure a pseudo-locked region accurately while also addressing your
requirements to use perf correctly.

Is this solution acceptable to you?

#include "../../events/perf_event.h" /* For X86_CONFIG() */

/*
 * The X86_CONFIG() macro cannot be used in
 * a designated initializer as below - the initialization of
 * the .config attribute is thus deferred to later in order
 * to use X86_CONFIG
 */

static struct perf_event_attr

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-10 Thread Reinette Chatre

Hi Peter,

On 8/8/2018 10:33 AM, Reinette Chatre wrote:
> On 8/8/2018 12:51 AM, Peter Zijlstra wrote:
>> On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote:
  - I don't much fancy people accessing the guts of events like that;
would not an inline function like:

static inline u64 x86_perf_rdpmc(struct perf_event *event)
{
u64 val;

lockdep_assert_irqs_disabled();

rdpmcl(event->hw.event_base_rdpmc, val);
return val;
}

Work for you?
>>>
>>> No. This does not provide accurate results. Implementing the above produces:
>>> pseudo_lock_mea-366   [002] 34.950740: pseudo_lock_l2: hits=4096
>>> miss=4
>>
>> But it being an inline function should allow the compiler to optimize
>> and lift the event->hw.event_base_rdpmc load like you now do manually.
>> Also, like Tony already suggested, you can prime that load just fine by
>> doing an extra invocation.
>>
>> (and note that the above function is _much_ simpler than
>> perf_event_read_local())
> 
> Unfortunately I do not find this to be the case. When I implement
> x86_perf_rdpmc() _exactly_ as you suggest above and do the measurement like:
> 
> l2_hits_before = x86_perf_rdpmc(l2_hit_event);
> l2_miss_before = x86_perf_rdpmc(l2_miss_event);
> l2_hits_before = x86_perf_rdpmc(l2_hit_event);
> l2_miss_before = x86_perf_rdpmc(l2_miss_event);
> /* read memory */
> l2_hits_after = x86_perf_rdpmc(l2_hit_event);
> l2_miss_after = x86_perf_rdpmc(l2_miss_event);
> 
> 
> Then the results are not accurate, neither are the consistently
> inaccurate to consider a constant adjustment:
> 
> pseudo_lock_mea-409   [002]    194.322611: pseudo_lock_l2: hits=4100
> miss=0
> pseudo_lock_mea-412   [002]    195.520203: pseudo_lock_l2: hits=4096
> miss=3
> pseudo_lock_mea-415   [002]    196.571114: pseudo_lock_l2: hits=4097
> miss=3
> pseudo_lock_mea-422   [002]    197.629118: pseudo_lock_l2: hits=4097
> miss=3
> pseudo_lock_mea-425   [002]    198.687160: pseudo_lock_l2: hits=4096
> miss=3
> pseudo_lock_mea-428   [002]    199.744156: pseudo_lock_l2: hits=4096
> miss=2
> pseudo_lock_mea-431   [002]    200.801131: pseudo_lock_l2: hits=4097
> miss=2
> pseudo_lock_mea-434   [002]    201.858141: pseudo_lock_l2: hits=4097
> miss=2
> pseudo_lock_mea-437   [002]    202.917168: pseudo_lock_l2: hits=4096
> miss=2
> 
> I was able to test Tony's theory and replacing the reading of the
> "after" counts with a direct rdpmcl() improve the results. What I mean
> is this:
> 
> l2_hit_pmcnum = x86_perf_rdpmc_ctr_get(l2_hit_event);
> l2_miss_pmcnum = x86_perf_rdpmc_ctr_get(l2_miss_event);
> l2_hits_before = x86_perf_rdpmc(l2_hit_event);
> l2_miss_before = x86_perf_rdpmc(l2_miss_event);
> l2_hits_before = x86_perf_rdpmc(l2_hit_event);
> l2_miss_before = x86_perf_rdpmc(l2_miss_event);
> /* read memory */
> rdpmcl(l2_hit_pmcnum, l2_hits_after);
> rdpmcl(l2_miss_pmcnum, l2_miss_after);
> 
> I did not run my full tests with the above but a simple read of 256KB
> pseudo-locked memory gives:
> pseudo_lock_mea-492   [002]    372.001385: pseudo_lock_l2: hits=4096
> miss=0
> pseudo_lock_mea-495   [002]    373.059748: pseudo_lock_l2: hits=4096
> miss=0
> pseudo_lock_mea-498   [002]    374.117027: pseudo_lock_l2: hits=4096
> miss=0
> pseudo_lock_mea-501   [002]    375.182864: pseudo_lock_l2: hits=4096
> miss=0
> pseudo_lock_mea-504   [002]    376.243958: pseudo_lock_l2: hits=4096
> miss=0
> 
> We thus seem to be encountering the issue Tony predicted where the
> memory being tested is evicting the earlier measurement code and data.

I thoroughly reviewed this email thread to ensure that all your feedback
is being addressed. At this time I believe the current solution does so
since it addresses all requirements I was able to capture:
- Use in-kernel interface to perf.
- Do not write directly to PMU registers.
- Do not introduce another PMU owner. perf maintains role as performing
  resource arbitration for PMU.
- User space is able to use perf and resctrl at the same time.
- event_base_rdpmc is accessed and used only within an interrupts
  disabled section.
- Internals of events are never accessed directly, inline function used.
- Due to "pinned" usage the scheduling of event may have failed.  Error
  state is checked in recommended way and have a credible error
  handling.
- use X86_CONFIG

The pseudocode of the current solution is presented below. With this
solution I am able to address our customer requirement to be able to
measure a pseudo-locked region accurately while also addressing your
requirements to use perf correctly.

Is this solution acceptable to you?

#include "../../events/perf_event.h" /* For X86_CONFIG() */

/*
 * The X86_CONFIG() macro cannot be used in
 * a designated initializer as below - the initialization of
 * the .config attribute is thus deferred to later in order
 * to use X86_CONFIG
 */

static struct perf_event_attr

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-08 Thread Reinette Chatre

Hi Peter and Tony,

On 8/8/2018 12:51 AM, Peter Zijlstra wrote:
> On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote:
>>> FWIW, how long is that IRQ disabled section? It looks like something
>>> that could be taking a bit of time. We have these people that care about
>>> IRQ latency.
>>
>> We work closely with customers needing low latency as well as customers
>> needing deterministic behavior.
>>
>> This measurement is triggered by the user as a validation mechanism of
>> the pseudo-locked memory region after it has been created as part of
>> system setup as well as during runtime if there are any concerns with
>> the performance of an application that uses it.
>>
>> This measurement would thus be triggered before the sensitive workloads
>> start - during system setup, or if an issue is already present. In
>> either case the measurement is triggered by the administrator via debugfs.
> 
> That does not in fact include the answer to the question. Also, it
> assumes a competent operator (something I've found is not always true).

My apologies, I did not intend to avoid your question. The main goal of
this measurement is for an operator to test if a pseudo-locked region
has been set up correctly. It is thus targeted for system setup time,
before the IRQ sensitive workloads - some of which would use these
memory regions - start.

>>>  - I don't much fancy people accessing the guts of events like that;
>>>would not an inline function like:
>>>
>>>static inline u64 x86_perf_rdpmc(struct perf_event *event)
>>>{
>>> u64 val;
>>>
>>> lockdep_assert_irqs_disabled();
>>>
>>> rdpmcl(event->hw.event_base_rdpmc, val);
>>> return val;
>>>}
>>>
>>>Work for you?
>>
>> No. This does not provide accurate results. Implementing the above produces:
>> pseudo_lock_mea-366   [002] 34.950740: pseudo_lock_l2: hits=4096
>> miss=4
> 
> But it being an inline function should allow the compiler to optimize
> and lift the event->hw.event_base_rdpmc load like you now do manually.
> Also, like Tony already suggested, you can prime that load just fine by
> doing an extra invocation.
> 
> (and note that the above function is _much_ simpler than
> perf_event_read_local())

Unfortunately I do not find this to be the case. When I implement
x86_perf_rdpmc() _exactly_ as you suggest above and do the measurement like:

l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
/* read memory */
l2_hits_after = x86_perf_rdpmc(l2_hit_event);
l2_miss_after = x86_perf_rdpmc(l2_miss_event);


Then the results are not accurate, neither are the consistently
inaccurate to consider a constant adjustment:

pseudo_lock_mea-409   [002]    194.322611: pseudo_lock_l2: hits=4100
miss=0
pseudo_lock_mea-412   [002]    195.520203: pseudo_lock_l2: hits=4096
miss=3
pseudo_lock_mea-415   [002]    196.571114: pseudo_lock_l2: hits=4097
miss=3
pseudo_lock_mea-422   [002]    197.629118: pseudo_lock_l2: hits=4097
miss=3
pseudo_lock_mea-425   [002]    198.687160: pseudo_lock_l2: hits=4096
miss=3
pseudo_lock_mea-428   [002]    199.744156: pseudo_lock_l2: hits=4096
miss=2
pseudo_lock_mea-431   [002]    200.801131: pseudo_lock_l2: hits=4097
miss=2
pseudo_lock_mea-434   [002]    201.858141: pseudo_lock_l2: hits=4097
miss=2
pseudo_lock_mea-437   [002]    202.917168: pseudo_lock_l2: hits=4096
miss=2

I was able to test Tony's theory and replacing the reading of the
"after" counts with a direct rdpmcl() improve the results. What I mean
is this:

l2_hit_pmcnum = x86_perf_rdpmc_ctr_get(l2_hit_event);
l2_miss_pmcnum = x86_perf_rdpmc_ctr_get(l2_miss_event);
l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
/* read memory */
rdpmcl(l2_hit_pmcnum, l2_hits_after);
rdpmcl(l2_miss_pmcnum, l2_miss_after);

I did not run my full tests with the above but a simple read of 256KB
pseudo-locked memory gives:
pseudo_lock_mea-492   [002]    372.001385: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-495   [002]    373.059748: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-498   [002]    374.117027: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-501   [002]    375.182864: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-504   [002]    376.243958: pseudo_lock_l2: hits=4096
miss=0

We thus seem to be encountering the issue Tony predicted where the
memory being tested is evicting the earlier measurement code and data.

>>>  - native_read_pmc(); are you 100% sure this code only ever runs on
>>>native and not in some dodgy virt environment?
>>
>> My understanding is that a virtual environment would be a customer of a
>> RDT allocation (cache or memory bandwidth). I do not see if/where this
>> is restricted though -

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-08 Thread Reinette Chatre

Hi Peter and Tony,

On 8/8/2018 12:51 AM, Peter Zijlstra wrote:
> On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote:
>>> FWIW, how long is that IRQ disabled section? It looks like something
>>> that could be taking a bit of time. We have these people that care about
>>> IRQ latency.
>>
>> We work closely with customers needing low latency as well as customers
>> needing deterministic behavior.
>>
>> This measurement is triggered by the user as a validation mechanism of
>> the pseudo-locked memory region after it has been created as part of
>> system setup as well as during runtime if there are any concerns with
>> the performance of an application that uses it.
>>
>> This measurement would thus be triggered before the sensitive workloads
>> start - during system setup, or if an issue is already present. In
>> either case the measurement is triggered by the administrator via debugfs.
> 
> That does not in fact include the answer to the question. Also, it
> assumes a competent operator (something I've found is not always true).

My apologies, I did not intend to avoid your question. The main goal of
this measurement is for an operator to test if a pseudo-locked region
has been set up correctly. It is thus targeted for system setup time,
before the IRQ sensitive workloads - some of which would use these
memory regions - start.

>>>  - I don't much fancy people accessing the guts of events like that;
>>>would not an inline function like:
>>>
>>>static inline u64 x86_perf_rdpmc(struct perf_event *event)
>>>{
>>> u64 val;
>>>
>>> lockdep_assert_irqs_disabled();
>>>
>>> rdpmcl(event->hw.event_base_rdpmc, val);
>>> return val;
>>>}
>>>
>>>Work for you?
>>
>> No. This does not provide accurate results. Implementing the above produces:
>> pseudo_lock_mea-366   [002] 34.950740: pseudo_lock_l2: hits=4096
>> miss=4
> 
> But it being an inline function should allow the compiler to optimize
> and lift the event->hw.event_base_rdpmc load like you now do manually.
> Also, like Tony already suggested, you can prime that load just fine by
> doing an extra invocation.
> 
> (and note that the above function is _much_ simpler than
> perf_event_read_local())

Unfortunately I do not find this to be the case. When I implement
x86_perf_rdpmc() _exactly_ as you suggest above and do the measurement like:

l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
/* read memory */
l2_hits_after = x86_perf_rdpmc(l2_hit_event);
l2_miss_after = x86_perf_rdpmc(l2_miss_event);


Then the results are not accurate, neither are the consistently
inaccurate to consider a constant adjustment:

pseudo_lock_mea-409   [002]    194.322611: pseudo_lock_l2: hits=4100
miss=0
pseudo_lock_mea-412   [002]    195.520203: pseudo_lock_l2: hits=4096
miss=3
pseudo_lock_mea-415   [002]    196.571114: pseudo_lock_l2: hits=4097
miss=3
pseudo_lock_mea-422   [002]    197.629118: pseudo_lock_l2: hits=4097
miss=3
pseudo_lock_mea-425   [002]    198.687160: pseudo_lock_l2: hits=4096
miss=3
pseudo_lock_mea-428   [002]    199.744156: pseudo_lock_l2: hits=4096
miss=2
pseudo_lock_mea-431   [002]    200.801131: pseudo_lock_l2: hits=4097
miss=2
pseudo_lock_mea-434   [002]    201.858141: pseudo_lock_l2: hits=4097
miss=2
pseudo_lock_mea-437   [002]    202.917168: pseudo_lock_l2: hits=4096
miss=2

I was able to test Tony's theory and replacing the reading of the
"after" counts with a direct rdpmcl() improve the results. What I mean
is this:

l2_hit_pmcnum = x86_perf_rdpmc_ctr_get(l2_hit_event);
l2_miss_pmcnum = x86_perf_rdpmc_ctr_get(l2_miss_event);
l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
/* read memory */
rdpmcl(l2_hit_pmcnum, l2_hits_after);
rdpmcl(l2_miss_pmcnum, l2_miss_after);

I did not run my full tests with the above but a simple read of 256KB
pseudo-locked memory gives:
pseudo_lock_mea-492   [002]    372.001385: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-495   [002]    373.059748: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-498   [002]    374.117027: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-501   [002]    375.182864: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-504   [002]    376.243958: pseudo_lock_l2: hits=4096
miss=0

We thus seem to be encountering the issue Tony predicted where the
memory being tested is evicting the earlier measurement code and data.

>>>  - native_read_pmc(); are you 100% sure this code only ever runs on
>>>native and not in some dodgy virt environment?
>>
>> My understanding is that a virtual environment would be a customer of a
>> RDT allocation (cache or memory bandwidth). I do not see if/where this
>> is restricted though -

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-08 Thread Reinette Chatre

On 8/8/2018 9:47 AM, Peter Zijlstra wrote:
> On Wed, Aug 08, 2018 at 03:55:54PM +, Luck, Tony wrote:
>>> So _why_ doesn't this work? As said by Tony, that first call should
>>> prime the caches, so the second and third calls should not generate any
>>> misses.
>>
>> How much code/data is involved? If there is a lot, then you may be unlucky
>> with cache coloring and the later parts of the "prime the caches" code path
>> may evict some lines loaded in the early parts.
> 
> Well, Reinette used perf_event_read_local() which is unfortunately quite
> a bit. But the inline I proposed is a single load and depending on
> rdpmcl() or native_read_pmc() a call to or just a single inline asm
> rdpmc instruction.
> 
> That should certainly work I think.

I am in the process of testing this variation.

Reinette

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-08 Thread Reinette Chatre

On 8/8/2018 9:47 AM, Peter Zijlstra wrote:
> On Wed, Aug 08, 2018 at 03:55:54PM +, Luck, Tony wrote:
>>> So _why_ doesn't this work? As said by Tony, that first call should
>>> prime the caches, so the second and third calls should not generate any
>>> misses.
>>
>> How much code/data is involved? If there is a lot, then you may be unlucky
>> with cache coloring and the later parts of the "prime the caches" code path
>> may evict some lines loaded in the early parts.
> 
> Well, Reinette used perf_event_read_local() which is unfortunately quite
> a bit. But the inline I proposed is a single load and depending on
> rdpmcl() or native_read_pmc() a call to or just a single inline asm
> rdpmc instruction.
> 
> That should certainly work I think.

I am in the process of testing this variation.

Reinette

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-08 Thread Peter Zijlstra

On Wed, Aug 08, 2018 at 03:55:54PM +, Luck, Tony wrote:
> > So _why_ doesn't this work? As said by Tony, that first call should
> > prime the caches, so the second and third calls should not generate any
> > misses.
> 
> How much code/data is involved? If there is a lot, then you may be unlucky
> with cache coloring and the later parts of the "prime the caches" code path
> may evict some lines loaded in the early parts.

Well, Reinette used perf_event_read_local() which is unfortunately quite
a bit. But the inline I proposed is a single load and depending on
rdpmcl() or native_read_pmc() a call to or just a single inline asm
rdpmc instruction.

That should certainly work I think.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-08 Thread Peter Zijlstra

On Wed, Aug 08, 2018 at 03:55:54PM +, Luck, Tony wrote:
> > So _why_ doesn't this work? As said by Tony, that first call should
> > prime the caches, so the second and third calls should not generate any
> > misses.
> 
> How much code/data is involved? If there is a lot, then you may be unlucky
> with cache coloring and the later parts of the "prime the caches" code path
> may evict some lines loaded in the early parts.

Well, Reinette used perf_event_read_local() which is unfortunately quite
a bit. But the inline I proposed is a single load and depending on
rdpmcl() or native_read_pmc() a call to or just a single inline asm
rdpmc instruction.

That should certainly work I think.

RE: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-08 Thread Luck, Tony

> So _why_ doesn't this work? As said by Tony, that first call should
> prime the caches, so the second and third calls should not generate any
> misses.

How much code/data is involved? If there is a lot, then you may be unlucky
with cache coloring and the later parts of the "prime the caches" code path
may evict some lines loaded in the early parts.

-Tony

RE: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-08 Thread Luck, Tony

> So _why_ doesn't this work? As said by Tony, that first call should
> prime the caches, so the second and third calls should not generate any
> misses.

How much code/data is involved? If there is a lot, then you may be unlucky
with cache coloring and the later parts of the "prime the caches" code path
may evict some lines loaded in the early parts.

-Tony

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-08 Thread Peter Zijlstra

On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote:
> > FWIW, how long is that IRQ disabled section? It looks like something
> > that could be taking a bit of time. We have these people that care about
> > IRQ latency.
> 
> We work closely with customers needing low latency as well as customers
> needing deterministic behavior.
> 
> This measurement is triggered by the user as a validation mechanism of
> the pseudo-locked memory region after it has been created as part of
> system setup as well as during runtime if there are any concerns with
> the performance of an application that uses it.
> 
> This measurement would thus be triggered before the sensitive workloads
> start - during system setup, or if an issue is already present. In
> either case the measurement is triggered by the administrator via debugfs.

That does not in fact include the answer to the question. Also, it
assumes a competent operator (something I've found is not always true).

> >  - I don't much fancy people accessing the guts of events like that;
> >would not an inline function like:
> > 
> >static inline u64 x86_perf_rdpmc(struct perf_event *event)
> >{
> > u64 val;
> > 
> > lockdep_assert_irqs_disabled();
> > 
> > rdpmcl(event->hw.event_base_rdpmc, val);
> > return val;
> >}
> > 
> >Work for you?
> 
> No. This does not provide accurate results. Implementing the above produces:
> pseudo_lock_mea-366   [002] 34.950740: pseudo_lock_l2: hits=4096
> miss=4

But it being an inline function should allow the compiler to optimize
and lift the event->hw.event_base_rdpmc load like you now do manually.
Also, like Tony already suggested, you can prime that load just fine by
doing an extra invocation.

(and note that the above function is _much_ simpler than
perf_event_read_local())

> >  - native_read_pmc(); are you 100% sure this code only ever runs on
> >native and not in some dodgy virt environment?
> 
> My understanding is that a virtual environment would be a customer of a
> RDT allocation (cache or memory bandwidth). I do not see if/where this
> is restricted though - I'll move to rdpmcl() but the usage of a cache
> allocation feature like this from a virtual machine needs more
> investigation.

I can imagine that hypervisors that allow physical partitioning could
allow delegating the rdt crud to their guests when they 'own' a full
socket or whatever the domain is for this.

> Will do. I created the following helper function that can be used after
> interrupts are disabled:
> 
> static inline int perf_event_error_state(struct perf_event *event)
> {
> int ret = 0;
> u64 tmp;
> 
> ret = perf_event_read_local(event, , NULL, NULL);
> if (ret < 0)
> return ret;
> 
> if (event->attr.pinned && event->oncpu != smp_processor_id())
> return -EBUSY;
> 
> return ret;
> }

Nah, stick the test in perf_event_read_local(), that actually needs it.

> > Also, while you disable IRQs, your fancy pants loop is still subject to
> > NMIs that can/will perturb your measurements, how do you deal with
> > those?

> Customers interested in this feature are familiar with dealing with them
> (and also SMIs). The user space counterpart is able to detect such an
> occurrence.

You're very optimistic about your customers capabilities. And this might
be true for the current people you're talking to, but once this is
available and public, joe monkey will have access and he _will_ screw it
up.

> Please note that if an NMI arrives it would be handled with the
> currently active cache capacity bitmask so none of the pseudo-locked
> memory will be evicted since no capacity bitmask overlaps with the
> pseudo-locked region.

So exceptions change / have their own bitmask?

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-08 Thread Peter Zijlstra

On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote:
> > FWIW, how long is that IRQ disabled section? It looks like something
> > that could be taking a bit of time. We have these people that care about
> > IRQ latency.
> 
> We work closely with customers needing low latency as well as customers
> needing deterministic behavior.
> 
> This measurement is triggered by the user as a validation mechanism of
> the pseudo-locked memory region after it has been created as part of
> system setup as well as during runtime if there are any concerns with
> the performance of an application that uses it.
> 
> This measurement would thus be triggered before the sensitive workloads
> start - during system setup, or if an issue is already present. In
> either case the measurement is triggered by the administrator via debugfs.

That does not in fact include the answer to the question. Also, it
assumes a competent operator (something I've found is not always true).

> >  - I don't much fancy people accessing the guts of events like that;
> >would not an inline function like:
> > 
> >static inline u64 x86_perf_rdpmc(struct perf_event *event)
> >{
> > u64 val;
> > 
> > lockdep_assert_irqs_disabled();
> > 
> > rdpmcl(event->hw.event_base_rdpmc, val);
> > return val;
> >}
> > 
> >Work for you?
> 
> No. This does not provide accurate results. Implementing the above produces:
> pseudo_lock_mea-366   [002] 34.950740: pseudo_lock_l2: hits=4096
> miss=4

But it being an inline function should allow the compiler to optimize
and lift the event->hw.event_base_rdpmc load like you now do manually.
Also, like Tony already suggested, you can prime that load just fine by
doing an extra invocation.

(and note that the above function is _much_ simpler than
perf_event_read_local())

> >  - native_read_pmc(); are you 100% sure this code only ever runs on
> >native and not in some dodgy virt environment?
> 
> My understanding is that a virtual environment would be a customer of a
> RDT allocation (cache or memory bandwidth). I do not see if/where this
> is restricted though - I'll move to rdpmcl() but the usage of a cache
> allocation feature like this from a virtual machine needs more
> investigation.

I can imagine that hypervisors that allow physical partitioning could
allow delegating the rdt crud to their guests when they 'own' a full
socket or whatever the domain is for this.

> Will do. I created the following helper function that can be used after
> interrupts are disabled:
> 
> static inline int perf_event_error_state(struct perf_event *event)
> {
> int ret = 0;
> u64 tmp;
> 
> ret = perf_event_read_local(event, , NULL, NULL);
> if (ret < 0)
> return ret;
> 
> if (event->attr.pinned && event->oncpu != smp_processor_id())
> return -EBUSY;
> 
> return ret;
> }

Nah, stick the test in perf_event_read_local(), that actually needs it.

> > Also, while you disable IRQs, your fancy pants loop is still subject to
> > NMIs that can/will perturb your measurements, how do you deal with
> > those?

> Customers interested in this feature are familiar with dealing with them
> (and also SMIs). The user space counterpart is able to detect such an
> occurrence.

You're very optimistic about your customers capabilities. And this might
be true for the current people you're talking to, but once this is
available and public, joe monkey will have access and he _will_ screw it
up.

> Please note that if an NMI arrives it would be handled with the
> currently active cache capacity bitmask so none of the pseudo-locked
> memory will be evicted since no capacity bitmask overlaps with the
> pseudo-locked region.

So exceptions change / have their own bitmask?

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-08 Thread Peter Zijlstra

On Tue, Aug 07, 2018 at 10:44:44PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 8/7/2018 6:28 PM, Luck, Tony wrote:
> > Would it help to call routines to read the "before" values of the counter
> > twice. The first time to preload the cache with anything needed to execute
> > the perf code path.

Indeed, that is the 'common' pattern for this.


> First, reading data using perf_event_read_local() called twice.
> When testing as follows:
> /* create perf events */
> /* disable irq */
> /* disable hw prefetchers */
> /* init local vars */
> /* read before data twice as follows: */
> perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL);
> perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL);
> perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL);
> perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL);
> /* read through pseudo-locked memory */
> perf_event_read_local(l2_hit_event, _hits_after, NULL, NULL);
> perf_event_read_local(l2_miss_event, _miss_after, NULL, NULL);
> /* re enable hw prefetchers */
> /* enable irq */
> /* write data to tracepoint */
> 
> With the above I am not able to obtain accurate data:
> pseudo_lock_mea-354   [002] 63.045734: pseudo_lock_l2: hits=4103
> miss=6

So _why_ doesn't this work? As said by Tony, that first call should
prime the caches, so the second and third calls should not generate any
misses.

They might cause extra hits though, but that should be a constant
amount is is also measureable with a no-op loop and can easily be
subtracted.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-08 Thread Peter Zijlstra

On Tue, Aug 07, 2018 at 10:44:44PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 8/7/2018 6:28 PM, Luck, Tony wrote:
> > Would it help to call routines to read the "before" values of the counter
> > twice. The first time to preload the cache with anything needed to execute
> > the perf code path.

Indeed, that is the 'common' pattern for this.


> First, reading data using perf_event_read_local() called twice.
> When testing as follows:
> /* create perf events */
> /* disable irq */
> /* disable hw prefetchers */
> /* init local vars */
> /* read before data twice as follows: */
> perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL);
> perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL);
> perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL);
> perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL);
> /* read through pseudo-locked memory */
> perf_event_read_local(l2_hit_event, _hits_after, NULL, NULL);
> perf_event_read_local(l2_miss_event, _miss_after, NULL, NULL);
> /* re enable hw prefetchers */
> /* enable irq */
> /* write data to tracepoint */
> 
> With the above I am not able to obtain accurate data:
> pseudo_lock_mea-354   [002] 63.045734: pseudo_lock_l2: hits=4103
> miss=6

So _why_ doesn't this work? As said by Tony, that first call should
prime the caches, so the second and third calls should not generate any
misses.

They might cause extra hits though, but that should be a constant
amount is is also measureable with a no-op loop and can easily be
subtracted.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-07 Thread Reinette Chatre

Hi Tony,

On 8/7/2018 6:28 PM, Luck, Tony wrote:
> Would it help to call routines to read the "before" values of the counter
> twice. The first time to preload the cache with anything needed to execute
> the perf code path.

>>> In an attempt to improve the accuracy of the above I modified it to the
>>> following:
>>>
>>> /* create the two events as before in "enabled" state */
>>> l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc;
>>> l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc;
>>> local_irq_disable();
>>> /* disable hw prefetchers */
>>> /* init local vars to loop through pseudo-locked mem
>   * may take some misses in the perf code
>   */
>  l2_hits_before = native_read_pmc(l2_hit_pmcnum);
>  l2_miss_before = native_read_pmc(l2_miss_pmcnum);
>  /* Read counters again, hope no new misses here */
>>> l2_hits_before = native_read_pmc(l2_hit_pmcnum);
>>> l2_miss_before = native_read_pmc(l2_miss_pmcnum);
>>> /* loop through pseudo-locked mem */
>>> l2_hits_after = native_read_pmc(l2_hit_pmcnum);
>>> l2_miss_after = native_read_pmc(l2_miss_pmcnum);
>>> /* enable hw prefetchers */
>>> local_irq_enable();
>

The end of my previous email to Peter contains a solution that does
address all the feedback received up to this point while also able to
obtain (what I thought to be ... more below) accurate results. The code
you comment on below is not this latest version but your suggestion is
valuable and I do try it out on two different ways from what you quote
below to read the perf data.

So, instead of reading data with native_read_pmc() as in the code you
quoted I first test when reading data twice using the original
recommendation of "perf_event_read_local()" and second when reading data
twice using "rdpmcl()" chosen instead of native_read_pmc().

First, reading data using perf_event_read_local() called twice.
When testing as follows:
/* create perf events */
/* disable irq */
/* disable hw prefetchers */
/* init local vars */
/* read before data twice as follows: */
perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL);
perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL);
perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL);
perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL);
/* read through pseudo-locked memory */
perf_event_read_local(l2_hit_event, _hits_after, NULL, NULL);
perf_event_read_local(l2_miss_event, _miss_after, NULL, NULL);
/* re enable hw prefetchers */
/* enable irq */
/* write data to tracepoint */

With the above I am not able to obtain accurate data:
pseudo_lock_mea-351   [002] 61.859147: pseudo_lock_l2: hits=4109
miss=0
pseudo_lock_mea-354   [002] 63.045734: pseudo_lock_l2: hits=4103
miss=6
pseudo_lock_mea-357   [002] 64.104673: pseudo_lock_l2: hits=4106
miss=3
pseudo_lock_mea-360   [002] 65.174775: pseudo_lock_l2: hits=4105
miss=5
pseudo_lock_mea-367   [002] 66.232308: pseudo_lock_l2: hits=4104
miss=5
pseudo_lock_mea-370   [002] 67.291844: pseudo_lock_l2: hits=4103
miss=6
pseudo_lock_mea-373   [002] 68.348725: pseudo_lock_l2: hits=4105
miss=5
pseudo_lock_mea-376   [002] 69.409738: pseudo_lock_l2: hits=4105
miss=5
pseudo_lock_mea-379   [002] 70.466763: pseudo_lock_l2: hits=4105
miss=5

Second, reading data using rdpmcl() called twice.
This is the same solution as documented in my previous email, with the
two extra rdpmcl() calls added. An overview of the flow:

/* create perf events */
/* disable irq */
/* check perf event error state */
/* disable hw prefetchers */
/* init local vars */
/* read before data twice as follows: */
rdpmcl(l2_hit_pmcnum, l2_hits_before);
rdpmcl(l2_miss_pmcnum, l2_miss_before);
rdpmcl(l2_hit_pmcnum, l2_hits_before);
rdpmcl(l2_miss_pmcnum, l2_miss_before);
/* read through pseudo-locked memory */
rdpmcl(l2_hit_pmcnum, l2_hits_after);
rdpmcl(l2_miss_pmcnum, l2_miss_after);
/* re enable hw prefetchers */
/* enable irq */
/* write data to tracepoint */

Here as expected a simple test showed that the data was accurate
(hits=4096 miss=0) so I repeated the creation and measurement of
pseudo-locked region at different sizes under different loads. Each
possible pseudo-lock region size is created and measured 100 times on an
idle system and 100 times on a system with a noisy neighbor - this
resulted in a total of 2800 pseudo-lock region creations each followed
by a measurement of that region.

The results of these tests are the best I have yet seen. In this total
of 2800 measurements the number of cache hits were miscounted only in
eight measurements - each miscount was under(?) counted with one.
Specifically, a memory region consisting of 8192 cache lines was
measured as "hits=8191 miss=0", three memory regions with 12288 cache
lines were measured as "hits=12287 miss=0", two memory regions with
10240 cache lines were measured as "hits=10239 miss=0", and two memory
regions with 14336 cache lines were measured as "hits=14335 miss=0".
I

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-07 Thread Reinette Chatre

Hi Tony,

On 8/7/2018 6:28 PM, Luck, Tony wrote:
> Would it help to call routines to read the "before" values of the counter
> twice. The first time to preload the cache with anything needed to execute
> the perf code path.

>>> In an attempt to improve the accuracy of the above I modified it to the
>>> following:
>>>
>>> /* create the two events as before in "enabled" state */
>>> l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc;
>>> l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc;
>>> local_irq_disable();
>>> /* disable hw prefetchers */
>>> /* init local vars to loop through pseudo-locked mem
>   * may take some misses in the perf code
>   */
>  l2_hits_before = native_read_pmc(l2_hit_pmcnum);
>  l2_miss_before = native_read_pmc(l2_miss_pmcnum);
>  /* Read counters again, hope no new misses here */
>>> l2_hits_before = native_read_pmc(l2_hit_pmcnum);
>>> l2_miss_before = native_read_pmc(l2_miss_pmcnum);
>>> /* loop through pseudo-locked mem */
>>> l2_hits_after = native_read_pmc(l2_hit_pmcnum);
>>> l2_miss_after = native_read_pmc(l2_miss_pmcnum);
>>> /* enable hw prefetchers */
>>> local_irq_enable();
>

The end of my previous email to Peter contains a solution that does
address all the feedback received up to this point while also able to
obtain (what I thought to be ... more below) accurate results. The code
you comment on below is not this latest version but your suggestion is
valuable and I do try it out on two different ways from what you quote
below to read the perf data.

So, instead of reading data with native_read_pmc() as in the code you
quoted I first test when reading data twice using the original
recommendation of "perf_event_read_local()" and second when reading data
twice using "rdpmcl()" chosen instead of native_read_pmc().

First, reading data using perf_event_read_local() called twice.
When testing as follows:
/* create perf events */
/* disable irq */
/* disable hw prefetchers */
/* init local vars */
/* read before data twice as follows: */
perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL);
perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL);
perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL);
perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL);
/* read through pseudo-locked memory */
perf_event_read_local(l2_hit_event, _hits_after, NULL, NULL);
perf_event_read_local(l2_miss_event, _miss_after, NULL, NULL);
/* re enable hw prefetchers */
/* enable irq */
/* write data to tracepoint */

With the above I am not able to obtain accurate data:
pseudo_lock_mea-351   [002] 61.859147: pseudo_lock_l2: hits=4109
miss=0
pseudo_lock_mea-354   [002] 63.045734: pseudo_lock_l2: hits=4103
miss=6
pseudo_lock_mea-357   [002] 64.104673: pseudo_lock_l2: hits=4106
miss=3
pseudo_lock_mea-360   [002] 65.174775: pseudo_lock_l2: hits=4105
miss=5
pseudo_lock_mea-367   [002] 66.232308: pseudo_lock_l2: hits=4104
miss=5
pseudo_lock_mea-370   [002] 67.291844: pseudo_lock_l2: hits=4103
miss=6
pseudo_lock_mea-373   [002] 68.348725: pseudo_lock_l2: hits=4105
miss=5
pseudo_lock_mea-376   [002] 69.409738: pseudo_lock_l2: hits=4105
miss=5
pseudo_lock_mea-379   [002] 70.466763: pseudo_lock_l2: hits=4105
miss=5

Second, reading data using rdpmcl() called twice.
This is the same solution as documented in my previous email, with the
two extra rdpmcl() calls added. An overview of the flow:

/* create perf events */
/* disable irq */
/* check perf event error state */
/* disable hw prefetchers */
/* init local vars */
/* read before data twice as follows: */
rdpmcl(l2_hit_pmcnum, l2_hits_before);
rdpmcl(l2_miss_pmcnum, l2_miss_before);
rdpmcl(l2_hit_pmcnum, l2_hits_before);
rdpmcl(l2_miss_pmcnum, l2_miss_before);
/* read through pseudo-locked memory */
rdpmcl(l2_hit_pmcnum, l2_hits_after);
rdpmcl(l2_miss_pmcnum, l2_miss_after);
/* re enable hw prefetchers */
/* enable irq */
/* write data to tracepoint */

Here as expected a simple test showed that the data was accurate
(hits=4096 miss=0) so I repeated the creation and measurement of
pseudo-locked region at different sizes under different loads. Each
possible pseudo-lock region size is created and measured 100 times on an
idle system and 100 times on a system with a noisy neighbor - this
resulted in a total of 2800 pseudo-lock region creations each followed
by a measurement of that region.

The results of these tests are the best I have yet seen. In this total
of 2800 measurements the number of cache hits were miscounted only in
eight measurements - each miscount was under(?) counted with one.
Specifically, a memory region consisting of 8192 cache lines was
measured as "hits=8191 miss=0", three memory regions with 12288 cache
lines were measured as "hits=12287 miss=0", two memory regions with
10240 cache lines were measured as "hits=10239 miss=0", and two memory
regions with 14336 cache lines were measured as "hits=14335 miss=0".
I

RE: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-07 Thread Luck, Tony

Would it help to call routines to read the "before" values of the counter
twice. The first time to preload the cache with anything needed to execute
the perf code path.

>> In an attempt to improve the accuracy of the above I modified it to the
>> following:
>>
>> /* create the two events as before in "enabled" state */
>> l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc;
>> l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc;
>> local_irq_disable();
>> /* disable hw prefetchers */
>> /* init local vars to loop through pseudo-locked mem
  * may take some misses in the perf code
  */
 l2_hits_before = native_read_pmc(l2_hit_pmcnum);
 l2_miss_before = native_read_pmc(l2_miss_pmcnum);
 /* Read counters again, hope no new misses here */
>> l2_hits_before = native_read_pmc(l2_hit_pmcnum);
>> l2_miss_before = native_read_pmc(l2_miss_pmcnum);
>> /* loop through pseudo-locked mem */
>> l2_hits_after = native_read_pmc(l2_hit_pmcnum);
>> l2_miss_after = native_read_pmc(l2_miss_pmcnum);
>> /* enable hw prefetchers */
>> local_irq_enable();

RE: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-07 Thread Luck, Tony

Would it help to call routines to read the "before" values of the counter
twice. The first time to preload the cache with anything needed to execute
the perf code path.

>> In an attempt to improve the accuracy of the above I modified it to the
>> following:
>>
>> /* create the two events as before in "enabled" state */
>> l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc;
>> l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc;
>> local_irq_disable();
>> /* disable hw prefetchers */
>> /* init local vars to loop through pseudo-locked mem
  * may take some misses in the perf code
  */
 l2_hits_before = native_read_pmc(l2_hit_pmcnum);
 l2_miss_before = native_read_pmc(l2_miss_pmcnum);
 /* Read counters again, hope no new misses here */
>> l2_hits_before = native_read_pmc(l2_hit_pmcnum);
>> l2_miss_before = native_read_pmc(l2_miss_pmcnum);
>> /* loop through pseudo-locked mem */
>> l2_hits_after = native_read_pmc(l2_hit_pmcnum);
>> l2_miss_after = native_read_pmc(l2_miss_pmcnum);
>> /* enable hw prefetchers */
>> local_irq_enable();

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-07 Thread Peter Zijlstra

On Mon, Aug 06, 2018 at 04:07:09PM -0700, Reinette Chatre wrote:

> I've modified your suggestion slightly in an attempt to gain accuracy.
> Now it looks like:
> 
> local_irq_disable();
> /* disable hw prefetchers */
> /* init local vars to loop through pseudo-locked mem */
> perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL);
> perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL);
> /* loop through pseudo-locked mem */
> perf_event_read_local(l2_hit_event, _hits_after, NULL, NULL);
> perf_event_read_local(l2_miss_event, _miss_after, NULL, NULL);
> /* enable hw prefetchers */
> local_irq_enable();
> 
> With the above I do not see the impact of an interference workload
> anymore but the results are not yet accurate:
> 
> pseudo_lock_mea-538   [002]    113.296084: pseudo_lock_l2: hits=4103
> miss=2

an error of ~0.2% sounds pretty good to me ;-)

FWIW, how long is that IRQ disabled section? It looks like something
that could be taking a bit of time. We have these people that care about
IRQ latency.

> While the results do seem close, reporting a cache miss on memory that
> is set up to be locked in cache is significant.

If it is 'locked' then how does perf causes misses? I suppose I should
go read that earlier email.

> In an attempt to improve the accuracy of the above I modified it to the
> following:
> 
> /* create the two events as before in "enabled" state */
> l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc;
> l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc;
> local_irq_disable();
> /* disable hw prefetchers */
> /* init local vars to loop through pseudo-locked mem */
> l2_hits_before = native_read_pmc(l2_hit_pmcnum);
> l2_miss_before = native_read_pmc(l2_miss_pmcnum);
> /* loop through pseudo-locked mem */
> l2_hits_after = native_read_pmc(l2_hit_pmcnum);
> l2_miss_after = native_read_pmc(l2_miss_pmcnum);
> /* enable hw prefetchers */
> local_irq_enable();

So the above has a number or practical problems:

 - event_base_rdpmc is subject to change as long as interrupts are
   enabled.

 - I don't much fancy people accessing the guts of events like that;
   would not an inline function like:

   static inline u64 x86_perf_rdpmc(struct perf_event *event)
   {
u64 val;

lockdep_assert_irqs_disabled();

rdpmcl(event->hw.event_base_rdpmc, val);
return val;
   }

   Work for you?

 - native_read_pmc(); are you 100% sure this code only ever runs on
   native and not in some dodgy virt environment?

   (also, I suppose all hardware supporting this resctl locking muck
   actually has rdpmc ;-))

 - You used perf_event_attr::pinned=1, this means we could have failed
   to schedule the event, you need to check error state somewhere and
   have a credible error handling.

   For checking error state; you could use perf_event_read_local() early
   after disabling IRQs, it needs a little extra test though:

if (event->attr.pinned && event->oncpu != smp_processor_id())
return -EBUSY;

   to correctly deal with pinned failure. Place it such that it becomes
   the last sanity check.


Also, while you disable IRQs, your fancy pants loop is still subject to
NMIs that can/will perturb your measurements, how do you deal with
those?

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-07 Thread Peter Zijlstra

On Mon, Aug 06, 2018 at 04:07:09PM -0700, Reinette Chatre wrote:

> I've modified your suggestion slightly in an attempt to gain accuracy.
> Now it looks like:
> 
> local_irq_disable();
> /* disable hw prefetchers */
> /* init local vars to loop through pseudo-locked mem */
> perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL);
> perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL);
> /* loop through pseudo-locked mem */
> perf_event_read_local(l2_hit_event, _hits_after, NULL, NULL);
> perf_event_read_local(l2_miss_event, _miss_after, NULL, NULL);
> /* enable hw prefetchers */
> local_irq_enable();
> 
> With the above I do not see the impact of an interference workload
> anymore but the results are not yet accurate:
> 
> pseudo_lock_mea-538   [002]    113.296084: pseudo_lock_l2: hits=4103
> miss=2

an error of ~0.2% sounds pretty good to me ;-)

FWIW, how long is that IRQ disabled section? It looks like something
that could be taking a bit of time. We have these people that care about
IRQ latency.

> While the results do seem close, reporting a cache miss on memory that
> is set up to be locked in cache is significant.

If it is 'locked' then how does perf causes misses? I suppose I should
go read that earlier email.

> In an attempt to improve the accuracy of the above I modified it to the
> following:
> 
> /* create the two events as before in "enabled" state */
> l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc;
> l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc;
> local_irq_disable();
> /* disable hw prefetchers */
> /* init local vars to loop through pseudo-locked mem */
> l2_hits_before = native_read_pmc(l2_hit_pmcnum);
> l2_miss_before = native_read_pmc(l2_miss_pmcnum);
> /* loop through pseudo-locked mem */
> l2_hits_after = native_read_pmc(l2_hit_pmcnum);
> l2_miss_after = native_read_pmc(l2_miss_pmcnum);
> /* enable hw prefetchers */
> local_irq_enable();

So the above has a number or practical problems:

 - event_base_rdpmc is subject to change as long as interrupts are
   enabled.

 - I don't much fancy people accessing the guts of events like that;
   would not an inline function like:

   static inline u64 x86_perf_rdpmc(struct perf_event *event)
   {
u64 val;

lockdep_assert_irqs_disabled();

rdpmcl(event->hw.event_base_rdpmc, val);
return val;
   }

   Work for you?

 - native_read_pmc(); are you 100% sure this code only ever runs on
   native and not in some dodgy virt environment?

   (also, I suppose all hardware supporting this resctl locking muck
   actually has rdpmc ;-))

 - You used perf_event_attr::pinned=1, this means we could have failed
   to schedule the event, you need to check error state somewhere and
   have a credible error handling.

   For checking error state; you could use perf_event_read_local() early
   after disabling IRQs, it needs a little extra test though:

if (event->attr.pinned && event->oncpu != smp_processor_id())
return -EBUSY;

   to correctly deal with pinned failure. Place it such that it becomes
   the last sanity check.


Also, while you disable IRQs, your fancy pants loop is still subject to
NMIs that can/will perturb your measurements, how do you deal with
those?

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-06 Thread Reinette Chatre

Hi Peter,

On 8/6/2018 3:12 PM, Peter Zijlstra wrote:
> On Mon, Aug 06, 2018 at 12:50:50PM -0700, Reinette Chatre wrote:
>> In my previous email I provided the details of the Cache Pseudo-Locking
>> feature implemented on top of resctrl. Please let me know if you would
>> like any more details about that. I can send you more materials.
> 
> I've no yet had time to read..
> 
>> BUG: sleeping function called from invalid context at
>> kernel/locking/mutex.c:748
>>
>> I thus continued to use the API with interrupts enabled did the following:
>>
>> Two new event attributes:
>> static struct perf_event_attr l2_miss_attr = {
>> .type   = PERF_TYPE_RAW,
>> .config = (0x10ULL << 8) | 0xd1,
> 
> Please use something like:
> 
>   X86_CONFIG(.event=0xd1, .umask=0x10),
> 
> that's ever so much more readable.
> 
>> .size   = sizeof(struct perf_event_attr),
>> .pinned = 1,
>> .disabled   = 1,
>> .exclude_user   = 1
>> };
>>
>> static struct perf_event_attr l2_hit_attr = {
>> .type   = PERF_TYPE_RAW,
>> .config = (0x2ULL << 8) | 0xd1,
>> .size   = sizeof(struct perf_event_attr),
>> .pinned = 1,
>> .disabled   = 1,
>> .exclude_user   = 1
>> };
>>
>> Create the two new events using these attributes:
>> l2_miss_event = perf_event_create_kernel_counter(_miss_attr, cpu,
>> NULL, NULL, NULL);
>> l2_hit_event = perf_event_create_kernel_counter(_hit_attr, cpu, NULL,
>> NULL, NULL);
>>
>> Take measurements:
>> perf_event_enable(l2_miss_event);
>> perf_event_enable(l2_hit_event);
>> local_irq_disable();
>> /* Disable hardware prefetchers */
>> /* Loop through pseudo-locked memory */
>> /* Enable hardware prefetchers */
>> local_irq_enable();
>> perf_event_disable(l2_hit_event);
>> perf_event_disable(l2_miss_event);
>>
>> Read results:
>> l2_hits = perf_event_read_value(l2_hit_event, , );
>> l2_miss = perf_event_read_value(l2_miss_event, , );
>> /* Make results available in tracepoints */
> 
> switch to .disabled=0 and try this for measurement:
> 
>   local_irq_disable();
>   perf_event_read_local(l2_miss_event, _val1, NULL, NULL);
>   perf_event_read_local(l2_hit_event, _val1, NULL, NULL);
>   /* do your thing */
>   perf_event_read_local(l2_miss_event, _val2, NULL, NULL);
>   perf_event_read_local(l2_hit_event, _val2, NULL, NULL);
>   local_irq_enable();

Thank you very much for taking a look and providing your guidance.

> 
> You're running this on the CPU you created the event for, right?

Yes.

I've modified your suggestion slightly in an attempt to gain accuracy.
Now it looks like:

local_irq_disable();
/* disable hw prefetchers */
/* init local vars to loop through pseudo-locked mem */
perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL);
perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL);
/* loop through pseudo-locked mem */
perf_event_read_local(l2_hit_event, _hits_after, NULL, NULL);
perf_event_read_local(l2_miss_event, _miss_after, NULL, NULL);
/* enable hw prefetchers */
local_irq_enable();

With the above I do not see the impact of an interference workload
anymore but the results are not yet accurate:

pseudo_lock_mea-538   [002]    113.296084: pseudo_lock_l2: hits=4103
miss=2
pseudo_lock_mea-541   [002]    114.349343: pseudo_lock_l2: hits=4102
miss=3
pseudo_lock_mea-544   [002]    115.410206: pseudo_lock_l2: hits=4101
miss=4
pseudo_lock_mea-551   [002]    116.473912: pseudo_lock_l2: hits=4102
miss=3
pseudo_lock_mea-554   [002]    117.532446: pseudo_lock_l2: hits=4100
miss=5
pseudo_lock_mea-557   [002]    118.591121: pseudo_lock_l2: hits=4103
miss=2
pseudo_lock_mea-560   [002]    119.642467: pseudo_lock_l2: hits=4102
miss=3
pseudo_lock_mea-563   [002]    120.698562: pseudo_lock_l2: hits=4102
miss=3
pseudo_lock_mea-566   [002]    121.769348: pseudo_lock_l2: hits=4105
miss=4

In an attempt to improve the accuracy of the above I modified it to the
following:

/* create the two events as before in "enabled" state */
l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc;
l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc;
local_irq_disable();
/* disable hw prefetchers */
/* init local vars to loop through pseudo-locked mem */
l2_hits_before = native_read_pmc(l2_hit_pmcnum);
l2_miss_before = native_read_pmc(l2_miss_pmcnum);
/* loop through pseudo-locked mem */
l2_hits_after = native_read_pmc(l2_hit_pmcnum);
l2_miss_after = native_read_pmc(l2_miss_pmcnum);
/* enable hw prefetchers */
local_irq_enable();

With the above I seem to get the same accuracy as before:
pseudo_lock_mea-557   [002]    155.402566: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-564   [002]    156.441299: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-567   [002]    157.478605: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-570   [002]    158.524054: pseudo_lock_l2: hits=4096

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-06 Thread Reinette Chatre

Hi Peter,

On 8/6/2018 3:12 PM, Peter Zijlstra wrote:
> On Mon, Aug 06, 2018 at 12:50:50PM -0700, Reinette Chatre wrote:
>> In my previous email I provided the details of the Cache Pseudo-Locking
>> feature implemented on top of resctrl. Please let me know if you would
>> like any more details about that. I can send you more materials.
> 
> I've no yet had time to read..
> 
>> BUG: sleeping function called from invalid context at
>> kernel/locking/mutex.c:748
>>
>> I thus continued to use the API with interrupts enabled did the following:
>>
>> Two new event attributes:
>> static struct perf_event_attr l2_miss_attr = {
>> .type   = PERF_TYPE_RAW,
>> .config = (0x10ULL << 8) | 0xd1,
> 
> Please use something like:
> 
>   X86_CONFIG(.event=0xd1, .umask=0x10),
> 
> that's ever so much more readable.
> 
>> .size   = sizeof(struct perf_event_attr),
>> .pinned = 1,
>> .disabled   = 1,
>> .exclude_user   = 1
>> };
>>
>> static struct perf_event_attr l2_hit_attr = {
>> .type   = PERF_TYPE_RAW,
>> .config = (0x2ULL << 8) | 0xd1,
>> .size   = sizeof(struct perf_event_attr),
>> .pinned = 1,
>> .disabled   = 1,
>> .exclude_user   = 1
>> };
>>
>> Create the two new events using these attributes:
>> l2_miss_event = perf_event_create_kernel_counter(_miss_attr, cpu,
>> NULL, NULL, NULL);
>> l2_hit_event = perf_event_create_kernel_counter(_hit_attr, cpu, NULL,
>> NULL, NULL);
>>
>> Take measurements:
>> perf_event_enable(l2_miss_event);
>> perf_event_enable(l2_hit_event);
>> local_irq_disable();
>> /* Disable hardware prefetchers */
>> /* Loop through pseudo-locked memory */
>> /* Enable hardware prefetchers */
>> local_irq_enable();
>> perf_event_disable(l2_hit_event);
>> perf_event_disable(l2_miss_event);
>>
>> Read results:
>> l2_hits = perf_event_read_value(l2_hit_event, , );
>> l2_miss = perf_event_read_value(l2_miss_event, , );
>> /* Make results available in tracepoints */
> 
> switch to .disabled=0 and try this for measurement:
> 
>   local_irq_disable();
>   perf_event_read_local(l2_miss_event, _val1, NULL, NULL);
>   perf_event_read_local(l2_hit_event, _val1, NULL, NULL);
>   /* do your thing */
>   perf_event_read_local(l2_miss_event, _val2, NULL, NULL);
>   perf_event_read_local(l2_hit_event, _val2, NULL, NULL);
>   local_irq_enable();

Thank you very much for taking a look and providing your guidance.

> 
> You're running this on the CPU you created the event for, right?

Yes.

I've modified your suggestion slightly in an attempt to gain accuracy.
Now it looks like:

local_irq_disable();
/* disable hw prefetchers */
/* init local vars to loop through pseudo-locked mem */
perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL);
perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL);
/* loop through pseudo-locked mem */
perf_event_read_local(l2_hit_event, _hits_after, NULL, NULL);
perf_event_read_local(l2_miss_event, _miss_after, NULL, NULL);
/* enable hw prefetchers */
local_irq_enable();

With the above I do not see the impact of an interference workload
anymore but the results are not yet accurate:

pseudo_lock_mea-538   [002]    113.296084: pseudo_lock_l2: hits=4103
miss=2
pseudo_lock_mea-541   [002]    114.349343: pseudo_lock_l2: hits=4102
miss=3
pseudo_lock_mea-544   [002]    115.410206: pseudo_lock_l2: hits=4101
miss=4
pseudo_lock_mea-551   [002]    116.473912: pseudo_lock_l2: hits=4102
miss=3
pseudo_lock_mea-554   [002]    117.532446: pseudo_lock_l2: hits=4100
miss=5
pseudo_lock_mea-557   [002]    118.591121: pseudo_lock_l2: hits=4103
miss=2
pseudo_lock_mea-560   [002]    119.642467: pseudo_lock_l2: hits=4102
miss=3
pseudo_lock_mea-563   [002]    120.698562: pseudo_lock_l2: hits=4102
miss=3
pseudo_lock_mea-566   [002]    121.769348: pseudo_lock_l2: hits=4105
miss=4

In an attempt to improve the accuracy of the above I modified it to the
following:

/* create the two events as before in "enabled" state */
l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc;
l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc;
local_irq_disable();
/* disable hw prefetchers */
/* init local vars to loop through pseudo-locked mem */
l2_hits_before = native_read_pmc(l2_hit_pmcnum);
l2_miss_before = native_read_pmc(l2_miss_pmcnum);
/* loop through pseudo-locked mem */
l2_hits_after = native_read_pmc(l2_hit_pmcnum);
l2_miss_after = native_read_pmc(l2_miss_pmcnum);
/* enable hw prefetchers */
local_irq_enable();

With the above I seem to get the same accuracy as before:
pseudo_lock_mea-557   [002]    155.402566: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-564   [002]    156.441299: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-567   [002]    157.478605: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-570   [002]    158.524054: pseudo_lock_l2: hits=4096

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-06 Thread Peter Zijlstra

On Mon, Aug 06, 2018 at 12:50:50PM -0700, Reinette Chatre wrote:
> In my previous email I provided the details of the Cache Pseudo-Locking
> feature implemented on top of resctrl. Please let me know if you would
> like any more details about that. I can send you more materials.

I've no yet had time to read..

> BUG: sleeping function called from invalid context at
> kernel/locking/mutex.c:748
> 
> I thus continued to use the API with interrupts enabled did the following:
> 
> Two new event attributes:
> static struct perf_event_attr l2_miss_attr = {
> .type   = PERF_TYPE_RAW,
> .config = (0x10ULL << 8) | 0xd1,

Please use something like:

X86_CONFIG(.event=0xd1, .umask=0x10),

that's ever so much more readable.

> .size   = sizeof(struct perf_event_attr),
> .pinned = 1,
> .disabled   = 1,
> .exclude_user   = 1
> };
> 
> static struct perf_event_attr l2_hit_attr = {
> .type   = PERF_TYPE_RAW,
> .config = (0x2ULL << 8) | 0xd1,
> .size   = sizeof(struct perf_event_attr),
> .pinned = 1,
> .disabled   = 1,
> .exclude_user   = 1
> };
> 
> Create the two new events using these attributes:
> l2_miss_event = perf_event_create_kernel_counter(_miss_attr, cpu,
> NULL, NULL, NULL);
> l2_hit_event = perf_event_create_kernel_counter(_hit_attr, cpu, NULL,
> NULL, NULL);
> 
> Take measurements:
> perf_event_enable(l2_miss_event);
> perf_event_enable(l2_hit_event);
> local_irq_disable();
> /* Disable hardware prefetchers */
> /* Loop through pseudo-locked memory */
> /* Enable hardware prefetchers */
> local_irq_enable();
> perf_event_disable(l2_hit_event);
> perf_event_disable(l2_miss_event);
> 
> Read results:
> l2_hits = perf_event_read_value(l2_hit_event, , );
> l2_miss = perf_event_read_value(l2_miss_event, , );
> /* Make results available in tracepoints */

switch to .disabled=0 and try this for measurement:

local_irq_disable();
perf_event_read_local(l2_miss_event, _val1, NULL, NULL);
perf_event_read_local(l2_hit_event, _val1, NULL, NULL);
/* do your thing */
perf_event_read_local(l2_miss_event, _val2, NULL, NULL);
perf_event_read_local(l2_hit_event, _val2, NULL, NULL);
local_irq_enable();

You're running this on the CPU you created the event for, right?

> With the above implementation and a 256KB pseudo-locked memory region I
> obtain the following results:
> pseudo_lock_mea-755   [002]    396.946953: pseudo_lock_l2: hits=4140

> The above results are not accurate since it does not reflect the success
> of the pseudo-locked region. Expected results are as we can currently
> obtain (copying results from previous email):
> pseudo_lock_mea-26090 [002]  61838.488027: pseudo_lock_l2: hits=4096

Still fairly close.. only like 44 extra hits or 1% error.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-06 Thread Peter Zijlstra

On Mon, Aug 06, 2018 at 12:50:50PM -0700, Reinette Chatre wrote:
> In my previous email I provided the details of the Cache Pseudo-Locking
> feature implemented on top of resctrl. Please let me know if you would
> like any more details about that. I can send you more materials.

I've no yet had time to read..

> BUG: sleeping function called from invalid context at
> kernel/locking/mutex.c:748
> 
> I thus continued to use the API with interrupts enabled did the following:
> 
> Two new event attributes:
> static struct perf_event_attr l2_miss_attr = {
> .type   = PERF_TYPE_RAW,
> .config = (0x10ULL << 8) | 0xd1,

Please use something like:

X86_CONFIG(.event=0xd1, .umask=0x10),

that's ever so much more readable.

> .size   = sizeof(struct perf_event_attr),
> .pinned = 1,
> .disabled   = 1,
> .exclude_user   = 1
> };
> 
> static struct perf_event_attr l2_hit_attr = {
> .type   = PERF_TYPE_RAW,
> .config = (0x2ULL << 8) | 0xd1,
> .size   = sizeof(struct perf_event_attr),
> .pinned = 1,
> .disabled   = 1,
> .exclude_user   = 1
> };
> 
> Create the two new events using these attributes:
> l2_miss_event = perf_event_create_kernel_counter(_miss_attr, cpu,
> NULL, NULL, NULL);
> l2_hit_event = perf_event_create_kernel_counter(_hit_attr, cpu, NULL,
> NULL, NULL);
> 
> Take measurements:
> perf_event_enable(l2_miss_event);
> perf_event_enable(l2_hit_event);
> local_irq_disable();
> /* Disable hardware prefetchers */
> /* Loop through pseudo-locked memory */
> /* Enable hardware prefetchers */
> local_irq_enable();
> perf_event_disable(l2_hit_event);
> perf_event_disable(l2_miss_event);
> 
> Read results:
> l2_hits = perf_event_read_value(l2_hit_event, , );
> l2_miss = perf_event_read_value(l2_miss_event, , );
> /* Make results available in tracepoints */

switch to .disabled=0 and try this for measurement:

local_irq_disable();
perf_event_read_local(l2_miss_event, _val1, NULL, NULL);
perf_event_read_local(l2_hit_event, _val1, NULL, NULL);
/* do your thing */
perf_event_read_local(l2_miss_event, _val2, NULL, NULL);
perf_event_read_local(l2_hit_event, _val2, NULL, NULL);
local_irq_enable();

You're running this on the CPU you created the event for, right?

> With the above implementation and a 256KB pseudo-locked memory region I
> obtain the following results:
> pseudo_lock_mea-755   [002]    396.946953: pseudo_lock_l2: hits=4140

> The above results are not accurate since it does not reflect the success
> of the pseudo-locked region. Expected results are as we can currently
> obtain (copying results from previous email):
> pseudo_lock_mea-26090 [002]  61838.488027: pseudo_lock_l2: hits=4096

Still fairly close.. only like 44 extra hits or 1% error.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-06 Thread Reinette Chatre

Hi Peter,

On 8/3/2018 11:37 AM, Reinette Chatre wrote:
> On 8/3/2018 8:25 AM, Peter Zijlstra wrote:
>> On Fri, Aug 03, 2018 at 08:18:09AM -0700, Reinette Chatre wrote:
>>> You state that you understand what we are trying to do and I hope that I
>>> convinced you that we are not able to accomplish the same by following
>>> your guidance.
>>
>> No, I said I understood your pmc reserve patch and its implications.
>>
>> I have no clue what you're trying to do with resctl, nor why you think
>> this is not feasible with perf. And if it really is not feasible, you'll
>> have to live without it.

In my previous email I provided the details of the Cache Pseudo-Locking
feature implemented on top of resctrl. Please let me know if you would
like any more details about that. I can send you more materials.

In my previous message I also provided the thoughts on why I believe
same is not feasible with perf as commented below ...

> Looking at if we were to build on top of the kernel perf event API
> (perf_event_create_kernel_counter(), perf_event_enable(),
> perf_event_disable(), ...). Just looking at perf_event_enable() -
> ideally this would be as lean as possible to only enable the event and
> not result in itself contributing the the measurement. First, the
> enabling of the event is not as lean to fulfill this requirement since
> it executes more code after the event was actually enabled. Also, the
> code relies on a mutex so we cannot use it with interrupts disabled.

I proceeded to modify the implemented debugfs measurements to build on
top of the perf APIs mentioned above. As anticipated the events could
not be enabled in interrupt context. I get a clear message in this regard:

BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:748

I thus continued to use the API with interrupts enabled did the following:

Two new event attributes:
static struct perf_event_attr l2_miss_attr = {
.type   = PERF_TYPE_RAW,
.config = (0x10ULL << 8) | 0xd1,
.size   = sizeof(struct perf_event_attr),
.pinned = 1,
.disabled   = 1,
.exclude_user   = 1
};

static struct perf_event_attr l2_hit_attr = {
.type   = PERF_TYPE_RAW,
.config = (0x2ULL << 8) | 0xd1,
.size   = sizeof(struct perf_event_attr),
.pinned = 1,
.disabled   = 1,
.exclude_user   = 1
};

Create the two new events using these attributes:
l2_miss_event = perf_event_create_kernel_counter(_miss_attr, cpu,
NULL, NULL, NULL);
l2_hit_event = perf_event_create_kernel_counter(_hit_attr, cpu, NULL,
NULL, NULL);

Take measurements:
perf_event_enable(l2_miss_event);
perf_event_enable(l2_hit_event);
local_irq_disable();
/* Disable hardware prefetchers */
/* Loop through pseudo-locked memory */
/* Enable hardware prefetchers */
local_irq_enable();
perf_event_disable(l2_hit_event);
perf_event_disable(l2_miss_event);

Read results:
l2_hits = perf_event_read_value(l2_hit_event, , );
l2_miss = perf_event_read_value(l2_miss_event, , );
/* Make results available in tracepoints */


With the above implementation and a 256KB pseudo-locked memory region I
obtain the following results:
pseudo_lock_mea-755   [002]    396.946953: pseudo_lock_l2: hits=4140
miss=5
pseudo_lock_mea-762   [002]    397.998864: pseudo_lock_l2: hits=4138
miss=8
pseudo_lock_mea-765   [002]    399.041868: pseudo_lock_l2: hits=4142
miss=5
pseudo_lock_mea-768   [002]    400.086871: pseudo_lock_l2: hits=4141
miss=7
pseudo_lock_mea-771   [002]    401.132921: pseudo_lock_l2: hits=4138
miss=10
pseudo_lock_mea-774   [002]    402.216700: pseudo_lock_l2: hits=4238
miss=46
pseudo_lock_mea-777   [002]    403.312148: pseudo_lock_l2: hits=4142
miss=5
pseudo_lock_mea-780   [002]    404.381674: pseudo_lock_l2: hits=4139
miss=8
pseudo_lock_mea-783   [002]    405.422820: pseudo_lock_l2: hits=4472
miss=79
pseudo_lock_mea-786   [002]    406.495065: pseudo_lock_l2: hits=4140
miss=8
pseudo_lock_mea-793   [002]    407.561383: pseudo_lock_l2: hits=4143
miss=4

The above results are not accurate since it does not reflect the success
of the pseudo-locked region. Expected results are as we can currently
obtain (copying results from previous email):
pseudo_lock_mea-26090 [002]  61838.488027: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26097 [002]  61843.689381: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26100 [002]  61848.751411: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26108 [002]  61853.820361: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26111 [002]  61858.880364: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26118 [002]  61863.937343: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26121 [002]  61869.008341: pseudo_lock_l2: hits=4096
miss=0

Could you please guide me on how you would prefer us to use perf in
order to obtain the same accurate results we can

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-06 Thread Reinette Chatre

Hi Peter,

On 8/3/2018 11:37 AM, Reinette Chatre wrote:
> On 8/3/2018 8:25 AM, Peter Zijlstra wrote:
>> On Fri, Aug 03, 2018 at 08:18:09AM -0700, Reinette Chatre wrote:
>>> You state that you understand what we are trying to do and I hope that I
>>> convinced you that we are not able to accomplish the same by following
>>> your guidance.
>>
>> No, I said I understood your pmc reserve patch and its implications.
>>
>> I have no clue what you're trying to do with resctl, nor why you think
>> this is not feasible with perf. And if it really is not feasible, you'll
>> have to live without it.

In my previous email I provided the details of the Cache Pseudo-Locking
feature implemented on top of resctrl. Please let me know if you would
like any more details about that. I can send you more materials.

In my previous message I also provided the thoughts on why I believe
same is not feasible with perf as commented below ...

> Looking at if we were to build on top of the kernel perf event API
> (perf_event_create_kernel_counter(), perf_event_enable(),
> perf_event_disable(), ...). Just looking at perf_event_enable() -
> ideally this would be as lean as possible to only enable the event and
> not result in itself contributing the the measurement. First, the
> enabling of the event is not as lean to fulfill this requirement since
> it executes more code after the event was actually enabled. Also, the
> code relies on a mutex so we cannot use it with interrupts disabled.

I proceeded to modify the implemented debugfs measurements to build on
top of the perf APIs mentioned above. As anticipated the events could
not be enabled in interrupt context. I get a clear message in this regard:

BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:748

I thus continued to use the API with interrupts enabled did the following:

Two new event attributes:
static struct perf_event_attr l2_miss_attr = {
.type   = PERF_TYPE_RAW,
.config = (0x10ULL << 8) | 0xd1,
.size   = sizeof(struct perf_event_attr),
.pinned = 1,
.disabled   = 1,
.exclude_user   = 1
};

static struct perf_event_attr l2_hit_attr = {
.type   = PERF_TYPE_RAW,
.config = (0x2ULL << 8) | 0xd1,
.size   = sizeof(struct perf_event_attr),
.pinned = 1,
.disabled   = 1,
.exclude_user   = 1
};

Create the two new events using these attributes:
l2_miss_event = perf_event_create_kernel_counter(_miss_attr, cpu,
NULL, NULL, NULL);
l2_hit_event = perf_event_create_kernel_counter(_hit_attr, cpu, NULL,
NULL, NULL);

Take measurements:
perf_event_enable(l2_miss_event);
perf_event_enable(l2_hit_event);
local_irq_disable();
/* Disable hardware prefetchers */
/* Loop through pseudo-locked memory */
/* Enable hardware prefetchers */
local_irq_enable();
perf_event_disable(l2_hit_event);
perf_event_disable(l2_miss_event);

Read results:
l2_hits = perf_event_read_value(l2_hit_event, , );
l2_miss = perf_event_read_value(l2_miss_event, , );
/* Make results available in tracepoints */


With the above implementation and a 256KB pseudo-locked memory region I
obtain the following results:
pseudo_lock_mea-755   [002]    396.946953: pseudo_lock_l2: hits=4140
miss=5
pseudo_lock_mea-762   [002]    397.998864: pseudo_lock_l2: hits=4138
miss=8
pseudo_lock_mea-765   [002]    399.041868: pseudo_lock_l2: hits=4142
miss=5
pseudo_lock_mea-768   [002]    400.086871: pseudo_lock_l2: hits=4141
miss=7
pseudo_lock_mea-771   [002]    401.132921: pseudo_lock_l2: hits=4138
miss=10
pseudo_lock_mea-774   [002]    402.216700: pseudo_lock_l2: hits=4238
miss=46
pseudo_lock_mea-777   [002]    403.312148: pseudo_lock_l2: hits=4142
miss=5
pseudo_lock_mea-780   [002]    404.381674: pseudo_lock_l2: hits=4139
miss=8
pseudo_lock_mea-783   [002]    405.422820: pseudo_lock_l2: hits=4472
miss=79
pseudo_lock_mea-786   [002]    406.495065: pseudo_lock_l2: hits=4140
miss=8
pseudo_lock_mea-793   [002]    407.561383: pseudo_lock_l2: hits=4143
miss=4

The above results are not accurate since it does not reflect the success
of the pseudo-locked region. Expected results are as we can currently
obtain (copying results from previous email):
pseudo_lock_mea-26090 [002]  61838.488027: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26097 [002]  61843.689381: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26100 [002]  61848.751411: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26108 [002]  61853.820361: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26111 [002]  61858.880364: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26118 [002]  61863.937343: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26121 [002]  61869.008341: pseudo_lock_l2: hits=4096
miss=0

Could you please guide me on how you would prefer us to use perf in
order to obtain the same accurate results we can

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-03 Thread Reinette Chatre

Hi Peter,

On 8/3/2018 8:25 AM, Peter Zijlstra wrote:
> On Fri, Aug 03, 2018 at 08:18:09AM -0700, Reinette Chatre wrote:
>> You state that you understand what we are trying to do and I hope that I
>> convinced you that we are not able to accomplish the same by following
>> your guidance.
> 
> No, I said I understood your pmc reserve patch and its implications.
> 
> I have no clue what you're trying to do with resctl, nor why you think
> this is not feasible with perf. And if it really is not feasible, you'll
> have to live without it.

I can surely provide the details on what we are doing with resctrl and
elaborate more on why this is not feasible with the full perf kernel API.

In summary:
Building on top of Cache Allocation Technology (CAT) we load a portion
of memory into a specified region of cache. After the region of cache
obtained its data it (the cache region) is configured (via CAT) to only
serve cache hits - this pre-loaded memory cannot be evicted from the
cache. We call this "cache pseudo-locking" - the memory has been
"pseudo-locked" to the cache.

To measure how successful the pseudo-locking of the memory is we can use
the precision capable performance events on our platforms: start the
cache hits and miss counters, read the pseudo-locked memory, stop the
counters. This measurement is done with interrupts and hardware
prefetchers disabled to ensure that _only_ access to the pseudo-locked
memory is measured.

Any additional code or data accessed either by the counter management or
even by the loops reading the memory itself can contribute to cache
hits/misses measured for that instead of the memory we are trying to access.

Even within the current measurement code we had to take a lot of care to
not use, for example, pointers to obtain information about the memory to
be measured. The information had to be local variables.

Looking at if we were to build on top of the kernel perf event API
(perf_event_create_kernel_counter(), perf_event_enable(),
perf_event_disable(), ...). Just looking at perf_event_enable() -
ideally this would be as lean as possible to only enable the event and
not result in itself contributing the the measurement. First, the
enabling of the event is not as lean to fulfill this requirement since
it executes more code after the event was actually enabled. Also, the
code relies on a mutex so we cannot use it with interrupts disabled.

We have two types of customers of this feature: those who require very
low latency and those who require high determinism. In either case a
measured cache miss is of concern to them and our goal is to provide a
memory region for which the number of cache misses can be demonstrated.

Reinette

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-03 Thread Reinette Chatre

Hi Peter,

On 8/3/2018 8:25 AM, Peter Zijlstra wrote:
> On Fri, Aug 03, 2018 at 08:18:09AM -0700, Reinette Chatre wrote:
>> You state that you understand what we are trying to do and I hope that I
>> convinced you that we are not able to accomplish the same by following
>> your guidance.
> 
> No, I said I understood your pmc reserve patch and its implications.
> 
> I have no clue what you're trying to do with resctl, nor why you think
> this is not feasible with perf. And if it really is not feasible, you'll
> have to live without it.

I can surely provide the details on what we are doing with resctrl and
elaborate more on why this is not feasible with the full perf kernel API.

In summary:
Building on top of Cache Allocation Technology (CAT) we load a portion
of memory into a specified region of cache. After the region of cache
obtained its data it (the cache region) is configured (via CAT) to only
serve cache hits - this pre-loaded memory cannot be evicted from the
cache. We call this "cache pseudo-locking" - the memory has been
"pseudo-locked" to the cache.

To measure how successful the pseudo-locking of the memory is we can use
the precision capable performance events on our platforms: start the
cache hits and miss counters, read the pseudo-locked memory, stop the
counters. This measurement is done with interrupts and hardware
prefetchers disabled to ensure that _only_ access to the pseudo-locked
memory is measured.

Any additional code or data accessed either by the counter management or
even by the loops reading the memory itself can contribute to cache
hits/misses measured for that instead of the memory we are trying to access.

Even within the current measurement code we had to take a lot of care to
not use, for example, pointers to obtain information about the memory to
be measured. The information had to be local variables.

Looking at if we were to build on top of the kernel perf event API
(perf_event_create_kernel_counter(), perf_event_enable(),
perf_event_disable(), ...). Just looking at perf_event_enable() -
ideally this would be as lean as possible to only enable the event and
not result in itself contributing the the measurement. First, the
enabling of the event is not as lean to fulfill this requirement since
it executes more code after the event was actually enabled. Also, the
code relies on a mutex so we cannot use it with interrupts disabled.

We have two types of customers of this feature: those who require very
low latency and those who require high determinism. In either case a
measured cache miss is of concern to them and our goal is to provide a
memory region for which the number of cache misses can be demonstrated.

Reinette

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-03 Thread Peter Zijlstra

On Fri, Aug 03, 2018 at 08:18:09AM -0700, Reinette Chatre wrote:
> You state that you understand what we are trying to do and I hope that I
> convinced you that we are not able to accomplish the same by following
> your guidance.

No, I said I understood your pmc reserve patch and its implications.

I have no clue what you're trying to do with resctl, nor why you think
this is not feasible with perf. And if it really is not feasible, you'll
have to live without it.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-03 Thread Peter Zijlstra

On Fri, Aug 03, 2018 at 08:18:09AM -0700, Reinette Chatre wrote:
> You state that you understand what we are trying to do and I hope that I
> convinced you that we are not able to accomplish the same by following
> your guidance.

No, I said I understood your pmc reserve patch and its implications.

I have no clue what you're trying to do with resctl, nor why you think
this is not feasible with perf. And if it really is not feasible, you'll
have to live without it.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-03 Thread Reinette Chatre

Hi Peter,

On 8/3/2018 3:49 AM, Peter Zijlstra wrote:
> On Thu, Aug 02, 2018 at 01:43:42PM -0700, Reinette Chatre wrote:
> 
>> The goal of this work is to use the existing PMU hardware coordination
>> mechanism to ensure that perf and resctrl will not interfere with each
>> other.
> 
> I understand what it does.. but if I'd seen you frobbing at the PMU
> earlier your patches would've never gotten merged.
> 
> It's simply not going to happen.

Your two-fold guidance is clear to me: (1) do not use PMU registers
directly, (2) use perf API instead.

You state that you understand what we are trying to do and I hope that I
convinced you that we are not able to accomplish the same by following
your guidance.

Could you please guide us how to obtain the accurate results we obtain
at this time while satisfying your requirements?

Earlier your response to a similar question was "That's the best you're
going to get" - unfortunately I cannot respectfully say that to our
customers when they indeed have become used to working with accurate
data for more than a year. I really want to find a solution that is
acceptable to mainline instead of having to maintain a solution needed
by customers out of tree.

Reinette

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-03 Thread Reinette Chatre

Hi Peter,

On 8/3/2018 3:49 AM, Peter Zijlstra wrote:
> On Thu, Aug 02, 2018 at 01:43:42PM -0700, Reinette Chatre wrote:
> 
>> The goal of this work is to use the existing PMU hardware coordination
>> mechanism to ensure that perf and resctrl will not interfere with each
>> other.
> 
> I understand what it does.. but if I'd seen you frobbing at the PMU
> earlier your patches would've never gotten merged.
> 
> It's simply not going to happen.

Your two-fold guidance is clear to me: (1) do not use PMU registers
directly, (2) use perf API instead.

You state that you understand what we are trying to do and I hope that I
convinced you that we are not able to accomplish the same by following
your guidance.

Could you please guide us how to obtain the accurate results we obtain
at this time while satisfying your requirements?

Earlier your response to a similar question was "That's the best you're
going to get" - unfortunately I cannot respectfully say that to our
customers when they indeed have become used to working with accurate
data for more than a year. I really want to find a solution that is
acceptable to mainline instead of having to maintain a solution needed
by customers out of tree.

Reinette

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-03 Thread Peter Zijlstra

On Thu, Aug 02, 2018 at 01:43:42PM -0700, Reinette Chatre wrote:

> The goal of this work is to use the existing PMU hardware coordination
> mechanism to ensure that perf and resctrl will not interfere with each
> other.

I understand what it does.. but if I'd seen you frobbing at the PMU
earlier your patches would've never gotten merged.

It's simply not going to happen.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-03 Thread Peter Zijlstra

On Thu, Aug 02, 2018 at 01:43:42PM -0700, Reinette Chatre wrote:

> The goal of this work is to use the existing PMU hardware coordination
> mechanism to ensure that perf and resctrl will not interfere with each
> other.

I understand what it does.. but if I'd seen you frobbing at the PMU
earlier your patches would've never gotten merged.

It's simply not going to happen.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Reinette Chatre

Hi Peter,

On 8/2/2018 1:13 PM, Peter Zijlstra wrote:
> On Thu, Aug 02, 2018 at 01:06:19PM -0700, Dave Hansen wrote:
>> On 08/02/2018 12:54 PM, Peter Zijlstra wrote:
 I totally understand not wanting to fill the tree with code hijacking
 the raw PMU.  Is your reaction to this really around not wanting to
 start down the slippery slope that ends up with lots of raw PMU "owners"?
>>> That and the fact that multiple owner directly contradicts what perf set
>>> out to do, provide resource arbitration for the PMU.
>>>
>>> Not being able to use both perf and this resctl thing at the same time
>>> is utter crap. You will not get special dispensation.

The goal of this work is to use the existing PMU hardware coordination
mechanism to ensure that perf and resctrl will not interfere with each
other. The resctrl debugging is short-lived - it reserves the PMU
hardware while interrupts are disabled and releases the PMU hardware
before re-enabling interrupts. If a perf session is running at that time
then this reservation will ensure it is not interfered with. If a perf
session is not running at the time then it will not even get the
opportunity to attempt to start one.

>> Is there something we could do in the middle, like have perf itself be
>> in charge of doing all the programming, but the psuedo-locking could
>> still _read_ the counters?
> 
> perf has all of that.

At the time the counters are read the damage has already been done by
all the extra instructions and data accessed during the enable and
disable of the events.

Reinette

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Reinette Chatre

Hi Peter,

On 8/2/2018 1:13 PM, Peter Zijlstra wrote:
> On Thu, Aug 02, 2018 at 01:06:19PM -0700, Dave Hansen wrote:
>> On 08/02/2018 12:54 PM, Peter Zijlstra wrote:
 I totally understand not wanting to fill the tree with code hijacking
 the raw PMU.  Is your reaction to this really around not wanting to
 start down the slippery slope that ends up with lots of raw PMU "owners"?
>>> That and the fact that multiple owner directly contradicts what perf set
>>> out to do, provide resource arbitration for the PMU.
>>>
>>> Not being able to use both perf and this resctl thing at the same time
>>> is utter crap. You will not get special dispensation.

The goal of this work is to use the existing PMU hardware coordination
mechanism to ensure that perf and resctrl will not interfere with each
other. The resctrl debugging is short-lived - it reserves the PMU
hardware while interrupts are disabled and releases the PMU hardware
before re-enabling interrupts. If a perf session is running at that time
then this reservation will ensure it is not interfered with. If a perf
session is not running at the time then it will not even get the
opportunity to attempt to start one.

>> Is there something we could do in the middle, like have perf itself be
>> in charge of doing all the programming, but the psuedo-locking could
>> still _read_ the counters?
> 
> perf has all of that.

At the time the counters are read the damage has already been done by
all the extra instructions and data accessed during the enable and
disable of the events.

Reinette

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Peter Zijlstra

On Thu, Aug 02, 2018 at 01:06:19PM -0700, Dave Hansen wrote:
> On 08/02/2018 12:54 PM, Peter Zijlstra wrote:
> >> I totally understand not wanting to fill the tree with code hijacking
> >> the raw PMU.  Is your reaction to this really around not wanting to
> >> start down the slippery slope that ends up with lots of raw PMU "owners"?
> > That and the fact that multiple owner directly contradicts what perf set
> > out to do, provide resource arbitration for the PMU.
> > 
> > Not being able to use both perf and this resctl thing at the same time
> > is utter crap. You will not get special dispensation.
> 
> Is there something we could do in the middle, like have perf itself be
> in charge of doing all the programming, but the psuedo-locking could
> still _read_ the counters?

perf has all of that.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Peter Zijlstra

On Thu, Aug 02, 2018 at 01:06:19PM -0700, Dave Hansen wrote:
> On 08/02/2018 12:54 PM, Peter Zijlstra wrote:
> >> I totally understand not wanting to fill the tree with code hijacking
> >> the raw PMU.  Is your reaction to this really around not wanting to
> >> start down the slippery slope that ends up with lots of raw PMU "owners"?
> > That and the fact that multiple owner directly contradicts what perf set
> > out to do, provide resource arbitration for the PMU.
> > 
> > Not being able to use both perf and this resctl thing at the same time
> > is utter crap. You will not get special dispensation.
> 
> Is there something we could do in the middle, like have perf itself be
> in charge of doing all the programming, but the psuedo-locking could
> still _read_ the counters?

perf has all of that.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Dave Hansen

On 08/02/2018 12:54 PM, Peter Zijlstra wrote:
>> I totally understand not wanting to fill the tree with code hijacking
>> the raw PMU.  Is your reaction to this really around not wanting to
>> start down the slippery slope that ends up with lots of raw PMU "owners"?
> That and the fact that multiple owner directly contradicts what perf set
> out to do, provide resource arbitration for the PMU.
> 
> Not being able to use both perf and this resctl thing at the same time
> is utter crap. You will not get special dispensation.

Is there something we could do in the middle, like have perf itself be
in charge of doing all the programming, but the psuedo-locking could
still _read_ the counters?

I guess we'd still end up needing to have perf expose which counter got
assigned to which event, and make sure things like multiplexing are not
in play.

But, either way, I think we can live with this measurement being fragile
(like it already is).  If it only works when all the planets are
aligned, I think it's still usable enough.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Dave Hansen

On 08/02/2018 12:54 PM, Peter Zijlstra wrote:
>> I totally understand not wanting to fill the tree with code hijacking
>> the raw PMU.  Is your reaction to this really around not wanting to
>> start down the slippery slope that ends up with lots of raw PMU "owners"?
> That and the fact that multiple owner directly contradicts what perf set
> out to do, provide resource arbitration for the PMU.
> 
> Not being able to use both perf and this resctl thing at the same time
> is utter crap. You will not get special dispensation.

Is there something we could do in the middle, like have perf itself be
in charge of doing all the programming, but the psuedo-locking could
still _read_ the counters?

I guess we'd still end up needing to have perf expose which counter got
assigned to which event, and make sure things like multiplexing are not
in play.

But, either way, I think we can live with this measurement being fragile
(like it already is).  If it only works when all the planets are
aligned, I think it's still usable enough.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Peter Zijlstra

On Thu, Aug 02, 2018 at 11:18:01AM -0700, Dave Hansen wrote:
> On 08/02/2018 10:37 AM, Peter Zijlstra wrote:
> >> I do not see how I can do so without incurring the cache hits and misses
> >> from the data needed and instructions run by this interface. Could you
> >> please share how I can do so and still obtain the accurate measurement
> >> of cache residency of a specific memory region?
> > That's the best you're going to get. You do _NOT_ get to use raw PMU.
> 
> Hi Peter,
> 
> This code is really fidgety.  It's really easily perturbed even by tiny
> things like implicit cache accesses from the page walker filling the TLB
> from the page tables.
> 
> Adding a bunch more code in the way is surely going to make it more
> fragile and imprecise.
> 
> I totally understand not wanting to fill the tree with code hijacking
> the raw PMU.  Is your reaction to this really around not wanting to
> start down the slippery slope that ends up with lots of raw PMU "owners"?

That and the fact that multiple owner directly contradicts what perf set
out to do, provide resource arbitration for the PMU.

Not being able to use both perf and this resctl thing at the same time
is utter crap. You will not get special dispensation.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Peter Zijlstra

On Thu, Aug 02, 2018 at 11:18:01AM -0700, Dave Hansen wrote:
> On 08/02/2018 10:37 AM, Peter Zijlstra wrote:
> >> I do not see how I can do so without incurring the cache hits and misses
> >> from the data needed and instructions run by this interface. Could you
> >> please share how I can do so and still obtain the accurate measurement
> >> of cache residency of a specific memory region?
> > That's the best you're going to get. You do _NOT_ get to use raw PMU.
> 
> Hi Peter,
> 
> This code is really fidgety.  It's really easily perturbed even by tiny
> things like implicit cache accesses from the page walker filling the TLB
> from the page tables.
> 
> Adding a bunch more code in the way is surely going to make it more
> fragile and imprecise.
> 
> I totally understand not wanting to fill the tree with code hijacking
> the raw PMU.  Is your reaction to this really around not wanting to
> start down the slippery slope that ends up with lots of raw PMU "owners"?

That and the fact that multiple owner directly contradicts what perf set
out to do, provide resource arbitration for the PMU.

Not being able to use both perf and this resctl thing at the same time
is utter crap. You will not get special dispensation.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Dave Hansen

On 08/02/2018 10:37 AM, Peter Zijlstra wrote:
>> I do not see how I can do so without incurring the cache hits and misses
>> from the data needed and instructions run by this interface. Could you
>> please share how I can do so and still obtain the accurate measurement
>> of cache residency of a specific memory region?
> That's the best you're going to get. You do _NOT_ get to use raw PMU.

Hi Peter,

This code is really fidgety.  It's really easily perturbed even by tiny
things like implicit cache accesses from the page walker filling the TLB
from the page tables.

Adding a bunch more code in the way is surely going to make it more
fragile and imprecise.

I totally understand not wanting to fill the tree with code hijacking
the raw PMU.  Is your reaction to this really around not wanting to
start down the slippery slope that ends up with lots of raw PMU "owners"?

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Dave Hansen

On 08/02/2018 10:37 AM, Peter Zijlstra wrote:
>> I do not see how I can do so without incurring the cache hits and misses
>> from the data needed and instructions run by this interface. Could you
>> please share how I can do so and still obtain the accurate measurement
>> of cache residency of a specific memory region?
> That's the best you're going to get. You do _NOT_ get to use raw PMU.

Hi Peter,

This code is really fidgety.  It's really easily perturbed even by tiny
things like implicit cache accesses from the page walker filling the TLB
from the page tables.

Adding a bunch more code in the way is surely going to make it more
fragile and imprecise.

I totally understand not wanting to fill the tree with code hijacking
the raw PMU.  Is your reaction to this really around not wanting to
start down the slippery slope that ends up with lots of raw PMU "owners"?

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Peter Zijlstra

On Thu, Aug 02, 2018 at 09:44:13AM -0700, Reinette Chatre wrote:
> On 8/2/2018 9:18 AM, Peter Zijlstra wrote:
> > On Thu, Aug 02, 2018 at 09:14:10AM -0700, Reinette Chatre wrote:
> > 
> >> The current implementation does not coordinate with perf and this is
> >> what I am trying to fix in this series.
> >>
> >> I do respect your NAK but it is not clear to me how to proceed after
> >> obtaining it. Could you please elaborate on what you would prefer as a
> >> solution to ensure accurate measurement of cache-locked data that is
> >> better integrated?
> > 
> > We have an in-kernel interface to perf, use that if you want access to
> > the PMU. You will not directly stomp on PMU registers.
> 
> I do not see how I can do so without incurring the cache hits and misses
> from the data needed and instructions run by this interface. Could you
> please share how I can do so and still obtain the accurate measurement
> of cache residency of a specific memory region?

That's the best you're going to get. You do _NOT_ get to use raw PMU.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Peter Zijlstra

On Thu, Aug 02, 2018 at 09:44:13AM -0700, Reinette Chatre wrote:
> On 8/2/2018 9:18 AM, Peter Zijlstra wrote:
> > On Thu, Aug 02, 2018 at 09:14:10AM -0700, Reinette Chatre wrote:
> > 
> >> The current implementation does not coordinate with perf and this is
> >> what I am trying to fix in this series.
> >>
> >> I do respect your NAK but it is not clear to me how to proceed after
> >> obtaining it. Could you please elaborate on what you would prefer as a
> >> solution to ensure accurate measurement of cache-locked data that is
> >> better integrated?
> > 
> > We have an in-kernel interface to perf, use that if you want access to
> > the PMU. You will not directly stomp on PMU registers.
> 
> I do not see how I can do so without incurring the cache hits and misses
> from the data needed and instructions run by this interface. Could you
> please share how I can do so and still obtain the accurate measurement
> of cache residency of a specific memory region?

That's the best you're going to get. You do _NOT_ get to use raw PMU.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Reinette Chatre

On 8/2/2018 9:18 AM, Peter Zijlstra wrote:
> On Thu, Aug 02, 2018 at 09:14:10AM -0700, Reinette Chatre wrote:
> 
>> The current implementation does not coordinate with perf and this is
>> what I am trying to fix in this series.
>>
>> I do respect your NAK but it is not clear to me how to proceed after
>> obtaining it. Could you please elaborate on what you would prefer as a
>> solution to ensure accurate measurement of cache-locked data that is
>> better integrated?
> 
> We have an in-kernel interface to perf, use that if you want access to
> the PMU. You will not directly stomp on PMU registers.

I do not see how I can do so without incurring the cache hits and misses
from the data needed and instructions run by this interface. Could you
please share how I can do so and still obtain the accurate measurement
of cache residency of a specific memory region?

Reinette

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Reinette Chatre

On 8/2/2018 9:18 AM, Peter Zijlstra wrote:
> On Thu, Aug 02, 2018 at 09:14:10AM -0700, Reinette Chatre wrote:
> 
>> The current implementation does not coordinate with perf and this is
>> what I am trying to fix in this series.
>>
>> I do respect your NAK but it is not clear to me how to proceed after
>> obtaining it. Could you please elaborate on what you would prefer as a
>> solution to ensure accurate measurement of cache-locked data that is
>> better integrated?
> 
> We have an in-kernel interface to perf, use that if you want access to
> the PMU. You will not directly stomp on PMU registers.

I do not see how I can do so without incurring the cache hits and misses
from the data needed and instructions run by this interface. Could you
please share how I can do so and still obtain the accurate measurement
of cache residency of a specific memory region?

Reinette

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Peter Zijlstra

On Thu, Aug 02, 2018 at 09:14:10AM -0700, Reinette Chatre wrote:

> The current implementation does not coordinate with perf and this is
> what I am trying to fix in this series.
> 
> I do respect your NAK but it is not clear to me how to proceed after
> obtaining it. Could you please elaborate on what you would prefer as a
> solution to ensure accurate measurement of cache-locked data that is
> better integrated?

We have an in-kernel interface to perf, use that if you want access to
the PMU. You will not directly stomp on PMU registers.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Peter Zijlstra

On Thu, Aug 02, 2018 at 09:14:10AM -0700, Reinette Chatre wrote:

> The current implementation does not coordinate with perf and this is
> what I am trying to fix in this series.
> 
> I do respect your NAK but it is not clear to me how to proceed after
> obtaining it. Could you please elaborate on what you would prefer as a
> solution to ensure accurate measurement of cache-locked data that is
> better integrated?

We have an in-kernel interface to perf, use that if you want access to
the PMU. You will not directly stomp on PMU registers.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Reinette Chatre

Hi Peter,

On 8/2/2018 5:39 AM, Peter Zijlstra wrote:
> On Tue, Jul 31, 2018 at 12:38:27PM -0700, Reinette Chatre wrote:
>> Dear Maintainers,
>>
>> The success of Cache Pseudo-Locking can be measured via the use of
>> performance events. Specifically, the number of cache hits and misses
>> reading a memory region after it has been pseudo-locked to cache. This
>> measurement is triggered via the resctrl debugfs interface.
>>
>> To ensure most accurate results the performance counters and their
>> configuration registers are accessed directly.
> 
> NAK on that.
> 

After data is locked to cache we need to measure the success of that.
There is no instruction that we can use to query if a memory address has
been cached but we can use performance monitoring events that are
especially valuable on the platforms where they are precise event capable.

To ensure that we are only measuring the presence of data that should be
locked to cache we need to tightly control how this measurement is done.

For example, on my test system I locked 256KB to the cache and with the
current implementation (tip.git on branch x86/cache) I am able to
accurately measure that this was successful as seen below (each cache
line within the 256KB is accessed while the performance monitoring
events are active):

pseudo_lock_mea-26090 [002]  61838.488027: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26097 [002]  61843.689381: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26100 [002]  61848.751411: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26108 [002]  61853.820361: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26111 [002]  61858.880364: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26118 [002]  61863.937343: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26121 [002]  61869.008341: pseudo_lock_l2: hits=4096
miss=0

The current implementation does not coordinate with perf and this is
what I am trying to fix in this series.

I do respect your NAK but it is not clear to me how to proceed after
obtaining it. Could you please elaborate on what you would prefer as a
solution to ensure accurate measurement of cache-locked data that is
better integrated?

Thank you very much

Reinette

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Reinette Chatre

Hi Peter,

On 8/2/2018 5:39 AM, Peter Zijlstra wrote:
> On Tue, Jul 31, 2018 at 12:38:27PM -0700, Reinette Chatre wrote:
>> Dear Maintainers,
>>
>> The success of Cache Pseudo-Locking can be measured via the use of
>> performance events. Specifically, the number of cache hits and misses
>> reading a memory region after it has been pseudo-locked to cache. This
>> measurement is triggered via the resctrl debugfs interface.
>>
>> To ensure most accurate results the performance counters and their
>> configuration registers are accessed directly.
> 
> NAK on that.
> 

After data is locked to cache we need to measure the success of that.
There is no instruction that we can use to query if a memory address has
been cached but we can use performance monitoring events that are
especially valuable on the platforms where they are precise event capable.

To ensure that we are only measuring the presence of data that should be
locked to cache we need to tightly control how this measurement is done.

For example, on my test system I locked 256KB to the cache and with the
current implementation (tip.git on branch x86/cache) I am able to
accurately measure that this was successful as seen below (each cache
line within the 256KB is accessed while the performance monitoring
events are active):

pseudo_lock_mea-26090 [002]  61838.488027: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26097 [002]  61843.689381: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26100 [002]  61848.751411: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26108 [002]  61853.820361: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26111 [002]  61858.880364: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26118 [002]  61863.937343: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26121 [002]  61869.008341: pseudo_lock_l2: hits=4096
miss=0

The current implementation does not coordinate with perf and this is
what I am trying to fix in this series.

I do respect your NAK but it is not clear to me how to proceed after
obtaining it. Could you please elaborate on what you would prefer as a
solution to ensure accurate measurement of cache-locked data that is
better integrated?

Thank you very much

Reinette

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Peter Zijlstra

On Tue, Jul 31, 2018 at 12:38:27PM -0700, Reinette Chatre wrote:
> Dear Maintainers,
> 
> The success of Cache Pseudo-Locking can be measured via the use of
> performance events. Specifically, the number of cache hits and misses
> reading a memory region after it has been pseudo-locked to cache. This
> measurement is triggered via the resctrl debugfs interface.
> 
> To ensure most accurate results the performance counters and their
> configuration registers are accessed directly.

NAK on that.

Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

2018-08-02 Thread Peter Zijlstra

On Tue, Jul 31, 2018 at 12:38:27PM -0700, Reinette Chatre wrote:
> Dear Maintainers,
> 
> The success of Cache Pseudo-Locking can be measured via the use of
> performance events. Specifically, the number of cache hits and misses
> reading a memory region after it has been pseudo-locked to cache. This
> measurement is triggered via the resctrl debugfs interface.
> 
> To ensure most accurate results the performance counters and their
> configuration registers are accessed directly.

NAK on that.

56 matches

Mail list logo