Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Just one clarification ... On 8/10/2018 9:25 AM, Reinette Chatre wrote: > static inline int x86_perf_event_error_state(struct perf_event *event) > { > int ret = 0; > u64 tmp; > > ret = perf_event_read_local(event, , NULL, NULL); > if (ret < 0) > return ret; > > if (event->attr.pinned && event->oncpu != smp_processor_id()) > return -EBUSY; I am preparing a patch series and in that the above extra test will be included as the last sanity check in perf_event_read_local() as you suggested. This inline function will thus go away and the only error state check would be a call to perf_event_read_local(). Reinette
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Just one clarification ... On 8/10/2018 9:25 AM, Reinette Chatre wrote: > static inline int x86_perf_event_error_state(struct perf_event *event) > { > int ret = 0; > u64 tmp; > > ret = perf_event_read_local(event, , NULL, NULL); > if (ret < 0) > return ret; > > if (event->attr.pinned && event->oncpu != smp_processor_id()) > return -EBUSY; I am preparing a patch series and in that the above extra test will be included as the last sanity check in perf_event_read_local() as you suggested. This inline function will thus go away and the only error state check would be a call to perf_event_read_local(). Reinette
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Peter, On 8/8/2018 10:33 AM, Reinette Chatre wrote: > On 8/8/2018 12:51 AM, Peter Zijlstra wrote: >> On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote: - I don't much fancy people accessing the guts of events like that; would not an inline function like: static inline u64 x86_perf_rdpmc(struct perf_event *event) { u64 val; lockdep_assert_irqs_disabled(); rdpmcl(event->hw.event_base_rdpmc, val); return val; } Work for you? >>> >>> No. This does not provide accurate results. Implementing the above produces: >>> pseudo_lock_mea-366 [002] 34.950740: pseudo_lock_l2: hits=4096 >>> miss=4 >> >> But it being an inline function should allow the compiler to optimize >> and lift the event->hw.event_base_rdpmc load like you now do manually. >> Also, like Tony already suggested, you can prime that load just fine by >> doing an extra invocation. >> >> (and note that the above function is _much_ simpler than >> perf_event_read_local()) > > Unfortunately I do not find this to be the case. When I implement > x86_perf_rdpmc() _exactly_ as you suggest above and do the measurement like: > > l2_hits_before = x86_perf_rdpmc(l2_hit_event); > l2_miss_before = x86_perf_rdpmc(l2_miss_event); > l2_hits_before = x86_perf_rdpmc(l2_hit_event); > l2_miss_before = x86_perf_rdpmc(l2_miss_event); > /* read memory */ > l2_hits_after = x86_perf_rdpmc(l2_hit_event); > l2_miss_after = x86_perf_rdpmc(l2_miss_event); > > > Then the results are not accurate, neither are the consistently > inaccurate to consider a constant adjustment: > > pseudo_lock_mea-409 [002] 194.322611: pseudo_lock_l2: hits=4100 > miss=0 > pseudo_lock_mea-412 [002] 195.520203: pseudo_lock_l2: hits=4096 > miss=3 > pseudo_lock_mea-415 [002] 196.571114: pseudo_lock_l2: hits=4097 > miss=3 > pseudo_lock_mea-422 [002] 197.629118: pseudo_lock_l2: hits=4097 > miss=3 > pseudo_lock_mea-425 [002] 198.687160: pseudo_lock_l2: hits=4096 > miss=3 > pseudo_lock_mea-428 [002] 199.744156: pseudo_lock_l2: hits=4096 > miss=2 > pseudo_lock_mea-431 [002] 200.801131: pseudo_lock_l2: hits=4097 > miss=2 > pseudo_lock_mea-434 [002] 201.858141: pseudo_lock_l2: hits=4097 > miss=2 > pseudo_lock_mea-437 [002] 202.917168: pseudo_lock_l2: hits=4096 > miss=2 > > I was able to test Tony's theory and replacing the reading of the > "after" counts with a direct rdpmcl() improve the results. What I mean > is this: > > l2_hit_pmcnum = x86_perf_rdpmc_ctr_get(l2_hit_event); > l2_miss_pmcnum = x86_perf_rdpmc_ctr_get(l2_miss_event); > l2_hits_before = x86_perf_rdpmc(l2_hit_event); > l2_miss_before = x86_perf_rdpmc(l2_miss_event); > l2_hits_before = x86_perf_rdpmc(l2_hit_event); > l2_miss_before = x86_perf_rdpmc(l2_miss_event); > /* read memory */ > rdpmcl(l2_hit_pmcnum, l2_hits_after); > rdpmcl(l2_miss_pmcnum, l2_miss_after); > > I did not run my full tests with the above but a simple read of 256KB > pseudo-locked memory gives: > pseudo_lock_mea-492 [002] 372.001385: pseudo_lock_l2: hits=4096 > miss=0 > pseudo_lock_mea-495 [002] 373.059748: pseudo_lock_l2: hits=4096 > miss=0 > pseudo_lock_mea-498 [002] 374.117027: pseudo_lock_l2: hits=4096 > miss=0 > pseudo_lock_mea-501 [002] 375.182864: pseudo_lock_l2: hits=4096 > miss=0 > pseudo_lock_mea-504 [002] 376.243958: pseudo_lock_l2: hits=4096 > miss=0 > > We thus seem to be encountering the issue Tony predicted where the > memory being tested is evicting the earlier measurement code and data. I thoroughly reviewed this email thread to ensure that all your feedback is being addressed. At this time I believe the current solution does so since it addresses all requirements I was able to capture: - Use in-kernel interface to perf. - Do not write directly to PMU registers. - Do not introduce another PMU owner. perf maintains role as performing resource arbitration for PMU. - User space is able to use perf and resctrl at the same time. - event_base_rdpmc is accessed and used only within an interrupts disabled section. - Internals of events are never accessed directly, inline function used. - Due to "pinned" usage the scheduling of event may have failed. Error state is checked in recommended way and have a credible error handling. - use X86_CONFIG The pseudocode of the current solution is presented below. With this solution I am able to address our customer requirement to be able to measure a pseudo-locked region accurately while also addressing your requirements to use perf correctly. Is this solution acceptable to you? #include "../../events/perf_event.h" /* For X86_CONFIG() */ /* * The X86_CONFIG() macro cannot be used in * a designated initializer as below - the initialization of * the .config attribute is thus deferred to later in order * to use X86_CONFIG */ static struct perf_event_attr
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Peter, On 8/8/2018 10:33 AM, Reinette Chatre wrote: > On 8/8/2018 12:51 AM, Peter Zijlstra wrote: >> On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote: - I don't much fancy people accessing the guts of events like that; would not an inline function like: static inline u64 x86_perf_rdpmc(struct perf_event *event) { u64 val; lockdep_assert_irqs_disabled(); rdpmcl(event->hw.event_base_rdpmc, val); return val; } Work for you? >>> >>> No. This does not provide accurate results. Implementing the above produces: >>> pseudo_lock_mea-366 [002] 34.950740: pseudo_lock_l2: hits=4096 >>> miss=4 >> >> But it being an inline function should allow the compiler to optimize >> and lift the event->hw.event_base_rdpmc load like you now do manually. >> Also, like Tony already suggested, you can prime that load just fine by >> doing an extra invocation. >> >> (and note that the above function is _much_ simpler than >> perf_event_read_local()) > > Unfortunately I do not find this to be the case. When I implement > x86_perf_rdpmc() _exactly_ as you suggest above and do the measurement like: > > l2_hits_before = x86_perf_rdpmc(l2_hit_event); > l2_miss_before = x86_perf_rdpmc(l2_miss_event); > l2_hits_before = x86_perf_rdpmc(l2_hit_event); > l2_miss_before = x86_perf_rdpmc(l2_miss_event); > /* read memory */ > l2_hits_after = x86_perf_rdpmc(l2_hit_event); > l2_miss_after = x86_perf_rdpmc(l2_miss_event); > > > Then the results are not accurate, neither are the consistently > inaccurate to consider a constant adjustment: > > pseudo_lock_mea-409 [002] 194.322611: pseudo_lock_l2: hits=4100 > miss=0 > pseudo_lock_mea-412 [002] 195.520203: pseudo_lock_l2: hits=4096 > miss=3 > pseudo_lock_mea-415 [002] 196.571114: pseudo_lock_l2: hits=4097 > miss=3 > pseudo_lock_mea-422 [002] 197.629118: pseudo_lock_l2: hits=4097 > miss=3 > pseudo_lock_mea-425 [002] 198.687160: pseudo_lock_l2: hits=4096 > miss=3 > pseudo_lock_mea-428 [002] 199.744156: pseudo_lock_l2: hits=4096 > miss=2 > pseudo_lock_mea-431 [002] 200.801131: pseudo_lock_l2: hits=4097 > miss=2 > pseudo_lock_mea-434 [002] 201.858141: pseudo_lock_l2: hits=4097 > miss=2 > pseudo_lock_mea-437 [002] 202.917168: pseudo_lock_l2: hits=4096 > miss=2 > > I was able to test Tony's theory and replacing the reading of the > "after" counts with a direct rdpmcl() improve the results. What I mean > is this: > > l2_hit_pmcnum = x86_perf_rdpmc_ctr_get(l2_hit_event); > l2_miss_pmcnum = x86_perf_rdpmc_ctr_get(l2_miss_event); > l2_hits_before = x86_perf_rdpmc(l2_hit_event); > l2_miss_before = x86_perf_rdpmc(l2_miss_event); > l2_hits_before = x86_perf_rdpmc(l2_hit_event); > l2_miss_before = x86_perf_rdpmc(l2_miss_event); > /* read memory */ > rdpmcl(l2_hit_pmcnum, l2_hits_after); > rdpmcl(l2_miss_pmcnum, l2_miss_after); > > I did not run my full tests with the above but a simple read of 256KB > pseudo-locked memory gives: > pseudo_lock_mea-492 [002] 372.001385: pseudo_lock_l2: hits=4096 > miss=0 > pseudo_lock_mea-495 [002] 373.059748: pseudo_lock_l2: hits=4096 > miss=0 > pseudo_lock_mea-498 [002] 374.117027: pseudo_lock_l2: hits=4096 > miss=0 > pseudo_lock_mea-501 [002] 375.182864: pseudo_lock_l2: hits=4096 > miss=0 > pseudo_lock_mea-504 [002] 376.243958: pseudo_lock_l2: hits=4096 > miss=0 > > We thus seem to be encountering the issue Tony predicted where the > memory being tested is evicting the earlier measurement code and data. I thoroughly reviewed this email thread to ensure that all your feedback is being addressed. At this time I believe the current solution does so since it addresses all requirements I was able to capture: - Use in-kernel interface to perf. - Do not write directly to PMU registers. - Do not introduce another PMU owner. perf maintains role as performing resource arbitration for PMU. - User space is able to use perf and resctrl at the same time. - event_base_rdpmc is accessed and used only within an interrupts disabled section. - Internals of events are never accessed directly, inline function used. - Due to "pinned" usage the scheduling of event may have failed. Error state is checked in recommended way and have a credible error handling. - use X86_CONFIG The pseudocode of the current solution is presented below. With this solution I am able to address our customer requirement to be able to measure a pseudo-locked region accurately while also addressing your requirements to use perf correctly. Is this solution acceptable to you? #include "../../events/perf_event.h" /* For X86_CONFIG() */ /* * The X86_CONFIG() macro cannot be used in * a designated initializer as below - the initialization of * the .config attribute is thus deferred to later in order * to use X86_CONFIG */ static struct perf_event_attr
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Peter and Tony, On 8/8/2018 12:51 AM, Peter Zijlstra wrote: > On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote: >>> FWIW, how long is that IRQ disabled section? It looks like something >>> that could be taking a bit of time. We have these people that care about >>> IRQ latency. >> >> We work closely with customers needing low latency as well as customers >> needing deterministic behavior. >> >> This measurement is triggered by the user as a validation mechanism of >> the pseudo-locked memory region after it has been created as part of >> system setup as well as during runtime if there are any concerns with >> the performance of an application that uses it. >> >> This measurement would thus be triggered before the sensitive workloads >> start - during system setup, or if an issue is already present. In >> either case the measurement is triggered by the administrator via debugfs. > > That does not in fact include the answer to the question. Also, it > assumes a competent operator (something I've found is not always true). My apologies, I did not intend to avoid your question. The main goal of this measurement is for an operator to test if a pseudo-locked region has been set up correctly. It is thus targeted for system setup time, before the IRQ sensitive workloads - some of which would use these memory regions - start. >>> - I don't much fancy people accessing the guts of events like that; >>>would not an inline function like: >>> >>>static inline u64 x86_perf_rdpmc(struct perf_event *event) >>>{ >>> u64 val; >>> >>> lockdep_assert_irqs_disabled(); >>> >>> rdpmcl(event->hw.event_base_rdpmc, val); >>> return val; >>>} >>> >>>Work for you? >> >> No. This does not provide accurate results. Implementing the above produces: >> pseudo_lock_mea-366 [002] 34.950740: pseudo_lock_l2: hits=4096 >> miss=4 > > But it being an inline function should allow the compiler to optimize > and lift the event->hw.event_base_rdpmc load like you now do manually. > Also, like Tony already suggested, you can prime that load just fine by > doing an extra invocation. > > (and note that the above function is _much_ simpler than > perf_event_read_local()) Unfortunately I do not find this to be the case. When I implement x86_perf_rdpmc() _exactly_ as you suggest above and do the measurement like: l2_hits_before = x86_perf_rdpmc(l2_hit_event); l2_miss_before = x86_perf_rdpmc(l2_miss_event); l2_hits_before = x86_perf_rdpmc(l2_hit_event); l2_miss_before = x86_perf_rdpmc(l2_miss_event); /* read memory */ l2_hits_after = x86_perf_rdpmc(l2_hit_event); l2_miss_after = x86_perf_rdpmc(l2_miss_event); Then the results are not accurate, neither are the consistently inaccurate to consider a constant adjustment: pseudo_lock_mea-409 [002] 194.322611: pseudo_lock_l2: hits=4100 miss=0 pseudo_lock_mea-412 [002] 195.520203: pseudo_lock_l2: hits=4096 miss=3 pseudo_lock_mea-415 [002] 196.571114: pseudo_lock_l2: hits=4097 miss=3 pseudo_lock_mea-422 [002] 197.629118: pseudo_lock_l2: hits=4097 miss=3 pseudo_lock_mea-425 [002] 198.687160: pseudo_lock_l2: hits=4096 miss=3 pseudo_lock_mea-428 [002] 199.744156: pseudo_lock_l2: hits=4096 miss=2 pseudo_lock_mea-431 [002] 200.801131: pseudo_lock_l2: hits=4097 miss=2 pseudo_lock_mea-434 [002] 201.858141: pseudo_lock_l2: hits=4097 miss=2 pseudo_lock_mea-437 [002] 202.917168: pseudo_lock_l2: hits=4096 miss=2 I was able to test Tony's theory and replacing the reading of the "after" counts with a direct rdpmcl() improve the results. What I mean is this: l2_hit_pmcnum = x86_perf_rdpmc_ctr_get(l2_hit_event); l2_miss_pmcnum = x86_perf_rdpmc_ctr_get(l2_miss_event); l2_hits_before = x86_perf_rdpmc(l2_hit_event); l2_miss_before = x86_perf_rdpmc(l2_miss_event); l2_hits_before = x86_perf_rdpmc(l2_hit_event); l2_miss_before = x86_perf_rdpmc(l2_miss_event); /* read memory */ rdpmcl(l2_hit_pmcnum, l2_hits_after); rdpmcl(l2_miss_pmcnum, l2_miss_after); I did not run my full tests with the above but a simple read of 256KB pseudo-locked memory gives: pseudo_lock_mea-492 [002] 372.001385: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-495 [002] 373.059748: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-498 [002] 374.117027: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-501 [002] 375.182864: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-504 [002] 376.243958: pseudo_lock_l2: hits=4096 miss=0 We thus seem to be encountering the issue Tony predicted where the memory being tested is evicting the earlier measurement code and data. >>> - native_read_pmc(); are you 100% sure this code only ever runs on >>>native and not in some dodgy virt environment? >> >> My understanding is that a virtual environment would be a customer of a >> RDT allocation (cache or memory bandwidth). I do not see if/where this >> is restricted though -
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Peter and Tony, On 8/8/2018 12:51 AM, Peter Zijlstra wrote: > On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote: >>> FWIW, how long is that IRQ disabled section? It looks like something >>> that could be taking a bit of time. We have these people that care about >>> IRQ latency. >> >> We work closely with customers needing low latency as well as customers >> needing deterministic behavior. >> >> This measurement is triggered by the user as a validation mechanism of >> the pseudo-locked memory region after it has been created as part of >> system setup as well as during runtime if there are any concerns with >> the performance of an application that uses it. >> >> This measurement would thus be triggered before the sensitive workloads >> start - during system setup, or if an issue is already present. In >> either case the measurement is triggered by the administrator via debugfs. > > That does not in fact include the answer to the question. Also, it > assumes a competent operator (something I've found is not always true). My apologies, I did not intend to avoid your question. The main goal of this measurement is for an operator to test if a pseudo-locked region has been set up correctly. It is thus targeted for system setup time, before the IRQ sensitive workloads - some of which would use these memory regions - start. >>> - I don't much fancy people accessing the guts of events like that; >>>would not an inline function like: >>> >>>static inline u64 x86_perf_rdpmc(struct perf_event *event) >>>{ >>> u64 val; >>> >>> lockdep_assert_irqs_disabled(); >>> >>> rdpmcl(event->hw.event_base_rdpmc, val); >>> return val; >>>} >>> >>>Work for you? >> >> No. This does not provide accurate results. Implementing the above produces: >> pseudo_lock_mea-366 [002] 34.950740: pseudo_lock_l2: hits=4096 >> miss=4 > > But it being an inline function should allow the compiler to optimize > and lift the event->hw.event_base_rdpmc load like you now do manually. > Also, like Tony already suggested, you can prime that load just fine by > doing an extra invocation. > > (and note that the above function is _much_ simpler than > perf_event_read_local()) Unfortunately I do not find this to be the case. When I implement x86_perf_rdpmc() _exactly_ as you suggest above and do the measurement like: l2_hits_before = x86_perf_rdpmc(l2_hit_event); l2_miss_before = x86_perf_rdpmc(l2_miss_event); l2_hits_before = x86_perf_rdpmc(l2_hit_event); l2_miss_before = x86_perf_rdpmc(l2_miss_event); /* read memory */ l2_hits_after = x86_perf_rdpmc(l2_hit_event); l2_miss_after = x86_perf_rdpmc(l2_miss_event); Then the results are not accurate, neither are the consistently inaccurate to consider a constant adjustment: pseudo_lock_mea-409 [002] 194.322611: pseudo_lock_l2: hits=4100 miss=0 pseudo_lock_mea-412 [002] 195.520203: pseudo_lock_l2: hits=4096 miss=3 pseudo_lock_mea-415 [002] 196.571114: pseudo_lock_l2: hits=4097 miss=3 pseudo_lock_mea-422 [002] 197.629118: pseudo_lock_l2: hits=4097 miss=3 pseudo_lock_mea-425 [002] 198.687160: pseudo_lock_l2: hits=4096 miss=3 pseudo_lock_mea-428 [002] 199.744156: pseudo_lock_l2: hits=4096 miss=2 pseudo_lock_mea-431 [002] 200.801131: pseudo_lock_l2: hits=4097 miss=2 pseudo_lock_mea-434 [002] 201.858141: pseudo_lock_l2: hits=4097 miss=2 pseudo_lock_mea-437 [002] 202.917168: pseudo_lock_l2: hits=4096 miss=2 I was able to test Tony's theory and replacing the reading of the "after" counts with a direct rdpmcl() improve the results. What I mean is this: l2_hit_pmcnum = x86_perf_rdpmc_ctr_get(l2_hit_event); l2_miss_pmcnum = x86_perf_rdpmc_ctr_get(l2_miss_event); l2_hits_before = x86_perf_rdpmc(l2_hit_event); l2_miss_before = x86_perf_rdpmc(l2_miss_event); l2_hits_before = x86_perf_rdpmc(l2_hit_event); l2_miss_before = x86_perf_rdpmc(l2_miss_event); /* read memory */ rdpmcl(l2_hit_pmcnum, l2_hits_after); rdpmcl(l2_miss_pmcnum, l2_miss_after); I did not run my full tests with the above but a simple read of 256KB pseudo-locked memory gives: pseudo_lock_mea-492 [002] 372.001385: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-495 [002] 373.059748: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-498 [002] 374.117027: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-501 [002] 375.182864: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-504 [002] 376.243958: pseudo_lock_l2: hits=4096 miss=0 We thus seem to be encountering the issue Tony predicted where the memory being tested is evicting the earlier measurement code and data. >>> - native_read_pmc(); are you 100% sure this code only ever runs on >>>native and not in some dodgy virt environment? >> >> My understanding is that a virtual environment would be a customer of a >> RDT allocation (cache or memory bandwidth). I do not see if/where this >> is restricted though -
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On 8/8/2018 9:47 AM, Peter Zijlstra wrote: > On Wed, Aug 08, 2018 at 03:55:54PM +, Luck, Tony wrote: >>> So _why_ doesn't this work? As said by Tony, that first call should >>> prime the caches, so the second and third calls should not generate any >>> misses. >> >> How much code/data is involved? If there is a lot, then you may be unlucky >> with cache coloring and the later parts of the "prime the caches" code path >> may evict some lines loaded in the early parts. > > Well, Reinette used perf_event_read_local() which is unfortunately quite > a bit. But the inline I proposed is a single load and depending on > rdpmcl() or native_read_pmc() a call to or just a single inline asm > rdpmc instruction. > > That should certainly work I think. I am in the process of testing this variation. Reinette
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On 8/8/2018 9:47 AM, Peter Zijlstra wrote: > On Wed, Aug 08, 2018 at 03:55:54PM +, Luck, Tony wrote: >>> So _why_ doesn't this work? As said by Tony, that first call should >>> prime the caches, so the second and third calls should not generate any >>> misses. >> >> How much code/data is involved? If there is a lot, then you may be unlucky >> with cache coloring and the later parts of the "prime the caches" code path >> may evict some lines loaded in the early parts. > > Well, Reinette used perf_event_read_local() which is unfortunately quite > a bit. But the inline I proposed is a single load and depending on > rdpmcl() or native_read_pmc() a call to or just a single inline asm > rdpmc instruction. > > That should certainly work I think. I am in the process of testing this variation. Reinette
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Wed, Aug 08, 2018 at 03:55:54PM +, Luck, Tony wrote: > > So _why_ doesn't this work? As said by Tony, that first call should > > prime the caches, so the second and third calls should not generate any > > misses. > > How much code/data is involved? If there is a lot, then you may be unlucky > with cache coloring and the later parts of the "prime the caches" code path > may evict some lines loaded in the early parts. Well, Reinette used perf_event_read_local() which is unfortunately quite a bit. But the inline I proposed is a single load and depending on rdpmcl() or native_read_pmc() a call to or just a single inline asm rdpmc instruction. That should certainly work I think.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Wed, Aug 08, 2018 at 03:55:54PM +, Luck, Tony wrote: > > So _why_ doesn't this work? As said by Tony, that first call should > > prime the caches, so the second and third calls should not generate any > > misses. > > How much code/data is involved? If there is a lot, then you may be unlucky > with cache coloring and the later parts of the "prime the caches" code path > may evict some lines loaded in the early parts. Well, Reinette used perf_event_read_local() which is unfortunately quite a bit. But the inline I proposed is a single load and depending on rdpmcl() or native_read_pmc() a call to or just a single inline asm rdpmc instruction. That should certainly work I think.
RE: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
> So _why_ doesn't this work? As said by Tony, that first call should > prime the caches, so the second and third calls should not generate any > misses. How much code/data is involved? If there is a lot, then you may be unlucky with cache coloring and the later parts of the "prime the caches" code path may evict some lines loaded in the early parts. -Tony
RE: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
> So _why_ doesn't this work? As said by Tony, that first call should > prime the caches, so the second and third calls should not generate any > misses. How much code/data is involved? If there is a lot, then you may be unlucky with cache coloring and the later parts of the "prime the caches" code path may evict some lines loaded in the early parts. -Tony
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote: > > FWIW, how long is that IRQ disabled section? It looks like something > > that could be taking a bit of time. We have these people that care about > > IRQ latency. > > We work closely with customers needing low latency as well as customers > needing deterministic behavior. > > This measurement is triggered by the user as a validation mechanism of > the pseudo-locked memory region after it has been created as part of > system setup as well as during runtime if there are any concerns with > the performance of an application that uses it. > > This measurement would thus be triggered before the sensitive workloads > start - during system setup, or if an issue is already present. In > either case the measurement is triggered by the administrator via debugfs. That does not in fact include the answer to the question. Also, it assumes a competent operator (something I've found is not always true). > > - I don't much fancy people accessing the guts of events like that; > >would not an inline function like: > > > >static inline u64 x86_perf_rdpmc(struct perf_event *event) > >{ > > u64 val; > > > > lockdep_assert_irqs_disabled(); > > > > rdpmcl(event->hw.event_base_rdpmc, val); > > return val; > >} > > > >Work for you? > > No. This does not provide accurate results. Implementing the above produces: > pseudo_lock_mea-366 [002] 34.950740: pseudo_lock_l2: hits=4096 > miss=4 But it being an inline function should allow the compiler to optimize and lift the event->hw.event_base_rdpmc load like you now do manually. Also, like Tony already suggested, you can prime that load just fine by doing an extra invocation. (and note that the above function is _much_ simpler than perf_event_read_local()) > > - native_read_pmc(); are you 100% sure this code only ever runs on > >native and not in some dodgy virt environment? > > My understanding is that a virtual environment would be a customer of a > RDT allocation (cache or memory bandwidth). I do not see if/where this > is restricted though - I'll move to rdpmcl() but the usage of a cache > allocation feature like this from a virtual machine needs more > investigation. I can imagine that hypervisors that allow physical partitioning could allow delegating the rdt crud to their guests when they 'own' a full socket or whatever the domain is for this. > Will do. I created the following helper function that can be used after > interrupts are disabled: > > static inline int perf_event_error_state(struct perf_event *event) > { > int ret = 0; > u64 tmp; > > ret = perf_event_read_local(event, , NULL, NULL); > if (ret < 0) > return ret; > > if (event->attr.pinned && event->oncpu != smp_processor_id()) > return -EBUSY; > > return ret; > } Nah, stick the test in perf_event_read_local(), that actually needs it. > > Also, while you disable IRQs, your fancy pants loop is still subject to > > NMIs that can/will perturb your measurements, how do you deal with > > those? > Customers interested in this feature are familiar with dealing with them > (and also SMIs). The user space counterpart is able to detect such an > occurrence. You're very optimistic about your customers capabilities. And this might be true for the current people you're talking to, but once this is available and public, joe monkey will have access and he _will_ screw it up. > Please note that if an NMI arrives it would be handled with the > currently active cache capacity bitmask so none of the pseudo-locked > memory will be evicted since no capacity bitmask overlaps with the > pseudo-locked region. So exceptions change / have their own bitmask?
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote: > > FWIW, how long is that IRQ disabled section? It looks like something > > that could be taking a bit of time. We have these people that care about > > IRQ latency. > > We work closely with customers needing low latency as well as customers > needing deterministic behavior. > > This measurement is triggered by the user as a validation mechanism of > the pseudo-locked memory region after it has been created as part of > system setup as well as during runtime if there are any concerns with > the performance of an application that uses it. > > This measurement would thus be triggered before the sensitive workloads > start - during system setup, or if an issue is already present. In > either case the measurement is triggered by the administrator via debugfs. That does not in fact include the answer to the question. Also, it assumes a competent operator (something I've found is not always true). > > - I don't much fancy people accessing the guts of events like that; > >would not an inline function like: > > > >static inline u64 x86_perf_rdpmc(struct perf_event *event) > >{ > > u64 val; > > > > lockdep_assert_irqs_disabled(); > > > > rdpmcl(event->hw.event_base_rdpmc, val); > > return val; > >} > > > >Work for you? > > No. This does not provide accurate results. Implementing the above produces: > pseudo_lock_mea-366 [002] 34.950740: pseudo_lock_l2: hits=4096 > miss=4 But it being an inline function should allow the compiler to optimize and lift the event->hw.event_base_rdpmc load like you now do manually. Also, like Tony already suggested, you can prime that load just fine by doing an extra invocation. (and note that the above function is _much_ simpler than perf_event_read_local()) > > - native_read_pmc(); are you 100% sure this code only ever runs on > >native and not in some dodgy virt environment? > > My understanding is that a virtual environment would be a customer of a > RDT allocation (cache or memory bandwidth). I do not see if/where this > is restricted though - I'll move to rdpmcl() but the usage of a cache > allocation feature like this from a virtual machine needs more > investigation. I can imagine that hypervisors that allow physical partitioning could allow delegating the rdt crud to their guests when they 'own' a full socket or whatever the domain is for this. > Will do. I created the following helper function that can be used after > interrupts are disabled: > > static inline int perf_event_error_state(struct perf_event *event) > { > int ret = 0; > u64 tmp; > > ret = perf_event_read_local(event, , NULL, NULL); > if (ret < 0) > return ret; > > if (event->attr.pinned && event->oncpu != smp_processor_id()) > return -EBUSY; > > return ret; > } Nah, stick the test in perf_event_read_local(), that actually needs it. > > Also, while you disable IRQs, your fancy pants loop is still subject to > > NMIs that can/will perturb your measurements, how do you deal with > > those? > Customers interested in this feature are familiar with dealing with them > (and also SMIs). The user space counterpart is able to detect such an > occurrence. You're very optimistic about your customers capabilities. And this might be true for the current people you're talking to, but once this is available and public, joe monkey will have access and he _will_ screw it up. > Please note that if an NMI arrives it would be handled with the > currently active cache capacity bitmask so none of the pseudo-locked > memory will be evicted since no capacity bitmask overlaps with the > pseudo-locked region. So exceptions change / have their own bitmask?
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Tue, Aug 07, 2018 at 10:44:44PM -0700, Reinette Chatre wrote: > Hi Tony, > > On 8/7/2018 6:28 PM, Luck, Tony wrote: > > Would it help to call routines to read the "before" values of the counter > > twice. The first time to preload the cache with anything needed to execute > > the perf code path. Indeed, that is the 'common' pattern for this. > First, reading data using perf_event_read_local() called twice. > When testing as follows: > /* create perf events */ > /* disable irq */ > /* disable hw prefetchers */ > /* init local vars */ > /* read before data twice as follows: */ > perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL); > perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL); > perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL); > perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL); > /* read through pseudo-locked memory */ > perf_event_read_local(l2_hit_event, _hits_after, NULL, NULL); > perf_event_read_local(l2_miss_event, _miss_after, NULL, NULL); > /* re enable hw prefetchers */ > /* enable irq */ > /* write data to tracepoint */ > > With the above I am not able to obtain accurate data: > pseudo_lock_mea-354 [002] 63.045734: pseudo_lock_l2: hits=4103 > miss=6 So _why_ doesn't this work? As said by Tony, that first call should prime the caches, so the second and third calls should not generate any misses. They might cause extra hits though, but that should be a constant amount is is also measureable with a no-op loop and can easily be subtracted.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Tue, Aug 07, 2018 at 10:44:44PM -0700, Reinette Chatre wrote: > Hi Tony, > > On 8/7/2018 6:28 PM, Luck, Tony wrote: > > Would it help to call routines to read the "before" values of the counter > > twice. The first time to preload the cache with anything needed to execute > > the perf code path. Indeed, that is the 'common' pattern for this. > First, reading data using perf_event_read_local() called twice. > When testing as follows: > /* create perf events */ > /* disable irq */ > /* disable hw prefetchers */ > /* init local vars */ > /* read before data twice as follows: */ > perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL); > perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL); > perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL); > perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL); > /* read through pseudo-locked memory */ > perf_event_read_local(l2_hit_event, _hits_after, NULL, NULL); > perf_event_read_local(l2_miss_event, _miss_after, NULL, NULL); > /* re enable hw prefetchers */ > /* enable irq */ > /* write data to tracepoint */ > > With the above I am not able to obtain accurate data: > pseudo_lock_mea-354 [002] 63.045734: pseudo_lock_l2: hits=4103 > miss=6 So _why_ doesn't this work? As said by Tony, that first call should prime the caches, so the second and third calls should not generate any misses. They might cause extra hits though, but that should be a constant amount is is also measureable with a no-op loop and can easily be subtracted.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Tony, On 8/7/2018 6:28 PM, Luck, Tony wrote: > Would it help to call routines to read the "before" values of the counter > twice. The first time to preload the cache with anything needed to execute > the perf code path. >>> In an attempt to improve the accuracy of the above I modified it to the >>> following: >>> >>> /* create the two events as before in "enabled" state */ >>> l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc; >>> l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc; >>> local_irq_disable(); >>> /* disable hw prefetchers */ >>> /* init local vars to loop through pseudo-locked mem > * may take some misses in the perf code > */ > l2_hits_before = native_read_pmc(l2_hit_pmcnum); > l2_miss_before = native_read_pmc(l2_miss_pmcnum); > /* Read counters again, hope no new misses here */ >>> l2_hits_before = native_read_pmc(l2_hit_pmcnum); >>> l2_miss_before = native_read_pmc(l2_miss_pmcnum); >>> /* loop through pseudo-locked mem */ >>> l2_hits_after = native_read_pmc(l2_hit_pmcnum); >>> l2_miss_after = native_read_pmc(l2_miss_pmcnum); >>> /* enable hw prefetchers */ >>> local_irq_enable(); > The end of my previous email to Peter contains a solution that does address all the feedback received up to this point while also able to obtain (what I thought to be ... more below) accurate results. The code you comment on below is not this latest version but your suggestion is valuable and I do try it out on two different ways from what you quote below to read the perf data. So, instead of reading data with native_read_pmc() as in the code you quoted I first test when reading data twice using the original recommendation of "perf_event_read_local()" and second when reading data twice using "rdpmcl()" chosen instead of native_read_pmc(). First, reading data using perf_event_read_local() called twice. When testing as follows: /* create perf events */ /* disable irq */ /* disable hw prefetchers */ /* init local vars */ /* read before data twice as follows: */ perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL); perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL); perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL); perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL); /* read through pseudo-locked memory */ perf_event_read_local(l2_hit_event, _hits_after, NULL, NULL); perf_event_read_local(l2_miss_event, _miss_after, NULL, NULL); /* re enable hw prefetchers */ /* enable irq */ /* write data to tracepoint */ With the above I am not able to obtain accurate data: pseudo_lock_mea-351 [002] 61.859147: pseudo_lock_l2: hits=4109 miss=0 pseudo_lock_mea-354 [002] 63.045734: pseudo_lock_l2: hits=4103 miss=6 pseudo_lock_mea-357 [002] 64.104673: pseudo_lock_l2: hits=4106 miss=3 pseudo_lock_mea-360 [002] 65.174775: pseudo_lock_l2: hits=4105 miss=5 pseudo_lock_mea-367 [002] 66.232308: pseudo_lock_l2: hits=4104 miss=5 pseudo_lock_mea-370 [002] 67.291844: pseudo_lock_l2: hits=4103 miss=6 pseudo_lock_mea-373 [002] 68.348725: pseudo_lock_l2: hits=4105 miss=5 pseudo_lock_mea-376 [002] 69.409738: pseudo_lock_l2: hits=4105 miss=5 pseudo_lock_mea-379 [002] 70.466763: pseudo_lock_l2: hits=4105 miss=5 Second, reading data using rdpmcl() called twice. This is the same solution as documented in my previous email, with the two extra rdpmcl() calls added. An overview of the flow: /* create perf events */ /* disable irq */ /* check perf event error state */ /* disable hw prefetchers */ /* init local vars */ /* read before data twice as follows: */ rdpmcl(l2_hit_pmcnum, l2_hits_before); rdpmcl(l2_miss_pmcnum, l2_miss_before); rdpmcl(l2_hit_pmcnum, l2_hits_before); rdpmcl(l2_miss_pmcnum, l2_miss_before); /* read through pseudo-locked memory */ rdpmcl(l2_hit_pmcnum, l2_hits_after); rdpmcl(l2_miss_pmcnum, l2_miss_after); /* re enable hw prefetchers */ /* enable irq */ /* write data to tracepoint */ Here as expected a simple test showed that the data was accurate (hits=4096 miss=0) so I repeated the creation and measurement of pseudo-locked region at different sizes under different loads. Each possible pseudo-lock region size is created and measured 100 times on an idle system and 100 times on a system with a noisy neighbor - this resulted in a total of 2800 pseudo-lock region creations each followed by a measurement of that region. The results of these tests are the best I have yet seen. In this total of 2800 measurements the number of cache hits were miscounted only in eight measurements - each miscount was under(?) counted with one. Specifically, a memory region consisting of 8192 cache lines was measured as "hits=8191 miss=0", three memory regions with 12288 cache lines were measured as "hits=12287 miss=0", two memory regions with 10240 cache lines were measured as "hits=10239 miss=0", and two memory regions with 14336 cache lines were measured as "hits=14335 miss=0". I
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Tony, On 8/7/2018 6:28 PM, Luck, Tony wrote: > Would it help to call routines to read the "before" values of the counter > twice. The first time to preload the cache with anything needed to execute > the perf code path. >>> In an attempt to improve the accuracy of the above I modified it to the >>> following: >>> >>> /* create the two events as before in "enabled" state */ >>> l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc; >>> l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc; >>> local_irq_disable(); >>> /* disable hw prefetchers */ >>> /* init local vars to loop through pseudo-locked mem > * may take some misses in the perf code > */ > l2_hits_before = native_read_pmc(l2_hit_pmcnum); > l2_miss_before = native_read_pmc(l2_miss_pmcnum); > /* Read counters again, hope no new misses here */ >>> l2_hits_before = native_read_pmc(l2_hit_pmcnum); >>> l2_miss_before = native_read_pmc(l2_miss_pmcnum); >>> /* loop through pseudo-locked mem */ >>> l2_hits_after = native_read_pmc(l2_hit_pmcnum); >>> l2_miss_after = native_read_pmc(l2_miss_pmcnum); >>> /* enable hw prefetchers */ >>> local_irq_enable(); > The end of my previous email to Peter contains a solution that does address all the feedback received up to this point while also able to obtain (what I thought to be ... more below) accurate results. The code you comment on below is not this latest version but your suggestion is valuable and I do try it out on two different ways from what you quote below to read the perf data. So, instead of reading data with native_read_pmc() as in the code you quoted I first test when reading data twice using the original recommendation of "perf_event_read_local()" and second when reading data twice using "rdpmcl()" chosen instead of native_read_pmc(). First, reading data using perf_event_read_local() called twice. When testing as follows: /* create perf events */ /* disable irq */ /* disable hw prefetchers */ /* init local vars */ /* read before data twice as follows: */ perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL); perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL); perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL); perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL); /* read through pseudo-locked memory */ perf_event_read_local(l2_hit_event, _hits_after, NULL, NULL); perf_event_read_local(l2_miss_event, _miss_after, NULL, NULL); /* re enable hw prefetchers */ /* enable irq */ /* write data to tracepoint */ With the above I am not able to obtain accurate data: pseudo_lock_mea-351 [002] 61.859147: pseudo_lock_l2: hits=4109 miss=0 pseudo_lock_mea-354 [002] 63.045734: pseudo_lock_l2: hits=4103 miss=6 pseudo_lock_mea-357 [002] 64.104673: pseudo_lock_l2: hits=4106 miss=3 pseudo_lock_mea-360 [002] 65.174775: pseudo_lock_l2: hits=4105 miss=5 pseudo_lock_mea-367 [002] 66.232308: pseudo_lock_l2: hits=4104 miss=5 pseudo_lock_mea-370 [002] 67.291844: pseudo_lock_l2: hits=4103 miss=6 pseudo_lock_mea-373 [002] 68.348725: pseudo_lock_l2: hits=4105 miss=5 pseudo_lock_mea-376 [002] 69.409738: pseudo_lock_l2: hits=4105 miss=5 pseudo_lock_mea-379 [002] 70.466763: pseudo_lock_l2: hits=4105 miss=5 Second, reading data using rdpmcl() called twice. This is the same solution as documented in my previous email, with the two extra rdpmcl() calls added. An overview of the flow: /* create perf events */ /* disable irq */ /* check perf event error state */ /* disable hw prefetchers */ /* init local vars */ /* read before data twice as follows: */ rdpmcl(l2_hit_pmcnum, l2_hits_before); rdpmcl(l2_miss_pmcnum, l2_miss_before); rdpmcl(l2_hit_pmcnum, l2_hits_before); rdpmcl(l2_miss_pmcnum, l2_miss_before); /* read through pseudo-locked memory */ rdpmcl(l2_hit_pmcnum, l2_hits_after); rdpmcl(l2_miss_pmcnum, l2_miss_after); /* re enable hw prefetchers */ /* enable irq */ /* write data to tracepoint */ Here as expected a simple test showed that the data was accurate (hits=4096 miss=0) so I repeated the creation and measurement of pseudo-locked region at different sizes under different loads. Each possible pseudo-lock region size is created and measured 100 times on an idle system and 100 times on a system with a noisy neighbor - this resulted in a total of 2800 pseudo-lock region creations each followed by a measurement of that region. The results of these tests are the best I have yet seen. In this total of 2800 measurements the number of cache hits were miscounted only in eight measurements - each miscount was under(?) counted with one. Specifically, a memory region consisting of 8192 cache lines was measured as "hits=8191 miss=0", three memory regions with 12288 cache lines were measured as "hits=12287 miss=0", two memory regions with 10240 cache lines were measured as "hits=10239 miss=0", and two memory regions with 14336 cache lines were measured as "hits=14335 miss=0". I
RE: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Would it help to call routines to read the "before" values of the counter twice. The first time to preload the cache with anything needed to execute the perf code path. >> In an attempt to improve the accuracy of the above I modified it to the >> following: >> >> /* create the two events as before in "enabled" state */ >> l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc; >> l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc; >> local_irq_disable(); >> /* disable hw prefetchers */ >> /* init local vars to loop through pseudo-locked mem * may take some misses in the perf code */ l2_hits_before = native_read_pmc(l2_hit_pmcnum); l2_miss_before = native_read_pmc(l2_miss_pmcnum); /* Read counters again, hope no new misses here */ >> l2_hits_before = native_read_pmc(l2_hit_pmcnum); >> l2_miss_before = native_read_pmc(l2_miss_pmcnum); >> /* loop through pseudo-locked mem */ >> l2_hits_after = native_read_pmc(l2_hit_pmcnum); >> l2_miss_after = native_read_pmc(l2_miss_pmcnum); >> /* enable hw prefetchers */ >> local_irq_enable();
RE: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Would it help to call routines to read the "before" values of the counter twice. The first time to preload the cache with anything needed to execute the perf code path. >> In an attempt to improve the accuracy of the above I modified it to the >> following: >> >> /* create the two events as before in "enabled" state */ >> l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc; >> l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc; >> local_irq_disable(); >> /* disable hw prefetchers */ >> /* init local vars to loop through pseudo-locked mem * may take some misses in the perf code */ l2_hits_before = native_read_pmc(l2_hit_pmcnum); l2_miss_before = native_read_pmc(l2_miss_pmcnum); /* Read counters again, hope no new misses here */ >> l2_hits_before = native_read_pmc(l2_hit_pmcnum); >> l2_miss_before = native_read_pmc(l2_miss_pmcnum); >> /* loop through pseudo-locked mem */ >> l2_hits_after = native_read_pmc(l2_hit_pmcnum); >> l2_miss_after = native_read_pmc(l2_miss_pmcnum); >> /* enable hw prefetchers */ >> local_irq_enable();
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Mon, Aug 06, 2018 at 04:07:09PM -0700, Reinette Chatre wrote: > I've modified your suggestion slightly in an attempt to gain accuracy. > Now it looks like: > > local_irq_disable(); > /* disable hw prefetchers */ > /* init local vars to loop through pseudo-locked mem */ > perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL); > perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL); > /* loop through pseudo-locked mem */ > perf_event_read_local(l2_hit_event, _hits_after, NULL, NULL); > perf_event_read_local(l2_miss_event, _miss_after, NULL, NULL); > /* enable hw prefetchers */ > local_irq_enable(); > > With the above I do not see the impact of an interference workload > anymore but the results are not yet accurate: > > pseudo_lock_mea-538 [002] 113.296084: pseudo_lock_l2: hits=4103 > miss=2 an error of ~0.2% sounds pretty good to me ;-) FWIW, how long is that IRQ disabled section? It looks like something that could be taking a bit of time. We have these people that care about IRQ latency. > While the results do seem close, reporting a cache miss on memory that > is set up to be locked in cache is significant. If it is 'locked' then how does perf causes misses? I suppose I should go read that earlier email. > In an attempt to improve the accuracy of the above I modified it to the > following: > > /* create the two events as before in "enabled" state */ > l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc; > l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc; > local_irq_disable(); > /* disable hw prefetchers */ > /* init local vars to loop through pseudo-locked mem */ > l2_hits_before = native_read_pmc(l2_hit_pmcnum); > l2_miss_before = native_read_pmc(l2_miss_pmcnum); > /* loop through pseudo-locked mem */ > l2_hits_after = native_read_pmc(l2_hit_pmcnum); > l2_miss_after = native_read_pmc(l2_miss_pmcnum); > /* enable hw prefetchers */ > local_irq_enable(); So the above has a number or practical problems: - event_base_rdpmc is subject to change as long as interrupts are enabled. - I don't much fancy people accessing the guts of events like that; would not an inline function like: static inline u64 x86_perf_rdpmc(struct perf_event *event) { u64 val; lockdep_assert_irqs_disabled(); rdpmcl(event->hw.event_base_rdpmc, val); return val; } Work for you? - native_read_pmc(); are you 100% sure this code only ever runs on native and not in some dodgy virt environment? (also, I suppose all hardware supporting this resctl locking muck actually has rdpmc ;-)) - You used perf_event_attr::pinned=1, this means we could have failed to schedule the event, you need to check error state somewhere and have a credible error handling. For checking error state; you could use perf_event_read_local() early after disabling IRQs, it needs a little extra test though: if (event->attr.pinned && event->oncpu != smp_processor_id()) return -EBUSY; to correctly deal with pinned failure. Place it such that it becomes the last sanity check. Also, while you disable IRQs, your fancy pants loop is still subject to NMIs that can/will perturb your measurements, how do you deal with those?
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Mon, Aug 06, 2018 at 04:07:09PM -0700, Reinette Chatre wrote: > I've modified your suggestion slightly in an attempt to gain accuracy. > Now it looks like: > > local_irq_disable(); > /* disable hw prefetchers */ > /* init local vars to loop through pseudo-locked mem */ > perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL); > perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL); > /* loop through pseudo-locked mem */ > perf_event_read_local(l2_hit_event, _hits_after, NULL, NULL); > perf_event_read_local(l2_miss_event, _miss_after, NULL, NULL); > /* enable hw prefetchers */ > local_irq_enable(); > > With the above I do not see the impact of an interference workload > anymore but the results are not yet accurate: > > pseudo_lock_mea-538 [002] 113.296084: pseudo_lock_l2: hits=4103 > miss=2 an error of ~0.2% sounds pretty good to me ;-) FWIW, how long is that IRQ disabled section? It looks like something that could be taking a bit of time. We have these people that care about IRQ latency. > While the results do seem close, reporting a cache miss on memory that > is set up to be locked in cache is significant. If it is 'locked' then how does perf causes misses? I suppose I should go read that earlier email. > In an attempt to improve the accuracy of the above I modified it to the > following: > > /* create the two events as before in "enabled" state */ > l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc; > l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc; > local_irq_disable(); > /* disable hw prefetchers */ > /* init local vars to loop through pseudo-locked mem */ > l2_hits_before = native_read_pmc(l2_hit_pmcnum); > l2_miss_before = native_read_pmc(l2_miss_pmcnum); > /* loop through pseudo-locked mem */ > l2_hits_after = native_read_pmc(l2_hit_pmcnum); > l2_miss_after = native_read_pmc(l2_miss_pmcnum); > /* enable hw prefetchers */ > local_irq_enable(); So the above has a number or practical problems: - event_base_rdpmc is subject to change as long as interrupts are enabled. - I don't much fancy people accessing the guts of events like that; would not an inline function like: static inline u64 x86_perf_rdpmc(struct perf_event *event) { u64 val; lockdep_assert_irqs_disabled(); rdpmcl(event->hw.event_base_rdpmc, val); return val; } Work for you? - native_read_pmc(); are you 100% sure this code only ever runs on native and not in some dodgy virt environment? (also, I suppose all hardware supporting this resctl locking muck actually has rdpmc ;-)) - You used perf_event_attr::pinned=1, this means we could have failed to schedule the event, you need to check error state somewhere and have a credible error handling. For checking error state; you could use perf_event_read_local() early after disabling IRQs, it needs a little extra test though: if (event->attr.pinned && event->oncpu != smp_processor_id()) return -EBUSY; to correctly deal with pinned failure. Place it such that it becomes the last sanity check. Also, while you disable IRQs, your fancy pants loop is still subject to NMIs that can/will perturb your measurements, how do you deal with those?
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Peter, On 8/6/2018 3:12 PM, Peter Zijlstra wrote: > On Mon, Aug 06, 2018 at 12:50:50PM -0700, Reinette Chatre wrote: >> In my previous email I provided the details of the Cache Pseudo-Locking >> feature implemented on top of resctrl. Please let me know if you would >> like any more details about that. I can send you more materials. > > I've no yet had time to read.. > >> BUG: sleeping function called from invalid context at >> kernel/locking/mutex.c:748 >> >> I thus continued to use the API with interrupts enabled did the following: >> >> Two new event attributes: >> static struct perf_event_attr l2_miss_attr = { >> .type = PERF_TYPE_RAW, >> .config = (0x10ULL << 8) | 0xd1, > > Please use something like: > > X86_CONFIG(.event=0xd1, .umask=0x10), > > that's ever so much more readable. > >> .size = sizeof(struct perf_event_attr), >> .pinned = 1, >> .disabled = 1, >> .exclude_user = 1 >> }; >> >> static struct perf_event_attr l2_hit_attr = { >> .type = PERF_TYPE_RAW, >> .config = (0x2ULL << 8) | 0xd1, >> .size = sizeof(struct perf_event_attr), >> .pinned = 1, >> .disabled = 1, >> .exclude_user = 1 >> }; >> >> Create the two new events using these attributes: >> l2_miss_event = perf_event_create_kernel_counter(_miss_attr, cpu, >> NULL, NULL, NULL); >> l2_hit_event = perf_event_create_kernel_counter(_hit_attr, cpu, NULL, >> NULL, NULL); >> >> Take measurements: >> perf_event_enable(l2_miss_event); >> perf_event_enable(l2_hit_event); >> local_irq_disable(); >> /* Disable hardware prefetchers */ >> /* Loop through pseudo-locked memory */ >> /* Enable hardware prefetchers */ >> local_irq_enable(); >> perf_event_disable(l2_hit_event); >> perf_event_disable(l2_miss_event); >> >> Read results: >> l2_hits = perf_event_read_value(l2_hit_event, , ); >> l2_miss = perf_event_read_value(l2_miss_event, , ); >> /* Make results available in tracepoints */ > > switch to .disabled=0 and try this for measurement: > > local_irq_disable(); > perf_event_read_local(l2_miss_event, _val1, NULL, NULL); > perf_event_read_local(l2_hit_event, _val1, NULL, NULL); > /* do your thing */ > perf_event_read_local(l2_miss_event, _val2, NULL, NULL); > perf_event_read_local(l2_hit_event, _val2, NULL, NULL); > local_irq_enable(); Thank you very much for taking a look and providing your guidance. > > You're running this on the CPU you created the event for, right? Yes. I've modified your suggestion slightly in an attempt to gain accuracy. Now it looks like: local_irq_disable(); /* disable hw prefetchers */ /* init local vars to loop through pseudo-locked mem */ perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL); perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL); /* loop through pseudo-locked mem */ perf_event_read_local(l2_hit_event, _hits_after, NULL, NULL); perf_event_read_local(l2_miss_event, _miss_after, NULL, NULL); /* enable hw prefetchers */ local_irq_enable(); With the above I do not see the impact of an interference workload anymore but the results are not yet accurate: pseudo_lock_mea-538 [002] 113.296084: pseudo_lock_l2: hits=4103 miss=2 pseudo_lock_mea-541 [002] 114.349343: pseudo_lock_l2: hits=4102 miss=3 pseudo_lock_mea-544 [002] 115.410206: pseudo_lock_l2: hits=4101 miss=4 pseudo_lock_mea-551 [002] 116.473912: pseudo_lock_l2: hits=4102 miss=3 pseudo_lock_mea-554 [002] 117.532446: pseudo_lock_l2: hits=4100 miss=5 pseudo_lock_mea-557 [002] 118.591121: pseudo_lock_l2: hits=4103 miss=2 pseudo_lock_mea-560 [002] 119.642467: pseudo_lock_l2: hits=4102 miss=3 pseudo_lock_mea-563 [002] 120.698562: pseudo_lock_l2: hits=4102 miss=3 pseudo_lock_mea-566 [002] 121.769348: pseudo_lock_l2: hits=4105 miss=4 In an attempt to improve the accuracy of the above I modified it to the following: /* create the two events as before in "enabled" state */ l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc; l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc; local_irq_disable(); /* disable hw prefetchers */ /* init local vars to loop through pseudo-locked mem */ l2_hits_before = native_read_pmc(l2_hit_pmcnum); l2_miss_before = native_read_pmc(l2_miss_pmcnum); /* loop through pseudo-locked mem */ l2_hits_after = native_read_pmc(l2_hit_pmcnum); l2_miss_after = native_read_pmc(l2_miss_pmcnum); /* enable hw prefetchers */ local_irq_enable(); With the above I seem to get the same accuracy as before: pseudo_lock_mea-557 [002] 155.402566: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-564 [002] 156.441299: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-567 [002] 157.478605: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-570 [002] 158.524054: pseudo_lock_l2: hits=4096
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Peter, On 8/6/2018 3:12 PM, Peter Zijlstra wrote: > On Mon, Aug 06, 2018 at 12:50:50PM -0700, Reinette Chatre wrote: >> In my previous email I provided the details of the Cache Pseudo-Locking >> feature implemented on top of resctrl. Please let me know if you would >> like any more details about that. I can send you more materials. > > I've no yet had time to read.. > >> BUG: sleeping function called from invalid context at >> kernel/locking/mutex.c:748 >> >> I thus continued to use the API with interrupts enabled did the following: >> >> Two new event attributes: >> static struct perf_event_attr l2_miss_attr = { >> .type = PERF_TYPE_RAW, >> .config = (0x10ULL << 8) | 0xd1, > > Please use something like: > > X86_CONFIG(.event=0xd1, .umask=0x10), > > that's ever so much more readable. > >> .size = sizeof(struct perf_event_attr), >> .pinned = 1, >> .disabled = 1, >> .exclude_user = 1 >> }; >> >> static struct perf_event_attr l2_hit_attr = { >> .type = PERF_TYPE_RAW, >> .config = (0x2ULL << 8) | 0xd1, >> .size = sizeof(struct perf_event_attr), >> .pinned = 1, >> .disabled = 1, >> .exclude_user = 1 >> }; >> >> Create the two new events using these attributes: >> l2_miss_event = perf_event_create_kernel_counter(_miss_attr, cpu, >> NULL, NULL, NULL); >> l2_hit_event = perf_event_create_kernel_counter(_hit_attr, cpu, NULL, >> NULL, NULL); >> >> Take measurements: >> perf_event_enable(l2_miss_event); >> perf_event_enable(l2_hit_event); >> local_irq_disable(); >> /* Disable hardware prefetchers */ >> /* Loop through pseudo-locked memory */ >> /* Enable hardware prefetchers */ >> local_irq_enable(); >> perf_event_disable(l2_hit_event); >> perf_event_disable(l2_miss_event); >> >> Read results: >> l2_hits = perf_event_read_value(l2_hit_event, , ); >> l2_miss = perf_event_read_value(l2_miss_event, , ); >> /* Make results available in tracepoints */ > > switch to .disabled=0 and try this for measurement: > > local_irq_disable(); > perf_event_read_local(l2_miss_event, _val1, NULL, NULL); > perf_event_read_local(l2_hit_event, _val1, NULL, NULL); > /* do your thing */ > perf_event_read_local(l2_miss_event, _val2, NULL, NULL); > perf_event_read_local(l2_hit_event, _val2, NULL, NULL); > local_irq_enable(); Thank you very much for taking a look and providing your guidance. > > You're running this on the CPU you created the event for, right? Yes. I've modified your suggestion slightly in an attempt to gain accuracy. Now it looks like: local_irq_disable(); /* disable hw prefetchers */ /* init local vars to loop through pseudo-locked mem */ perf_event_read_local(l2_hit_event, _hits_before, NULL, NULL); perf_event_read_local(l2_miss_event, _miss_before, NULL, NULL); /* loop through pseudo-locked mem */ perf_event_read_local(l2_hit_event, _hits_after, NULL, NULL); perf_event_read_local(l2_miss_event, _miss_after, NULL, NULL); /* enable hw prefetchers */ local_irq_enable(); With the above I do not see the impact of an interference workload anymore but the results are not yet accurate: pseudo_lock_mea-538 [002] 113.296084: pseudo_lock_l2: hits=4103 miss=2 pseudo_lock_mea-541 [002] 114.349343: pseudo_lock_l2: hits=4102 miss=3 pseudo_lock_mea-544 [002] 115.410206: pseudo_lock_l2: hits=4101 miss=4 pseudo_lock_mea-551 [002] 116.473912: pseudo_lock_l2: hits=4102 miss=3 pseudo_lock_mea-554 [002] 117.532446: pseudo_lock_l2: hits=4100 miss=5 pseudo_lock_mea-557 [002] 118.591121: pseudo_lock_l2: hits=4103 miss=2 pseudo_lock_mea-560 [002] 119.642467: pseudo_lock_l2: hits=4102 miss=3 pseudo_lock_mea-563 [002] 120.698562: pseudo_lock_l2: hits=4102 miss=3 pseudo_lock_mea-566 [002] 121.769348: pseudo_lock_l2: hits=4105 miss=4 In an attempt to improve the accuracy of the above I modified it to the following: /* create the two events as before in "enabled" state */ l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc; l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc; local_irq_disable(); /* disable hw prefetchers */ /* init local vars to loop through pseudo-locked mem */ l2_hits_before = native_read_pmc(l2_hit_pmcnum); l2_miss_before = native_read_pmc(l2_miss_pmcnum); /* loop through pseudo-locked mem */ l2_hits_after = native_read_pmc(l2_hit_pmcnum); l2_miss_after = native_read_pmc(l2_miss_pmcnum); /* enable hw prefetchers */ local_irq_enable(); With the above I seem to get the same accuracy as before: pseudo_lock_mea-557 [002] 155.402566: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-564 [002] 156.441299: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-567 [002] 157.478605: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-570 [002] 158.524054: pseudo_lock_l2: hits=4096
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Mon, Aug 06, 2018 at 12:50:50PM -0700, Reinette Chatre wrote: > In my previous email I provided the details of the Cache Pseudo-Locking > feature implemented on top of resctrl. Please let me know if you would > like any more details about that. I can send you more materials. I've no yet had time to read.. > BUG: sleeping function called from invalid context at > kernel/locking/mutex.c:748 > > I thus continued to use the API with interrupts enabled did the following: > > Two new event attributes: > static struct perf_event_attr l2_miss_attr = { > .type = PERF_TYPE_RAW, > .config = (0x10ULL << 8) | 0xd1, Please use something like: X86_CONFIG(.event=0xd1, .umask=0x10), that's ever so much more readable. > .size = sizeof(struct perf_event_attr), > .pinned = 1, > .disabled = 1, > .exclude_user = 1 > }; > > static struct perf_event_attr l2_hit_attr = { > .type = PERF_TYPE_RAW, > .config = (0x2ULL << 8) | 0xd1, > .size = sizeof(struct perf_event_attr), > .pinned = 1, > .disabled = 1, > .exclude_user = 1 > }; > > Create the two new events using these attributes: > l2_miss_event = perf_event_create_kernel_counter(_miss_attr, cpu, > NULL, NULL, NULL); > l2_hit_event = perf_event_create_kernel_counter(_hit_attr, cpu, NULL, > NULL, NULL); > > Take measurements: > perf_event_enable(l2_miss_event); > perf_event_enable(l2_hit_event); > local_irq_disable(); > /* Disable hardware prefetchers */ > /* Loop through pseudo-locked memory */ > /* Enable hardware prefetchers */ > local_irq_enable(); > perf_event_disable(l2_hit_event); > perf_event_disable(l2_miss_event); > > Read results: > l2_hits = perf_event_read_value(l2_hit_event, , ); > l2_miss = perf_event_read_value(l2_miss_event, , ); > /* Make results available in tracepoints */ switch to .disabled=0 and try this for measurement: local_irq_disable(); perf_event_read_local(l2_miss_event, _val1, NULL, NULL); perf_event_read_local(l2_hit_event, _val1, NULL, NULL); /* do your thing */ perf_event_read_local(l2_miss_event, _val2, NULL, NULL); perf_event_read_local(l2_hit_event, _val2, NULL, NULL); local_irq_enable(); You're running this on the CPU you created the event for, right? > With the above implementation and a 256KB pseudo-locked memory region I > obtain the following results: > pseudo_lock_mea-755 [002] 396.946953: pseudo_lock_l2: hits=4140 > The above results are not accurate since it does not reflect the success > of the pseudo-locked region. Expected results are as we can currently > obtain (copying results from previous email): > pseudo_lock_mea-26090 [002] 61838.488027: pseudo_lock_l2: hits=4096 Still fairly close.. only like 44 extra hits or 1% error.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Mon, Aug 06, 2018 at 12:50:50PM -0700, Reinette Chatre wrote: > In my previous email I provided the details of the Cache Pseudo-Locking > feature implemented on top of resctrl. Please let me know if you would > like any more details about that. I can send you more materials. I've no yet had time to read.. > BUG: sleeping function called from invalid context at > kernel/locking/mutex.c:748 > > I thus continued to use the API with interrupts enabled did the following: > > Two new event attributes: > static struct perf_event_attr l2_miss_attr = { > .type = PERF_TYPE_RAW, > .config = (0x10ULL << 8) | 0xd1, Please use something like: X86_CONFIG(.event=0xd1, .umask=0x10), that's ever so much more readable. > .size = sizeof(struct perf_event_attr), > .pinned = 1, > .disabled = 1, > .exclude_user = 1 > }; > > static struct perf_event_attr l2_hit_attr = { > .type = PERF_TYPE_RAW, > .config = (0x2ULL << 8) | 0xd1, > .size = sizeof(struct perf_event_attr), > .pinned = 1, > .disabled = 1, > .exclude_user = 1 > }; > > Create the two new events using these attributes: > l2_miss_event = perf_event_create_kernel_counter(_miss_attr, cpu, > NULL, NULL, NULL); > l2_hit_event = perf_event_create_kernel_counter(_hit_attr, cpu, NULL, > NULL, NULL); > > Take measurements: > perf_event_enable(l2_miss_event); > perf_event_enable(l2_hit_event); > local_irq_disable(); > /* Disable hardware prefetchers */ > /* Loop through pseudo-locked memory */ > /* Enable hardware prefetchers */ > local_irq_enable(); > perf_event_disable(l2_hit_event); > perf_event_disable(l2_miss_event); > > Read results: > l2_hits = perf_event_read_value(l2_hit_event, , ); > l2_miss = perf_event_read_value(l2_miss_event, , ); > /* Make results available in tracepoints */ switch to .disabled=0 and try this for measurement: local_irq_disable(); perf_event_read_local(l2_miss_event, _val1, NULL, NULL); perf_event_read_local(l2_hit_event, _val1, NULL, NULL); /* do your thing */ perf_event_read_local(l2_miss_event, _val2, NULL, NULL); perf_event_read_local(l2_hit_event, _val2, NULL, NULL); local_irq_enable(); You're running this on the CPU you created the event for, right? > With the above implementation and a 256KB pseudo-locked memory region I > obtain the following results: > pseudo_lock_mea-755 [002] 396.946953: pseudo_lock_l2: hits=4140 > The above results are not accurate since it does not reflect the success > of the pseudo-locked region. Expected results are as we can currently > obtain (copying results from previous email): > pseudo_lock_mea-26090 [002] 61838.488027: pseudo_lock_l2: hits=4096 Still fairly close.. only like 44 extra hits or 1% error.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Peter, On 8/3/2018 11:37 AM, Reinette Chatre wrote: > On 8/3/2018 8:25 AM, Peter Zijlstra wrote: >> On Fri, Aug 03, 2018 at 08:18:09AM -0700, Reinette Chatre wrote: >>> You state that you understand what we are trying to do and I hope that I >>> convinced you that we are not able to accomplish the same by following >>> your guidance. >> >> No, I said I understood your pmc reserve patch and its implications. >> >> I have no clue what you're trying to do with resctl, nor why you think >> this is not feasible with perf. And if it really is not feasible, you'll >> have to live without it. In my previous email I provided the details of the Cache Pseudo-Locking feature implemented on top of resctrl. Please let me know if you would like any more details about that. I can send you more materials. In my previous message I also provided the thoughts on why I believe same is not feasible with perf as commented below ... > Looking at if we were to build on top of the kernel perf event API > (perf_event_create_kernel_counter(), perf_event_enable(), > perf_event_disable(), ...). Just looking at perf_event_enable() - > ideally this would be as lean as possible to only enable the event and > not result in itself contributing the the measurement. First, the > enabling of the event is not as lean to fulfill this requirement since > it executes more code after the event was actually enabled. Also, the > code relies on a mutex so we cannot use it with interrupts disabled. I proceeded to modify the implemented debugfs measurements to build on top of the perf APIs mentioned above. As anticipated the events could not be enabled in interrupt context. I get a clear message in this regard: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:748 I thus continued to use the API with interrupts enabled did the following: Two new event attributes: static struct perf_event_attr l2_miss_attr = { .type = PERF_TYPE_RAW, .config = (0x10ULL << 8) | 0xd1, .size = sizeof(struct perf_event_attr), .pinned = 1, .disabled = 1, .exclude_user = 1 }; static struct perf_event_attr l2_hit_attr = { .type = PERF_TYPE_RAW, .config = (0x2ULL << 8) | 0xd1, .size = sizeof(struct perf_event_attr), .pinned = 1, .disabled = 1, .exclude_user = 1 }; Create the two new events using these attributes: l2_miss_event = perf_event_create_kernel_counter(_miss_attr, cpu, NULL, NULL, NULL); l2_hit_event = perf_event_create_kernel_counter(_hit_attr, cpu, NULL, NULL, NULL); Take measurements: perf_event_enable(l2_miss_event); perf_event_enable(l2_hit_event); local_irq_disable(); /* Disable hardware prefetchers */ /* Loop through pseudo-locked memory */ /* Enable hardware prefetchers */ local_irq_enable(); perf_event_disable(l2_hit_event); perf_event_disable(l2_miss_event); Read results: l2_hits = perf_event_read_value(l2_hit_event, , ); l2_miss = perf_event_read_value(l2_miss_event, , ); /* Make results available in tracepoints */ With the above implementation and a 256KB pseudo-locked memory region I obtain the following results: pseudo_lock_mea-755 [002] 396.946953: pseudo_lock_l2: hits=4140 miss=5 pseudo_lock_mea-762 [002] 397.998864: pseudo_lock_l2: hits=4138 miss=8 pseudo_lock_mea-765 [002] 399.041868: pseudo_lock_l2: hits=4142 miss=5 pseudo_lock_mea-768 [002] 400.086871: pseudo_lock_l2: hits=4141 miss=7 pseudo_lock_mea-771 [002] 401.132921: pseudo_lock_l2: hits=4138 miss=10 pseudo_lock_mea-774 [002] 402.216700: pseudo_lock_l2: hits=4238 miss=46 pseudo_lock_mea-777 [002] 403.312148: pseudo_lock_l2: hits=4142 miss=5 pseudo_lock_mea-780 [002] 404.381674: pseudo_lock_l2: hits=4139 miss=8 pseudo_lock_mea-783 [002] 405.422820: pseudo_lock_l2: hits=4472 miss=79 pseudo_lock_mea-786 [002] 406.495065: pseudo_lock_l2: hits=4140 miss=8 pseudo_lock_mea-793 [002] 407.561383: pseudo_lock_l2: hits=4143 miss=4 The above results are not accurate since it does not reflect the success of the pseudo-locked region. Expected results are as we can currently obtain (copying results from previous email): pseudo_lock_mea-26090 [002] 61838.488027: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26097 [002] 61843.689381: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26100 [002] 61848.751411: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26108 [002] 61853.820361: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26111 [002] 61858.880364: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26118 [002] 61863.937343: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26121 [002] 61869.008341: pseudo_lock_l2: hits=4096 miss=0 Could you please guide me on how you would prefer us to use perf in order to obtain the same accurate results we can
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Peter, On 8/3/2018 11:37 AM, Reinette Chatre wrote: > On 8/3/2018 8:25 AM, Peter Zijlstra wrote: >> On Fri, Aug 03, 2018 at 08:18:09AM -0700, Reinette Chatre wrote: >>> You state that you understand what we are trying to do and I hope that I >>> convinced you that we are not able to accomplish the same by following >>> your guidance. >> >> No, I said I understood your pmc reserve patch and its implications. >> >> I have no clue what you're trying to do with resctl, nor why you think >> this is not feasible with perf. And if it really is not feasible, you'll >> have to live without it. In my previous email I provided the details of the Cache Pseudo-Locking feature implemented on top of resctrl. Please let me know if you would like any more details about that. I can send you more materials. In my previous message I also provided the thoughts on why I believe same is not feasible with perf as commented below ... > Looking at if we were to build on top of the kernel perf event API > (perf_event_create_kernel_counter(), perf_event_enable(), > perf_event_disable(), ...). Just looking at perf_event_enable() - > ideally this would be as lean as possible to only enable the event and > not result in itself contributing the the measurement. First, the > enabling of the event is not as lean to fulfill this requirement since > it executes more code after the event was actually enabled. Also, the > code relies on a mutex so we cannot use it with interrupts disabled. I proceeded to modify the implemented debugfs measurements to build on top of the perf APIs mentioned above. As anticipated the events could not be enabled in interrupt context. I get a clear message in this regard: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:748 I thus continued to use the API with interrupts enabled did the following: Two new event attributes: static struct perf_event_attr l2_miss_attr = { .type = PERF_TYPE_RAW, .config = (0x10ULL << 8) | 0xd1, .size = sizeof(struct perf_event_attr), .pinned = 1, .disabled = 1, .exclude_user = 1 }; static struct perf_event_attr l2_hit_attr = { .type = PERF_TYPE_RAW, .config = (0x2ULL << 8) | 0xd1, .size = sizeof(struct perf_event_attr), .pinned = 1, .disabled = 1, .exclude_user = 1 }; Create the two new events using these attributes: l2_miss_event = perf_event_create_kernel_counter(_miss_attr, cpu, NULL, NULL, NULL); l2_hit_event = perf_event_create_kernel_counter(_hit_attr, cpu, NULL, NULL, NULL); Take measurements: perf_event_enable(l2_miss_event); perf_event_enable(l2_hit_event); local_irq_disable(); /* Disable hardware prefetchers */ /* Loop through pseudo-locked memory */ /* Enable hardware prefetchers */ local_irq_enable(); perf_event_disable(l2_hit_event); perf_event_disable(l2_miss_event); Read results: l2_hits = perf_event_read_value(l2_hit_event, , ); l2_miss = perf_event_read_value(l2_miss_event, , ); /* Make results available in tracepoints */ With the above implementation and a 256KB pseudo-locked memory region I obtain the following results: pseudo_lock_mea-755 [002] 396.946953: pseudo_lock_l2: hits=4140 miss=5 pseudo_lock_mea-762 [002] 397.998864: pseudo_lock_l2: hits=4138 miss=8 pseudo_lock_mea-765 [002] 399.041868: pseudo_lock_l2: hits=4142 miss=5 pseudo_lock_mea-768 [002] 400.086871: pseudo_lock_l2: hits=4141 miss=7 pseudo_lock_mea-771 [002] 401.132921: pseudo_lock_l2: hits=4138 miss=10 pseudo_lock_mea-774 [002] 402.216700: pseudo_lock_l2: hits=4238 miss=46 pseudo_lock_mea-777 [002] 403.312148: pseudo_lock_l2: hits=4142 miss=5 pseudo_lock_mea-780 [002] 404.381674: pseudo_lock_l2: hits=4139 miss=8 pseudo_lock_mea-783 [002] 405.422820: pseudo_lock_l2: hits=4472 miss=79 pseudo_lock_mea-786 [002] 406.495065: pseudo_lock_l2: hits=4140 miss=8 pseudo_lock_mea-793 [002] 407.561383: pseudo_lock_l2: hits=4143 miss=4 The above results are not accurate since it does not reflect the success of the pseudo-locked region. Expected results are as we can currently obtain (copying results from previous email): pseudo_lock_mea-26090 [002] 61838.488027: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26097 [002] 61843.689381: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26100 [002] 61848.751411: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26108 [002] 61853.820361: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26111 [002] 61858.880364: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26118 [002] 61863.937343: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26121 [002] 61869.008341: pseudo_lock_l2: hits=4096 miss=0 Could you please guide me on how you would prefer us to use perf in order to obtain the same accurate results we can
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Peter, On 8/3/2018 8:25 AM, Peter Zijlstra wrote: > On Fri, Aug 03, 2018 at 08:18:09AM -0700, Reinette Chatre wrote: >> You state that you understand what we are trying to do and I hope that I >> convinced you that we are not able to accomplish the same by following >> your guidance. > > No, I said I understood your pmc reserve patch and its implications. > > I have no clue what you're trying to do with resctl, nor why you think > this is not feasible with perf. And if it really is not feasible, you'll > have to live without it. I can surely provide the details on what we are doing with resctrl and elaborate more on why this is not feasible with the full perf kernel API. In summary: Building on top of Cache Allocation Technology (CAT) we load a portion of memory into a specified region of cache. After the region of cache obtained its data it (the cache region) is configured (via CAT) to only serve cache hits - this pre-loaded memory cannot be evicted from the cache. We call this "cache pseudo-locking" - the memory has been "pseudo-locked" to the cache. To measure how successful the pseudo-locking of the memory is we can use the precision capable performance events on our platforms: start the cache hits and miss counters, read the pseudo-locked memory, stop the counters. This measurement is done with interrupts and hardware prefetchers disabled to ensure that _only_ access to the pseudo-locked memory is measured. Any additional code or data accessed either by the counter management or even by the loops reading the memory itself can contribute to cache hits/misses measured for that instead of the memory we are trying to access. Even within the current measurement code we had to take a lot of care to not use, for example, pointers to obtain information about the memory to be measured. The information had to be local variables. Looking at if we were to build on top of the kernel perf event API (perf_event_create_kernel_counter(), perf_event_enable(), perf_event_disable(), ...). Just looking at perf_event_enable() - ideally this would be as lean as possible to only enable the event and not result in itself contributing the the measurement. First, the enabling of the event is not as lean to fulfill this requirement since it executes more code after the event was actually enabled. Also, the code relies on a mutex so we cannot use it with interrupts disabled. We have two types of customers of this feature: those who require very low latency and those who require high determinism. In either case a measured cache miss is of concern to them and our goal is to provide a memory region for which the number of cache misses can be demonstrated. Reinette
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Peter, On 8/3/2018 8:25 AM, Peter Zijlstra wrote: > On Fri, Aug 03, 2018 at 08:18:09AM -0700, Reinette Chatre wrote: >> You state that you understand what we are trying to do and I hope that I >> convinced you that we are not able to accomplish the same by following >> your guidance. > > No, I said I understood your pmc reserve patch and its implications. > > I have no clue what you're trying to do with resctl, nor why you think > this is not feasible with perf. And if it really is not feasible, you'll > have to live without it. I can surely provide the details on what we are doing with resctrl and elaborate more on why this is not feasible with the full perf kernel API. In summary: Building on top of Cache Allocation Technology (CAT) we load a portion of memory into a specified region of cache. After the region of cache obtained its data it (the cache region) is configured (via CAT) to only serve cache hits - this pre-loaded memory cannot be evicted from the cache. We call this "cache pseudo-locking" - the memory has been "pseudo-locked" to the cache. To measure how successful the pseudo-locking of the memory is we can use the precision capable performance events on our platforms: start the cache hits and miss counters, read the pseudo-locked memory, stop the counters. This measurement is done with interrupts and hardware prefetchers disabled to ensure that _only_ access to the pseudo-locked memory is measured. Any additional code or data accessed either by the counter management or even by the loops reading the memory itself can contribute to cache hits/misses measured for that instead of the memory we are trying to access. Even within the current measurement code we had to take a lot of care to not use, for example, pointers to obtain information about the memory to be measured. The information had to be local variables. Looking at if we were to build on top of the kernel perf event API (perf_event_create_kernel_counter(), perf_event_enable(), perf_event_disable(), ...). Just looking at perf_event_enable() - ideally this would be as lean as possible to only enable the event and not result in itself contributing the the measurement. First, the enabling of the event is not as lean to fulfill this requirement since it executes more code after the event was actually enabled. Also, the code relies on a mutex so we cannot use it with interrupts disabled. We have two types of customers of this feature: those who require very low latency and those who require high determinism. In either case a measured cache miss is of concern to them and our goal is to provide a memory region for which the number of cache misses can be demonstrated. Reinette
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Fri, Aug 03, 2018 at 08:18:09AM -0700, Reinette Chatre wrote: > You state that you understand what we are trying to do and I hope that I > convinced you that we are not able to accomplish the same by following > your guidance. No, I said I understood your pmc reserve patch and its implications. I have no clue what you're trying to do with resctl, nor why you think this is not feasible with perf. And if it really is not feasible, you'll have to live without it.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Fri, Aug 03, 2018 at 08:18:09AM -0700, Reinette Chatre wrote: > You state that you understand what we are trying to do and I hope that I > convinced you that we are not able to accomplish the same by following > your guidance. No, I said I understood your pmc reserve patch and its implications. I have no clue what you're trying to do with resctl, nor why you think this is not feasible with perf. And if it really is not feasible, you'll have to live without it.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Peter, On 8/3/2018 3:49 AM, Peter Zijlstra wrote: > On Thu, Aug 02, 2018 at 01:43:42PM -0700, Reinette Chatre wrote: > >> The goal of this work is to use the existing PMU hardware coordination >> mechanism to ensure that perf and resctrl will not interfere with each >> other. > > I understand what it does.. but if I'd seen you frobbing at the PMU > earlier your patches would've never gotten merged. > > It's simply not going to happen. Your two-fold guidance is clear to me: (1) do not use PMU registers directly, (2) use perf API instead. You state that you understand what we are trying to do and I hope that I convinced you that we are not able to accomplish the same by following your guidance. Could you please guide us how to obtain the accurate results we obtain at this time while satisfying your requirements? Earlier your response to a similar question was "That's the best you're going to get" - unfortunately I cannot respectfully say that to our customers when they indeed have become used to working with accurate data for more than a year. I really want to find a solution that is acceptable to mainline instead of having to maintain a solution needed by customers out of tree. Reinette
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Peter, On 8/3/2018 3:49 AM, Peter Zijlstra wrote: > On Thu, Aug 02, 2018 at 01:43:42PM -0700, Reinette Chatre wrote: > >> The goal of this work is to use the existing PMU hardware coordination >> mechanism to ensure that perf and resctrl will not interfere with each >> other. > > I understand what it does.. but if I'd seen you frobbing at the PMU > earlier your patches would've never gotten merged. > > It's simply not going to happen. Your two-fold guidance is clear to me: (1) do not use PMU registers directly, (2) use perf API instead. You state that you understand what we are trying to do and I hope that I convinced you that we are not able to accomplish the same by following your guidance. Could you please guide us how to obtain the accurate results we obtain at this time while satisfying your requirements? Earlier your response to a similar question was "That's the best you're going to get" - unfortunately I cannot respectfully say that to our customers when they indeed have become used to working with accurate data for more than a year. I really want to find a solution that is acceptable to mainline instead of having to maintain a solution needed by customers out of tree. Reinette
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Thu, Aug 02, 2018 at 01:43:42PM -0700, Reinette Chatre wrote: > The goal of this work is to use the existing PMU hardware coordination > mechanism to ensure that perf and resctrl will not interfere with each > other. I understand what it does.. but if I'd seen you frobbing at the PMU earlier your patches would've never gotten merged. It's simply not going to happen.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Thu, Aug 02, 2018 at 01:43:42PM -0700, Reinette Chatre wrote: > The goal of this work is to use the existing PMU hardware coordination > mechanism to ensure that perf and resctrl will not interfere with each > other. I understand what it does.. but if I'd seen you frobbing at the PMU earlier your patches would've never gotten merged. It's simply not going to happen.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Peter, On 8/2/2018 1:13 PM, Peter Zijlstra wrote: > On Thu, Aug 02, 2018 at 01:06:19PM -0700, Dave Hansen wrote: >> On 08/02/2018 12:54 PM, Peter Zijlstra wrote: I totally understand not wanting to fill the tree with code hijacking the raw PMU. Is your reaction to this really around not wanting to start down the slippery slope that ends up with lots of raw PMU "owners"? >>> That and the fact that multiple owner directly contradicts what perf set >>> out to do, provide resource arbitration for the PMU. >>> >>> Not being able to use both perf and this resctl thing at the same time >>> is utter crap. You will not get special dispensation. The goal of this work is to use the existing PMU hardware coordination mechanism to ensure that perf and resctrl will not interfere with each other. The resctrl debugging is short-lived - it reserves the PMU hardware while interrupts are disabled and releases the PMU hardware before re-enabling interrupts. If a perf session is running at that time then this reservation will ensure it is not interfered with. If a perf session is not running at the time then it will not even get the opportunity to attempt to start one. >> Is there something we could do in the middle, like have perf itself be >> in charge of doing all the programming, but the psuedo-locking could >> still _read_ the counters? > > perf has all of that. At the time the counters are read the damage has already been done by all the extra instructions and data accessed during the enable and disable of the events. Reinette
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Peter, On 8/2/2018 1:13 PM, Peter Zijlstra wrote: > On Thu, Aug 02, 2018 at 01:06:19PM -0700, Dave Hansen wrote: >> On 08/02/2018 12:54 PM, Peter Zijlstra wrote: I totally understand not wanting to fill the tree with code hijacking the raw PMU. Is your reaction to this really around not wanting to start down the slippery slope that ends up with lots of raw PMU "owners"? >>> That and the fact that multiple owner directly contradicts what perf set >>> out to do, provide resource arbitration for the PMU. >>> >>> Not being able to use both perf and this resctl thing at the same time >>> is utter crap. You will not get special dispensation. The goal of this work is to use the existing PMU hardware coordination mechanism to ensure that perf and resctrl will not interfere with each other. The resctrl debugging is short-lived - it reserves the PMU hardware while interrupts are disabled and releases the PMU hardware before re-enabling interrupts. If a perf session is running at that time then this reservation will ensure it is not interfered with. If a perf session is not running at the time then it will not even get the opportunity to attempt to start one. >> Is there something we could do in the middle, like have perf itself be >> in charge of doing all the programming, but the psuedo-locking could >> still _read_ the counters? > > perf has all of that. At the time the counters are read the damage has already been done by all the extra instructions and data accessed during the enable and disable of the events. Reinette
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Thu, Aug 02, 2018 at 01:06:19PM -0700, Dave Hansen wrote: > On 08/02/2018 12:54 PM, Peter Zijlstra wrote: > >> I totally understand not wanting to fill the tree with code hijacking > >> the raw PMU. Is your reaction to this really around not wanting to > >> start down the slippery slope that ends up with lots of raw PMU "owners"? > > That and the fact that multiple owner directly contradicts what perf set > > out to do, provide resource arbitration for the PMU. > > > > Not being able to use both perf and this resctl thing at the same time > > is utter crap. You will not get special dispensation. > > Is there something we could do in the middle, like have perf itself be > in charge of doing all the programming, but the psuedo-locking could > still _read_ the counters? perf has all of that.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Thu, Aug 02, 2018 at 01:06:19PM -0700, Dave Hansen wrote: > On 08/02/2018 12:54 PM, Peter Zijlstra wrote: > >> I totally understand not wanting to fill the tree with code hijacking > >> the raw PMU. Is your reaction to this really around not wanting to > >> start down the slippery slope that ends up with lots of raw PMU "owners"? > > That and the fact that multiple owner directly contradicts what perf set > > out to do, provide resource arbitration for the PMU. > > > > Not being able to use both perf and this resctl thing at the same time > > is utter crap. You will not get special dispensation. > > Is there something we could do in the middle, like have perf itself be > in charge of doing all the programming, but the psuedo-locking could > still _read_ the counters? perf has all of that.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On 08/02/2018 12:54 PM, Peter Zijlstra wrote: >> I totally understand not wanting to fill the tree with code hijacking >> the raw PMU. Is your reaction to this really around not wanting to >> start down the slippery slope that ends up with lots of raw PMU "owners"? > That and the fact that multiple owner directly contradicts what perf set > out to do, provide resource arbitration for the PMU. > > Not being able to use both perf and this resctl thing at the same time > is utter crap. You will not get special dispensation. Is there something we could do in the middle, like have perf itself be in charge of doing all the programming, but the psuedo-locking could still _read_ the counters? I guess we'd still end up needing to have perf expose which counter got assigned to which event, and make sure things like multiplexing are not in play. But, either way, I think we can live with this measurement being fragile (like it already is). If it only works when all the planets are aligned, I think it's still usable enough.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On 08/02/2018 12:54 PM, Peter Zijlstra wrote: >> I totally understand not wanting to fill the tree with code hijacking >> the raw PMU. Is your reaction to this really around not wanting to >> start down the slippery slope that ends up with lots of raw PMU "owners"? > That and the fact that multiple owner directly contradicts what perf set > out to do, provide resource arbitration for the PMU. > > Not being able to use both perf and this resctl thing at the same time > is utter crap. You will not get special dispensation. Is there something we could do in the middle, like have perf itself be in charge of doing all the programming, but the psuedo-locking could still _read_ the counters? I guess we'd still end up needing to have perf expose which counter got assigned to which event, and make sure things like multiplexing are not in play. But, either way, I think we can live with this measurement being fragile (like it already is). If it only works when all the planets are aligned, I think it's still usable enough.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Thu, Aug 02, 2018 at 11:18:01AM -0700, Dave Hansen wrote: > On 08/02/2018 10:37 AM, Peter Zijlstra wrote: > >> I do not see how I can do so without incurring the cache hits and misses > >> from the data needed and instructions run by this interface. Could you > >> please share how I can do so and still obtain the accurate measurement > >> of cache residency of a specific memory region? > > That's the best you're going to get. You do _NOT_ get to use raw PMU. > > Hi Peter, > > This code is really fidgety. It's really easily perturbed even by tiny > things like implicit cache accesses from the page walker filling the TLB > from the page tables. > > Adding a bunch more code in the way is surely going to make it more > fragile and imprecise. > > I totally understand not wanting to fill the tree with code hijacking > the raw PMU. Is your reaction to this really around not wanting to > start down the slippery slope that ends up with lots of raw PMU "owners"? That and the fact that multiple owner directly contradicts what perf set out to do, provide resource arbitration for the PMU. Not being able to use both perf and this resctl thing at the same time is utter crap. You will not get special dispensation.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Thu, Aug 02, 2018 at 11:18:01AM -0700, Dave Hansen wrote: > On 08/02/2018 10:37 AM, Peter Zijlstra wrote: > >> I do not see how I can do so without incurring the cache hits and misses > >> from the data needed and instructions run by this interface. Could you > >> please share how I can do so and still obtain the accurate measurement > >> of cache residency of a specific memory region? > > That's the best you're going to get. You do _NOT_ get to use raw PMU. > > Hi Peter, > > This code is really fidgety. It's really easily perturbed even by tiny > things like implicit cache accesses from the page walker filling the TLB > from the page tables. > > Adding a bunch more code in the way is surely going to make it more > fragile and imprecise. > > I totally understand not wanting to fill the tree with code hijacking > the raw PMU. Is your reaction to this really around not wanting to > start down the slippery slope that ends up with lots of raw PMU "owners"? That and the fact that multiple owner directly contradicts what perf set out to do, provide resource arbitration for the PMU. Not being able to use both perf and this resctl thing at the same time is utter crap. You will not get special dispensation.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On 08/02/2018 10:37 AM, Peter Zijlstra wrote: >> I do not see how I can do so without incurring the cache hits and misses >> from the data needed and instructions run by this interface. Could you >> please share how I can do so and still obtain the accurate measurement >> of cache residency of a specific memory region? > That's the best you're going to get. You do _NOT_ get to use raw PMU. Hi Peter, This code is really fidgety. It's really easily perturbed even by tiny things like implicit cache accesses from the page walker filling the TLB from the page tables. Adding a bunch more code in the way is surely going to make it more fragile and imprecise. I totally understand not wanting to fill the tree with code hijacking the raw PMU. Is your reaction to this really around not wanting to start down the slippery slope that ends up with lots of raw PMU "owners"?
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On 08/02/2018 10:37 AM, Peter Zijlstra wrote: >> I do not see how I can do so without incurring the cache hits and misses >> from the data needed and instructions run by this interface. Could you >> please share how I can do so and still obtain the accurate measurement >> of cache residency of a specific memory region? > That's the best you're going to get. You do _NOT_ get to use raw PMU. Hi Peter, This code is really fidgety. It's really easily perturbed even by tiny things like implicit cache accesses from the page walker filling the TLB from the page tables. Adding a bunch more code in the way is surely going to make it more fragile and imprecise. I totally understand not wanting to fill the tree with code hijacking the raw PMU. Is your reaction to this really around not wanting to start down the slippery slope that ends up with lots of raw PMU "owners"?
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Thu, Aug 02, 2018 at 09:44:13AM -0700, Reinette Chatre wrote: > On 8/2/2018 9:18 AM, Peter Zijlstra wrote: > > On Thu, Aug 02, 2018 at 09:14:10AM -0700, Reinette Chatre wrote: > > > >> The current implementation does not coordinate with perf and this is > >> what I am trying to fix in this series. > >> > >> I do respect your NAK but it is not clear to me how to proceed after > >> obtaining it. Could you please elaborate on what you would prefer as a > >> solution to ensure accurate measurement of cache-locked data that is > >> better integrated? > > > > We have an in-kernel interface to perf, use that if you want access to > > the PMU. You will not directly stomp on PMU registers. > > I do not see how I can do so without incurring the cache hits and misses > from the data needed and instructions run by this interface. Could you > please share how I can do so and still obtain the accurate measurement > of cache residency of a specific memory region? That's the best you're going to get. You do _NOT_ get to use raw PMU.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Thu, Aug 02, 2018 at 09:44:13AM -0700, Reinette Chatre wrote: > On 8/2/2018 9:18 AM, Peter Zijlstra wrote: > > On Thu, Aug 02, 2018 at 09:14:10AM -0700, Reinette Chatre wrote: > > > >> The current implementation does not coordinate with perf and this is > >> what I am trying to fix in this series. > >> > >> I do respect your NAK but it is not clear to me how to proceed after > >> obtaining it. Could you please elaborate on what you would prefer as a > >> solution to ensure accurate measurement of cache-locked data that is > >> better integrated? > > > > We have an in-kernel interface to perf, use that if you want access to > > the PMU. You will not directly stomp on PMU registers. > > I do not see how I can do so without incurring the cache hits and misses > from the data needed and instructions run by this interface. Could you > please share how I can do so and still obtain the accurate measurement > of cache residency of a specific memory region? That's the best you're going to get. You do _NOT_ get to use raw PMU.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On 8/2/2018 9:18 AM, Peter Zijlstra wrote: > On Thu, Aug 02, 2018 at 09:14:10AM -0700, Reinette Chatre wrote: > >> The current implementation does not coordinate with perf and this is >> what I am trying to fix in this series. >> >> I do respect your NAK but it is not clear to me how to proceed after >> obtaining it. Could you please elaborate on what you would prefer as a >> solution to ensure accurate measurement of cache-locked data that is >> better integrated? > > We have an in-kernel interface to perf, use that if you want access to > the PMU. You will not directly stomp on PMU registers. I do not see how I can do so without incurring the cache hits and misses from the data needed and instructions run by this interface. Could you please share how I can do so and still obtain the accurate measurement of cache residency of a specific memory region? Reinette
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On 8/2/2018 9:18 AM, Peter Zijlstra wrote: > On Thu, Aug 02, 2018 at 09:14:10AM -0700, Reinette Chatre wrote: > >> The current implementation does not coordinate with perf and this is >> what I am trying to fix in this series. >> >> I do respect your NAK but it is not clear to me how to proceed after >> obtaining it. Could you please elaborate on what you would prefer as a >> solution to ensure accurate measurement of cache-locked data that is >> better integrated? > > We have an in-kernel interface to perf, use that if you want access to > the PMU. You will not directly stomp on PMU registers. I do not see how I can do so without incurring the cache hits and misses from the data needed and instructions run by this interface. Could you please share how I can do so and still obtain the accurate measurement of cache residency of a specific memory region? Reinette
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Thu, Aug 02, 2018 at 09:14:10AM -0700, Reinette Chatre wrote: > The current implementation does not coordinate with perf and this is > what I am trying to fix in this series. > > I do respect your NAK but it is not clear to me how to proceed after > obtaining it. Could you please elaborate on what you would prefer as a > solution to ensure accurate measurement of cache-locked data that is > better integrated? We have an in-kernel interface to perf, use that if you want access to the PMU. You will not directly stomp on PMU registers.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Thu, Aug 02, 2018 at 09:14:10AM -0700, Reinette Chatre wrote: > The current implementation does not coordinate with perf and this is > what I am trying to fix in this series. > > I do respect your NAK but it is not clear to me how to proceed after > obtaining it. Could you please elaborate on what you would prefer as a > solution to ensure accurate measurement of cache-locked data that is > better integrated? We have an in-kernel interface to perf, use that if you want access to the PMU. You will not directly stomp on PMU registers.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Peter, On 8/2/2018 5:39 AM, Peter Zijlstra wrote: > On Tue, Jul 31, 2018 at 12:38:27PM -0700, Reinette Chatre wrote: >> Dear Maintainers, >> >> The success of Cache Pseudo-Locking can be measured via the use of >> performance events. Specifically, the number of cache hits and misses >> reading a memory region after it has been pseudo-locked to cache. This >> measurement is triggered via the resctrl debugfs interface. >> >> To ensure most accurate results the performance counters and their >> configuration registers are accessed directly. > > NAK on that. > After data is locked to cache we need to measure the success of that. There is no instruction that we can use to query if a memory address has been cached but we can use performance monitoring events that are especially valuable on the platforms where they are precise event capable. To ensure that we are only measuring the presence of data that should be locked to cache we need to tightly control how this measurement is done. For example, on my test system I locked 256KB to the cache and with the current implementation (tip.git on branch x86/cache) I am able to accurately measure that this was successful as seen below (each cache line within the 256KB is accessed while the performance monitoring events are active): pseudo_lock_mea-26090 [002] 61838.488027: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26097 [002] 61843.689381: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26100 [002] 61848.751411: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26108 [002] 61853.820361: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26111 [002] 61858.880364: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26118 [002] 61863.937343: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26121 [002] 61869.008341: pseudo_lock_l2: hits=4096 miss=0 The current implementation does not coordinate with perf and this is what I am trying to fix in this series. I do respect your NAK but it is not clear to me how to proceed after obtaining it. Could you please elaborate on what you would prefer as a solution to ensure accurate measurement of cache-locked data that is better integrated? Thank you very much Reinette
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
Hi Peter, On 8/2/2018 5:39 AM, Peter Zijlstra wrote: > On Tue, Jul 31, 2018 at 12:38:27PM -0700, Reinette Chatre wrote: >> Dear Maintainers, >> >> The success of Cache Pseudo-Locking can be measured via the use of >> performance events. Specifically, the number of cache hits and misses >> reading a memory region after it has been pseudo-locked to cache. This >> measurement is triggered via the resctrl debugfs interface. >> >> To ensure most accurate results the performance counters and their >> configuration registers are accessed directly. > > NAK on that. > After data is locked to cache we need to measure the success of that. There is no instruction that we can use to query if a memory address has been cached but we can use performance monitoring events that are especially valuable on the platforms where they are precise event capable. To ensure that we are only measuring the presence of data that should be locked to cache we need to tightly control how this measurement is done. For example, on my test system I locked 256KB to the cache and with the current implementation (tip.git on branch x86/cache) I am able to accurately measure that this was successful as seen below (each cache line within the 256KB is accessed while the performance monitoring events are active): pseudo_lock_mea-26090 [002] 61838.488027: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26097 [002] 61843.689381: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26100 [002] 61848.751411: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26108 [002] 61853.820361: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26111 [002] 61858.880364: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26118 [002] 61863.937343: pseudo_lock_l2: hits=4096 miss=0 pseudo_lock_mea-26121 [002] 61869.008341: pseudo_lock_l2: hits=4096 miss=0 The current implementation does not coordinate with perf and this is what I am trying to fix in this series. I do respect your NAK but it is not clear to me how to proceed after obtaining it. Could you please elaborate on what you would prefer as a solution to ensure accurate measurement of cache-locked data that is better integrated? Thank you very much Reinette
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Tue, Jul 31, 2018 at 12:38:27PM -0700, Reinette Chatre wrote: > Dear Maintainers, > > The success of Cache Pseudo-Locking can be measured via the use of > performance events. Specifically, the number of cache hits and misses > reading a memory region after it has been pseudo-locked to cache. This > measurement is triggered via the resctrl debugfs interface. > > To ensure most accurate results the performance counters and their > configuration registers are accessed directly. NAK on that.
Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
On Tue, Jul 31, 2018 at 12:38:27PM -0700, Reinette Chatre wrote: > Dear Maintainers, > > The success of Cache Pseudo-Locking can be measured via the use of > performance events. Specifically, the number of cache hits and misses > reading a memory region after it has been pseudo-locked to cache. This > measurement is triggered via the resctrl debugfs interface. > > To ensure most accurate results the performance counters and their > configuration registers are accessed directly. NAK on that.