Re: [PATCH v3 7/8] perf: Define PMU_TXN_READ interface
On Tue, Jul 14, 2015 at 08:01:54PM -0700, Sukadev Bhattiprolu wrote: > +/* > + * Use the transaction interface to read the group of events in @leader. > + * PMUs like the 24x7 counters in Power, can use this to queue the events > + * in the ->read() operation and perform the actual read in ->commit_txn. > + * > + * Other PMUs can ignore the ->start_txn and ->commit_txn and read each > + * PMU directly in the ->read() operation. > + */ > +static int perf_event_read_group(struct perf_event *leader) > +{ > + int ret; > + struct perf_event *sub; > + struct pmu *pmu; > + > + pmu = leader->pmu; > + > + pmu->start_txn(pmu, PERF_PMU_TXN_READ); > + > + perf_event_read(leader); There should be a lockdep assert with that list iteration. > + list_for_each_entry(sub, &leader->sibling_list, group_entry) > + perf_event_read(sub); > + > + ret = pmu->commit_txn(pmu); > + > + return ret; > +} ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 7/8] perf: Define PMU_TXN_READ interface
Peter Zijlstra [pet...@infradead.org] wrote: | On Tue, Jul 14, 2015 at 08:01:54PM -0700, Sukadev Bhattiprolu wrote: | > +/* | > + * Use the transaction interface to read the group of events in @leader. | > + * PMUs like the 24x7 counters in Power, can use this to queue the events | > + * in the ->read() operation and perform the actual read in ->commit_txn. | > + * | > + * Other PMUs can ignore the ->start_txn and ->commit_txn and read each | > + * PMU directly in the ->read() operation. | > + */ | > +static int perf_event_read_group(struct perf_event *leader) | > +{ | > + int ret; | > + struct perf_event *sub; | > + struct pmu *pmu; | > + | > + pmu = leader->pmu; | > + | > + pmu->start_txn(pmu, PERF_PMU_TXN_READ); | > + | > + perf_event_read(leader); | | There should be a lockdep assert with that list iteration. | | > + list_for_each_entry(sub, &leader->sibling_list, group_entry) | > + perf_event_read(sub); | > + | > + ret = pmu->commit_txn(pmu); Peter, I have a situation :-) We are trying to use the following interface: start_txn(pmu, PERF_PMU_TXN_READ); perf_event_read(leader); list_for_each(sibling, &leader->sibling_list, group_entry) perf_event_read(sibling) pmu->commit_txn(pmu); with the idea that the PMU driver would save the type of transaction in ->start_txn() and use in ->read() and ->commit_txn(). But since ->start_txn() and the ->read() operations could happen on different CPUs (perf_event_read() uses the event->oncpu to schedule a call), the PMU driver cannot use a per-cpu variable to save the state in ->start_txn(). I tried using a pmu-wide global, but that would also need us to hold a mutex to serialize access to that global. The problem is ->start_txn() can be called from an interrupt context for the TXN_ADD transactions (I got the following backtrace during testing) mutex_lock_nested+0x504/0x520 (unreliable) h_24x7_event_start_txn+0x3c/0xd0 group_sched_in+0x70/0x230 ctx_sched_in.isra.63+0x150/0x230 __perf_install_in_context+0x1c8/0x1e0 remote_function+0x7c/0xa0 flush_smp_call_function_queue+0xb0/0x1d0 smp_ipi_demux+0x88/0xf0 icp_hv_ipi_action+0x54/0xc0 handle_irq_event_percpu+0x98/0x2b0 handle_percpu_irq+0x7c/0xc0 generic_handle_irq+0x4c/0x80 __do_irq+0x7c/0x190 call_do_irq+0x14/0x24 do_IRQ+0x8c/0x100 hardware_interrupt_common+0x168/0x180 --- interrupt: 501 at .plpar_hcall_norets+0x14/0x20 Basically stuck trying to save the txn type in ->start_txn() and retrieve in ->read(). Couple of options I can think of are: - having ->start_txn() return a handle that should then be passed in with ->read() (yuck) and ->commit_txn(). - serialize the READ transaction for the PMU in perf_event_read_group() with a new pmu->txn_mutex: mutex_lock(&pmu->txn_mutex); pmu->start_txn() list_for_each_entry(sub, &leader->sibling_list, group_entry) perf_event_read(sub); ret = pmu->commit_txn(pmu); mutex_unlock(&pmu->txn_mutex); such serialization would be ok with 24x7 counters (they are system wide counters anyway) We could maybe skip the mutex for PMUs that don't implement TXN_READ interface. or is there better way? Sukadev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 7/8] perf: Define PMU_TXN_READ interface
On Tue, Jul 21, 2015 at 06:50:45PM -0700, Sukadev Bhattiprolu wrote: > We are trying to use the following interface: > > start_txn(pmu, PERF_PMU_TXN_READ); > > perf_event_read(leader); > list_for_each(sibling, &leader->sibling_list, group_entry) > perf_event_read(sibling) > > pmu->commit_txn(pmu); > > with the idea that the PMU driver would save the type of transaction in > ->start_txn() and use in ->read() and ->commit_txn(). > > But since ->start_txn() and the ->read() operations could happen on different > CPUs (perf_event_read() uses the event->oncpu to schedule a call), the PMU > driver cannot use a per-cpu variable to save the state in ->start_txn(). > or is there better way? I've not woken up yet, and not actually fully read the email, but can you stuff the entire above chunk inside the IPI? I think you could then actually optimize __perf_event_read() as well, because all these events should be on the same context, so no point in calling update_*time*() for every event or so. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 7/8] perf: Define PMU_TXN_READ interface
Peter Zijlstra [pet...@infradead.org] wrote: | I've not woken up yet, and not actually fully read the email, but can | you stuff the entire above chunk inside the IPI? | | I think you could then actually optimize __perf_event_read() as well, | because all these events should be on the same context, so no point in | calling update_*time*() for every event or so. | Do you mean something like this (will move the rename to a separate patch before posting): -- From e8eddb5d3877ebdb3b71213a00aaa980f4010dd0 Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu Date: Tue, 7 Jul 2015 21:45:23 -0400 Subject: [PATCH 1/1] perf: Define PMU_TXN_READ interface Define a new PERF_PMU_TXN_READ interface to read a group of counters at once. Note that we use this interface with all PMUs. PMUs that implement this interface use the ->read() operation to _queue_ the counters to be read and use ->commit_txn() to actually read all the queued counters at once. PMUs that don't implement PERF_PMU_TXN_READ ignore ->start_txn() and ->commit_txn() and continue to read counters one at a time. Thanks to input from Peter Zijlstra. Signed-off-by: Sukadev Bhattiprolu --- Changelog[v5] [Peter Zijlstra] Ensure the entire transaction happens on the same CPU. Changelog[v4] [Peter Zijlstra] Add lockdep_assert_held() in perf_event_read_group() --- include/linux/perf_event.h |1 + kernel/events/core.c | 72 +--- 2 files changed, 62 insertions(+), 11 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 44bf05f..da307ad 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -169,6 +169,7 @@ struct perf_event; #define PERF_EVENT_TXN 0x1 #define PERF_PMU_TXN_ADD 0x1 /* txn to add/schedule event on PMU */ +#define PERF_PMU_TXN_READ 0x2 /* txn to read event group from PMU */ /** * pmu::capabilities flags diff --git a/kernel/events/core.c b/kernel/events/core.c index a6bd09d..7177dd8 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3174,12 +3174,8 @@ void perf_event_exec(void) rcu_read_unlock(); } -/* - * Cross CPU call to read the hardware event - */ -static void __perf_event_read(void *info) +static void __perf_event_read(struct perf_event *event, int update_ctx) { - struct perf_event *event = info; struct perf_event_context *ctx = event->ctx; struct perf_cpu_context *cpuctx = __get_cpu_context(ctx); @@ -3194,7 +3190,7 @@ static void __perf_event_read(void *info) return; raw_spin_lock(&ctx->lock); - if (ctx->is_active) { + if (ctx->is_active && update_ctx) { update_context_time(ctx); update_cgrp_time_from_event(event); } @@ -3204,6 +3200,16 @@ static void __perf_event_read(void *info) raw_spin_unlock(&ctx->lock); } +/* + * Cross CPU call to read the hardware event + */ +static void __perf_event_read_ipi(void *info) +{ + struct perf_event *event = info; + + __perf_event_read(event, 1); +} + static inline u64 perf_event_count(struct perf_event *event) { if (event->pmu->count) @@ -3220,7 +3226,7 @@ static void perf_event_read(struct perf_event *event) */ if (event->state == PERF_EVENT_STATE_ACTIVE) { smp_call_function_single(event->oncpu, -__perf_event_read, event, 1); +__perf_event_read_ipi, event, 1); } else if (event->state == PERF_EVENT_STATE_INACTIVE) { struct perf_event_context *ctx = event->ctx; unsigned long flags; @@ -3765,6 +3771,36 @@ static void orphans_remove_work(struct work_struct *work) put_ctx(ctx); } +/* + * Use the transaction interface to read the group of events in @leader. + * PMUs like the 24x7 counters in Power, can use this to queue the events + * in the ->read() operation and perform the actual read in ->commit_txn. + * + * Other PMUs can ignore the ->start_txn and ->commit_txn and read each + * PMU directly in the ->read() operation. + */ +static int perf_event_read_group(struct perf_event *leader) +{ + int ret; + struct perf_event *sub; + struct pmu *pmu; + struct perf_event_context *ctx = leader->ctx; + + lockdep_assert_held(&ctx->mutex); + + pmu = leader->pmu; + + pmu->start_txn(pmu, PERF_PMU_TXN_READ); + + __perf_event_read(leader, 1); + list_for_each_entry(sub, &leader->sibling_list, group_entry) + __perf_event_read(sub, 0); + + ret = pmu->commit_txn(pmu); + + return ret; +} + u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running) { u64 total = 0; @@ -3794,7 +3830,17 @@ static int perf_read_group(struct perf_event *event, lockdep_assert_held(&ctx->mutex); - count = perf_event_read_value(leader, &enabled, &running
Re: [PATCH v3 7/8] perf: Define PMU_TXN_READ interface
On Wed, Jul 22, 2015 at 04:19:16PM -0700, Sukadev Bhattiprolu wrote: > Peter Zijlstra [pet...@infradead.org] wrote: > | I've not woken up yet, and not actually fully read the email, but can > | you stuff the entire above chunk inside the IPI? > | > | I think you could then actually optimize __perf_event_read() as well, > | because all these events should be on the same context, so no point in > | calling update_*time*() for every event or so. > | > > Do you mean something like this (will move the rename to a separate > patch before posting): More like so.. please double check, I've not even had tea yet. --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3174,14 +3174,22 @@ void perf_event_exec(void) rcu_read_unlock(); } +struct perf_read_data { + struct perf_event *event; + bool group; + int ret; +}; + /* * Cross CPU call to read the hardware event */ static void __perf_event_read(void *info) { - struct perf_event *event = info; + struct perf_read_data *data = info; + struct perf_event *sub, *event = data->event; struct perf_event_context *ctx = event->ctx; struct perf_cpu_context *cpuctx = __get_cpu_context(ctx); + struct pmu *pmu = event->pmu; /* * If this is a task context, we need to check whether it is @@ -3199,8 +3207,23 @@ static void __perf_event_read(void *info update_cgrp_time_from_event(event); } update_event_times(event); - if (event->state == PERF_EVENT_STATE_ACTIVE) - event->pmu->read(event); + if (event->state != PERF_EVENT_STATE_ACTIVE) + goto unlock; + + if (!data->group) { + pmu->read(event); + goto unlock; + } + + pmu->start_txn(pmu, PERF_PMU_TXN_READ); + pmu->read(event); + list_for_each_entry(sub, &event->sibling_list, group_entry) { + if (sub->state == PERF_EVENT_STATE_ACTIVE) + pmu->read(sub); + } + data->ret = pmu->commit_txn(pmu); + +unlock: raw_spin_unlock(&ctx->lock); } @@ -3212,15 +3235,23 @@ static inline u64 perf_event_count(struc return __perf_event_count(event); } -static void perf_event_read(struct perf_event *event) +static int perf_event_read(struct perf_event *event, bool group) { + int ret = 0 + /* * If event is enabled and currently active on a CPU, update the * value in the event structure: */ if (event->state == PERF_EVENT_STATE_ACTIVE) { + struct perf_read_data data = { + .event = event, + .group = group, + .ret = 0, + }; smp_call_function_single(event->oncpu, -__perf_event_read, event, 1); +__perf_event_read, &data, 1); + ret = data.ret; } else if (event->state == PERF_EVENT_STATE_INACTIVE) { struct perf_event_context *ctx = event->ctx; unsigned long flags; @@ -3235,9 +3266,14 @@ static void perf_event_read(struct perf_ update_context_time(ctx); update_cgrp_time_from_event(event); } - update_event_times(event); + if (group) + update_group_times(event); + else + update_event_times(event); raw_spin_unlock_irqrestore(&ctx->lock, flags); } + + return ret; } /* @@ -3718,7 +3754,6 @@ static u64 perf_event_compute(struct per atomic64_read(&event->child_total_time_running); list_for_each_entry(child, &event->child_list, child_list) { - perf_event_read(child); total += perf_event_count(child); *enabled += child->total_time_enabled; *running += child->total_time_running; @@ -3772,7 +3807,7 @@ u64 perf_event_read_value(struct perf_ev mutex_lock(&event->child_mutex); - perf_event_read(event); + perf_event_read(event, false); total = perf_event_compute(event, enabled, running); mutex_unlock(&event->child_mutex); @@ -3792,7 +3827,11 @@ static int perf_read_group(struct perf_e lockdep_assert_held(&ctx->mutex); - count = perf_event_read_value(leader, &enabled, &running); + ret = perf_event_read(leader, true); + if (ret) + return ret; + + count = perf_event_compute(leader, &enabled, &running); values[n++] = 1 + leader->nr_siblings; if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED) @@ -3813,7 +3852,7 @@ static int perf_read_group(struct perf_e list_for_each_entry(sub, &leader->sibling_list, group_entry) { n = 0; - values[n++] = perf_event_read_value(sub, &enabled, &runnin
Re: [PATCH v3 7/8] perf: Define PMU_TXN_READ interface
Peter Zijlstra [pet...@infradead.org] wrote: | On Wed, Jul 22, 2015 at 04:19:16PM -0700, Sukadev Bhattiprolu wrote: | > Peter Zijlstra [pet...@infradead.org] wrote: | > | I've not woken up yet, and not actually fully read the email, but can | > | you stuff the entire above chunk inside the IPI? | > | | > | I think you could then actually optimize __perf_event_read() as well, | > | because all these events should be on the same context, so no point in | > | calling update_*time*() for every event or so. | > | | > | > Do you mean something like this (will move the rename to a separate | > patch before posting): | | More like so.. please double check, I've not even had tea yet. Yeah, I realized I had ignored the 'event->cpu' spec. Will try this out. Thanks, Sukadev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev