Re: [PATCH net-next v5 1/4] bpf: add helper bpf_perf_event_read_value for perf event array map

2017-09-25 Thread Yonghong Song



On 9/21/17 8:15 AM, Alexei Starovoitov wrote:

On 9/20/17 4:07 PM, David Miller wrote:

From: Peter Zijlstra 
Date: Wed, 20 Sep 2017 19:26:51 +0200


Dave, could we have this in a topic tree of sorts, because I have a
pending series to rework all the timekeeping and it might be nice to not
have sfr run into all sorts of conflicts.


If you want to merge it into your tree that's fine.

But it means that any further development done on top of this
work will need to go via you as well.


can we merge this set of patches into both net-next and tip ?
We did such tricks in the past to avoid merge conflicts and
everything went fine during merge window.


Hi, Peter,

Any suggestion for merging this patch? Alexei's idea sounds
reasonable?

Thanks,

Yonghong


Re: [PATCH net-next v5 1/4] bpf: add helper bpf_perf_event_read_value for perf event array map

2017-09-21 Thread Alexei Starovoitov

On 9/20/17 4:07 PM, David Miller wrote:

From: Peter Zijlstra 
Date: Wed, 20 Sep 2017 19:26:51 +0200


Dave, could we have this in a topic tree of sorts, because I have a
pending series to rework all the timekeeping and it might be nice to not
have sfr run into all sorts of conflicts.


If you want to merge it into your tree that's fine.

But it means that any further development done on top of this
work will need to go via you as well.


can we merge this set of patches into both net-next and tip ?
We did such tricks in the past to avoid merge conflicts and
everything went fine during merge window.



Re: [PATCH net-next v5 1/4] bpf: add helper bpf_perf_event_read_value for perf event array map

2017-09-20 Thread David Miller
From: Peter Zijlstra 
Date: Wed, 20 Sep 2017 19:26:51 +0200

> Dave, could we have this in a topic tree of sorts, because I have a
> pending series to rework all the timekeeping and it might be nice to not
> have sfr run into all sorts of conflicts.

If you want to merge it into your tree that's fine.

But it means that any further development done on top of this
work will need to go via you as well.


Re: [PATCH net-next v5 1/4] bpf: add helper bpf_perf_event_read_value for perf event array map

2017-09-20 Thread Peter Zijlstra
On Tue, Sep 19, 2017 at 11:09:32PM -0700, Yonghong Song wrote:
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 3e691b7..2d5bbe5 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3684,10 +3684,12 @@ static inline u64 perf_event_count(struct perf_event 
> *event)
>   * will not be local and we cannot read them atomically
>   *   - must not have a pmu::count method
>   */
> -int perf_event_read_local(struct perf_event *event, u64 *value)
> +int perf_event_read_local(struct perf_event *event, u64 *value,
> +   u64 *enabled, u64 *running)
>  {
>   unsigned long flags;
>   int ret = 0;
> + u64 now;
>  
>   /*
>* Disabling interrupts avoids all counter scheduling (context
> @@ -3718,14 +3720,21 @@ int perf_event_read_local(struct perf_event *event, 
> u64 *value)
>   goto out;
>   }
>  
> + now = event->shadow_ctx_time + perf_clock();
> + if (enabled)
> + *enabled = now - event->tstamp_enabled;
>   /*
>* If the event is currently on this CPU, its either a per-task event,
>* or local to this CPU. Furthermore it means its ACTIVE (otherwise
>* oncpu == -1).
>*/
> - if (event->oncpu == smp_processor_id())
> + if (event->oncpu == smp_processor_id()) {
>   event->pmu->read(event);
> -
> + if (running)
> + *running = now - event->tstamp_running;
> + } else if (running) {
> + *running = event->total_time_running;
> + }
>   *value = local64_read(>count);
>  out:
>   local_irq_restore(flags);

Yeah, this looks about right.

Dave, could we have this in a topic tree of sorts, because I have a
pending series to rework all the timekeeping and it might be nice to not
have sfr run into all sorts of conflicts.


[PATCH net-next v5 1/4] bpf: add helper bpf_perf_event_read_value for perf event array map

2017-09-20 Thread Yonghong Song
Hardware pmu counters are limited resources. When there are more
pmu based perf events opened than available counters, kernel will
multiplex these events so each event gets certain percentage
(but not 100%) of the pmu time. In case that multiplexing happens,
the number of samples or counter value will not reflect the
case compared to no multiplexing. This makes comparison between
different runs difficult.

Typically, the number of samples or counter value should be
normalized before comparing to other experiments. The typical
normalization is done like:
  normalized_num_samples = num_samples * time_enabled / time_running
  normalized_counter_value = counter_value * time_enabled / time_running
where time_enabled is the time enabled for event and time_running is
the time running for event since last normalization.

This patch adds helper bpf_perf_event_read_value for kprobed based perf
event array map, to read perf counter and enabled/running time.
The enabled/running time is accumulated since the perf event open.
To achieve scaling factor between two bpf invocations, users
can can use cpu_id as the key (which is typical for perf array usage model)
to remember the previous value and do the calculation inside the
bpf program.

Signed-off-by: Yonghong Song 
---
 include/linux/perf_event.h |  6 --
 include/uapi/linux/bpf.h   | 19 ++-
 kernel/bpf/arraymap.c  |  2 +-
 kernel/bpf/verifier.c  |  4 +++-
 kernel/events/core.c   | 15 ---
 kernel/trace/bpf_trace.c   | 46 +-
 6 files changed, 79 insertions(+), 13 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8e22f24..21d8c12 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -884,7 +884,8 @@ perf_event_create_kernel_counter(struct perf_event_attr 
*attr,
void *context);
 extern void perf_pmu_migrate_context(struct pmu *pmu,
int src_cpu, int dst_cpu);
-int perf_event_read_local(struct perf_event *event, u64 *value);
+int perf_event_read_local(struct perf_event *event, u64 *value,
+ u64 *enabled, u64 *running);
 extern u64 perf_event_read_value(struct perf_event *event,
 u64 *enabled, u64 *running);
 
@@ -1286,7 +1287,8 @@ static inline const struct perf_event_attr 
*perf_event_attrs(struct perf_event *
 {
return ERR_PTR(-EINVAL);
 }
-static inline int perf_event_read_local(struct perf_event *event, u64 *value)
+static inline int perf_event_read_local(struct perf_event *event, u64 *value,
+   u64 *enabled, u64 *running)
 {
return -EINVAL;
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 43ab5c4..ccfe1b1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -582,6 +582,14 @@ union bpf_attr {
  * @map: pointer to sockmap to update
  * @key: key to insert/update sock in map
  * @flags: same flags as map update elem
+ *
+ * int bpf_perf_event_read_value(map, flags, buf, buf_size)
+ * read perf event counter value and perf event enabled/running time
+ * @map: pointer to perf_event_array map
+ * @flags: index of event in the map or bitmask flags
+ * @buf: buf to fill
+ * @buf_size: size of the buf
+ * Return: 0 on success or negative error code
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -638,6 +646,7 @@ union bpf_attr {
FN(redirect_map),   \
FN(sk_redirect_map),\
FN(sock_map_update),\
+   FN(perf_event_read_value),  \
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -681,7 +690,9 @@ enum bpf_func_id {
 #define BPF_F_ZERO_CSUM_TX (1ULL << 1)
 #define BPF_F_DONT_FRAGMENT(1ULL << 2)
 
-/* BPF_FUNC_perf_event_output and BPF_FUNC_perf_event_read flags. */
+/* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and
+ * BPF_FUNC_perf_event_read_value flags.
+ */
 #define BPF_F_INDEX_MASK   0xULL
 #define BPF_F_CURRENT_CPU  BPF_F_INDEX_MASK
 /* BPF_FUNC_perf_event_output for sk_buff input context. */
@@ -864,4 +875,10 @@ enum {
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
 #define TCP_BPF_SNDCWND_CLAMP  1002/* Set sndcwnd_clamp */
 
+struct bpf_perf_event_value {
+   __u64 counter;
+   __u64 enabled;
+   __u64 running;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 98c0f00..68d8666 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -492,7 +492,7 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map 
*map,
 
ee = ERR_PTR(-EOPNOTSUPP);
event = perf_file->private_data;
-   if