Re: [PATCH net-next] openvswitch: Fix conntrack compilation without mark.

2015-08-28 Thread David Miller
From: Joe Stringer 
Date: Fri, 28 Aug 2015 19:22:11 -0700

> Fix build with !CONFIG_NF_CONNTRACK_MARK && CONFIG_OPENVSWITCH_CONNTRACK
> 
> Fixes: 182e304 ("openvswitch: Allow matching on conntrack mark")
> Reported-by: Simon Horman 
> Signed-off-by: Joe Stringer 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread David Miller
From: Raghavendra K T 
Date: Sat, 29 Aug 2015 08:27:15 +0530

> resending the patch with memset. Please let me know if you want to
> resend all the patches.

Do not post patches as replies to existing discussion threads.

Instead, make a new, fresh, patch posting, updating the Subject line
as needed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 7/8] net: thunderx: Support for upto 96 queues for a VF

2015-08-28 Thread David Miller
From: Alexey Klimov 
Date: Sat, 29 Aug 2015 04:45:03 +0300

>> @@ -717,9 +833,24 @@ static void nic_unregister_interrupts(struct nicpf *nic)
>> nic_disable_msix(nic);
>>  }
>>
>> +static int nic_num_sqs_en(struct nicpf *nic, int vf_en)
>> +{
>> +   int pos = 0, sqs_per_vf = MAX_SQS_PER_VF_SINGLE_NODE;
> 
> Please check if you really need to initialize 'pos' by zero here.

Please do _NOT_ quote hundreds of lines of code only to give feedback
on one particular hunk.

Quote _ONLY_ the exact context required, nothing more.

Everyone has to scroll through all of this unrelated crap you quoted,
and that makes more work for everyone.

Think particularly of _ME_ who has to be aware of what's going on in
every discussion thread for every patch that gets posted to this list.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 13/31] perf tools: Attach eBPF program to perf event

2015-08-28 Thread Wang Nan
This is the final patch which makes basic BPF filter work. After
applying this patch, users are allowed to use BPF filter like:

 # perf record --event ./hello_world.c ls

In this patch PERF_EVENT_IOC_SET_BPF ioctl is used to attach eBPF
program to a newly created perf event. The file descriptor of the
eBPF program is passed to perf record using previous patches, and
stored into evsel->bpf_fd.

It is possible that different perf event are created for one kprobe
events for different CPUs. In this case, when trying to call the
ioctl, EEXIST will be return. This patch doesn't treat it as an error.

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-26-git-send-email-wangn...@huawei.com
---
 tools/perf/util/evsel.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 6fff961..5f59841 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1365,6 +1365,22 @@ retry_open:
  err);
goto try_fallback;
}
+
+   if (evsel->bpf_fd >= 0) {
+   int evt_fd = FD(evsel, cpu, thread);
+   int bpf_fd = evsel->bpf_fd;
+
+   err = ioctl(evt_fd,
+   PERF_EVENT_IOC_SET_BPF,
+   bpf_fd);
+   if (err && errno != EEXIST) {
+   pr_err("failed to attach bpf fd %d: 
%s\n",
+  bpf_fd, strerror(errno));
+   err = -EINVAL;
+   goto out_close;
+   }
+   }
+
set_rlimit = NO_CHANGE;
 
/*
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 12/31] perf tools: Allow filter option to be applied to bof object

2015-08-28 Thread Wang Nan
Before this patch, --filter options can't be applied to BPF object
'events'. For example, the following command:

 # perf record -e cycles -e test_bpf.o --exclude-perf -a sleep 1

doesn't apply '--exclude-perf' to events in test_bpf.o. Instead, the
filter will be applied to 'cycles' event. This is caused by the delay
manner of adding real BPF events. Because all BPF probing points are
probed by one call, we can't add real events until all BPF objects
are collected. In previous patch (perf tools: Enable passing bpf object
file to --event), nothing is appended to evlist.

This patch fixes this by utilizing the dummy event linked during
parse_events(). Filter settings goes to dummy event, and be synced with
real events in add_bpf_event().

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/r/1440742821-44548-5-git-send-email-wangn...@huawei.com
---
 tools/perf/builtin-record.c  |  6 -
 tools/perf/util/bpf-loader.c |  8 ++-
 tools/perf/util/bpf-loader.h |  2 ++
 tools/perf/util/evlist.c | 53 +---
 4 files changed, 64 insertions(+), 5 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 5051d3b..fd56a5b 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1113,7 +1113,6 @@ int cmd_record(int argc, const char **argv, const char 
*prefix __maybe_unused)
 
argc = parse_options(argc, argv, record_options, record_usage,
PARSE_OPT_STOP_AT_NON_OPTION);
-   perf_evlist__purge_dummy(rec->evlist);
 
if (!argc && target__none(>opts.target))
usage_with_options(record_usage, record_options);
@@ -1178,6 +1177,11 @@ int cmd_record(int argc, const char **argv, const char 
*prefix __maybe_unused)
pr_err("Failed to add events from BPF object(s)\n");
goto out_symbol_exit;
}
+   /*
+* Until now let's purge dummy event. Filter options should
+* have been attached to real events by perf_evlist__add_bpf().
+*/
+   perf_evlist__purge_dummy(rec->evlist);
 
symbol__init(NULL);
 
diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
index 126aa71..c3bc0a8 100644
--- a/tools/perf/util/bpf-loader.c
+++ b/tools/perf/util/bpf-loader.c
@@ -293,6 +293,12 @@ int bpf__foreach_tev(bpf_prog_iter_callback_t func, void 
*arg)
int err;
 
bpf_object__for_each_safe(obj, tmp) {
+   const char *obj_name;
+
+   obj_name = bpf_object__get_name(obj);
+   if (!obj_name)
+   obj_name = "[unknown]";
+
bpf_object__for_each_program(prog, obj) {
struct probe_trace_event *tev;
struct perf_probe_event *pev;
@@ -316,7 +322,7 @@ int bpf__foreach_tev(bpf_prog_iter_callback_t func, void 
*arg)
return fd;
}
 
-   err = func(tev, fd, arg);
+   err = func(tev, obj_name, fd, arg);
if (err) {
pr_debug("bpf: call back failed, stop 
iterate\n");
return err;
diff --git a/tools/perf/util/bpf-loader.h b/tools/perf/util/bpf-loader.h
index 34656f8..323e664 100644
--- a/tools/perf/util/bpf-loader.h
+++ b/tools/perf/util/bpf-loader.h
@@ -6,6 +6,7 @@
 #define __BPF_LOADER_H
 
 #include 
+#include 
 #include 
 #include "probe-event.h"
 #include "debug.h"
@@ -13,6 +14,7 @@
 #define PERF_BPF_PROBE_GROUP "perf_bpf_probe"
 
 typedef int (*bpf_prog_iter_callback_t)(struct probe_trace_event *tev,
+   const char *obj_name,
int fd, void *arg);
 
 #ifdef HAVE_LIBBPF_SUPPORT
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index f79bbf8..c00e939 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -197,7 +197,45 @@ error:
return -ENOMEM;
 }
 
-static int add_bpf_event(struct probe_trace_event *tev, int fd,
+static void
+sync_with_dummy(struct perf_evlist *evlist, const char *obj_name,
+   struct list_head *list)
+{
+   struct perf_evsel *dummy_evsel, *pos;
+   const char *filter;
+   bool found = false;
+   int err;
+
+   evlist__for_each(evlist, dummy_evsel) {
+   if (!perf_evsel__is_dummy(dummy_evsel))
+   continue;
+
+   if (strcmp(dummy_evsel->name, obj_name) == 0) {
+   found = true;
+   break;
+   }
+   }
+
+   if (!found) {
+ 

[PATCH 30/31] perf tools: Fix cross compiling error

2015-08-28 Thread Wang Nan
Cross compiling perf to other platform failed due to missing symbol:

  ...
  AR   /pathofperf/libperf.a
  LD   /pathofperf/tests/perf-in.o
  LD   /pathofperf/perf-in.o
  LINK /pathofperf/perf
/pathofperf/libperf.a(libperf-in.o): In function `intel_pt_synth_branch_sample':
/usr/src/kernel/tools/perf/util/intel-pt.c:899: undefined reference to 
`tsc_to_perf_time'
/pathofperf/libperf.a(libperf-in.o): In function 
`intel_pt_synth_transaction_sample':
/usr/src/kernel/tools/perf/util/intel-pt.c:992: undefined reference to 
`tsc_to_perf_time'
/pathofperf/libperf.a(libperf-in.o): In function 
`intel_pt_synth_instruction_sample':
/usr/src/kernel/tools/perf/util/intel-pt.c:943: undefined reference to 
`tsc_to_perf_time'
  ...

This is because we allow newly introduced intel-pt-decoder to be
compiled to not only X86, but tsc.c which required by it is compiled
for x86 only.

This patch fix the compiling error by allow tsc.c to be compiled if
CONFIG_AUXTRACE is set, no matter the target platform.

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1440766442-48116-1-git-send-email-wangn...@huawei.com
---
 tools/perf/util/Build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index fd2f084..c8d9c7e 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -74,7 +74,7 @@ libperf-y += stat-shadow.o
 libperf-y += record.o
 libperf-y += srcline.o
 libperf-y += data.o
-libperf-$(CONFIG_X86) += tsc.o
+libperf-$(CONFIG_AUXTRACE) += tsc.o
 libperf-y += cloexec.o
 libperf-y += thread-stack.o
 libperf-$(CONFIG_AUXTRACE) += auxtrace.o
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/31] perf ebpf: Add the libbpf glue

2015-08-28 Thread Wang Nan
The 'bpf-loader.[ch]' files are introduced in this patch. Which will be
the interface between perf and libbpf. bpf__prepare_load() resides in
bpf-loader.c. Dummy functions should be used because bpf-loader.c is
available only when CONFIG_LIBBPF is on.

Functions in bpf-loader.c should not report error explicitly. Instead,
strerror style error reporting should be used.

Signed-off-by: Wang Nan 
Acked-by: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Link: 
http://lkml.kernel.org/n/1436445342-1402-19-git-send-email-wangn...@huawei.com
[ split from a larger patch ]
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/util/bpf-loader.c | 92 
 tools/perf/util/bpf-loader.h | 47 ++
 2 files changed, 139 insertions(+)
 create mode 100644 tools/perf/util/bpf-loader.c
 create mode 100644 tools/perf/util/bpf-loader.h

diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
new file mode 100644
index 000..88531ea
--- /dev/null
+++ b/tools/perf/util/bpf-loader.c
@@ -0,0 +1,92 @@
+/*
+ * bpf-loader.c
+ *
+ * Copyright (C) 2015 Wang Nan 
+ * Copyright (C) 2015 Huawei Inc.
+ */
+
+#include 
+#include "perf.h"
+#include "debug.h"
+#include "bpf-loader.h"
+
+#define DEFINE_PRINT_FN(name, level) \
+static int libbpf_##name(const char *fmt, ...) \
+{  \
+   va_list args;   \
+   int ret;\
+   \
+   va_start(args, fmt);\
+   ret = veprintf(level, verbose, pr_fmt(fmt), args);\
+   va_end(args);   \
+   return ret; \
+}
+
+DEFINE_PRINT_FN(warning, 0)
+DEFINE_PRINT_FN(info, 0)
+DEFINE_PRINT_FN(debug, 1)
+
+static bool libbpf_initialized;
+
+int bpf__prepare_load(const char *filename)
+{
+   struct bpf_object *obj;
+
+   if (!libbpf_initialized)
+   libbpf_set_print(libbpf_warning,
+libbpf_info,
+libbpf_debug);
+
+   obj = bpf_object__open(filename);
+   if (!obj) {
+   pr_debug("bpf: failed to load %s\n", filename);
+   return -EINVAL;
+   }
+
+   /*
+* Throw object pointer away: it will be retrived using
+* bpf_objects iterater.
+*/
+
+   return 0;
+}
+
+void bpf__clear(void)
+{
+   struct bpf_object *obj, *tmp;
+
+   bpf_object__for_each_safe(obj, tmp)
+   bpf_object__close(obj);
+}
+
+#define bpf__strerror_head(err, buf, size) \
+   char sbuf[STRERR_BUFSIZE], *emsg;\
+   if (!size)\
+   return 0;\
+   if (err < 0)\
+   err = -err;\
+   emsg = strerror_r(err, sbuf, sizeof(sbuf));\
+   switch (err) {\
+   default:\
+   scnprintf(buf, size, "%s", emsg);\
+   break;
+
+#define bpf__strerror_entry(val, fmt...)\
+   case val: {\
+   scnprintf(buf, size, fmt);\
+   break;\
+   }
+
+#define bpf__strerror_end(buf, size)\
+   }\
+   buf[size - 1] = '\0';
+
+int bpf__strerror_prepare_load(const char *filename, int err,
+  char *buf, size_t size)
+{
+   bpf__strerror_head(err, buf, size);
+   bpf__strerror_entry(EINVAL, "%s: BPF object file '%s' is invalid",
+   emsg, filename)
+   bpf__strerror_end(buf, size);
+   return 0;
+}
diff --git a/tools/perf/util/bpf-loader.h b/tools/perf/util/bpf-loader.h
new file mode 100644
index 000..12be630
--- /dev/null
+++ b/tools/perf/util/bpf-loader.h
@@ -0,0 +1,47 @@
+/*
+ * Copyright (C) 2015, Wang Nan 
+ * Copyright (C) 2015, Huawei Inc.
+ */
+#ifndef __BPF_LOADER_H
+#define __BPF_LOADER_H
+
+#include 
+#include 
+#include "debug.h"
+
+#ifdef HAVE_LIBBPF_SUPPORT
+int bpf__prepare_load(const char *filename);
+int bpf__strerror_prepare_load(const char *filename, int err,
+  char *buf, size_t size);
+
+void bpf__clear(void);
+#else
+static inline int bpf__prepare_load(const char *filename __maybe_unused)
+{
+   pr_debug("ERROR: eBPF object loading is disabled during compiling.\n");
+   return -1;
+}
+
+static inline void bpf__clear(void) { }
+
+static inline int
+__bpf_strerror(char *buf, size_t size)
+{
+   if (!size)
+   return 0;
+   strncpy(buf,
+   "ERROR: eBPF object loading is disabled during compiling.\n",
+   size);
+   buf[size - 1] = '\0';
+   return 0;
+}
+
+static inline int
+bpf__strerror_prepare_load(const char *filename __maybe_unused,
+  int err __maybe_unused,
+  char *buf, size_t size)
+{
+   return 

[PATCH 03/31] perf tools: Introduce dummy evsel

2015-08-28 Thread Wang Nan
This patch allows linking dummy evsel onto evlist as a placeholder. It
is for following patch which allows passing BPF object using '--event
object.o'.

Doesn't link other event selectors, if passing a BPF object file to
'--event', nothing is linked onto evlist. Instead, events described in
BPF object file are probed and linked in a delayed manner because we
want do all probing work together. Therefore, evsel for events in BPF
object would be linked at the end of evlist. Which causes a small
problem that, if passing '--filter' setting after object file, the
filter option won't be correctly applied to those events.

This patch links dummy onto evlist, so following --filter can be
collected by the dummy evsel. For this reason dummy evsels are set to
PERF_TYPE_TRACEPOINT.

Due to the possibility of existance of dummy evsel,
perf_evlist__purge_dummy() must be called right after parse_options().
This patch adds it to record, top, trace and stat builtin commands.
Further patch moves it down after real BPF events are processed with.

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Link: 
http://lkml.kernel.org/r/1440742821-44548-4-git-send-email-wangn...@huawei.com
---
 tools/perf/builtin-record.c|  2 ++
 tools/perf/builtin-stat.c  |  1 +
 tools/perf/builtin-top.c   |  1 +
 tools/perf/builtin-trace.c |  1 +
 tools/perf/util/evlist.c   | 19 +++
 tools/perf/util/evlist.h   |  1 +
 tools/perf/util/evsel.c| 32 
 tools/perf/util/evsel.h|  6 ++
 tools/perf/util/parse-events.c | 25 +
 9 files changed, 84 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index a660022..81829de 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1112,6 +1112,8 @@ int cmd_record(int argc, const char **argv, const char 
*prefix __maybe_unused)
 
argc = parse_options(argc, argv, record_options, record_usage,
PARSE_OPT_STOP_AT_NON_OPTION);
+   perf_evlist__purge_dummy(rec->evlist);
+
if (!argc && target__none(>opts.target))
usage_with_options(record_usage, record_options);
 
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 7aa039b..99b62f1 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1208,6 +1208,7 @@ int cmd_stat(int argc, const char **argv, const char 
*prefix __maybe_unused)
 
argc = parse_options(argc, argv, options, stat_usage,
PARSE_OPT_STOP_AT_NON_OPTION);
+   perf_evlist__purge_dummy(evsel_list);
 
interval = stat_config.interval;
 
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 8c465c8..246203b 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -1198,6 +1198,7 @@ int cmd_top(int argc, const char **argv, const char 
*prefix __maybe_unused)
perf_config(perf_top_config, );
 
argc = parse_options(argc, argv, options, top_usage, 0);
+   perf_evlist__purge_dummy(top.evlist);
if (argc)
usage_with_options(top_usage, options);
 
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 4e3abba..57712b9 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -3099,6 +3099,7 @@ int cmd_trace(int argc, const char **argv, const char 
*prefix __maybe_unused)
 
argc = parse_options_subcommand(argc, argv, trace_options, 
trace_subcommands,
 trace_usage, PARSE_OPT_STOP_AT_NON_OPTION);
+   perf_evlist__purge_dummy(trace.evlist);
 
if (trace.trace_pgfaults) {
trace.opts.sample_address = true;
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 8d00039..8a4e64d 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1696,3 +1696,22 @@ void perf_evlist__set_tracking_event(struct perf_evlist 
*evlist,
 
tracking_evsel->tracking = true;
 }
+
+void perf_evlist__purge_dummy(struct perf_evlist *evlist)
+{
+   struct perf_evsel *pos, *n;
+
+   /*
+* Remove all dummy events.
+* During linking, we don't touch anything except link
+* it into evlist. As a result, we don't
+* need to adjust evlist->nr_entries during removal.
+*/
+
+   evlist__for_each_safe(evlist, n, pos) {
+   if (perf_evsel__is_dummy(pos)) {
+   list_del_init(>node);
+   perf_evsel__delete(pos);
+   }
+   }
+}
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index b39a619..7f15727 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -181,6 

[PATCH 18/31] perf test: Add 'perf test BPF'

2015-08-28 Thread Wang Nan
This patch adds BPF testcase for testing BPF event filtering.

By utilizing the result of 'perf test LLVM', this patch compiles the
eBPF sample program then test it ability. The BPF script in 'perf test
LLVM' collects half of execution of epoll_pwait(). This patch runs 111
times of it, so the resule should contains 56 samples.

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Link: 
http://lkml.kernel.org/n/1440151770-129878-16-git-send-email-wangn...@huawei.com
---
 tools/perf/tests/Build  |   1 +
 tools/perf/tests/bpf.c  | 170 
 tools/perf/tests/builtin-test.c |   4 +
 tools/perf/tests/llvm.c |  19 +
 tools/perf/tests/llvm.h |   1 +
 tools/perf/tests/tests.h|   1 +
 tools/perf/util/bpf-loader.c|  14 
 tools/perf/util/bpf-loader.h|   8 ++
 8 files changed, 218 insertions(+)
 create mode 100644 tools/perf/tests/bpf.c

diff --git a/tools/perf/tests/Build b/tools/perf/tests/Build
index 8c98409..7ceb448 100644
--- a/tools/perf/tests/Build
+++ b/tools/perf/tests/Build
@@ -33,6 +33,7 @@ perf-y += parse-no-sample-id-all.o
 perf-y += kmod-path.o
 perf-y += thread-map.o
 perf-y += llvm.o llvm-src.o
+perf-y += bpf.o
 
 $(OUTPUT)tests/llvm-src.c: tests/bpf-script-example.c
$(Q)echo '#include ' > $@
diff --git a/tools/perf/tests/bpf.c b/tools/perf/tests/bpf.c
new file mode 100644
index 000..6c238ca
--- /dev/null
+++ b/tools/perf/tests/bpf.c
@@ -0,0 +1,170 @@
+#include 
+#include 
+#include 
+#include 
+#include "tests.h"
+#include "llvm.h"
+#include "debug.h"
+#define NR_ITERS   111
+
+#ifdef HAVE_LIBBPF_SUPPORT
+
+static int epoll_pwait_loop(void)
+{
+   int i;
+
+   /* Should fail NR_ITERS times */
+   for (i = 0; i < NR_ITERS; i++)
+   epoll_pwait(-(i + 1), NULL, 0, 0, NULL);
+   return 0;
+}
+
+static int prepare_bpf(void *obj_buf, size_t obj_buf_sz)
+{
+   int err;
+   char errbuf[BUFSIZ];
+
+   err = bpf__prepare_load_buffer(obj_buf, obj_buf_sz, NULL);
+   if (err) {
+   bpf__strerror_prepare_load("[buffer]", false, err, errbuf,
+  sizeof(errbuf));
+   fprintf(stderr, " (%s)", errbuf);
+   return TEST_FAIL;
+   }
+
+   err = bpf__probe();
+   if (err) {
+   bpf__strerror_load(err, errbuf, sizeof(errbuf));
+   fprintf(stderr, " (%s)", errbuf);
+   if (getuid() != 0)
+   fprintf(stderr, " (try run as root)");
+   return TEST_FAIL;
+   }
+
+   err = bpf__load();
+   if (err) {
+   bpf__strerror_load(err, errbuf, sizeof(errbuf));
+   fprintf(stderr, " (%s)", errbuf);
+   return TEST_FAIL;
+   }
+
+   return 0;
+}
+
+static int do_test(void)
+{
+   struct record_opts opts = {
+   .target = {
+   .uid = UINT_MAX,
+   .uses_mmap = true,
+   },
+   .freq = 0,
+   .mmap_pages   = 256,
+   .default_interval = 1,
+   };
+
+   int err, i, count = 0;
+   char pid[16];
+   char sbuf[STRERR_BUFSIZE];
+   struct perf_evlist *evlist;
+
+   snprintf(pid, sizeof(pid), "%d", getpid());
+   pid[sizeof(pid) - 1] = '\0';
+   opts.target.tid = opts.target.pid = pid;
+
+   /* Instead of perf_evlist__new_default, don't add default events */
+   evlist = perf_evlist__new();
+   if (!evlist) {
+   pr_debug("No ehough memory to create evlist\n");
+   return -ENOMEM;
+   }
+
+   err = perf_evlist__create_maps(evlist, );
+   if (err < 0) {
+   pr_debug("Not enough memory to create thread/cpu maps\n");
+   goto out_delete_evlist;
+   }
+
+   err = perf_evlist__add_bpf(evlist);
+   if (err) {
+   fprintf(stderr, " (Failed to add events selected by BPF)");
+   goto out_delete_evlist;
+   }
+
+   perf_evlist__config(evlist, );
+
+   err = perf_evlist__open(evlist);
+   if (err < 0) {
+   pr_debug("perf_evlist__open: %s\n",
+strerror_r(errno, sbuf, sizeof(sbuf)));
+   goto out_delete_evlist;
+   }
+
+   err = perf_evlist__mmap(evlist, opts.mmap_pages, false);
+   if (err < 0) {
+   pr_debug("perf_evlist__mmap: %s\n",
+strerror_r(errno, sbuf, sizeof(sbuf)));
+   goto out_delete_evlist;
+   }
+
+   perf_evlist__enable(evlist);
+   epoll_pwait_loop();
+   perf_evlist__disable(evlist);
+
+   for (i = 0; i < evlist->nr_mmaps; i++) {
+   union perf_event *event;
+
+

[PATCH 01/31] bpf tools: New API to get name from a BPF object

2015-08-28 Thread Wang Nan
Before this patch there's no way to connect a loaded bpf object
to its source file. However, during applying perf's '--filter' to BPF
object, without this connection makes things harder, because perf loads
all programs together, but '--filter' setting is for each object.

API of bpf_object__open_buffer() is changed to allow passing a name.
Fortunately, at this time there's only one user of it (perf test LLVM),
so we change it together.

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/r/1440742821-44548-2-git-send-email-wangn...@huawei.com
---
 tools/lib/bpf/libbpf.c  | 25 ++---
 tools/lib/bpf/libbpf.h  |  4 +++-
 tools/perf/tests/llvm.c |  2 +-
 3 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 4fa4bc4..4252fc2 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -880,15 +880,26 @@ struct bpf_object *bpf_object__open(const char *path)
 }
 
 struct bpf_object *bpf_object__open_buffer(void *obj_buf,
-  size_t obj_buf_sz)
+  size_t obj_buf_sz,
+  const char *name)
 {
+   char tmp_name[64];
+
/* param validation */
if (!obj_buf || obj_buf_sz <= 0)
return NULL;
 
-   pr_debug("loading object from buffer\n");
+   if (!name) {
+   snprintf(tmp_name, sizeof(tmp_name), "%lx-%lx",
+(unsigned long)obj_buf,
+(unsigned long)obj_buf_sz);
+   tmp_name[sizeof(tmp_name) - 1] = '\0';
+   name = tmp_name;
+   }
+   pr_debug("loading object '%s' from buffer\n",
+name);
 
-   return __bpf_object__open("[buffer]", obj_buf, obj_buf_sz);
+   return __bpf_object__open(name, obj_buf, obj_buf_sz);
 }
 
 int bpf_object__unload(struct bpf_object *obj)
@@ -975,6 +986,14 @@ bpf_object__next(struct bpf_object *prev)
return next;
 }
 
+const char *
+bpf_object__get_name(struct bpf_object *obj)
+{
+   if (!obj)
+   return NULL;
+   return obj->path;
+}
+
 struct bpf_program *
 bpf_program__next(struct bpf_program *prev, struct bpf_object *obj)
 {
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index ea8adc2..f16170c 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -28,12 +28,14 @@ struct bpf_object;
 
 struct bpf_object *bpf_object__open(const char *path);
 struct bpf_object *bpf_object__open_buffer(void *obj_buf,
-  size_t obj_buf_sz);
+  size_t obj_buf_sz,
+  const char *name);
 void bpf_object__close(struct bpf_object *object);
 
 /* Load/unload object into/from kernel */
 int bpf_object__load(struct bpf_object *obj);
 int bpf_object__unload(struct bpf_object *obj);
+const char *bpf_object__get_name(struct bpf_object *obj);
 
 struct bpf_object *bpf_object__next(struct bpf_object *prev);
 #define bpf_object__for_each_safe(pos, tmp)\
diff --git a/tools/perf/tests/llvm.c b/tools/perf/tests/llvm.c
index a337356..52d5597 100644
--- a/tools/perf/tests/llvm.c
+++ b/tools/perf/tests/llvm.c
@@ -26,7 +26,7 @@ static int test__bpf_parsing(void *obj_buf, size_t obj_buf_sz)
 {
struct bpf_object *obj;
 
-   obj = bpf_object__open_buffer(obj_buf, obj_buf_sz);
+   obj = bpf_object__open_buffer(obj_buf, obj_buf_sz, NULL);
if (!obj)
return -1;
bpf_object__close(obj);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02/31] perf tools: Don't set cmdline_group_boundary if no evsel is collected

2015-08-28 Thread Wang Nan
If parse_events__scanner() collects no entry, perf_evlist__last(evlist)
is invalid. Then setting of cmdline_group_boundary touches invalid.

It could happend in currect BPF implementation. See [1]. Although it
can be fixed, for safety reason it whould be better to introduce this
check.

Instead of checking number of entries, check data.list instead, so we
can add dummy evsel here.

[1]: 
http://lkml.kernel.org/n/1436445342-1402-19-git-send-email-wangn...@huawei.com

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/r/1440742821-44548-3-git-send-email-wangn...@huawei.com
---
 tools/perf/util/parse-events.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index d826e6f..14cd7e3 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -1143,10 +1143,14 @@ int parse_events(struct perf_evlist *evlist, const char 
*str,
int entries = data.idx - evlist->nr_entries;
struct perf_evsel *last;
 
+   if (!list_empty()) {
+   last = list_entry(data.list.prev,
+ struct perf_evsel, node);
+   last->cmdline_group_boundary = true;
+   }
+
perf_evlist__splice_list_tail(evlist, , entries);
evlist->nr_groups += data.nr_groups;
-   last = perf_evlist__last(evlist);
-   last->cmdline_group_boundary = true;
 
return 0;
}
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 17/31] perf tests: Enforce LLVM test for BPF test

2015-08-28 Thread Wang Nan
This patch replaces the original toy BPF program with previous introduced
bpf-script-example.c. Dynamically embedded it into 'llvm-src.c'.

The newly introduced BPF program attaches a BPF program at
'sys_epoll_pwait()', and collect half samples from it. perf itself never
use that syscall, so further test can verify their result with it.

Since BPF program require LINUX_VERSION_CODE of runtime kernel, this
patch computes that code from uname.

Since the resuling BPF object is useful for further testcases, this patch
introduces 'prepare' and 'cleanup' method to tests, and makes test__llvm()
create a MAP_SHARED memory array to hold the resulting object.

Signed-off-by: He Kuang 
Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Link: 
http://lkml.kernel.org/n/1440151770-129878-15-git-send-email-wangn...@huawei.com
---
 tools/perf/tests/Build  |   9 +++-
 tools/perf/tests/builtin-test.c |   8 
 tools/perf/tests/llvm.c | 104 +++-
 tools/perf/tests/llvm.h |  14 ++
 tools/perf/tests/tests.h|   2 +
 5 files changed, 123 insertions(+), 14 deletions(-)
 create mode 100644 tools/perf/tests/llvm.h

diff --git a/tools/perf/tests/Build b/tools/perf/tests/Build
index c1518bd..8c98409 100644
--- a/tools/perf/tests/Build
+++ b/tools/perf/tests/Build
@@ -32,7 +32,14 @@ perf-y += sample-parsing.o
 perf-y += parse-no-sample-id-all.o
 perf-y += kmod-path.o
 perf-y += thread-map.o
-perf-y += llvm.o
+perf-y += llvm.o llvm-src.o
+
+$(OUTPUT)tests/llvm-src.c: tests/bpf-script-example.c
+   $(Q)echo '#include ' > $@
+   $(Q)echo 'const char test_llvm__bpf_prog[] =' >> $@
+   $(Q)sed -e 's/"/\\"/g' -e 's/\(.*\)/"\1\\n"/g' $< >> $@
+   $(Q)echo ';' >> $@
+
 
 perf-$(CONFIG_X86) += perf-time-to-tsc.o
 
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index 136cd93..1a349e8 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -17,6 +17,8 @@
 static struct test {
const char *desc;
int (*func)(void);
+   void (*prepare)(void);
+   void (*cleanup)(void);
 } tests[] = {
{
.desc = "vmlinux symtab matches kallsyms",
@@ -177,6 +179,8 @@ static struct test {
{
.desc = "Test LLVM searching and compiling",
.func = test__llvm,
+   .prepare = test__llvm_prepare,
+   .cleanup = test__llvm_cleanup,
},
{
.func = NULL,
@@ -265,7 +269,11 @@ static int __cmd_test(int argc, const char *argv[], struct 
intlist *skiplist)
}
 
pr_debug("\n--- start ---\n");
+   if (tests[curr].prepare)
+   tests[curr].prepare();
err = run_test([curr]);
+   if (tests[curr].cleanup)
+   tests[curr].cleanup();
pr_debug(" end \n%s:", tests[curr].desc);
 
switch (err) {
diff --git a/tools/perf/tests/llvm.c b/tools/perf/tests/llvm.c
index 52d5597..236bf39 100644
--- a/tools/perf/tests/llvm.c
+++ b/tools/perf/tests/llvm.c
@@ -1,9 +1,13 @@
 #include 
+#include 
 #include 
 #include 
 #include 
+#include 
+#include 
 #include "tests.h"
 #include "debug.h"
+#include "llvm.h"
 
 static int perf_config_cb(const char *var, const char *val,
  void *arg __maybe_unused)
@@ -11,16 +15,6 @@ static int perf_config_cb(const char *var, const char *val,
return perf_default_config(var, val, arg);
 }
 
-/*
- * Randomly give it a "version" section since we don't really load it
- * into kernel
- */
-static const char test_bpf_prog[] =
-   "__attribute__((section(\"do_fork\"), used)) "
-   "int fork(void *ctx) {return 0;} "
-   "char _license[] __attribute__((section(\"license\"), used)) = \"GPL\";"
-   "int _version __attribute__((section(\"version\"), used)) = 0x40100;";
-
 #ifdef HAVE_LIBBPF_SUPPORT
 static int test__bpf_parsing(void *obj_buf, size_t obj_buf_sz)
 {
@@ -41,12 +35,44 @@ static int test__bpf_parsing(void *obj_buf __maybe_unused,
 }
 #endif
 
+static char *
+compose_source(void)
+{
+   struct utsname utsname;
+   int version, patchlevel, sublevel, err;
+   unsigned long version_code;
+   char *code;
+
+   if (uname())
+   return NULL;
+
+   err = sscanf(utsname.release, "%d.%d.%d",
+, , );
+   if (err != 3) {
+   fprintf(stderr, " (Can't get kernel version from uname '%s')",
+   utsname.release);
+   return NULL;
+   }
+
+   version_code = (version << 16) + (patchlevel << 8) + sublevel;
+   err = asprintf(, "#define LINUX_VERSION_CODE 0x%08lx;\n%s",
+  

[PATCH 07/31] perf probe: Attach trace_probe_event with perf_probe_event

2015-08-28 Thread Wang Nan
This patch drops struct __event_package structure. Instead, it adds
trace_probe_event into 'struct perf_probe_event'.

trace_probe_event information gives further patches a chance to access
actual probe points and actual arguments. Using them, bpf_loader will
be able to attach one bpf program to different probing points of a
inline functions (which has multiple probing points) and glob
functions. Moreover, by reading arguments information, bpf code for
reading those arguments can be generated.

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-22-git-send-email-wangn...@huawei.com
---
 tools/perf/builtin-probe.c|  4 ++-
 tools/perf/util/probe-event.c | 60 +--
 tools/perf/util/probe-event.h |  6 -
 3 files changed, 38 insertions(+), 32 deletions(-)

diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index b81cec3..826d452 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -496,7 +496,9 @@ __cmd_probe(int argc, const char **argv, const char *prefix 
__maybe_unused)
usage_with_options(probe_usage, options);
}
 
-   ret = add_perf_probe_events(params.events, params.nevents);
+   ret = add_perf_probe_events(params.events,
+   params.nevents,
+   true);
if (ret < 0) {
pr_err_with_code("  Error: Failed to add events.", ret);
return ret;
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index eb5f18b..57a7bae 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -1985,6 +1985,9 @@ void clear_perf_probe_event(struct perf_probe_event *pev)
struct perf_probe_arg_field *field, *next;
int i;
 
+   if (pev->ntevs)
+   cleanup_perf_probe_event(pev);
+
free(pev->event);
free(pev->group);
free(pev->target);
@@ -2759,61 +2762,58 @@ static int convert_to_probe_trace_events(struct 
perf_probe_event *pev,
return find_probe_trace_events_from_map(pev, tevs);
 }
 
-struct __event_package {
-   struct perf_probe_event *pev;
-   struct probe_trace_event*tevs;
-   int ntevs;
-};
-
-int add_perf_probe_events(struct perf_probe_event *pevs, int npevs)
+int cleanup_perf_probe_event(struct perf_probe_event *pev)
 {
-   int i, j, ret;
-   struct __event_package *pkgs;
+   int i;
 
-   ret = 0;
-   pkgs = zalloc(sizeof(struct __event_package) * npevs);
+   if (!pev || !pev->ntevs)
+   return 0;
 
-   if (pkgs == NULL)
-   return -ENOMEM;
+   for (i = 0; i < pev->ntevs; i++)
+   clear_probe_trace_event(>tevs[i]);
+
+   zfree(>tevs);
+   pev->ntevs = 0;
+   return 0;
+}
+
+int add_perf_probe_events(struct perf_probe_event *pevs, int npevs,
+ bool cleanup)
+{
+   int i, ret;
 
ret = init_symbol_maps(pevs->uprobes);
-   if (ret < 0) {
-   free(pkgs);
+   if (ret < 0)
return ret;
-   }
 
/* Loop 1: convert all events */
for (i = 0; i < npevs; i++) {
-   pkgs[i].pev = [i];
/* Init kprobe blacklist if needed */
-   if (!pkgs[i].pev->uprobes)
+   if (pevs[i].uprobes)
kprobe_blacklist__init();
/* Convert with or without debuginfo */
-   ret  = convert_to_probe_trace_events(pkgs[i].pev,
-[i].tevs);
-   if (ret < 0)
+   ret  = convert_to_probe_trace_events([i], [i].tevs);
+   if (ret < 0) {
+   cleanup = true;
goto end;
-   pkgs[i].ntevs = ret;
+   }
+   pevs[i].ntevs = ret;
}
/* This just release blacklist only if allocated */
kprobe_blacklist__release();
 
/* Loop 2: add all events */
for (i = 0; i < npevs; i++) {
-   ret = __add_probe_trace_events(pkgs[i].pev, pkgs[i].tevs,
-  pkgs[i].ntevs,
+   ret = __add_probe_trace_events([i], pevs[i].tevs,
+  pevs[i].ntevs,
   probe_conf.force_add);
if (ret < 0)
break;
}
 end:
/* Loop 3: cleanup and free trace events  */
-   for (i = 0; i < npevs; i++) {
-   

[PATCH 06/31] perf tools: Enable passing bpf object file to --event

2015-08-28 Thread Wang Nan
By introducing new rules in tools/perf/util/parse-events.[ly], this
patch enables 'perf record --event bpf_file.o' to select events by an
eBPF object file. It calls parse_events_load_bpf() to load that file,
which uses bpf__prepare_load() and finally calls bpf_object__open() for
the object files.

Instead of introducing evsel to evlist during parsing, events selected
by eBPF object files are appended separately. The reason is:

 1. During parsing, the probing points have not been initialized.

 2. Currently we are unable to call add_perf_probe_events() twice,
therefore we have to wait until all such events are collected,
then probe all points by one call.

The real probing and selecting is reside in following patches.

To collect '--filter' events, add a dummy evsel during parsing.

Since bpf__prepare_load() is possible to be called during cmdline
parsing, all builtin commands which are possible to call
parse_events_option() should release bpf resources during cleanup.
Add bpf__clear() to stat, record, top and trace commands, although
currently we are going to support 'perf record' only.

Commiter note:

Testing if the event parsing changes indeed call the BPF loading
routines:

  [root@felicio ~]# ls -la foo.o
  ls: cannot access foo.o: No such file or directory
  [root@felicio ~]# perf record --event foo.o sleep
  libbpf: failed to open foo.o: No such file or directory
  bpf: failed to load foo.o
  invalid or unsupported event: 'foo.o'
  Run 'perf list' for a list of valid events

   usage: perf record [] []
  or: perf record [] --  []

  -e, --eventevent selector. use 'perf list' to list available 
events
  [root@felicio ~]#

Yes, it does this time around.

Signed-off-by: Wang Nan 
Acked-by: Alexei Starovoitov 
Tested-by: Arnaldo Carvalho de Melo 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Link: 
http://lkml.kernel.org/n/1436445342-1402-19-git-send-email-wangn...@huawei.com
[ The veprintf() and bpf loader parts were split from this one;
  Add bpf__clear() into stat, record, top and trace commands.
  Add dummy evsel when parsing.
]
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/builtin-record.c|  7 +--
 tools/perf/builtin-stat.c  |  8 ++--
 tools/perf/builtin-top.c   | 10 +++---
 tools/perf/builtin-trace.c |  6 +-
 tools/perf/util/Build  |  1 +
 tools/perf/util/parse-events.c | 40 
 tools/perf/util/parse-events.h |  3 +++
 tools/perf/util/parse-events.l |  3 +++
 tools/perf/util/parse-events.y | 18 +-
 9 files changed, 87 insertions(+), 9 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 81829de..31934b1 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -29,6 +29,7 @@
 #include "util/data.h"
 #include "util/auxtrace.h"
 #include "util/parse-branch-options.h"
+#include "util/bpf-loader.h"
 
 #include 
 #include 
@@ -1131,13 +1132,13 @@ int cmd_record(int argc, const char **argv, const char 
*prefix __maybe_unused)
if (!rec->itr) {
rec->itr = auxtrace_record__init(rec->evlist, );
if (err)
-   return err;
+   goto out_bpf_clear;
}
 
err = auxtrace_parse_snapshot_options(rec->itr, >opts,
  rec->opts.auxtrace_snapshot_opts);
if (err)
-   return err;
+   goto out_bpf_clear;
 
err = -ENOMEM;
 
@@ -1200,6 +1201,8 @@ out_symbol_exit:
perf_evlist__delete(rec->evlist);
symbol__exit();
auxtrace_record__free(rec->itr);
+out_bpf_clear:
+   bpf__clear();
return err;
 }
 
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 99b62f1..d50a19a 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -59,6 +59,7 @@
 #include "util/thread.h"
 #include "util/thread_map.h"
 #include "util/counts.h"
+#include "util/bpf-loader.h"
 
 #include 
 #include 
@@ -1235,7 +1236,8 @@ int cmd_stat(int argc, const char **argv, const char 
*prefix __maybe_unused)
output = fopen(output_name, mode);
if (!output) {
perror("failed to create output file");
-   return -1;
+   status = -1;
+   goto out;
}
clock_gettime(CLOCK_REALTIME, );
fprintf(output, "# started on %s\n", ctime(_sec));
@@ -1244,7 +1246,8 @@ int cmd_stat(int argc, const char **argv, const char 
*prefix __maybe_unused)
output = fdopen(output_fd, mode);
if (!output) {
perror("Failed opening logfd");
-   return -errno;
+   status = 

[PATCH 09/31] perf bpf: Collect 'struct perf_probe_event' for bpf_program

2015-08-28 Thread Wang Nan
This patch utilizes bpf_program__set_private(), binding perf_probe_event
with bpf program by private field.

Saving those information so 'perf record' knows which kprobe point a program
should be attached.

Since data in 'struct perf_probe_event' is build by 2 stages, tev_ready
is used to mark whether the information (especially tevs) in 'struct
perf_probe_event' is valid or not. It is false at first, and set to true
by sync_bpf_program_pev(), which copy all pointers in original pev into
a program specific memory region. sync_bpf_program_pev() is called after
add_perf_probe_events() to make sure ready of data.

Remove code which clean 'struct perf_probe_event' after bpf__probe()
because pointers in pevs are copied to program's private field, calling
clear_perf_probe_event() becomes unsafe.

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Link: 
http://lkml.kernel.org/n/1436445342-1402-21-git-send-email-wangn...@huawei.com
[Splitted from a larger patch]
---
 tools/perf/util/bpf-loader.c | 90 +++-
 1 file changed, 88 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
index 435f52e..ae23f6f 100644
--- a/tools/perf/util/bpf-loader.c
+++ b/tools/perf/util/bpf-loader.c
@@ -30,9 +30,35 @@ DEFINE_PRINT_FN(debug, 1)
 
 static bool libbpf_initialized;
 
+struct bpf_prog_priv {
+   /*
+* If pev_ready is false, ppev pointes to a local memory which
+* is only valid inside bpf__probe().
+* pev is valid only when pev_ready.
+*/
+   bool pev_ready;
+   union {
+   struct perf_probe_event *ppev;
+   struct perf_probe_event pev;
+   };
+};
+
+static void
+bpf_prog_priv__clear(struct bpf_program *prog __maybe_unused,
+ void *_priv)
+{
+   struct bpf_prog_priv *priv = _priv;
+
+   /* check if pev is initialized */
+   if (priv && priv->pev_ready)
+   clear_perf_probe_event(>pev);
+   free(priv);
+}
+
 static int
 config_bpf_program(struct bpf_program *prog, struct perf_probe_event *pev)
 {
+   struct bpf_prog_priv *priv = NULL;
const char *config_str;
int err;
 
@@ -74,14 +100,58 @@ config_bpf_program(struct bpf_program *prog, struct 
perf_probe_event *pev)
 
pr_debug("bpf: config '%s' is ok\n", config_str);
 
+   priv = calloc(1, sizeof(*priv));
+   if (!priv) {
+   pr_debug("bpf: failed to alloc memory\n");
+   err = -ENOMEM;
+   goto errout;
+   }
+
+   /*
+* At this very early stage, tevs inside pev are not ready.
+* It becomes usable after add_perf_probe_events() is called.
+* set pev_ready to false so further access read priv->ppev
+* only.
+*/
+   priv->pev_ready = false;
+   priv->ppev = pev;
+
+   err = bpf_program__set_private(prog, priv,
+  bpf_prog_priv__clear);
+   if (err) {
+   pr_debug("bpf: set program private failed\n");
+   err = -ENOMEM;
+   goto errout;
+   }
return 0;
 
 errout:
if (pev)
clear_perf_probe_event(pev);
+   if (priv)
+   free(priv);
return err;
 }
 
+static int
+sync_bpf_program_pev(struct bpf_program *prog)
+{
+   int err;
+   struct bpf_prog_priv *priv;
+   struct perf_probe_event *ppev;
+
+   err = bpf_program__get_private(prog, (void **));
+   if (err || !priv || priv->pev_ready) {
+   pr_debug("Internal error: sync_bpf_program_pev\n");
+   return -EINVAL;
+   }
+
+   ppev = priv->ppev;
+   memcpy(>pev, ppev, sizeof(*ppev));
+   priv->pev_ready = true;
+   return 0;
+}
+
 int bpf__prepare_load(const char *filename)
 {
struct bpf_object *obj;
@@ -172,11 +242,27 @@ int bpf__probe(void)
/* add_perf_probe_events return negative when fail */
if (err < 0) {
pr_debug("bpf probe: failed to probe events\n");
+   goto out;
} else
is_probed = true;
+
+   /*
+* After add_perf_probe_events, 'struct perf_probe_event' is ready.
+* Until now copying program's priv->pev field and freeing
+* the big array allocated before become safe.
+*/
+   bpf_object__for_each_safe(obj, tmp) {
+   bpf_object__for_each_program(prog, obj) {
+   err = sync_bpf_program_pev(prog);
+   if (err)
+   goto out;
+   }
+   }
 out:
-   while (nr_events > 0)
-   clear_perf_probe_event([--nr_events]);
+   /*
+* 

[PATCH 31/31] tools lib traceevent: Support function __get_dynamic_array_len

2015-08-28 Thread Wang Nan
From: He Kuang 

Support helper function __get_dynamic_array_len() in libtraceevent,
this function is used accompany with __print_array() or __print_hex(),
but currently it is not an available function in the function list of
process_function().

The total allocated length of the dynamic array is embedded in the top
half of __data_loc_##item field. This patch adds new arg type
PRINT_DYNAMIC_ARRAY_LEN to return the length to eval_num_arg(),

Signed-off-by: He Kuang 
Acked-by: Namhyung Kim 
Cc: Alexei Starovoitov 
Cc: Arnaldo Carvalho de Melo 
Cc: Ingo Molnar 
Cc: Jiri Olsa 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: pi3or...@163.com
Signed-off-by: Wang Nan 
Link: 
http://lkml.kernel.org/n/1437448130-134621-2-git-send-email-heku...@huawei.com
---
 tools/lib/traceevent/event-parse.c | 56 +-
 tools/lib/traceevent/event-parse.h |  1 +
 .../perf/util/scripting-engines/trace-event-perl.c |  1 +
 .../util/scripting-engines/trace-event-python.c|  1 +
 4 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/tools/lib/traceevent/event-parse.c 
b/tools/lib/traceevent/event-parse.c
index 4d88593..1244797 100644
--- a/tools/lib/traceevent/event-parse.c
+++ b/tools/lib/traceevent/event-parse.c
@@ -848,6 +848,7 @@ static void free_arg(struct print_arg *arg)
free(arg->bitmask.bitmask);
break;
case PRINT_DYNAMIC_ARRAY:
+   case PRINT_DYNAMIC_ARRAY_LEN:
free(arg->dynarray.index);
break;
case PRINT_OP:
@@ -2729,6 +2730,42 @@ process_dynamic_array(struct event_format *event, struct 
print_arg *arg, char **
 }
 
 static enum event_type
+process_dynamic_array_len(struct event_format *event, struct print_arg *arg,
+ char **tok)
+{
+   struct format_field *field;
+   enum event_type type;
+   char *token;
+
+   if (read_expect_type(EVENT_ITEM, ) < 0)
+   goto out_free;
+
+   arg->type = PRINT_DYNAMIC_ARRAY_LEN;
+
+   /* Find the field */
+   field = pevent_find_field(event, token);
+   if (!field)
+   goto out_free;
+
+   arg->dynarray.field = field;
+   arg->dynarray.index = 0;
+
+   if (read_expected(EVENT_DELIM, ")") < 0)
+   goto out_err;
+
+   type = read_token();
+   *tok = token;
+
+   return type;
+
+ out_free:
+   free_token(token);
+ out_err:
+   *tok = NULL;
+   return EVENT_ERROR;
+}
+
+static enum event_type
 process_paren(struct event_format *event, struct print_arg *arg, char **tok)
 {
struct print_arg *item_arg;
@@ -2975,6 +3012,10 @@ process_function(struct event_format *event, struct 
print_arg *arg,
free_token(token);
return process_dynamic_array(event, arg, tok);
}
+   if (strcmp(token, "__get_dynamic_array_len") == 0) {
+   free_token(token);
+   return process_dynamic_array_len(event, arg, tok);
+   }
 
func = find_func_handler(event->pevent, token);
if (func) {
@@ -3655,14 +3696,25 @@ eval_num_arg(void *data, int size, struct event_format 
*event, struct print_arg
goto out_warning_op;
}
break;
+   case PRINT_DYNAMIC_ARRAY_LEN:
+   offset = pevent_read_number(pevent,
+   data + arg->dynarray.field->offset,
+   arg->dynarray.field->size);
+   /*
+* The total allocated length of the dynamic array is
+* stored in the top half of the field, and the offset
+* is in the bottom half of the 32 bit field.
+*/
+   val = (unsigned long long)(offset >> 16);
+   break;
case PRINT_DYNAMIC_ARRAY:
/* Without [], we pass the address to the dynamic data */
offset = pevent_read_number(pevent,
data + arg->dynarray.field->offset,
arg->dynarray.field->size);
/*
-* The actual length of the dynamic array is stored
-* in the top half of the field, and the offset
+* The total allocated length of the dynamic array is
+* stored in the top half of the field, and the offset
 * is in the bottom half of the 32 bit field.
 */
offset &= 0x;
diff --git a/tools/lib/traceevent/event-parse.h 
b/tools/lib/traceevent/event-parse.h
index 204befb..6fc83c7 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -294,6 +294,7 @@ enum print_arg_type {
PRINT_OP,
PRINT_FUNC,
PRINT_BITMASK,
+   PRINT_DYNAMIC_ARRAY_LEN,
 };
 
 struct print_arg {
diff --git 

[PATCH 24/31] perf tools: Add prologue for BPF programs for fetching arguments

2015-08-28 Thread Wang Nan
This patch generates prologue for a BPF program which fetch arguments
for it. With this patch, the program can have arguments as follow:

 SEC("lock_page=__lock_page page->flags")
 int lock_page(struct pt_regs *ctx, int err, unsigned long flags)
 {
 return 1;
 }

This patch passes at most 3 arguments from r3, r4 and r5. r1 is still
the ctx pointer. r2 is used to indicate the successfulness of
dereferencing.

This patch uses r6 to hold ctx (struct pt_regs) and r7 to hold stack
pointer for result. Result of each arguments first store on stack:

 low address
 BPF_REG_FP - 24  ARG3
 BPF_REG_FP - 16  ARG2
 BPF_REG_FP - 8   ARG1
 BPF_REG_FP
 high address

Then loaded into r3, r4 and r5.

The output prologue for offn(...off2(off1(reg should be:

 r6 <- r1   // save ctx into a callee saved register
 r7 <- fp
 r7 <- r7 - stack_offset// pointer to result slot
 /* load r3 with the offset in pt_regs of 'reg' */
 (r7) <- r3 // make slot valid
 r3 <- r3 + off1// prepare to read unsafe pointer
 r2 <- 8
 r1 <- r7   // result put onto stack
 call probe_read// read unsafe pointer
 jnei r0, 0, err// error checking
 r3 <- (r7) // read result
 r3 <- r3 + off2// prepare to read unsafe pointer
 r2 <- 8
 r1 <- r7
 call probe_read
 jnei r0, 0, err
 ...
 /* load r2, r3, r4 from stack */
 goto success
err:
 r2 <- 1
 /* load r3, r4, r5 with 0 */
 goto usercode
success:
 r2 <- 0
usercode:
 r1 <- r6   // restore ctx
 // original user code

If all of arguments reside in register (dereferencing is not
required), gen_prologue_fastpath() will be used to create
fast prologue:

 r3 <- (r1 + offset of reg1)
 r4 <- (r1 + offset of reg2)
 r5 <- (r1 + offset of reg3)
 r2 <- 0

P.S.

eBPF calling convention is defined as:

* r0- return value from in-kernel function, and exit value
  for eBPF program
* r1 - r5   - arguments from eBPF program to in-kernel function
* r6 - r9   - callee saved registers that in-kernel function will
  preserve
* r10   - read-only frame pointer to access stack

Signed-off-by: He Kuang 
Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-35-git-send-email-wangn...@huawei.com
---
 tools/perf/util/Build  |   1 +
 tools/perf/util/bpf-prologue.c | 442 +
 tools/perf/util/bpf-prologue.h |  34 
 3 files changed, 477 insertions(+)
 create mode 100644 tools/perf/util/bpf-prologue.c
 create mode 100644 tools/perf/util/bpf-prologue.h

diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index c0ca4a1..fd2f084 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -84,6 +84,7 @@ libperf-$(CONFIG_AUXTRACE) += intel-bts.o
 libperf-y += parse-branch-options.o
 
 libperf-$(CONFIG_LIBBPF) += bpf-loader.o
+libperf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
 libperf-$(CONFIG_LIBELF) += symbol-elf.o
 libperf-$(CONFIG_LIBELF) += probe-file.o
 libperf-$(CONFIG_LIBELF) += probe-event.o
diff --git a/tools/perf/util/bpf-prologue.c b/tools/perf/util/bpf-prologue.c
new file mode 100644
index 000..2a5f4c7
--- /dev/null
+++ b/tools/perf/util/bpf-prologue.c
@@ -0,0 +1,442 @@
+/*
+ * bpf-prologue.c
+ *
+ * Copyright (C) 2015 He Kuang 
+ * Copyright (C) 2015 Huawei Inc.
+ */
+
+#include 
+#include "perf.h"
+#include "debug.h"
+#include "bpf-prologue.h"
+#include "probe-finder.h"
+#include 
+#include 
+
+#define BPF_REG_SIZE   8
+
+#define JMP_TO_ERROR_CODE  -1
+#define JMP_TO_SUCCESS_CODE-2
+#define JMP_TO_USER_CODE   -3
+
+struct bpf_insn_pos {
+   struct bpf_insn *begin;
+   struct bpf_insn *end;
+   struct bpf_insn *pos;
+};
+
+static inline int
+pos_get_cnt(struct bpf_insn_pos *pos)
+{
+   return pos->pos - pos->begin;
+}
+
+static int
+append_insn(struct bpf_insn new_insn, struct bpf_insn_pos *pos)
+{
+   if (!pos->pos)
+   return -ERANGE;
+
+   if (pos->pos + 1 >= pos->end) {
+   pr_err("bpf prologue: prologue too long\n");
+   pos->pos = NULL;
+   return -ERANGE;
+   }
+
+   *(pos->pos)++ = new_insn;
+   return 0;
+}
+
+static int
+check_pos(struct bpf_insn_pos *pos)
+{
+   if (!pos->pos || pos->pos >= pos->end)
+   return -ERANGE;
+   return 0;
+}
+
+/* Give it a shorter name */
+#define ins(i, p) append_insn((i), (p))
+
+/*
+ * Give a register name (in 'reg'), generate instruction to
+ * load register into an eBPF register rd:
+ *   'ldd target_reg, offset(ctx_reg)', 

[PATCH 28/31] perf probe: Init symbol as kprobe

2015-08-28 Thread Wang Nan
Before this patch, add_perf_probe_events() init symbol maps only for
uprobe if the first 'struct perf_probe_event' passed to it is a uprobe
event. This is a trick because 'perf probe''s command line syntax
constrains the first elements of the probe_event arrays must be kprobes
if there is one kprobe there.

However, with the incoming BPF uprobe support, that constrain is not
hold since 'perf record' will also probe on k/u probes through BPF
object, and is possible to pass an array with kprobe but the first
element is uprobe.

This patch init symbol maps for kprobes even if all of events are
uprobes, because the extra cost should be small enough.

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-39-git-send-email-wangn...@huawei.com
---
 tools/perf/util/probe-event.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index e720913..b94a8d7 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -2789,7 +2789,7 @@ int add_perf_probe_events(struct perf_probe_event *pevs, 
int npevs,
 {
int i, ret;
 
-   ret = init_symbol_maps(pevs->uprobes);
+   ret = init_symbol_maps(false);
if (ret < 0)
return ret;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 10/31] perf record: Load all eBPF object into kernel

2015-08-28 Thread Wang Nan
This patch utilizes bpf_object__load() provided by libbpf to load all
objects into kernel.

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-24-git-send-email-wangn...@huawei.com
---
 tools/perf/builtin-record.c  | 15 +++
 tools/perf/util/bpf-loader.c | 28 
 tools/perf/util/bpf-loader.h | 10 ++
 3 files changed, 53 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 8833186..c335ac5 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1158,6 +1158,21 @@ int cmd_record(int argc, const char **argv, const char 
*prefix __maybe_unused)
goto out_symbol_exit;
}
 
+   /*
+* bpf__probe() also calls symbol__init() if there are probe
+* events in bpf objects, so calling symbol_exit when failure
+* is safe. If there is no probe event, bpf__load() always
+* success.
+*/
+   err = bpf__load();
+   if (err) {
+   pr_err("Loading BPF programs failed:\n");
+
+   bpf__strerror_load(err, errbuf, sizeof(errbuf));
+   pr_err("\t%s\n", errbuf);
+   goto out_symbol_exit;
+   }
+
symbol__init(NULL);
 
if (symbol_conf.kptr_restrict)
diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
index ae23f6f..d63a594 100644
--- a/tools/perf/util/bpf-loader.c
+++ b/tools/perf/util/bpf-loader.c
@@ -267,6 +267,25 @@ out:
return err < 0 ? err : 0;
 }
 
+int bpf__load(void)
+{
+   struct bpf_object *obj, *tmp;
+   int err = 0;
+
+   bpf_object__for_each_safe(obj, tmp) {
+   err = bpf_object__load(obj);
+   if (err) {
+   pr_debug("bpf: load objects failed\n");
+   goto errout;
+   }
+   }
+   return 0;
+errout:
+   bpf_object__for_each_safe(obj, tmp)
+   bpf_object__unload(obj);
+   return err;
+}
+
 #define bpf__strerror_head(err, buf, size) \
char sbuf[STRERR_BUFSIZE], *emsg;\
if (!size)\
@@ -309,3 +328,12 @@ int bpf__strerror_probe(int err, char *buf, size_t size)
bpf__strerror_end(buf, size);
return 0;
 }
+
+int bpf__strerror_load(int err, char *buf, size_t size)
+{
+   bpf__strerror_head(err, buf, size);
+   bpf__strerror_entry(EINVAL, "%s: add -v to see detail. Run a 
CONFIG_BPF_SYSCALL kernel?",
+   emsg)
+   bpf__strerror_end(buf, size);
+   return 0;
+}
diff --git a/tools/perf/util/bpf-loader.h b/tools/perf/util/bpf-loader.h
index 6b09a85..4d7552e 100644
--- a/tools/perf/util/bpf-loader.h
+++ b/tools/perf/util/bpf-loader.h
@@ -19,6 +19,9 @@ int bpf__probe(void);
 int bpf__unprobe(void);
 int bpf__strerror_probe(int err, char *buf, size_t size);
 
+int bpf__load(void);
+int bpf__strerror_load(int err, char *buf, size_t size);
+
 void bpf__clear(void);
 #else
 static inline int bpf__prepare_load(const char *filename __maybe_unused)
@@ -29,6 +32,7 @@ static inline int bpf__prepare_load(const char *filename 
__maybe_unused)
 
 static inline int bpf__probe(void) { return 0; }
 static inline int bpf__unprobe(void) { return 0; }
+static inline int bpf__load(void) { return 0; }
 static inline void bpf__clear(void) { }
 
 static inline int
@@ -56,5 +60,11 @@ static inline int bpf__strerror_probe(int err __maybe_unused,
 {
return __bpf_strerror(buf, size);
 }
+
+static inline int bpf__strerror_load(int err __maybe_unused,
+char *buf, size_t size)
+{
+   return __bpf_strerror(buf, size);
+}
 #endif
 #endif
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 27/31] perf record: Support custom vmlinux path

2015-08-28 Thread Wang Nan
From: He Kuang 

Make perf-record command support --vmlinux option if BPF_PROLOGUE is on.

'perf record' needs vmlinux as the source of DWARF info to generate
prologue for BPF programs, so path of vmlinux should be specified.

Short name 'k' has been taken by 'clockid'. This patch skips the short
option name and use '--vmlinux' for vmlinux path.

Signed-off-by: He Kuang 
Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-38-git-send-email-wangn...@huawei.com
---
 tools/perf/builtin-record.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 212718c..8eb39d5 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1100,6 +1100,10 @@ struct option __record_options[] = {
   "clang binary to use for compiling BPF scriptlets"),
OPT_STRING(0, "clang-opt", _param.clang_opt, "clang options",
   "options passed to clang when compiling BPF scriptlets"),
+#ifdef HAVE_BPF_PROLOGUE
+   OPT_STRING(0, "vmlinux", _conf.vmlinux_name,
+  "file", "vmlinux pathname"),
+#endif
 #endif
OPT_END()
 };
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 22/31] perf tools: Add BPF_PROLOGUE config options for further patches

2015-08-28 Thread Wang Nan
If both LIBBPF and DWARF are detected, it is possible to create prologue
for eBPF programs to help them accessing kernel data. HAVE_BPF_PROLOGUE
and CONFIG_BPF_PROLOGUE is added as flags for this feature.

PERF_HAVE_ARCH_GET_REG_OFFSET indicates an architecture supports
converting name of a register to its offset in 'struct pt_regs'.
Without this support, BPF_PROLOGUE should be turned off.

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-33-git-send-email-wangn...@huawei.com
---
 tools/perf/config/Makefile   | 12 
 tools/perf/util/include/dwarf-regs.h |  7 +++
 2 files changed, 19 insertions(+)

diff --git a/tools/perf/config/Makefile b/tools/perf/config/Makefile
index 38a4144..d46765b7 100644
--- a/tools/perf/config/Makefile
+++ b/tools/perf/config/Makefile
@@ -314,6 +314,18 @@ ifndef NO_LIBELF
   CFLAGS += -DHAVE_LIBBPF_SUPPORT
   $(call detected,CONFIG_LIBBPF)
 endif
+
+ifndef NO_DWARF
+  ifneq ($(origin PERF_HAVE_ARCH_GET_REG_INFO), undefined)
+CFLAGS += -DHAVE_BPF_PROLOGUE
+$(call detected,CONFIG_BPF_PROLOGUE)
+  else
+msg := $(warning BPF prologue is not supported by architecture 
$(ARCH));
+  endif
+else
+  msg := $(warning DWARF support is off, BPF prologue is disabled);
+endif
+
   endif # NO_LIBBPF
 endif # NO_LIBELF
 
diff --git a/tools/perf/util/include/dwarf-regs.h 
b/tools/perf/util/include/dwarf-regs.h
index 8f14965..3dda083 100644
--- a/tools/perf/util/include/dwarf-regs.h
+++ b/tools/perf/util/include/dwarf-regs.h
@@ -5,4 +5,11 @@
 const char *get_arch_regstr(unsigned int n);
 #endif
 
+#ifdef HAVE_BPF_PROLOGUE
+/*
+ * Arch should support fetching the offset of a register in pt_regs
+ * by its name.
+ */
+int arch_get_reg_info(const char *name, int *offset);
+#endif
 #endif
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 25/31] perf tools: Generate prologue for BPF programs

2015-08-28 Thread Wang Nan
This patch generates prologue for each 'struct probe_trace_event' for
fetching arguments for BPF programs.

After bpf__probe(), iterate over each programs to check whether
prologue is required. If none of 'struct perf_probe_event' a program
will attach to has at least one argument, simply skip preprocessor
hooking. For those who prologue is required, calls bpf__gen_prologue()
and paste original instruction after prologue.

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-36-git-send-email-wangn...@huawei.com
---
 tools/perf/util/bpf-loader.c | 120 ++-
 1 file changed, 119 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
index 95e529b..66d9bea 100644
--- a/tools/perf/util/bpf-loader.c
+++ b/tools/perf/util/bpf-loader.c
@@ -5,10 +5,13 @@
  * Copyright (C) 2015 Huawei Inc.
  */
 
+#include 
 #include 
 #include "perf.h"
 #include "debug.h"
 #include "bpf-loader.h"
+#include "bpf-prologue.h"
+#include "llvm-utils.h"
 #include "probe-event.h"
 #include "probe-finder.h"
 #include "llvm-utils.h"
@@ -42,6 +45,8 @@ struct bpf_prog_priv {
struct perf_probe_event *ppev;
struct perf_probe_event pev;
};
+   bool need_prologue;
+   struct bpf_insn *insns_buf;
 };
 
 static void
@@ -53,6 +58,7 @@ bpf_prog_priv__clear(struct bpf_program *prog __maybe_unused,
/* check if pev is initialized */
if (priv && priv->pev_ready)
clear_perf_probe_event(>pev);
+   zfree(>insns_buf);
free(priv);
 }
 
@@ -239,6 +245,103 @@ int bpf__unprobe(void)
return ret < 0 ? ret : 0;
 }
 
+static int
+preproc_gen_prologue(struct bpf_program *prog, int n,
+struct bpf_insn *orig_insns, int orig_insns_cnt,
+struct bpf_prog_prep_result *res)
+{
+   struct probe_trace_event *tev;
+   struct perf_probe_event *pev;
+   struct bpf_prog_priv *priv;
+   struct bpf_insn *buf;
+   size_t prologue_cnt = 0;
+   int err;
+
+   err = bpf_program__get_private(prog, (void **));
+   if (err || !priv || !priv->pev_ready)
+   goto errout;
+
+   pev = >pev;
+
+   if (n < 0 || n >= pev->ntevs)
+   goto errout;
+
+   tev = >tevs[n];
+
+   buf = priv->insns_buf;
+   err = bpf__gen_prologue(tev->args, tev->nargs,
+   buf, _cnt,
+   BPF_MAXINSNS - orig_insns_cnt);
+   if (err) {
+   const char *title;
+
+   title = bpf_program__title(prog, false);
+   if (!title)
+   title = "??";
+
+   pr_debug("Failed to generate prologue for program %s\n",
+title);
+   return err;
+   }
+
+   memcpy([prologue_cnt], orig_insns,
+  sizeof(struct bpf_insn) * orig_insns_cnt);
+
+   res->new_insn_ptr = buf;
+   res->new_insn_cnt = prologue_cnt + orig_insns_cnt;
+   res->pfd = NULL;
+   return 0;
+
+errout:
+   pr_debug("Internal error in preproc_gen_prologue\n");
+   return -EINVAL;
+}
+
+static int hook_load_preprocessor(struct bpf_program *prog)
+{
+   struct perf_probe_event *pev;
+   struct bpf_prog_priv *priv;
+   bool need_prologue = false;
+   int err, i;
+
+   err = bpf_program__get_private(prog, (void **));
+   if (err || !priv) {
+   pr_debug("Internal error when hook preprocessor\n");
+   return -EINVAL;
+   }
+
+   pev = >pev;
+   for (i = 0; i < pev->ntevs; i++) {
+   struct probe_trace_event *tev = >tevs[i];
+
+   if (tev->nargs > 0) {
+   need_prologue = true;
+   break;
+   }
+   }
+
+   /*
+* Since all tev doesn't have argument, we don't need generate
+* prologue.
+*/
+   if (!need_prologue) {
+   priv->need_prologue = false;
+   return 0;
+   }
+
+   priv->need_prologue = true;
+   priv->insns_buf = malloc(sizeof(struct bpf_insn) *
+   BPF_MAXINSNS);
+   if (!priv->insns_buf) {
+   pr_debug("No enough memory: alloc insns_buf failed\n");
+   return -ENOMEM;
+   }
+
+   err = bpf_program__set_prep(prog, pev->ntevs,
+   preproc_gen_prologue);
+   return err;
+}
+
 int bpf__probe(void)
 {
int err, nr_events = 0;
@@ -289,6 +392,17 @@ int bpf__probe(void)
err = sync_bpf_program_pev(prog);
if (err)
  

[PATCH 26/31] perf tools: Use same BPF program if arguments are identical

2015-08-28 Thread Wang Nan
This patch allows creating only one BPF program for different
'probe_trace_event'(tev) generated by one 'perf_probe_event'(pev), if
their prologues are identical.

This is done by comparing argument list of different tev, and maps type
of prologue and tev using a mapping array. This patch utilizes qsort to
sort tevs. After sorting, tevs with identical argument list will group
together.

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-37-git-send-email-wangn...@huawei.com
---
 tools/perf/util/bpf-loader.c | 133 ---
 1 file changed, 126 insertions(+), 7 deletions(-)

diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
index 66d9bea..a23aaf0 100644
--- a/tools/perf/util/bpf-loader.c
+++ b/tools/perf/util/bpf-loader.c
@@ -47,6 +47,8 @@ struct bpf_prog_priv {
};
bool need_prologue;
struct bpf_insn *insns_buf;
+   int nr_types;
+   int *type_mapping;
 };
 
 static void
@@ -59,6 +61,7 @@ bpf_prog_priv__clear(struct bpf_program *prog __maybe_unused,
if (priv && priv->pev_ready)
clear_perf_probe_event(>pev);
zfree(>insns_buf);
+   zfree(>type_mapping);
free(priv);
 }
 
@@ -255,7 +258,7 @@ preproc_gen_prologue(struct bpf_program *prog, int n,
struct bpf_prog_priv *priv;
struct bpf_insn *buf;
size_t prologue_cnt = 0;
-   int err;
+   int i, err;
 
err = bpf_program__get_private(prog, (void **));
if (err || !priv || !priv->pev_ready)
@@ -263,10 +266,20 @@ preproc_gen_prologue(struct bpf_program *prog, int n,
 
pev = >pev;
 
-   if (n < 0 || n >= pev->ntevs)
+   if (n < 0 || n >= priv->nr_types)
goto errout;
 
-   tev = >tevs[n];
+   /* Find a tev belongs to that type */
+   for (i = 0; i < pev->ntevs; i++)
+   if (priv->type_mapping[i] == n)
+   break;
+
+   if (i >= pev->ntevs) {
+   pr_debug("Internal error: prologue type %d not found\n", n);
+   return -ENOENT;
+   }
+
+   tev = >tevs[i];
 
buf = priv->insns_buf;
err = bpf__gen_prologue(tev->args, tev->nargs,
@@ -297,6 +310,98 @@ errout:
return -EINVAL;
 }
 
+/*
+ * compare_tev_args is reflexive, transitive and antisymmetric.
+ * I can show that but this margin is too narrow to contain.
+ */
+static int compare_tev_args(const void *ptev1, const void *ptev2)
+{
+   int i, ret;
+   const struct probe_trace_event *tev1 =
+   *(const struct probe_trace_event **)ptev1;
+   const struct probe_trace_event *tev2 =
+   *(const struct probe_trace_event **)ptev2;
+
+   ret = tev2->nargs - tev1->nargs;
+   if (ret)
+   return ret;
+
+   for (i = 0; i < tev1->nargs; i++) {
+   struct probe_trace_arg *arg1, *arg2;
+   struct probe_trace_arg_ref *ref1, *ref2;
+
+   arg1 = >args[i];
+   arg2 = >args[i];
+
+   ret = strcmp(arg1->value, arg2->value);
+   if (ret)
+   return ret;
+
+   ref1 = arg1->ref;
+   ref2 = arg2->ref;
+
+   while (ref1 && ref2) {
+   ret = ref2->offset - ref1->offset;
+   if (ret)
+   return ret;
+
+   ref1 = ref1->next;
+   ref2 = ref2->next;
+   }
+
+   if (ref1 || ref2)
+   return ref2 ? 1 : -1;
+   }
+
+   return 0;
+}
+
+static int map_prologue(struct perf_probe_event *pev, int *mapping,
+   int *nr_types)
+{
+   int i, type = 0;
+   struct {
+   struct probe_trace_event *tev;
+   int idx;
+   } *stevs;
+   size_t array_sz = sizeof(*stevs) * pev->ntevs;
+
+   stevs = malloc(array_sz);
+   if (!stevs) {
+   pr_debug("No ehough memory: alloc stevs failed\n");
+   return -ENOMEM;
+   }
+
+   pr_debug("In map_prologue, ntevs=%d\n", pev->ntevs);
+   for (i = 0; i < pev->ntevs; i++) {
+   stevs[i].tev = >tevs[i];
+   stevs[i].idx = i;
+   }
+   qsort(stevs, pev->ntevs, sizeof(*stevs),
+ compare_tev_args);
+
+   for (i = 0; i < pev->ntevs; i++) {
+   if (i == 0) {
+   mapping[stevs[i].idx] = type;
+   pr_debug("mapping[%d]=%d\n", stevs[i].idx,
+type);
+   continue;
+   }
+
+   if (compare_tev_args(stevs + i, stevs + i - 1) == 0)
+

[PATCH 23/31] perf tools: Introduce arch_get_reg_info() for x86

2015-08-28 Thread Wang Nan
From: He Kuang 

arch_get_reg_info() is a helper function which converts register name
like "%rax" to offset of a register in 'struct pt_regs', which is
required by BPF prologue generator.

This patch replaces original string table by a 'struct reg_info' table,
which records offset of registers according to its name.

For x86, since there are two sub-archs (x86_32 and x86_64) but we can
only get pt_regs for the arch we are currently on, this patch fills
offset with '-1' for another sub-arch. This introduces a limitation to
perf prologue that, we are unable to generate prologue on a x86_32
compiled perf for BPF programs targeted on x86_64 kernel. This
limitation is acceptable, because this is a very rare usecase.

Signed-off-by: Wang Nan 
Signed-off-by: He Kuang 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-34-git-send-email-wangn...@huawei.com
---
 tools/perf/arch/x86/Makefile  |   1 +
 tools/perf/arch/x86/util/Build|   2 +
 tools/perf/arch/x86/util/dwarf-regs.c | 104 --
 3 files changed, 78 insertions(+), 29 deletions(-)

diff --git a/tools/perf/arch/x86/Makefile b/tools/perf/arch/x86/Makefile
index 21322e0..a84a6f6f 100644
--- a/tools/perf/arch/x86/Makefile
+++ b/tools/perf/arch/x86/Makefile
@@ -2,3 +2,4 @@ ifndef NO_DWARF
 PERF_HAVE_DWARF_REGS := 1
 endif
 HAVE_KVM_STAT_SUPPORT := 1
+PERF_HAVE_ARCH_GET_REG_INFO := 1
diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
index 2c55e1b..09429f6 100644
--- a/tools/perf/arch/x86/util/Build
+++ b/tools/perf/arch/x86/util/Build
@@ -3,6 +3,8 @@ libperf-y += tsc.o
 libperf-y += pmu.o
 libperf-y += kvm-stat.o
 
+# BPF_PROLOGUE also need dwarf-regs.o. However, if CONFIG_BPF_PROLOGUE
+# is true, CONFIG_DWARF must true.
 libperf-$(CONFIG_DWARF) += dwarf-regs.o
 
 libperf-$(CONFIG_LIBUNWIND)  += unwind-libunwind.o
diff --git a/tools/perf/arch/x86/util/dwarf-regs.c 
b/tools/perf/arch/x86/util/dwarf-regs.c
index be22dd4..9928caf 100644
--- a/tools/perf/arch/x86/util/dwarf-regs.c
+++ b/tools/perf/arch/x86/util/dwarf-regs.c
@@ -22,44 +22,67 @@
 
 #include 
 #include 
+#include 
+#include 
+#include  /* for offsetof */
+#include 
+
+struct reg_info {
+   const char  *name;  /* Reg string in debuginfo  */
+   int offset; /* Reg offset in struct pt_regs */
+};
 
 /*
  * Generic dwarf analysis helpers
  */
-
+/*
+ * x86_64 compiling can't access pt_regs for x86_32, so fill offset
+ * with -1.
+ */
+#ifdef __x86_64__
+# define REG_INFO(n, f) { .name = n, .offset = -1, }
+#else
+# define REG_INFO(n, f) { .name = n, .offset = offsetof(struct pt_regs, f), }
+#endif
 #define X86_32_MAX_REGS 8
-const char *x86_32_regs_table[X86_32_MAX_REGS] = {
-   "%ax",
-   "%cx",
-   "%dx",
-   "%bx",
-   "$stack",   /* Stack address instead of %sp */
-   "%bp",
-   "%si",
-   "%di",
+
+struct reg_info x86_32_regs_table[X86_32_MAX_REGS] = {
+   REG_INFO("%ax", eax),
+   REG_INFO("%cx", ecx),
+   REG_INFO("%dx", edx),
+   REG_INFO("%bx", ebx),
+   REG_INFO("$stack", esp),/* Stack address instead of %sp */
+   REG_INFO("%bp", ebp),
+   REG_INFO("%si", esi),
+   REG_INFO("%di", edi),
 };
 
+#undef REG_INFO
+#ifdef __x86_64__
+# define REG_INFO(n, f) { .name = n, .offset = offsetof(struct pt_regs, f), }
+#else
+# define REG_INFO(n, f) { .name = n, .offset = -1, }
+#endif
 #define X86_64_MAX_REGS 16
-const char *x86_64_regs_table[X86_64_MAX_REGS] = {
-   "%ax",
-   "%dx",
-   "%cx",
-   "%bx",
-   "%si",
-   "%di",
-   "%bp",
-   "%sp",
-   "%r8",
-   "%r9",
-   "%r10",
-   "%r11",
-   "%r12",
-   "%r13",
-   "%r14",
-   "%r15",
+struct reg_info x86_64_regs_table[X86_64_MAX_REGS] = {
+   REG_INFO("%ax", rax),
+   REG_INFO("%dx", rdx),
+   REG_INFO("%cx", rcx),
+   REG_INFO("%bx", rbx),
+   REG_INFO("%si", rsi),
+   REG_INFO("%di", rdi),
+   REG_INFO("%bp", rbp),
+   REG_INFO("%sp", rsp),
+   REG_INFO("%r8", r8),
+   REG_INFO("%r9", r9),
+   REG_INFO("%r10",r10),
+   REG_INFO("%r11",r11),
+   REG_INFO("%r12",r12),
+   REG_INFO("%r13",r13),
+   REG_INFO("%r14",r14),
+   REG_INFO("%r15",r15),
 };
 
-/* TODO: switching by dwarf address size */
 #ifdef __x86_64__
 #define ARCH_MAX_REGS X86_64_MAX_REGS
 #define arch_regs_table x86_64_regs_table
@@ -71,5 +94,28 @@ const char *x86_64_regs_table[X86_64_MAX_REGS] = {
 /* Return architecture dependent register string (for kprobe-tracer) */
 

[PATCH 11/31] perf tools: Add bpf_fd field to evsel and config it

2015-08-28 Thread Wang Nan
This patch adds a bpf_fd field to 'struct evsel' then introduces method
to config it. In bpf-loader, a bpf__foreach_tev() function is added,
which calls the callback function for each 'struct probe_trace_event'
events for each bpf program with their file descriptors. In evlist.c,
perf_evlist__add_bpf() is introduced to add all bpf events into evlist.
The event names are found from probe_trace_event structure.
'perf record' calls perf_evlist__add_bpf().

Since bpf-loader.c will not be built if libbpf is turned off, an empty
bpf__foreach_tev() is defined in bpf-loader.h to avoid compiling
error.

This patch iterates over 'struct probe_trace_event' instead of
'struct probe_trace_event' during the loop for further patches, which
will generate multiple instances form one BPF program and install then
onto different 'struct probe_trace_event'.

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-25-git-send-email-wangn...@huawei.com
---
 tools/perf/builtin-record.c  |  6 ++
 tools/perf/util/bpf-loader.c | 41 +
 tools/perf/util/bpf-loader.h | 13 +
 tools/perf/util/evlist.c | 41 +
 tools/perf/util/evlist.h |  1 +
 tools/perf/util/evsel.c  |  1 +
 tools/perf/util/evsel.h  |  1 +
 7 files changed, 104 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index c335ac5..5051d3b 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1173,6 +1173,12 @@ int cmd_record(int argc, const char **argv, const char 
*prefix __maybe_unused)
goto out_symbol_exit;
}
 
+   err = perf_evlist__add_bpf(rec->evlist);
+   if (err < 0) {
+   pr_err("Failed to add events from BPF object(s)\n");
+   goto out_symbol_exit;
+   }
+
symbol__init(NULL);
 
if (symbol_conf.kptr_restrict)
diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
index d63a594..126aa71 100644
--- a/tools/perf/util/bpf-loader.c
+++ b/tools/perf/util/bpf-loader.c
@@ -286,6 +286,47 @@ errout:
return err;
 }
 
+int bpf__foreach_tev(bpf_prog_iter_callback_t func, void *arg)
+{
+   struct bpf_object *obj, *tmp;
+   struct bpf_program *prog;
+   int err;
+
+   bpf_object__for_each_safe(obj, tmp) {
+   bpf_object__for_each_program(prog, obj) {
+   struct probe_trace_event *tev;
+   struct perf_probe_event *pev;
+   struct bpf_prog_priv *priv;
+   int i, fd;
+
+   err = bpf_program__get_private(prog,
+  (void **));
+   if (err || !priv) {
+   pr_debug("bpf: failed to get private field\n");
+   return -EINVAL;
+   }
+
+   pev = >pev;
+   for (i = 0; i < pev->ntevs; i++) {
+   tev = >tevs[i];
+
+   fd = bpf_program__fd(prog);
+   if (fd < 0) {
+   pr_debug("bpf: failed to get file 
descriptor\n");
+   return fd;
+   }
+
+   err = func(tev, fd, arg);
+   if (err) {
+   pr_debug("bpf: call back failed, stop 
iterate\n");
+   return err;
+   }
+   }
+   }
+   }
+   return 0;
+}
+
 #define bpf__strerror_head(err, buf, size) \
char sbuf[STRERR_BUFSIZE], *emsg;\
if (!size)\
diff --git a/tools/perf/util/bpf-loader.h b/tools/perf/util/bpf-loader.h
index 4d7552e..34656f8 100644
--- a/tools/perf/util/bpf-loader.h
+++ b/tools/perf/util/bpf-loader.h
@@ -7,10 +7,14 @@
 
 #include 
 #include 
+#include "probe-event.h"
 #include "debug.h"
 
 #define PERF_BPF_PROBE_GROUP "perf_bpf_probe"
 
+typedef int (*bpf_prog_iter_callback_t)(struct probe_trace_event *tev,
+   int fd, void *arg);
+
 #ifdef HAVE_LIBBPF_SUPPORT
 int bpf__prepare_load(const char *filename);
 int bpf__strerror_prepare_load(const char *filename, int err,
@@ -23,6 +27,8 @@ int bpf__load(void);
 int bpf__strerror_load(int err, char *buf, size_t size);
 
 void bpf__clear(void);
+
+int bpf__foreach_tev(bpf_prog_iter_callback_t func, void *arg);
 #else
 static inline int bpf__prepare_load(const char *filename __maybe_unused)
 {
@@ -36,6 +42,13 @@ 

[PATCH 19/31] bpf tools: Load a program with different instances using preprocessor

2015-08-28 Thread Wang Nan
In this patch, caller of libbpf is able to control the loaded programs
by installing a preprocessor callback for a BPF program. With
preprocessor, different instances can be created from one BPF program.

This patch will be used by perf to generate different prologue for
different 'struct probe_trace_event' instances matched by one
'struct perf_probe_event'.

bpf_program__set_prep() is added to support this feature. Caller
should pass libbpf the number of instances should be created and a
preprocessor function which will be called when doing real loading.
The callback should return instructions arrays for each instances.

fd field in bpf_programs is replaced by instance, which has an nr field
and fds array. bpf_program__nth_fd() is introduced for read fd of
instances. Old interface bpf_program__fd() is reimplemented by
returning the first fd.

Signed-off-by: Wang Nan 
Signed-off-by: He Kuang 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-29-git-send-email-wangn...@huawei.com
[wangnan: Add missing '!',
  allows bpf_program__unload() when prog->instance.nr == -1
]
---
 tools/lib/bpf/libbpf.c | 143 +
 tools/lib/bpf/libbpf.h |  22 
 2 files changed, 156 insertions(+), 9 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 4252fc2..6a07b26 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -98,7 +98,11 @@ struct bpf_program {
} *reloc_desc;
int nr_reloc;
 
-   int fd;
+   struct {
+   int nr;
+   int *fds;
+   } instance;
+   bpf_program_prep_t preprocessor;
 
struct bpf_object *obj;
void *priv;
@@ -152,10 +156,24 @@ struct bpf_object {
 
 static void bpf_program__unload(struct bpf_program *prog)
 {
+   int i;
+
if (!prog)
return;
 
-   zclose(prog->fd);
+   /*
+* If the object is opened but the program is never loaded,
+* it is possible that prog->instance.nr == -1.
+*/
+   if (prog->instance.nr > 0) {
+   for (i = 0; i < prog->instance.nr; i++)
+   zclose(prog->instance.fds[i]);
+   } else if (prog->instance.nr != -1)
+   pr_warning("Internal error: instance.nr is %d\n",
+  prog->instance.nr);
+
+   prog->instance.nr = -1;
+   zfree(>instance.fds);
 }
 
 static void bpf_program__exit(struct bpf_program *prog)
@@ -206,7 +224,8 @@ bpf_program__init(void *data, size_t size, char *name, int 
idx,
memcpy(prog->insns, data,
   prog->insns_cnt * sizeof(struct bpf_insn));
prog->idx = idx;
-   prog->fd = -1;
+   prog->instance.fds = NULL;
+   prog->instance.nr = -1;
 
return 0;
 errout:
@@ -795,13 +814,71 @@ static int
 bpf_program__load(struct bpf_program *prog,
  char *license, u32 kern_version)
 {
-   int err, fd;
+   int err = 0, fd, i;
+
+   if (prog->instance.nr < 0 || !prog->instance.fds) {
+   if (prog->preprocessor) {
+   pr_warning("Internal error: can't load program '%s'\n",
+  prog->section_name);
+   return -EINVAL;
+   }
+
+   prog->instance.fds = malloc(sizeof(int));
+   if (!prog->instance.fds) {
+   pr_warning("No enough memory for fds\n");
+   return -ENOMEM;
+   }
+   prog->instance.nr = 1;
+   prog->instance.fds[0] = -1;
+   }
+
+   if (!prog->preprocessor) {
+   if (prog->instance.nr != 1)
+   pr_warning("Program '%s' inconsistent: nr(%d) not 1\n",
+  prog->section_name, prog->instance.nr);
 
-   err = load_program(prog->insns, prog->insns_cnt,
-  license, kern_version, );
-   if (!err)
-   prog->fd = fd;
+   err = load_program(prog->insns, prog->insns_cnt,
+  license, kern_version, );
+   if (!err)
+   prog->instance.fds[0] = fd;
+   goto out;
+   }
+
+   for (i = 0; i < prog->instance.nr; i++) {
+   struct bpf_prog_prep_result result;
+   bpf_program_prep_t preprocessor = prog->preprocessor;
+
+   bzero(, sizeof(result));
+   err = preprocessor(prog, i, prog->insns,
+  prog->insns_cnt, );
+   if (err) {
+   pr_warning("Preprocessing %dth instance of program '%s' 
failed\n",
+   i, 

[PATCH 21/31] perf tools: Move linux/filter.h to tools/include

2015-08-28 Thread Wang Nan
From: He Kuang 

This patch moves filter.h from include/linux/kernel.h to
tools/include/linux/filter.h to enable other libraries use macros in
it, like libbpf which will be introduced by further patches. Currenty,
the moved filter.h only contains the useful macros needed by libbpf
for not introducing too much dependence.

MANIFEST is also updated for 'make perf-*-src-pkg'.

One change:
  imm field of BPF_EMIT_CALL becomes ((FUNC) - BPF_FUNC_unspec) to
  suit user space code generator.

Signed-off-by: He Kuang 
Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-32-git-send-email-wangn...@huawei.com
---
 tools/include/linux/filter.h | 237 +++
 tools/perf/MANIFEST  |   1 +
 2 files changed, 238 insertions(+)
 create mode 100644 tools/include/linux/filter.h

diff --git a/tools/include/linux/filter.h b/tools/include/linux/filter.h
new file mode 100644
index 000..11d2b1c
--- /dev/null
+++ b/tools/include/linux/filter.h
@@ -0,0 +1,237 @@
+/*
+ * Linux Socket Filter Data Structures
+ */
+#ifndef __TOOLS_LINUX_FILTER_H
+#define __TOOLS_LINUX_FILTER_H
+
+#include 
+
+/* ArgX, context and stack frame pointer register positions. Note,
+ * Arg1, Arg2, Arg3, etc are used as argument mappings of function
+ * calls in BPF_CALL instruction.
+ */
+#define BPF_REG_ARG1   BPF_REG_1
+#define BPF_REG_ARG2   BPF_REG_2
+#define BPF_REG_ARG3   BPF_REG_3
+#define BPF_REG_ARG4   BPF_REG_4
+#define BPF_REG_ARG5   BPF_REG_5
+#define BPF_REG_CTXBPF_REG_6
+#define BPF_REG_FP BPF_REG_10
+
+/* Additional register mappings for converted user programs. */
+#define BPF_REG_A  BPF_REG_0
+#define BPF_REG_X  BPF_REG_7
+#define BPF_REG_TMPBPF_REG_8
+
+/* BPF program can access up to 512 bytes of stack space. */
+#define MAX_BPF_STACK  512
+
+/* Helper macros for filter block array initializers. */
+
+/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
+
+#define BPF_ALU64_REG(OP, DST, SRC)\
+   ((struct bpf_insn) {\
+   .code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,\
+   .dst_reg = DST, \
+   .src_reg = SRC, \
+   .off   = 0, \
+   .imm   = 0 })
+
+#define BPF_ALU32_REG(OP, DST, SRC)\
+   ((struct bpf_insn) {\
+   .code  = BPF_ALU | BPF_OP(OP) | BPF_X,  \
+   .dst_reg = DST, \
+   .src_reg = SRC, \
+   .off   = 0, \
+   .imm   = 0 })
+
+/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
+
+#define BPF_ALU64_IMM(OP, DST, IMM)\
+   ((struct bpf_insn) {\
+   .code  = BPF_ALU64 | BPF_OP(OP) | BPF_K,\
+   .dst_reg = DST, \
+   .src_reg = 0,   \
+   .off   = 0, \
+   .imm   = IMM })
+
+#define BPF_ALU32_IMM(OP, DST, IMM)\
+   ((struct bpf_insn) {\
+   .code  = BPF_ALU | BPF_OP(OP) | BPF_K,  \
+   .dst_reg = DST, \
+   .src_reg = 0,   \
+   .off   = 0, \
+   .imm   = IMM })
+
+/* Endianness conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */
+
+#define BPF_ENDIAN(TYPE, DST, LEN) \
+   ((struct bpf_insn) {\
+   .code  = BPF_ALU | BPF_END | BPF_SRC(TYPE), \
+   .dst_reg = DST, \
+   .src_reg = 0,   \
+   .off   = 0, \
+   .imm   = LEN })
+
+/* Short form of mov, dst_reg = src_reg */
+
+#define BPF_MOV64_REG(DST, SRC)\
+   ((struct bpf_insn) {\
+   .code  = BPF_ALU64 | BPF_MOV | BPF_X,   \
+   .dst_reg = DST, \
+   .src_reg = SRC, \
+   .off   = 0, \
+   .imm   = 0 })
+
+#define 

[PATCH 29/31] perf tools: Support attach BPF program on uprobe events

2015-08-28 Thread Wang Nan
This patch appends new syntax to BPF object section name to support
probing at uprobe event. Now we can use BPF program like this:

 SEC(
 "target=/lib64/libc.so.6\n"
 "libcwrite=__write"
 )
 int libcwrite(void *ctx)
 {
 return 1;
 }

Where, in section name of a program, before the main config string,
we can use 'key=value' style options. Now the only option key "target"
is for uprobe probing.

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-40-git-send-email-wangn...@huawei.com
---
 tools/perf/util/bpf-loader.c | 88 
 1 file changed, 81 insertions(+), 7 deletions(-)

diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
index a23aaf0..2735389 100644
--- a/tools/perf/util/bpf-loader.c
+++ b/tools/perf/util/bpf-loader.c
@@ -66,6 +66,84 @@ bpf_prog_priv__clear(struct bpf_program *prog __maybe_unused,
 }
 
 static int
+do_config(const char *key, const char *value,
+ struct perf_probe_event *pev)
+{
+   pr_debug("config bpf program: %s=%s\n", key, value);
+   if (strcmp(key, "target") == 0) {
+   pev->uprobes = true;
+   pev->target = strdup(value);
+   return 0;
+   }
+
+   pr_warning("BPF: WARNING: invalid config option in object: %s=%s\n",
+  key, value);
+   pr_warning("\tHint: Currently only valid option is 'target='\n");
+   return 0;
+}
+
+static const char *
+parse_config_kvpair(const char *config_str, struct perf_probe_event *pev)
+{
+   char *text = strdup(config_str);
+   char *sep, *line;
+   const char *main_str = NULL;
+   int err = 0;
+
+   if (!text) {
+   pr_debug("No enough memory: dup config_str failed\n");
+   return NULL;
+   }
+
+   line = text;
+   while ((sep = strchr(line, '\n'))) {
+   char *equ;
+
+   *sep = '\0';
+   equ = strchr(line, '=');
+   if (!equ) {
+   pr_warning("WARNING: invalid config in BPF object: 
%s\n",
+  line);
+   pr_warning("\tShould be 'key=value'.\n");
+   goto nextline;
+   }
+   *equ = '\0';
+
+   err = do_config(line, equ + 1, pev);
+   if (err)
+   break;
+nextline:
+   line = sep + 1;
+   }
+
+   if (!err)
+   main_str = config_str + (line - text);
+   free(text);
+
+   return main_str;
+}
+
+static int
+parse_config(const char *config_str, struct perf_probe_event *pev)
+{
+   const char *main_str;
+   int err;
+
+   main_str = parse_config_kvpair(config_str, pev);
+   if (!main_str)
+   return -EINVAL;
+
+   err = parse_perf_probe_command(main_str, pev);
+   if (err < 0) {
+   pr_debug("bpf: '%s' is not a valid config string\n",
+config_str);
+   /* parse failed, don't need clear pev. */
+   return -EINVAL;
+   }
+   return 0;
+}
+
+static int
 config_bpf_program(struct bpf_program *prog, struct perf_probe_event *pev)
 {
struct bpf_prog_priv *priv = NULL;
@@ -79,13 +157,9 @@ config_bpf_program(struct bpf_program *prog, struct 
perf_probe_event *pev)
}
 
pr_debug("bpf: config program '%s'\n", config_str);
-   err = parse_perf_probe_command(config_str, pev);
-   if (err < 0) {
-   pr_debug("bpf: '%s' is not a valid config string\n",
-config_str);
-   /* parse failed, don't need clear pev. */
-   return -EINVAL;
-   }
+   err = parse_config(config_str, pev);
+   if (err)
+   return err;
 
if (pev->group && strcmp(pev->group, PERF_BPF_PROBE_GROUP)) {
pr_debug("bpf: '%s': group for event is set and not '%s'.\n",
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 20/31] perf probe: Reset args and nargs for probe_trace_event when failure

2015-08-28 Thread Wang Nan
When failure occures in add_probe_trace_event(), args in
probe_trace_event is incomplete. Since information in it may be used
in further, this patch frees the allocated memory and set it to NULL
to avoid dangling pointer.

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-31-git-send-email-wangn...@huawei.com
---
 tools/perf/util/probe-finder.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/perf/util/probe-finder.c b/tools/perf/util/probe-finder.c
index 29c43c068..5ab9cd6 100644
--- a/tools/perf/util/probe-finder.c
+++ b/tools/perf/util/probe-finder.c
@@ -1228,6 +1228,10 @@ static int add_probe_trace_event(Dwarf_Die *sc_die, 
struct probe_finder *pf)
 
 end:
free(args);
+   if (ret) {
+   tev->nargs = 0;
+   zfree(>args);
+   }
return ret;
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 14/31] perf tools: Suppress probing messages when probing by BPF loading

2015-08-28 Thread Wang Nan
This patch suppresses message output by add_perf_probe_events() and
del_perf_probe_events() if they are triggered by BPF loading. Before
this patch, when using 'perf record' with BPF object/source as event
selector, following message will be output:

 Added new event:
   perf_bpf_probe:lock_page_ret (on __lock_page%return)
You can now use it in all perf tools, such as:
perf record -e perf_bpf_probe:lock_page_ret -aR sleep 1
 ...
 Removed event: perf_bpf_probe:lock_page_ret

Which is misleading, especially 'use it in all perf tools' because they
will be removed after 'pref record' exit.

In this patch, a 'silent' field is appended into probe_conf to control
output. bpf__{,un}probe() set it to true when calling
{add,del}_perf_probe_events().

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Link: 
http://lkml.kernel.org/n/1440151770-129878-12-git-send-email-wangn...@huawei.com
---
 tools/perf/util/bpf-loader.c  |  6 ++
 tools/perf/util/probe-event.c | 17 -
 tools/perf/util/probe-event.h |  1 +
 tools/perf/util/probe-file.c  |  5 -
 4 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
index c3bc0a8..77eeb99 100644
--- a/tools/perf/util/bpf-loader.c
+++ b/tools/perf/util/bpf-loader.c
@@ -188,6 +188,7 @@ static bool is_probed;
 int bpf__unprobe(void)
 {
struct strfilter *delfilter;
+   bool old_silent = probe_conf.silent;
int ret;
 
if (!is_probed)
@@ -199,7 +200,9 @@ int bpf__unprobe(void)
return -ENOMEM;
}
 
+   probe_conf.silent = true;
ret = del_perf_probe_events(delfilter);
+   probe_conf.silent = old_silent;
strfilter__delete(delfilter);
if (ret < 0 && is_probed)
pr_debug("Error: failed to delete events: %s\n",
@@ -215,6 +218,7 @@ int bpf__probe(void)
struct bpf_object *obj, *tmp;
struct bpf_program *prog;
struct perf_probe_event *pevs;
+   bool old_silent = probe_conf.silent;
 
pevs = calloc(MAX_PROBES, sizeof(pevs[0]));
if (!pevs)
@@ -235,9 +239,11 @@ int bpf__probe(void)
}
}
 
+   probe_conf.silent = true;
probe_conf.max_probes = MAX_PROBES;
/* Let add_perf_probe_events generates probe_trace_event (tevs) */
err = add_perf_probe_events(pevs, nr_events, false);
+   probe_conf.silent = old_silent;
 
/* add_perf_probe_events return negative when fail */
if (err < 0) {
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index 57a7bae..e720913 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -52,7 +52,9 @@
 #define PERFPROBE_GROUP "probe"
 
 bool probe_event_dry_run;  /* Dry run flag */
-struct probe_conf probe_conf;
+struct probe_conf probe_conf = {
+   .silent = false,
+};
 
 #define semantic_error(msg ...) pr_err("Semantic error :" msg)
 
@@ -2192,10 +2194,12 @@ static int show_perf_probe_event(const char *group, 
const char *event,
 
ret = perf_probe_event__sprintf(group, event, pev, module, );
if (ret >= 0) {
-   if (use_stdout)
+   if (use_stdout && !probe_conf.silent)
printf("%s\n", buf.buf);
-   else
+   else if (!probe_conf.silent)
pr_info("%s\n", buf.buf);
+   else
+   pr_debug("%s\n", buf.buf);
}
strbuf_release();
 
@@ -2418,7 +2422,10 @@ static int __add_probe_trace_events(struct 
perf_probe_event *pev,
}
 
ret = 0;
-   pr_info("Added new event%s\n", (ntevs > 1) ? "s:" : ":");
+   if (!probe_conf.silent)
+   pr_info("Added new event%s\n", (ntevs > 1) ? "s:" : ":");
+   else
+   pr_debug("Added new event%s\n", (ntevs > 1) ? "s:" : ":");
for (i = 0; i < ntevs; i++) {
tev = [i];
/* Skip if the symbol is out of .text or blacklisted */
@@ -2454,7 +2461,7 @@ static int __add_probe_trace_events(struct 
perf_probe_event *pev,
warn_uprobe_event_compat(tev);
 
/* Note that it is possible to skip all events because of blacklist */
-   if (ret >= 0 && event) {
+   if (ret >= 0 && event && !probe_conf.silent) {
/* Show how to use the event. */
pr_info("\nYou can now use it in all perf tools, such as:\n\n");
pr_info("\tperf record -e %s:%s -aR sleep 1\n\n", group, event);
diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
index 915f0d8..3ab9c3e 100644
--- a/tools/perf/util/probe-event.h
+++ b/tools/perf/util/probe-event.h

[PATCH 15/31] perf record: Add clang options for compiling BPF scripts

2015-08-28 Thread Wang Nan
Although previous patch allows setting BPF compiler related options in
perfconfig, on some ad-hoc situation it still requires passing options
through cmdline. This patch introduces 2 options to 'perf record' for
this propose: --clang-path and --clang-opt.

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-28-git-send-email-wangn...@huawei.com
---
 tools/perf/builtin-record.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index fd56a5b..212718c 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -30,6 +30,7 @@
 #include "util/auxtrace.h"
 #include "util/parse-branch-options.h"
 #include "util/bpf-loader.h"
+#include "util/llvm-utils.h"
 
 #include 
 #include 
@@ -1094,6 +1095,12 @@ struct option __record_options[] = {
"per thread proc mmap processing timeout in ms"),
OPT_BOOLEAN(0, "switch-events", _switch_events,
"Record context switch events"),
+#ifdef HAVE_LIBBPF_SUPPORT
+   OPT_STRING(0, "clang-path", _param.clang_path, "clang path",
+  "clang binary to use for compiling BPF scriptlets"),
+   OPT_STRING(0, "clang-opt", _param.clang_opt, "clang options",
+  "options passed to clang when compiling BPF scriptlets"),
+#endif
OPT_END()
 };
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 16/31] perf tools: Infrastructure for compiling scriptlets when passing '.c' to --event

2015-08-28 Thread Wang Nan
This patch provides infrastructure for passing source files to --event
directly using:

 # perf record --event bpf-file.c command

This patch does following works:

 1) Allow passing '.c' file to '--event'. parse_events_load_bpf() is
expanded to allow caller tell it whether the passed file is source
file or object.

 2) llvm__compile_bpf() is called to compile the '.c' file, the result
is saved into memory. Use bpf_object__open_buffer() to load the
in-memory object.

Introduces a bpf-script-example.c so we can manually test it:

 # perf record --clang-opt "-DLINUX_VERSION_CODE=0x40200" --event 
./bpf-script-example.c sleep 1

Note that '--clang-opt' must put before '--event'.

Futher patches will merge it into a testcase so can be tested automatically.

Signed-off-by: Wang Nan 
Signed-off-by: He Kuang 
Acked-by: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Link: 
http://lkml.kernel.org/n/1436445342-1402-20-git-send-email-wangn...@huawei.com
[ wangnan: Pass name of source file to bpf_object__open_buffer(). ]
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/tests/bpf-script-example.c | 44 +++
 tools/perf/util/bpf-loader.c  | 25 +++-
 tools/perf/util/bpf-loader.h  | 10 
 tools/perf/util/parse-events.c|  8 +++
 tools/perf/util/parse-events.h|  3 ++-
 tools/perf/util/parse-events.l|  3 +++
 tools/perf/util/parse-events.y| 15 ++--
 7 files changed, 91 insertions(+), 17 deletions(-)
 create mode 100644 tools/perf/tests/bpf-script-example.c

diff --git a/tools/perf/tests/bpf-script-example.c 
b/tools/perf/tests/bpf-script-example.c
new file mode 100644
index 000..410a70b
--- /dev/null
+++ b/tools/perf/tests/bpf-script-example.c
@@ -0,0 +1,44 @@
+#ifndef LINUX_VERSION_CODE
+# error Need LINUX_VERSION_CODE
+# error Example: for 4.2 kernel, put 'clang-opt="-DLINUX_VERSION_CODE=0x40200" 
into llvm section of ~/.perfconfig'
+#endif
+#define BPF_ANY 0
+#define BPF_MAP_TYPE_ARRAY 2
+#define BPF_FUNC_map_lookup_elem 1
+#define BPF_FUNC_map_update_elem 2
+
+static void *(*bpf_map_lookup_elem)(void *map, void *key) =
+   (void *) BPF_FUNC_map_lookup_elem;
+static void *(*bpf_map_update_elem)(void *map, void *key, void *value, int 
flags) =
+   (void *) BPF_FUNC_map_update_elem;
+
+struct bpf_map_def {
+   unsigned int type;
+   unsigned int key_size;
+   unsigned int value_size;
+   unsigned int max_entries;
+};
+
+#define SEC(NAME) __attribute__((section(NAME), used))
+struct bpf_map_def SEC("maps") flip_table = {
+   .type = BPF_MAP_TYPE_ARRAY,
+   .key_size = sizeof(int),
+   .value_size = sizeof(int),
+   .max_entries = 1,
+};
+
+SEC("func=sys_epoll_pwait")
+int bpf_func__sys_epoll_pwait(void *ctx)
+{
+   int ind =0;
+   int *flag = bpf_map_lookup_elem(_table, );
+   int new_flag;
+   if (!flag)
+   return 0;
+   /* flip flag and store back */
+   new_flag = !*flag;
+   bpf_map_update_elem(_table, , _flag, BPF_ANY);
+   return new_flag;
+}
+char _license[] SEC("license") = "GPL";
+int _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
index 77eeb99..c2aafe2 100644
--- a/tools/perf/util/bpf-loader.c
+++ b/tools/perf/util/bpf-loader.c
@@ -11,6 +11,7 @@
 #include "bpf-loader.h"
 #include "probe-event.h"
 #include "probe-finder.h"
+#include "llvm-utils.h"
 
 #define DEFINE_PRINT_FN(name, level) \
 static int libbpf_##name(const char *fmt, ...) \
@@ -152,16 +153,28 @@ sync_bpf_program_pev(struct bpf_program *prog)
return 0;
 }
 
-int bpf__prepare_load(const char *filename)
+int bpf__prepare_load(const char *filename, bool source)
 {
struct bpf_object *obj;
+   int err;
 
if (!libbpf_initialized)
libbpf_set_print(libbpf_warning,
 libbpf_info,
 libbpf_debug);
 
-   obj = bpf_object__open(filename);
+   if (source) {
+   void *obj_buf;
+   size_t obj_buf_sz;
+
+   err = llvm__compile_bpf(filename, _buf, _buf_sz);
+   if (err)
+   return err;
+   obj = bpf_object__open_buffer(obj_buf, obj_buf_sz, filename);
+   free(obj_buf);
+   } else
+   obj = bpf_object__open(filename);
+
if (!obj) {
pr_debug("bpf: failed to load %s\n", filename);
return -EINVAL;
@@ -361,12 +374,12 @@ int bpf__foreach_tev(bpf_prog_iter_callback_t func, void 
*arg)
}\
buf[size - 1] = '\0';
 
-int bpf__strerror_prepare_load(const char *filename, int err,
-  char *buf, size_t size)
+int 

[PATCH 08/31] perf record, bpf: Parse and probe eBPF programs probe points

2015-08-28 Thread Wang Nan
This patch introduces bpf__{un,}probe() functions to enable callers to
create kprobe points based on section names of BPF programs. It parses
the section names of each eBPF program and creates corresponding 'struct
perf_probe_event' structures. The parse_perf_probe_command() function is
used to do the main parsing work.

Parsing result is stored into an array to satisify
add_perf_probe_events(). It accepts an array of 'struct perf_probe_event'
and do all the work in one call.

Define PERF_BPF_PROBE_GROUP as "perf_bpf_probe", which will be used as
the group name of all eBPF probing points.

probe_conf.max_probes is set to MAX_PROBES to support glob matching.

Before ending of bpf__probe(), data in each 'struct perf_probe_event' is
cleaned. Things will be changed by following patches because they need
'struct probe_trace_event' in them,

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Link: 
http://lkml.kernel.org/n/1436445342-1402-21-git-send-email-wangn...@huawei.com
Link: 
http://lkml.kernel.org/n/1436445342-1402-23-git-send-email-wangn...@huawei.com
[Merged by two patches]
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/builtin-record.c  |  19 ++-
 tools/perf/util/bpf-loader.c | 133 +++
 tools/perf/util/bpf-loader.h |  13 +
 3 files changed, 164 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 31934b1..8833186 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1140,7 +1140,23 @@ int cmd_record(int argc, const char **argv, const char 
*prefix __maybe_unused)
if (err)
goto out_bpf_clear;
 
-   err = -ENOMEM;
+   /*
+* bpf__probe must be called before symbol__init() because we
+* need init_symbol_maps. If called after symbol__init,
+* symbol_conf.sort_by_name won't take effect.
+*
+* bpf__unprobe() is safe even if bpf__probe() failed, and it
+* also calls symbol__init. Therefore, goto out_symbol_exit
+* is safe when probe failed.
+*/
+   err = bpf__probe();
+   if (err) {
+   bpf__strerror_probe(err, errbuf, sizeof(errbuf));
+
+   pr_err("Probing at events in BPF object failed.\n");
+   pr_err("\t%s\n", errbuf);
+   goto out_symbol_exit;
+   }
 
symbol__init(NULL);
 
@@ -1201,6 +1217,7 @@ out_symbol_exit:
perf_evlist__delete(rec->evlist);
symbol__exit();
auxtrace_record__free(rec->itr);
+   bpf__unprobe();
 out_bpf_clear:
bpf__clear();
return err;
diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
index 88531ea..435f52e 100644
--- a/tools/perf/util/bpf-loader.c
+++ b/tools/perf/util/bpf-loader.c
@@ -9,6 +9,8 @@
 #include "perf.h"
 #include "debug.h"
 #include "bpf-loader.h"
+#include "probe-event.h"
+#include "probe-finder.h"
 
 #define DEFINE_PRINT_FN(name, level) \
 static int libbpf_##name(const char *fmt, ...) \
@@ -28,6 +30,58 @@ DEFINE_PRINT_FN(debug, 1)
 
 static bool libbpf_initialized;
 
+static int
+config_bpf_program(struct bpf_program *prog, struct perf_probe_event *pev)
+{
+   const char *config_str;
+   int err;
+
+   config_str = bpf_program__title(prog, false);
+   if (!config_str) {
+   pr_debug("bpf: unable to get title for program\n");
+   return -EINVAL;
+   }
+
+   pr_debug("bpf: config program '%s'\n", config_str);
+   err = parse_perf_probe_command(config_str, pev);
+   if (err < 0) {
+   pr_debug("bpf: '%s' is not a valid config string\n",
+config_str);
+   /* parse failed, don't need clear pev. */
+   return -EINVAL;
+   }
+
+   if (pev->group && strcmp(pev->group, PERF_BPF_PROBE_GROUP)) {
+   pr_debug("bpf: '%s': group for event is set and not '%s'.\n",
+config_str, PERF_BPF_PROBE_GROUP);
+   err = -EINVAL;
+   goto errout;
+   } else if (!pev->group)
+   pev->group = strdup(PERF_BPF_PROBE_GROUP);
+
+   if (!pev->group) {
+   pr_debug("bpf: strdup failed\n");
+   err = -ENOMEM;
+   goto errout;
+   }
+
+   if (!pev->event) {
+   pr_debug("bpf: '%s': event name is missing\n",
+config_str);
+   err = -EINVAL;
+   goto errout;
+   }
+
+   pr_debug("bpf: config '%s' is ok\n", config_str);
+
+   return 0;
+
+errout:
+   if (pev)
+   clear_perf_probe_event(pev);
+   return err;
+}
+
 int bpf__prepare_load(const char *filename)
 {
struct bpf_object *obj;
@@ -59,6 

[PATCH 04/31] perf tools: Make perf depend on libbpf

2015-08-28 Thread Wang Nan
By adding libbpf into perf's Makefile, this patch enables perf to build
libbpf during building if libelf is found and neither NO_LIBELF nor
NO_LIBBPF is set. The newly introduced code is similar to libapi and
libtraceevent building in Makefile.perf.

MANIFEST is also updated for 'make perf-*-src-pkg'.

Append make_no_libbpf to tools/perf/tests/make.

'bpf' feature check is appended into default FEATURE_TESTS and
FEATURE_DISPLAY, so perf will check API version of bpf in
/path/to/kernel/include/uapi/linux/bpf.h. Which should not fail except
when we are trying to port this code to an old kernel.

Error messages are also updated to notify users about the disable of BPF
support of 'perf record' if libelf is missed or BPF API check failed.

tools/lib/bpf is added into TAG_FOLDERS to allow us to navigate on
libbpf files when working on perf using tools/perf/tags.

Signed-off-by: Wang Nan 
Acked-by: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Link: 
http://lkml.kernel.org/r/1435716878-189507-24-git-send-email-wangn...@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/build/Makefile.feature |  6 --
 tools/perf/MANIFEST  |  3 +++
 tools/perf/Makefile.perf | 19 +--
 tools/perf/config/Makefile   | 19 ++-
 tools/perf/tests/make|  4 +++-
 5 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/tools/build/Makefile.feature b/tools/build/Makefile.feature
index 2975632..5ec6b37 100644
--- a/tools/build/Makefile.feature
+++ b/tools/build/Makefile.feature
@@ -51,7 +51,8 @@ FEATURE_TESTS ?=  \
timerfd \
libdw-dwarf-unwind  \
zlib\
-   lzma
+   lzma\
+   bpf
 
 FEATURE_DISPLAY ?= \
dwarf   \
@@ -67,7 +68,8 @@ FEATURE_DISPLAY ?=\
libunwind   \
libdw-dwarf-unwind  \
zlib\
-   lzma
+   lzma\
+   bpf
 
 # Set FEATURE_CHECK_(C|LD)FLAGS-all for all FEATURE_TESTS features.
 # If in the future we need per-feature checks/flags for features not
diff --git a/tools/perf/MANIFEST b/tools/perf/MANIFEST
index af009bd..56fe0c9 100644
--- a/tools/perf/MANIFEST
+++ b/tools/perf/MANIFEST
@@ -17,6 +17,7 @@ tools/build
 tools/arch/x86/include/asm/atomic.h
 tools/arch/x86/include/asm/rmwcc.h
 tools/lib/traceevent
+tools/lib/bpf
 tools/lib/api
 tools/lib/bpf
 tools/lib/hweight.c
@@ -67,6 +68,8 @@ arch/*/lib/memset*.S
 include/linux/poison.h
 include/linux/hw_breakpoint.h
 include/uapi/linux/perf_event.h
+include/uapi/linux/bpf.h
+include/uapi/linux/bpf_common.h
 include/uapi/linux/const.h
 include/uapi/linux/swab.h
 include/uapi/linux/hw_breakpoint.h
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index d9863cb..a6a789e 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -145,6 +145,7 @@ AWK = awk
 
 LIB_DIR  = $(srctree)/tools/lib/api/
 TRACE_EVENT_DIR = $(srctree)/tools/lib/traceevent/
+BPF_DIR = $(srctree)/tools/lib/bpf/
 
 # include config/Makefile by default and rule out
 # non-config cases
@@ -180,6 +181,7 @@ strip-libs = $(filter-out -l%,$(1))
 
 ifneq ($(OUTPUT),)
   TE_PATH=$(OUTPUT)
+  BPF_PATH=$(OUTPUT)
 ifneq ($(subdir),)
   LIB_PATH=$(OUTPUT)/../lib/api/
 else
@@ -188,6 +190,7 @@ endif
 else
   TE_PATH=$(TRACE_EVENT_DIR)
   LIB_PATH=$(LIB_DIR)
+  BPF_PATH=$(BPF_DIR)
 endif
 
 LIBTRACEEVENT = $(TE_PATH)libtraceevent.a
@@ -199,6 +202,8 @@ LIBTRACEEVENT_DYNAMIC_LIST_LDFLAGS = -Xlinker 
--dynamic-list=$(LIBTRACEEVENT_DYN
 LIBAPI = $(LIB_PATH)libapi.a
 export LIBAPI
 
+LIBBPF = $(BPF_PATH)libbpf.a
+
 # python extension build directories
 PYTHON_EXTBUILD := $(OUTPUT)python_ext_build/
 PYTHON_EXTBUILD_LIB := $(PYTHON_EXTBUILD)lib/
@@ -251,6 +256,9 @@ export PERL_PATH
 LIB_FILE=$(OUTPUT)libperf.a
 
 PERFLIBS = $(LIB_FILE) $(LIBAPI) $(LIBTRACEEVENT)
+ifndef NO_LIBBPF
+  PERFLIBS += $(LIBBPF)
+endif
 
 # We choose to avoid "if .. else if .. else .. endif endif"
 # because maintaining the nesting to match is a pain.  If
@@ -420,6 +428,13 @@ $(LIBAPI)-clean:
$(call QUIET_CLEAN, libapi)
$(Q)$(MAKE) -C $(LIB_DIR) O=$(OUTPUT) clean >/dev/null
 
+$(LIBBPF): FORCE
+   $(Q)$(MAKE) -C $(BPF_DIR) O=$(OUTPUT) $(OUTPUT)libbpf.a
+
+$(LIBBPF)-clean:
+   $(call QUIET_CLEAN, libbpf)
+   $(Q)$(MAKE) -C $(BPF_DIR) O=$(OUTPUT) clean >/dev/null
+
 help:
@echo 'Perf make targets:'
@echo '  doc- make *all* documentation (see below)'
@@ -459,7 +474,7 @@ INSTALL_DOC_TARGETS += quick-install-doc quick-install-man 
quick-install-html
 $(DOC_TARGETS):

[GIT PULL 00/31] perf tools: filtering events using eBPF programs

2015-08-28 Thread Wang Nan
Hi Arnaldo and Ingo,

Several small proglems are fixed based on yesterday's pull request. Please
see below. Since patch order is changed (original 20/32 and 32/32 are
dropped), I decide to send all of them again. Sorry for the noisy.

In addition: I collect a cross-compiling fix I posted yesterday into this
cset (the last one).

The following changes since commit 2c07144dfce366e21465cc7b0ada9f0b6dc7b7ed:

  perf evlist: Add backpointer for perf_env to evlist (2015-08-28 14:54:14 
-0300)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pi3orama/linux 
tags/perf-ebpf-for-acme-20150829

for you to fetch changes up to d4a337392b3724899a084170d9ea36a8e2392097:

  tools lib traceevent: Support function __get_dynamic_array_len (2015-08-29 
02:57:40 +)


perf BPF related improvements and bugfix:

 - Rebase to Arnaldo's newest perf/core.

 - Fix a missing include in builtin-trace.c.

 - Drop patch 'perf tools: Fix probe-event.h include' since
   the problem has been fixed by commit 5a023b57.

 - Fix a cross compiling error (introduced by inter pt).

 - Drop patch 'bpf: Introduce function for outputing data to
   perf event' because we want to do better.

Signed-off-by: Wang Nan 


He Kuang (4):
  perf tools: Move linux/filter.h to tools/include
  perf tools: Introduce arch_get_reg_info() for x86
  perf record: Support custom vmlinux path
  tools lib traceevent: Support function __get_dynamic_array_len

Wang Nan (27):
  bpf tools: New API to get name from a BPF object
  perf tools: Don't set cmdline_group_boundary if no evsel is collected
  perf tools: Introduce dummy evsel
  perf tools: Make perf depend on libbpf
  perf ebpf: Add the libbpf glue
  perf tools: Enable passing bpf object file to --event
  perf probe: Attach trace_probe_event with perf_probe_event
  perf record, bpf: Parse and probe eBPF programs probe points
  perf bpf: Collect 'struct perf_probe_event' for bpf_program
  perf record: Load all eBPF object into kernel
  perf tools: Add bpf_fd field to evsel and config it
  perf tools: Allow filter option to be applied to bof object
  perf tools: Attach eBPF program to perf event
  perf tools: Suppress probing messages when probing by BPF loading
  perf record: Add clang options for compiling BPF scripts
  perf tools: Infrastructure for compiling scriptlets when passing '.c' to 
--event
  perf tests: Enforce LLVM test for BPF test
  perf test: Add 'perf test BPF'
  bpf tools: Load a program with different instances using preprocessor
  perf probe: Reset args and nargs for probe_trace_event when failure
  perf tools: Add BPF_PROLOGUE config options for further patches
  perf tools: Add prologue for BPF programs for fetching arguments
  perf tools: Generate prologue for BPF programs
  perf tools: Use same BPF program if arguments are identical
  perf probe: Init symbol as kprobe
  perf tools: Support attach BPF program on uprobe events
  perf tools: Fix cross compiling error

 tools/build/Makefile.feature   |   6 +-
 tools/include/linux/filter.h   | 237 +++
 tools/lib/bpf/libbpf.c | 168 -
 tools/lib/bpf/libbpf.h |  26 +-
 tools/lib/traceevent/event-parse.c |  56 +-
 tools/lib/traceevent/event-parse.h |   1 +
 tools/perf/MANIFEST|   4 +
 tools/perf/Makefile.perf   |  19 +-
 tools/perf/arch/x86/Makefile   |   1 +
 tools/perf/arch/x86/util/Build |   2 +
 tools/perf/arch/x86/util/dwarf-regs.c  | 104 ++-
 tools/perf/builtin-probe.c |   4 +-
 tools/perf/builtin-record.c|  64 +-
 tools/perf/builtin-stat.c  |   9 +-
 tools/perf/builtin-top.c   |  11 +-
 tools/perf/builtin-trace.c |   7 +-
 tools/perf/config/Makefile |  31 +-
 tools/perf/tests/Build |  10 +-
 tools/perf/tests/bpf-script-example.c  |  44 ++
 tools/perf/tests/bpf.c | 170 +
 tools/perf/tests/builtin-test.c|  12 +
 tools/perf/tests/llvm.c| 125 +++-
 tools/perf/tests/llvm.h|  15 +
 tools/perf/tests/make  |   4 +-
 tools/perf/tests/tests.h   |   3 +
 tools/perf/util/Build  |   4 +-
 tools/perf/util/bpf-loader.c   | 730 +
 tools/perf/util/bpf-loader.h   |  95 +++
 tools/perf/util/bpf-prologue.c   

Re: Problems loading firmware using built-in drivers with kernels that use initramfs.

2015-08-28 Thread Ming Lei
On Sat, Aug 29, 2015 at 9:11 AM, Luis R. Rodriguez  wrote:
> On Thu, Aug 27, 2015 at 08:55:13AM +0800, Ming Lei wrote:
>> On Thu, Aug 27, 2015 at 2:07 AM, Linus Torvalds
>>  wrote:
>> > On Wed, Aug 26, 2015 at 1:06 AM, Liam Girdwood
>> >  wrote:
>> >>
>> >> I think the options are to either :-
>> >>
>> >> 1) Don not support audio DSP drivers using topology data as built-in
>> >> drivers. Audio is not really a critical system required for booting
>> >> anyway.
>> >
>> > Yes, forcing it to be a module and not letting people compile it in by
>> > mistake (and then not have it work) is an option.
>> >
>> > That said, there are situations where people don't want to use
>> > modules. I used to eschew them for security reasons, for example - now
>> > I instead just do a one-time temporary key. But others may have other
>> > reasons to try to avoid modules.
>> >
>> >> 2) Create a default PCM for every driver that has topology data on the
>> >> assumption that every sound card will at least 1 PCM. This PCM can then
>> >> be re-configured when the FW is loaded.
>> >
>> > That would seem to be the better option if it is reasonably implementable.
>> >
>> > Of course, some kind of timer-based retry (limited *somehow*) of the
>> > fw loading could work too, but smells really really hacky.
>>
>> Yeah, years ago, we discussed to use -EPROBE_DEFER for the situation,
>> which should be one kind of fix, but looks there were objections at that 
>> time.
>
> That would still be a hack. I'll note there is also asynchronous probe support
> now but to use that would also be a hack for this issue. We don't want to

If we think firmware as one kind of resources like regulators, gpio and others,
PROBE_DEFER is one good match for firmware loading case, and
it has been used by lots of drivers, so why can't it be used for
firmware loading?

One problem is that we need to convert drivers into returning -EPROBE_DEFER
in case of request failure, and that may involve some work, but which
should be mechanical.

> encourage folks to go down that road.  They'd be hacks for this issue as you
> are simply delaying the driver probe for a later time and there is no 
> guarantee
> that any pivot_root() might have already been completed later to ensure your
> driver's fw file is present. So it may work or it may not.

We can trigger defer probe explicitly once root fs is setup or other condition
is met.

>
> We should instead strive to be clear about expectations and requirements both
> through documentation and when possible through APIs. I'll send out an RFC
> which adds some grammar rules which can help us police this. I currently only
> spot two drivers that require fixing.
>
>   Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 5/9] x86, pmem: push fallback handling to arch code

2015-08-28 Thread Williams, Dan J
On Fri, 2015-08-28 at 15:48 -0600, Toshi Kani wrote:
> On Fri, 2015-08-28 at 14:47 -0700, Dan Williams wrote:
> > On Fri, Aug 28, 2015 at 2:41 PM, Toshi Kani  wrote:
> > > On Wed, 2015-08-26 at 21:34 +, Williams, Dan J wrote:
> > [..]
> > > > -#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
> > > 
> > > Should it be better to do:
> > > 
> > > #else   /* !CONFIG_ARCH_HAS_PMEM_API */
> > > #define ARCH_MEMREMAP_PMEM MEMREMAP_WT
> > > 
> > > so that you can remove all '#ifdef ARCH_MEMREMAP_PMEM' stuff?
> > 
> > Yeah, that seems like a nice incremental cleanup for memremap_pmem()
> > to just unconditionally use ARCH_MEMREMAP_PMEM, feel free to send it
> > along.
> 
> OK. Will do.
> 

Here's the re-worked patch with Toshi's fixes folded in:

8<-
Subject: x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB

From: Dan Williams 

Given that a write-back (WB) mapping plus non-temporal stores is
expected to be the most efficient way to access PMEM, update the
definition of ARCH_HAS_PMEM_API to imply arch support for
WB-mapped-PMEM.  This is needed as a pre-requisite for adding PMEM to
the direct map and mapping it with struct page.

The above clarification for X86_64 means that memcpy_to_pmem() is
permitted to use the non-temporal arch_memcpy_to_pmem() rather than
needlessly fall back to default_memcpy_to_pmem() when the pcommit
instruction is not available.  When arch_memcpy_to_pmem() is not
guaranteed to flush writes out of cache, i.e. on older X86_32
implementations where non-temporal stores may just dirty cache,
ARCH_HAS_PMEM_API is simply disabled.

The default fall back for persistent memory handling remains.  Namely,
map it with the WT (write-through) cache-type and hope for the best.

arch_has_pmem_api() is updated to only indicate whether the arch
provides the proper helpers to meet the minimum "writes are visible
outside the cache hierarchy after memcpy_to_pmem() + wmb_pmem()".  Code
that cares whether wmb_pmem() actually flushes writes to pmem must now
call arch_has_wmb_pmem() directly.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Reviewed-by: Ross Zwisler 
[hch: set ARCH_HAS_PMEM_API=n on x86_32]
Reviewed-by: Christoph Hellwig 
[toshi: x86_32 compile fixes]
Signed-off-by: Toshi Kani 
Signed-off-by: Dan Williams 
---
 arch/x86/Kconfig|2 +-
 arch/x86/include/asm/pmem.h |9 +
 drivers/acpi/nfit.c |3 ++-
 drivers/nvdimm/pmem.c   |2 +-
 include/linux/pmem.h|   36 ++--
 5 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 03ab6122325a..ef4c6bbb3af1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,7 +27,7 @@ config X86
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FAST_MULTIPLIER
select ARCH_HAS_GCOV_PROFILE_ALL
-   select ARCH_HAS_PMEM_API
+   select ARCH_HAS_PMEM_APIif X86_64
select ARCH_HAS_MMIO_FLUSH
select ARCH_HAS_SG_CHAIN
select ARCH_HAVE_NMI_SAFE_CMPXCHG
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index bb026c5adf8a..d8ce3ec816ab 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -18,8 +18,6 @@
 #include 
 #include 
 
-#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
-
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 /**
  * arch_memcpy_to_pmem - copy data to persistent memory
@@ -143,18 +141,13 @@ static inline void arch_clear_pmem(void __pmem *addr, 
size_t size)
__arch_wb_cache_pmem(vaddr, size);
 }
 
-static inline bool arch_has_wmb_pmem(void)
+static inline bool __arch_has_wmb_pmem(void)
 {
-#ifdef CONFIG_X86_64
/*
 * We require that wmb() be an 'sfence', that is only guaranteed on
 * 64-bit builds
 */
return static_cpu_has(X86_FEATURE_PCOMMIT);
-#else
-   return false;
-#endif
 }
 #endif /* CONFIG_ARCH_HAS_PMEM_API */
-
 #endif /* __ASM_X86_PMEM_H__ */
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 56fff0141636..f61e69fa2ad1 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "nfit.h"
 
 /*
@@ -1371,7 +1372,7 @@ static int acpi_nfit_blk_region_enable(struct nvdimm_bus 
*nvdimm_bus,
return -ENOMEM;
}
 
-   if (!arch_has_pmem_api() && !nfit_blk->nvdimm_flush)
+   if (!arch_has_wmb_pmem() && !nfit_blk->nvdimm_flush)
dev_warn(dev, "unable to guarantee persistence of writes\n");
 
if (mmio->line_size == 0)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b5b9cb758b6..20bf122328da 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -125,7 +125,7 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 
pmem->phys_addr = res->start;
pmem->size = resource_size(res);
-   if (!arch_has_pmem_api())
+   if (!arch_has_wmb_pmem())
dev_warn(dev, "unable to 

Re: [PATCH RFC RFT 0/3] clk: detect per-user enable imbalances and implement hand-off

2015-08-28 Thread Maxime Ripard
Hi Mike,

On Tue, Aug 25, 2015 at 02:50:51PM -0700, Michael Turquette wrote:
> Quoting Maxime Ripard (2015-08-20 08:15:10)
> > On Tue, Aug 18, 2015 at 09:43:56AM -0700, Michael Turquette wrote:
> > > Quoting Maxime Ripard (2015-08-18 08:45:52)
> > > > Hi Mike,
> > > > 
> > > > On Fri, Aug 07, 2015 at 12:09:27PM -0700, Michael Turquette wrote:
> > > > > All of the other kitchen sink stuff (DT binding, passing the flag back
> > > > > to the framework when the clock consumer driver calls clk_put) was 
> > > > > left
> > > > > out because I do not see a real use case for it. If one can 
> > > > > demonstrate
> > > > > a real use case (and not a hypothetical one) then this patch series 
> > > > > can
> > > > > be expanded further.
> > > > 
> > > > I think there is a very trivial use case for passing back the
> > > > reference to the framework, if during the probed, we have something
> > > > like:
> > > > 
> > > > clk = clk_get()
> > > > clk_prepare_enable(clk)
> > > > foo_framework_register()
> > > > 
> > > > if foo_framework_register fails, the sensible thing to do would be to
> > > > call clk_disable_unprepare. If the clock was a critical clock, you
> > > > just gated it.
> > > 
> > > Hmm, a good point. Creating the "pass the reference back" call is not
> > > hard technically. But how to keep from abusing it? E.g. I do not want
> > > that call to become an alternative to correct use of clk_enable.
> > > 
> > > Maybe I'll need a Coccinelle script or just some regular sed to
> > > occasionally search for new users of this api and audit them?
> > > 
> > > I was hoping to not add any new consumer api at all :-/
> > 
> > I don't think there's any abuse that can be done with the current API,
> > nor do I think you need to have new functions either.
> > 
> > If the clock is critical, when the customer calls
> > clk_unprepare_disable on it, simply take back the reference you gave
> > in the framework, and you're done. Or am I missing something?
> 
> Maybe I am the one missing something? My goal was to allow the consumer
> driver to gate the critical clock. So we need clk_disable_unused to
> actually disable the clock for that to work.

Yeah, but I guess the consumer driver clock gating is not the default
mode of operations.

Under normal circumstances, it should just always leave the clock
enabled, all the time.

> I think you are suggesting that clk_disable_unused should *not* disable
> the clock if it is critical. Can you confirm that?

By default, yes.

Now, we also have the knowledgeable driver case wanting to force the
clock gating. I think it's an orthogonal issue, we might have the same
use case for non-critical clocks, and since it's hard to get that done
with the current API, and that we don't really know what a
knowledgeable driver will look like at this point, maybe we can just
delay this entirely until we actually have one in front of us?

Maxime

-- 
Maxime Ripard, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com


signature.asc
Description: Digital signature


Re: [PATCH RFC RFT 0/3] clk: detect per-user enable imbalances and implement hand-off

2015-08-28 Thread Maxime Ripard
On Wed, Aug 26, 2015 at 07:54:23AM +0100, Lee Jones wrote:
> On Tue, 25 Aug 2015, Michael Turquette wrote:
> 
> > Quoting Maxime Ripard (2015-08-20 08:15:10)
> > > On Tue, Aug 18, 2015 at 09:43:56AM -0700, Michael Turquette wrote:
> > > > Quoting Maxime Ripard (2015-08-18 08:45:52)
> > > > > Hi Mike,
> > > > > 
> > > > > On Fri, Aug 07, 2015 at 12:09:27PM -0700, Michael Turquette wrote:
> > > > > > All of the other kitchen sink stuff (DT binding, passing the flag 
> > > > > > back
> > > > > > to the framework when the clock consumer driver calls clk_put) was 
> > > > > > left
> > > > > > out because I do not see a real use case for it. If one can 
> > > > > > demonstrate
> > > > > > a real use case (and not a hypothetical one) then this patch series 
> > > > > > can
> > > > > > be expanded further.
> > > > > 
> > > > > I think there is a very trivial use case for passing back the
> > > > > reference to the framework, if during the probed, we have something
> > > > > like:
> > > > > 
> > > > > clk = clk_get()
> > > > > clk_prepare_enable(clk)
> > > > > foo_framework_register()
> > > > > 
> > > > > if foo_framework_register fails, the sensible thing to do would be to
> > > > > call clk_disable_unprepare. If the clock was a critical clock, you
> > > > > just gated it.
> > > > 
> > > > Hmm, a good point. Creating the "pass the reference back" call is not
> > > > hard technically. But how to keep from abusing it? E.g. I do not want
> > > > that call to become an alternative to correct use of clk_enable.
> > > > 
> > > > Maybe I'll need a Coccinelle script or just some regular sed to
> > > > occasionally search for new users of this api and audit them?
> > > > 
> > > > I was hoping to not add any new consumer api at all :-/
> > > 
> > > I don't think there's any abuse that can be done with the current API,
> > > nor do I think you need to have new functions either.
> > > 
> > > If the clock is critical, when the customer calls
> > > clk_unprepare_disable on it, simply take back the reference you gave
> > > in the framework, and you're done. Or am I missing something?
> > 
> > Maybe I am the one missing something? My goal was to allow the consumer
> > driver to gate the critical clock. So we need clk_disable_unused to
> > actually disable the clock for that to work.
> > 
> > I think you are suggesting that clk_disable_unused should *not* disable
> > the clock if it is critical. Can you confirm that?
> 
> My take is that a critical clock should only be disabled when a
> knowledgeable driver wants to gate it for a specific purpose [probably
> using clk_disable()].  Once the aforementioned driver no longer has a
> use for the clock [whether that happens with clk_unprepare_disable()
> or clk_put() ...] the clock should be ungated and be provided with
> critical status once more.

Agreed.

Maxime

-- 
Maxime Ripard, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com


signature.asc
Description: Digital signature


[PATCH RFC tip/core/rcu 6/9] locking/percpu-rwsem: Make use of the rcu_sync infrastructure

2015-08-28 Thread Paul E. McKenney
From: Oleg Nesterov 

Currently down_write/up_write calls synchronize_sched_expedited()
twice, which is evil.  Change this code to rely on rcu-sync primitives.
This avoids the _expedited "big hammer", and this can be faster in
the contended case or even in the case when a single thread does
down_write/up_write in a loop.

Of course, a single down_write() will take more time, but otoh it
will be much more friendly to the whole system.

To simplify the review this patch doesn't update the comments, fixed
by the next change.

Signed-off-by: Oleg Nesterov 
Signed-off-by: Paul E. McKenney 
---
 include/linux/percpu-rwsem.h  |  3 ++-
 kernel/locking/percpu-rwsem.c | 18 +++---
 2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 3e88c9a7d57f..1ab2cf130816 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -5,11 +5,12 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct percpu_rw_semaphore {
+   struct rcu_sync rss;
unsigned int __percpu   *fast_read_ctr;
-   atomic_twrite_ctr;
struct rw_semaphore rw_sem;
atomic_tslow_read_ctr;
wait_queue_head_t   write_waitq;
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 67a758df1d7c..7abc0e150a22 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -17,7 +17,7 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
 
/* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */
__init_rwsem(>rw_sem, name, rwsem_key);
-   atomic_set(>write_ctr, 0);
+   rcu_sync_init(>rss, RCU_SCHED_SYNC);
atomic_set(>slow_read_ctr, 0);
init_waitqueue_head(>write_waitq);
return 0;
@@ -32,6 +32,7 @@ void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
if (!brw->fast_read_ctr)
return;
 
+   rcu_sync_dtor(>rss);
free_percpu(brw->fast_read_ctr);
brw->fast_read_ctr = NULL; /* catch use after free bugs */
 }
@@ -61,13 +62,12 @@ void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
  */
 static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
 {
-   bool success = false;
+   bool success;
 
preempt_disable();
-   if (likely(!atomic_read(>write_ctr))) {
+   success = rcu_sync_is_idle(>rss);
+   if (likely(success))
__this_cpu_add(*brw->fast_read_ctr, val);
-   success = true;
-   }
preempt_enable();
 
return success;
@@ -133,8 +133,6 @@ static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
  */
 void percpu_down_write(struct percpu_rw_semaphore *brw)
 {
-   /* tell update_fast_ctr() there is a pending writer */
-   atomic_inc(>write_ctr);
/*
 * 1. Ensures that write_ctr != 0 is visible to any down_read/up_read
 *so that update_fast_ctr() can't succeed.
@@ -146,7 +144,7 @@ void percpu_down_write(struct percpu_rw_semaphore *brw)
 *fast-path, it executes a full memory barrier before we return.
 *See R_W case in the comment above update_fast_ctr().
 */
-   synchronize_sched_expedited();
+   rcu_sync_enter(>rss);
 
/* exclude other writers, and block the new readers completely */
down_write(>rw_sem);
@@ -166,7 +164,5 @@ void percpu_up_write(struct percpu_rw_semaphore *brw)
 * Insert the barrier before the next fast-path in down_read,
 * see W_R case in the comment above update_fast_ctr().
 */
-   synchronize_sched_expedited();
-   /* the last writer unblocks update_fast_ctr() */
-   atomic_dec(>write_ctr);
+   rcu_sync_exit(>rss);
 }
-- 
1.8.1.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC tip/core/rcu 2/9] rcu_sync: Simplify rcu_sync using new rcu_sync_ops structure

2015-08-28 Thread Paul E. McKenney
From: Oleg Nesterov 

This commit adds the new struct rcu_sync_ops which holds sync/call
methods, and turns the function pointers in rcu_sync_struct into an array
of struct rcu_sync_ops.  This simplifies the "init" helpers by collapsing
a switch statement and explicit multiple definitions into a simple
assignment and a helper macro, respectively.

Signed-off-by: Oleg Nesterov 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcu_sync.h | 60 +++-
 kernel/rcu/sync.c| 42 +
 2 files changed, 45 insertions(+), 57 deletions(-)

diff --git a/include/linux/rcu_sync.h b/include/linux/rcu_sync.h
index cb044df2e21c..c6d2272c4459 100644
--- a/include/linux/rcu_sync.h
+++ b/include/linux/rcu_sync.h
@@ -26,6 +26,8 @@
 #include 
 #include 
 
+enum rcu_sync_type { RCU_SYNC, RCU_SCHED_SYNC, RCU_BH_SYNC };
+
 /* Structure to mediate between updaters and fastpath-using readers.  */
 struct rcu_sync {
int gp_state;
@@ -35,43 +37,9 @@ struct rcu_sync {
int cb_state;
struct rcu_head cb_head;
 
-   void (*sync)(void);
-   void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
+   enum rcu_sync_type  gp_type;
 };
 
-#define ___RCU_SYNC_INIT(name) \
-   .gp_state = 0,  \
-   .gp_count = 0,  \
-   .gp_wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.gp_wait), \
-   .cb_state = 0
-
-#define __RCU_SCHED_SYNC_INIT(name) {  \
-   ___RCU_SYNC_INIT(name), \
-   .sync = synchronize_sched,  \
-   .call = call_rcu_sched, \
-}
-
-#define __RCU_BH_SYNC_INIT(name) { \
-   ___RCU_SYNC_INIT(name), \
-   .sync = synchronize_rcu_bh, \
-   .call = call_rcu_bh,\
-}
-
-#define __RCU_SYNC_INIT(name) {
\
-   ___RCU_SYNC_INIT(name), \
-   .sync = synchronize_rcu,\
-   .call = call_rcu,   \
-}
-
-#define DEFINE_RCU_SCHED_SYNC(name)\
-   struct rcu_sync name = __RCU_SCHED_SYNC_INIT(name)
-
-#define DEFINE_RCU_BH_SYNC(name)   \
-   struct rcu_sync name = __RCU_BH_SYNC_INIT(name)
-
-#define DEFINE_RCU_SYNC(name)  \
-   struct rcu_sync name = __RCU_SYNC_INIT(name)
-
 /**
  * rcu_sync_is_idle() - Are readers permitted to use their fastpaths?
  * @rsp: Pointer to rcu_sync structure to use for synchronization
@@ -85,10 +53,28 @@ static inline bool rcu_sync_is_idle(struct rcu_sync *rsp)
return !rsp->gp_state; /* GP_IDLE */
 }
 
-enum rcu_sync_type { RCU_SYNC, RCU_SCHED_SYNC, RCU_BH_SYNC };
-
 extern void rcu_sync_init(struct rcu_sync *, enum rcu_sync_type);
 extern void rcu_sync_enter(struct rcu_sync *);
 extern void rcu_sync_exit(struct rcu_sync *);
 
+#define __RCU_SYNC_INITIALIZER(name, type) {   \
+   .gp_state = 0,  \
+   .gp_count = 0,  \
+   .gp_wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.gp_wait), \
+   .cb_state = 0,  \
+   .gp_type = type,\
+   }
+
+#define__DEFINE_RCU_SYNC(name, type)   \
+   struct rcu_sync_struct name = __RCU_SYNC_INITIALIZER(name, type)
+
+#define DEFINE_RCU_SYNC(name)  \
+   __DEFINE_RCU_SYNC(name, RCU_SYNC)
+
+#define DEFINE_RCU_SCHED_SYNC(name)\
+   __DEFINE_RCU_SYNC(name, RCU_SCHED_SYNC)
+
+#define DEFINE_RCU_BH_SYNC(name)   \
+   __DEFINE_RCU_SYNC(name, RCU_BH_SYNC)
+
 #endif /* _LINUX_RCU_SYNC_H_ */
diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
index 0a11df43be23..5a9aa4c394f1 100644
--- a/kernel/rcu/sync.c
+++ b/kernel/rcu/sync.c
@@ -23,6 +23,24 @@
 #include 
 #include 
 
+static const struct {
+   void (*sync)(void);
+   void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
+} gp_ops[] = {
+   [RCU_SYNC] = {
+   .sync = synchronize_rcu,
+   .call = call_rcu,
+   },
+   [RCU_SCHED_SYNC] = {
+   .sync = synchronize_sched,
+   .call = call_rcu_sched,
+   },
+   [RCU_BH_SYNC] = {
+   .sync = synchronize_rcu_bh,
+   .call = call_rcu_bh,
+ 

[PATCH RFC tip/core/rcu 1/9] rcu: Create rcu_sync infrastructure

2015-08-28 Thread Paul E. McKenney
From: Oleg Nesterov 

The rcu_sync infrastructure can be thought of as infrastructure to be
used to implement reader-writer primitives having extremely lightweight
readers during times when there are no writers.  The first use is in
the percpu_rwsem used by the VFS subsystem.

This infrastructure is functionally equivalent to

struct rcu_sync_struct {
atomic_t counter;
};

/* Check possibility of fast-path read-side operations. */
static inline bool rcu_sync_is_idle(struct rcu_sync_struct *rss)
{
return atomic_read(>counter) == 0;
}

/* Tell readers to use slowpaths. */
static inline void rcu_sync_enter(struct rcu_sync_struct *rss)
{
atomic_inc(>counter);
synchronize_sched();
}

/* Allow readers to once again use fastpaths. */
static inline void rcu_sync_exit(struct rcu_sync_struct *rss)
{
synchronize_sched();
atomic_dec(>counter);
}

The main difference is that it records the state and only calls
synchronize_sched() if required.  At least some of the calls to
synchronize_sched() will be optimized away when rcu_sync_enter() and
rcu_sync_exit() are invoked repeatedly in quick succession.

Signed-off-by: Oleg Nesterov 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcu_sync.h |  94 +
 kernel/rcu/Makefile  |   2 +-
 kernel/rcu/sync.c| 175 +++
 3 files changed, 270 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/rcu_sync.h
 create mode 100644 kernel/rcu/sync.c

diff --git a/include/linux/rcu_sync.h b/include/linux/rcu_sync.h
new file mode 100644
index ..cb044df2e21c
--- /dev/null
+++ b/include/linux/rcu_sync.h
@@ -0,0 +1,94 @@
+/*
+ * RCU-based infrastructure for lightweight reader-writer locking
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, you can access it online at
+ * http://www.gnu.org/licenses/gpl-2.0.html.
+ *
+ * Copyright (c) 2015, Red Hat, Inc.
+ *
+ * Author: Oleg Nesterov 
+ */
+
+#ifndef _LINUX_RCU_SYNC_H_
+#define _LINUX_RCU_SYNC_H_
+
+#include 
+#include 
+
+/* Structure to mediate between updaters and fastpath-using readers.  */
+struct rcu_sync {
+   int gp_state;
+   int gp_count;
+   wait_queue_head_t   gp_wait;
+
+   int cb_state;
+   struct rcu_head cb_head;
+
+   void (*sync)(void);
+   void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
+};
+
+#define ___RCU_SYNC_INIT(name) \
+   .gp_state = 0,  \
+   .gp_count = 0,  \
+   .gp_wait = __WAIT_QUEUE_HEAD_INITIALIZER(name.gp_wait), \
+   .cb_state = 0
+
+#define __RCU_SCHED_SYNC_INIT(name) {  \
+   ___RCU_SYNC_INIT(name), \
+   .sync = synchronize_sched,  \
+   .call = call_rcu_sched, \
+}
+
+#define __RCU_BH_SYNC_INIT(name) { \
+   ___RCU_SYNC_INIT(name), \
+   .sync = synchronize_rcu_bh, \
+   .call = call_rcu_bh,\
+}
+
+#define __RCU_SYNC_INIT(name) {
\
+   ___RCU_SYNC_INIT(name), \
+   .sync = synchronize_rcu,\
+   .call = call_rcu,   \
+}
+
+#define DEFINE_RCU_SCHED_SYNC(name)\
+   struct rcu_sync name = __RCU_SCHED_SYNC_INIT(name)
+
+#define DEFINE_RCU_BH_SYNC(name)   \
+   struct rcu_sync name = __RCU_BH_SYNC_INIT(name)
+
+#define DEFINE_RCU_SYNC(name)  \
+   struct rcu_sync name = __RCU_SYNC_INIT(name)
+
+/**
+ * rcu_sync_is_idle() - Are readers permitted to use their fastpaths?
+ * @rsp: Pointer to rcu_sync structure to 

[PATCH RFC tip/core/rcu 9/9] rcu: Change _wait_rcu_gp() to work around GCC bug 67055

2015-08-28 Thread Paul E. McKenney
From: Oleg Nesterov 

Code like this in inline functions confuses some recent versions of gcc:

const int n = const-expr;
whatever_t array[n];

For more details, see:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67055#c13

This compiler bug results in the following failure after 114b7fd4b (rcu:
Create rcu_sync infrastructure):

In file included from include/linux/rcupdate.h:429:0,
  from include/linux/rcu_sync.h:5,
  from kernel/rcu/sync.c:1:
include/linux/rcutiny.h: In function 'rcu_barrier_sched':
include/linux/rcutiny.h:55:20: internal compiler error: Segmentation 
fault
  static inline void rcu_barrier_sched(void)

This commit therefore eliminates the constant local variable in favor of
direct use of the expression.

Reported-and-tested-by: Mark Salter 
Reported-by: Guenter Roeck 
Signed-off-by: Oleg Nesterov 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcupdate.h | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index ff476515f716..581abf848566 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -230,12 +230,11 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t 
*crcu_array,
   struct rcu_synchronize *rs_array);
 
 #define _wait_rcu_gp(checktiny, ...) \
-do { \
-   call_rcu_func_t __crcu_array[] = { __VA_ARGS__ }; \
-   const int __n = ARRAY_SIZE(__crcu_array); \
-   struct rcu_synchronize __rs_array[__n]; \
-   \
-   __wait_rcu_gp(checktiny, __n, __crcu_array, __rs_array); \
+do {   \
+   call_rcu_func_t __crcu_array[] = { __VA_ARGS__ };   \
+   struct rcu_synchronize __rs_array[ARRAY_SIZE(__crcu_array)];\
+   __wait_rcu_gp(checktiny, ARRAY_SIZE(__crcu_array),  \
+   __crcu_array, __rs_array);  \
 } while (0)
 
 #define wait_rcu_gp(...) _wait_rcu_gp(false, __VA_ARGS__)
-- 
1.8.1.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC tip/core/rcu 7/9] locking/percpu-rwsem: Fix the comments outdated by rcu_sync

2015-08-28 Thread Paul E. McKenney
From: Oleg Nesterov 

Update the comments broken by the previous change.

Signed-off-by: Oleg Nesterov 
Signed-off-by: Paul E. McKenney 
---
 kernel/locking/percpu-rwsem.c | 50 ++-
 1 file changed, 11 insertions(+), 39 deletions(-)

diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 7abc0e150a22..25b73448929c 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -38,27 +38,12 @@ void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
 }
 
 /*
- * This is the fast-path for down_read/up_read, it only needs to ensure
- * there is no pending writer (atomic_read(write_ctr) == 0) and inc/dec the
- * fast per-cpu counter. The writer uses synchronize_sched_expedited() to
- * serialize with the preempt-disabled section below.
- *
- * The nontrivial part is that we should guarantee acquire/release semantics
- * in case when
- *
- * R_W: down_write() comes after up_read(), the writer should see all
- *  changes done by the reader
- * or
- * W_R: down_read() comes after up_write(), the reader should see all
- *  changes done by the writer
+ * This is the fast-path for down_read/up_read. If it succeeds we rely
+ * on the barriers provided by rcu_sync_enter/exit; see the comments in
+ * percpu_down_write() and percpu_up_write().
  *
  * If this helper fails the callers rely on the normal rw_semaphore and
  * atomic_dec_and_test(), so in this case we have the necessary barriers.
- *
- * But if it succeeds we do not have any barriers, atomic_read(write_ctr) or
- * __this_cpu_add() below can be reordered with any LOAD/STORE done by the
- * reader inside the critical section. See the comments in down_write and
- * up_write below.
  */
 static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
 {
@@ -120,29 +105,15 @@ static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
return sum;
 }
 
-/*
- * A writer increments ->write_ctr to force the readers to switch to the
- * slow mode, note the atomic_read() check in update_fast_ctr().
- *
- * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
- * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
- * counter it represents the number of active readers.
- *
- * Finally the writer takes ->rw_sem for writing and blocks the new readers,
- * then waits until the slow counter becomes zero.
- */
 void percpu_down_write(struct percpu_rw_semaphore *brw)
 {
/*
-* 1. Ensures that write_ctr != 0 is visible to any down_read/up_read
-*so that update_fast_ctr() can't succeed.
-*
-* 2. Ensures we see the result of every previous this_cpu_add() in
-*update_fast_ctr().
+* Make rcu_sync_is_idle() == F and thus disable the fast-path in
+* percpu_down_read() and percpu_up_read(), and wait for gp pass.
 *
-* 3. Ensures that if any reader has exited its critical section via
-*fast-path, it executes a full memory barrier before we return.
-*See R_W case in the comment above update_fast_ctr().
+* The latter synchronises us with the preceding readers which used
+* the fast-past, so we can not miss the result of __this_cpu_add()
+* or anything else inside their criticial sections.
 */
rcu_sync_enter(>rss);
 
@@ -161,8 +132,9 @@ void percpu_up_write(struct percpu_rw_semaphore *brw)
/* release the lock, but the readers can't use the fast-path */
up_write(>rw_sem);
/*
-* Insert the barrier before the next fast-path in down_read,
-* see W_R case in the comment above update_fast_ctr().
+* Enable the fast-path in percpu_down_read() and percpu_up_read()
+* but only after another gp pass; this adds the necessary barrier
+* to ensure the reader can't miss the changes done by us.
 */
rcu_sync_exit(>rss);
 }
-- 
1.8.1.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC tip/core/rcu 0/9] Add rcu_sync and implement percpu_rwsem in terms of it

2015-08-28 Thread Paul E. McKenney
Hello!

This series implements an rcu_sync primitive and updates percpu_rwsem to
be implemented in terms of it.  This is an updated version of the series
posted by Oleg Nesterov, responding to feedback from Ingo Molnar.  The
patches in this series, all courtesy of Oleg (and some in turn based
on work by Peter Zijlstra), are as follows:

1.  Create rcu_sync infrastructure.

2.  Simplify rcu_sync using new rcu_sync_ops structure.

3.  Add CONFIG_PROVE_RCU checks.

4.  Introduce rcu_sync_dtor().

5.  Make percpu_free_rwsem() after kzalloc() safe.

6.  Make use of the rcu_sync infrastructure for percpu_rwsem.

7.  Fix the percpu_rwsem comments outdated by rcu_sync.

8.  Clean up the lockdep annotations in percpu_down_read().

9.  Change _wait_rcu_gp() to work around GCC bug 67055.

Thanx, Paul



 b/include/linux/percpu-rwsem.h  |3 
 b/include/linux/rcu_sync.h  |  161 +++-
 b/include/linux/rcupdate.h  |   11 -
 b/kernel/locking/percpu-rwsem.c |   85 -
 b/kernel/rcu/Makefile   |2 
 b/kernel/rcu/sync.c |  259 
 6 files changed, 401 insertions(+), 120 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC tip/core/rcu 5/9] locking/percpu-rwsem: Make percpu_free_rwsem() after kzalloc() safe

2015-08-28 Thread Paul E. McKenney
From: Oleg Nesterov 

This is the temporary ugly hack which will be reverted later. We only
need it to ensure that the next patch will not break "change sb_writers
to use percpu_rw_semaphore" patches routed via the VFS tree.

The alloc_super()->destroy_super() error path assumes that it is safe
to call percpu_free_rwsem() after kzalloc() without percpu_init_rwsem(),
so let's not disappoint it.

Signed-off-by: Oleg Nesterov 
Signed-off-by: Paul E. McKenney 
---
 kernel/locking/percpu-rwsem.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 652a8ee8efe9..67a758df1d7c 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -25,6 +25,13 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
 
 void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
 {
+   /*
+* XXX: temporary kludge. The error path in alloc_super()
+* assumes that percpu_free_rwsem() is safe after kzalloc().
+*/
+   if (!brw->fast_read_ctr)
+   return;
+
free_percpu(brw->fast_read_ctr);
brw->fast_read_ctr = NULL; /* catch use after free bugs */
 }
-- 
1.8.1.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC tip/core/rcu 8/9] locking/percpu-rwsem: Clean up the lockdep annotations in percpu_down_read()

2015-08-28 Thread Paul E. McKenney
From: Oleg Nesterov 

Based on Peter Zijlstra's earlier patch.

Change percpu_down_read() to use __down_read(), this way we can
do rwsem_acquire_read() unconditionally at the start to make this
code more symmetric and clean.

Originally-From: Peter Zijlstra 
Signed-off-by: Oleg Nesterov 
Signed-off-by: Paul E. McKenney 
---
 kernel/locking/percpu-rwsem.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 25b73448929c..61b678d784ce 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -69,14 +69,14 @@ static bool update_fast_ctr(struct percpu_rw_semaphore 
*brw, unsigned int val)
 void percpu_down_read(struct percpu_rw_semaphore *brw)
 {
might_sleep();
-   if (likely(update_fast_ctr(brw, +1))) {
-   rwsem_acquire_read(>rw_sem.dep_map, 0, 0, _RET_IP_);
+   rwsem_acquire_read(>rw_sem.dep_map, 0, 0, _RET_IP_);
+
+   if (likely(update_fast_ctr(brw, +1)))
return;
-   }
 
-   down_read(>rw_sem);
+   /* Avoid rwsem_acquire_read() and rwsem_release() */
+   __down_read(>rw_sem);
atomic_inc(>slow_read_ctr);
-   /* avoid up_read()->rwsem_release() */
__up_read(>rw_sem);
 }
 
-- 
1.8.1.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC tip/core/rcu 3/9] rcu_sync: Add CONFIG_PROVE_RCU checks

2015-08-28 Thread Paul E. McKenney
From: Oleg Nesterov 

This commit validates that the caller of rcu_sync_is_idle() holds the
corresponding type of RCU read-side lock, but only in kernels built
with CONFIG_PROVE_RCU=y.  This validation is carried out via a new
rcu_sync_ops->held() method that is checked within rcu_sync_is_idle().

Note that although this does add code to the fast path, it only does so
in kernels built with CONFIG_PROVE_RCU=y.

Suggested-by: "Paul E. McKenney" 
Signed-off-by: Oleg Nesterov 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcu_sync.h |  6 ++
 kernel/rcu/sync.c| 20 
 2 files changed, 26 insertions(+)

diff --git a/include/linux/rcu_sync.h b/include/linux/rcu_sync.h
index c6d2272c4459..c55a070b2592 100644
--- a/include/linux/rcu_sync.h
+++ b/include/linux/rcu_sync.h
@@ -40,6 +40,8 @@ struct rcu_sync {
enum rcu_sync_type  gp_type;
 };
 
+extern bool __rcu_sync_is_idle(struct rcu_sync *);
+
 /**
  * rcu_sync_is_idle() - Are readers permitted to use their fastpaths?
  * @rsp: Pointer to rcu_sync structure to use for synchronization
@@ -50,7 +52,11 @@ struct rcu_sync {
  */
 static inline bool rcu_sync_is_idle(struct rcu_sync *rsp)
 {
+#ifdef CONFIG_PROVE_RCU
+   return __rcu_sync_is_idle(rss);
+#else
return !rsp->gp_state; /* GP_IDLE */
+#endif
 }
 
 extern void rcu_sync_init(struct rcu_sync *, enum rcu_sync_type);
diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
index 5a9aa4c394f1..26b2629e731e 100644
--- a/kernel/rcu/sync.c
+++ b/kernel/rcu/sync.c
@@ -23,21 +23,33 @@
 #include 
 #include 
 
+#ifdef CONFIG_PROVE_RCU
+#define __INIT_HELD(func)  .held = func,
+#else
+#define __INIT_HELD(func)
+#endif
+
 static const struct {
void (*sync)(void);
void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
+#ifdef CONFIG_PROVE_RCU
+   int  (*held)(void);
+#endif
 } gp_ops[] = {
[RCU_SYNC] = {
.sync = synchronize_rcu,
.call = call_rcu,
+   __INIT_HELD(rcu_read_lock_held)
},
[RCU_SCHED_SYNC] = {
.sync = synchronize_sched,
.call = call_rcu_sched,
+   __INIT_HELD(rcu_read_lock_sched_held)
},
[RCU_BH_SYNC] = {
.sync = synchronize_rcu_bh,
.call = call_rcu_bh,
+   __INIT_HELD(rcu_read_lock_bh_held)
},
 };
 
@@ -46,6 +58,13 @@ enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
 
 #definerss_lockgp_wait.lock
 
+#ifdef CONFIG_PROVE_RCU
+bool __rcu_sync_is_idle(struct rcu_sync *rsp)
+{
+   WARN_ON(!gp_ops[rsp->gp_type].held());
+   return rsp->gp_state == GP_IDLE;
+}
+
 /**
  * rcu_sync_init() - Initialize an rcu_sync structure
  * @rsp: Pointer to rcu_sync structure to be initialized
@@ -57,6 +76,7 @@ void rcu_sync_init(struct rcu_sync *rsp, enum rcu_sync_type 
type)
init_waitqueue_head(>gp_wait);
rsp->gp_type = type;
 }
+#endif
 
 /**
  * rcu_sync_enter() - Force readers onto slowpath
-- 
1.8.1.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC tip/core/rcu 4/9] rcu_sync: Introduce rcu_sync_dtor()

2015-08-28 Thread Paul E. McKenney
From: Oleg Nesterov 

This commit allows rcu_sync structures to be safely deallocated,
The trick is to add a new ->wait field to the gp_ops array.
This field is a pointer to the rcu_barrier() function corresponding
to the flavor of RCU in question.  This allows a new rcu_sync_dtor()
to wait for any outstanding callbacks before freeing the rcu_sync
structure.

Signed-off-by: Oleg Nesterov 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcu_sync.h |  1 +
 kernel/rcu/sync.c| 22 ++
 2 files changed, 23 insertions(+)

diff --git a/include/linux/rcu_sync.h b/include/linux/rcu_sync.h
index c55a070b2592..67a31ada392f 100644
--- a/include/linux/rcu_sync.h
+++ b/include/linux/rcu_sync.h
@@ -62,6 +62,7 @@ static inline bool rcu_sync_is_idle(struct rcu_sync *rsp)
 extern void rcu_sync_init(struct rcu_sync *, enum rcu_sync_type);
 extern void rcu_sync_enter(struct rcu_sync *);
 extern void rcu_sync_exit(struct rcu_sync *);
+extern void rcu_sync_dtor(struct rcu_sync *);
 
 #define __RCU_SYNC_INITIALIZER(name, type) {   \
.gp_state = 0,  \
diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
index 26b2629e731e..a1f87f1bb705 100644
--- a/kernel/rcu/sync.c
+++ b/kernel/rcu/sync.c
@@ -32,6 +32,7 @@
 static const struct {
void (*sync)(void);
void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
+   void (*wait)(void);
 #ifdef CONFIG_PROVE_RCU
int  (*held)(void);
 #endif
@@ -39,16 +40,19 @@ static const struct {
[RCU_SYNC] = {
.sync = synchronize_rcu,
.call = call_rcu,
+   .wait = rcu_barrier,
__INIT_HELD(rcu_read_lock_held)
},
[RCU_SCHED_SYNC] = {
.sync = synchronize_sched,
.call = call_rcu_sched,
+   .wait = rcu_barrier_sched,
__INIT_HELD(rcu_read_lock_sched_held)
},
[RCU_BH_SYNC] = {
.sync = synchronize_rcu_bh,
.call = call_rcu_bh,
+   .wait = rcu_barrier_bh,
__INIT_HELD(rcu_read_lock_bh_held)
},
 };
@@ -195,3 +199,21 @@ void rcu_sync_exit(struct rcu_sync *rsp)
}
spin_unlock_irq(>rss_lock);
 }
+
+void rcu_sync_dtor(struct rcu_sync *rsp)
+{
+   int cb_state;
+
+   BUG_ON(rsp->gp_count);
+
+   spin_lock_irq(>rss_lock);
+   if (rsp->cb_state == CB_REPLAY)
+   rsp->cb_state = CB_PENDING;
+   cb_state = rsp->cb_state;
+   spin_unlock_irq(>rss_lock);
+
+   if (cb_state != CB_IDLE) {
+   gp_ops[rsp->gp_type].wait();
+   BUG_ON(rsp->cb_state != CB_IDLE);
+   }
+}
-- 
1.8.1.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] pmem, nfit: Fix ARCH_MEMREMAP_PMEM handling on x86_32

2015-08-28 Thread Dan Williams
On Fri, Aug 28, 2015 at 5:16 PM, Toshi Kani  wrote:
> ARCH_MEMREMAP_PMEM is defined on x86_64 only per ARCH_HAS_PMEM_API.
> The following compile error in __nfit_spa_map() was observed on
> x86_32 as it refers ARCH_MEMREMAP_PMEM without #ifdef.
>
>   drivers/acpi/nfit.c:1205:8: error: 'ARCH_MEMREMAP_PMEM'
>   undeclared (first use in this function)
>
> Fix it by defining ARCH_MEMREMAP_PMEM to MEMREMAP_WT in 
> when CONFIG_ARCH_HAS_PMEM_API is not set, i.e. x86_32.
>
> Remove '#ifdef ARCH_MEMREMAP_PMEM's that are no longer necessary
> with this change.
>
> Also remove the redundant definition of ARCH_MEMREMAP_PMEM in
> .
>
> Signed-off-by: Toshi Kani 
> Cc: Dan Williams 
> Cc: Ross Zwisler 
> Cc: Christoph Hellwig 
> 
> Apply on top of libnvdimm-for-next of the nvdimm tree.

Thanks Toshi, I'll fold this in to prevent bisection breakage.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] nfit: Fix undefined mmio_flush_range on x86_32

2015-08-28 Thread Dan Williams
On Fri, Aug 28, 2015 at 6:18 PM, Toshi Kani  wrote:
> The following compile error was observed on x86_32 since nfit.c
> relies on  to include , which only
> works when CONFIG_ARCH_HAS_PMEM_API is set on x86_64.
>
>   drivers/acpi/nfit.c:1085:5: error: implicit declaration of
>   function 'mmio_flush_range' [-Werror=implicit-function-declaration]
>
> Change nfit.c to include  directly for now.
>
> Signed-off-by: Toshi Kani 
> Cc: Dan Williams 
> Cc: Ross Zwisler 
> ---
> Apply on top of libnvdimm-for-next of the nvdimm tree.
> This is a temporary fix and please feel free to replace it with
> a better solution.

This looks correct to me, I'm going to fold it in to where the breakage occurs.

Thanks Toshi!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Eric Dumazet
On Sat, 2015-08-29 at 08:27 +0530, Raghavendra K T wrote:
>  
>   /* Use put_unaligned() because stats may not be aligned for u64. */
>   put_unaligned(items, [0]);


>   for (i = 1; i < items; i++)
> - put_unaligned(snmp_fold_field64(mib, i, syncpoff), [i]);
> + put_unaligned(buff[i], [i]);
>  

I believe Joe suggested following code instead :

buff[0] = items;
memcpy(stats, buff, items * sizeof(u64));

Also please move buff[] array into __snmp6_fill_stats64() to make it
clear it is used in a 'leaf' function.

(even if calling memcpy()/memset() makes it not a leaf function)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] task_work: remove fifo ordering guarantee

2015-08-28 Thread Linus Torvalds
On Fri, Aug 28, 2015 at 7:42 PM, Eric Dumazet  wrote:
>
> We could add yet another cond_resched() in the reverse loop, or we
> can simply remove the reversal, as I do not think anything
> would depend on order of task_work_add() submitted works.

So I think this should be ok, with things like file closing not really
caring about ordering as far as I can tell.

However, has anybody gone through all the task-work users? I looked
quickly at the task_work_add() cases, and didn't see anything that
looked like it would care, but others should look too. In the vfs,
theres' the delayed fput and mnt freeing, and there's a keyring
installation one.

The threaded irq handlers use it as that exit-time hack, which
certainly shouldn't care, and there's some uprobe thing.

Can anybody see anything fishy?

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] perf tools: Support bpf prologue for arm64

2015-08-28 Thread He Kuang
This patch implements arch_get_reg_info() for arm64 to enable
HAVE_BPF_PROLOGUE feature. For arm64, structure pt_regs is not composed
by fields of register names but an array of regs, so here we simply
multiply fixed register size by index number to get the byte offset.

Signed-off-by: He Kuang 
---
 tools/perf/arch/arm64/Makefile  |  1 +
 tools/perf/arch/arm64/util/dwarf-regs.c | 26 ++
 2 files changed, 27 insertions(+)

diff --git a/tools/perf/arch/arm64/Makefile b/tools/perf/arch/arm64/Makefile
index 7fbca17..1256e6e 100644
--- a/tools/perf/arch/arm64/Makefile
+++ b/tools/perf/arch/arm64/Makefile
@@ -1,3 +1,4 @@
 ifndef NO_DWARF
 PERF_HAVE_DWARF_REGS := 1
 endif
+PERF_HAVE_ARCH_GET_REG_INFO := 1
diff --git a/tools/perf/arch/arm64/util/dwarf-regs.c 
b/tools/perf/arch/arm64/util/dwarf-regs.c
index d49efeb..cb2c50a 100644
--- a/tools/perf/arch/arm64/util/dwarf-regs.c
+++ b/tools/perf/arch/arm64/util/dwarf-regs.c
@@ -10,6 +10,10 @@
 
 #include 
 #include 
+#include 
+#include 
+
+#define PT_REG_SIZE (sizeof(((struct user_pt_regs *)0)->regs[0]))
 
 struct pt_regs_dwarfnum {
const char *name;
@@ -78,3 +82,25 @@ const char *get_arch_regstr(unsigned int n)
return roff->name;
return NULL;
 }
+
+#ifdef HAVE_BPF_PROLOGUE
+int arch_get_reg_info(const char *name, int *offset)
+{
+   const struct pt_regs_dwarfnum *roff;
+
+   if (!name || !offset)
+   return -1;
+
+   for (roff = regdwarfnum_table; roff->name != NULL; roff++) {
+   if (!strcmp(roff->name, name)) {
+   if (roff->dwarfnum < 0)
+   return -1;
+
+   *offset = roff->dwarfnum * PT_REG_SIZE;
+   return 0;
+   }
+   }
+
+   return -1;
+}
+#endif
-- 
1.8.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 1/2] Add the driver of mbigen interrupt controller

2015-08-28 Thread Alexey Klimov
Hi Ma Jun,

On Wed, Aug 19, 2015 at 5:55 AM, MaJun  wrote:
> From: Ma Jun 
>
> Mbigen means Message Based Interrupt Generator(MBIGEN).
>
> Its a kind of interrupt controller that collects
>
> the interrupts from external devices and generate msi interrupt.
>
> Mbigen is applied to reduce the number of wire connected interrupts.
>
> As the peripherals increasing, the interrupts lines needed is
> increasing much, especially on the Arm64 server soc.
>
> Therefore, the interrupt pin in gic is not enough to cover so
> many peripherals.
>
> Mbigen is designed to fix this problem.
>
> Mbigen chip locates in ITS or outside of ITS.
>
> Mbigen chip hardware structure shows as below:
>
> mbigen chip
> |-|---|
> mgn_node0 mgn_node1 mgn_node2
>  |   |---|  |---|--|
> dev1dev1dev2dev1   dev3   dev4
>
> Each mbigen chip contains several mbigen nodes.
>
> External devices can connects to mbigen node through wire connecting way.

s/connects/connect

>
> Because a mbigen node only can support 128 interrupt maximum, depends
> on the interrupt lines number of devices, a device can connects to one
> more mbigen nodes.
>
> Also, several different devices can connect to a same mbigen node.
>
> When devices triggered interrupt, mbigen chip detects and collects
> the interrupts and generates the MBI interrupts by writing the ITS
> Translator register.
>
>
> Signed-off-by: Ma Jun 
> ---
>  drivers/irqchip/Kconfig  |8 +
>  drivers/irqchip/Makefile |1 +
>  drivers/irqchip/irq-mbigen.c |  732 
> ++
>  3 files changed, 741 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/irqchip/irq-mbigen.c
>
> diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
> index 120d815..356507f 100644
> --- a/drivers/irqchip/Kconfig
> +++ b/drivers/irqchip/Kconfig
> @@ -27,6 +27,14 @@ config ARM_GIC_V3_ITS
> bool
> select PCI_MSI_IRQ_DOMAIN
>
> +config HISILICON_IRQ_MBIGEN
> +   bool "Support mbigen interrupt controller"
> +   default n
> +   depends on ARM_GIC_V3 && ARM_GIC_V3_ITS
> +   help
> +Enable the mbigen interrupt controller used on
> +Hisilicon platform.
> +
>  config ARM_NVIC
> bool
> select IRQ_DOMAIN
> diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile
> index 11d08c9..c6f3d66 100644
> --- a/drivers/irqchip/Makefile
> +++ b/drivers/irqchip/Makefile
> @@ -23,6 +23,7 @@ obj-$(CONFIG_ARM_GIC) += irq-gic.o 
> irq-gic-common.o
>  obj-$(CONFIG_ARM_GIC_V2M)  += irq-gic-v2m.o
>  obj-$(CONFIG_ARM_GIC_V3)   += irq-gic-v3.o irq-gic-common.o
>  obj-$(CONFIG_ARM_GIC_V3_ITS)   += irq-gic-v3-its.o 
> irq-gic-v3-its-pci-msi.o irq-gic-v3-its-platform-msi.o
> +obj-$(CONFIG_HISILICON_IRQ_MBIGEN) += irq-mbigen.o
>  obj-$(CONFIG_ARM_NVIC) += irq-nvic.o
>  obj-$(CONFIG_ARM_VIC)  += irq-vic.o
>  obj-$(CONFIG_ATMEL_AIC_IRQ)+= irq-atmel-aic-common.o 
> irq-atmel-aic.o
> diff --git a/drivers/irqchip/irq-mbigen.c b/drivers/irqchip/irq-mbigen.c
> new file mode 100644
> index 000..4bbbd76
> --- /dev/null
> +++ b/drivers/irqchip/irq-mbigen.c
> @@ -0,0 +1,732 @@
> +/*
> + * Copyright (C) 2014 Hisilicon Limited, All Rights Reserved.

maybe 2014-2015 or 2015?

> + * Author: Jun Ma 
> + * Author: Yun Wu 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see .
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 

What do you think about sorting this?


> +#include "irqchip.h"
> +
> +#defineMBIGEN_NODE_SHIFT   (8)
> +#define MBIGEN_DEV_SHIFT   (12)
> +
> +/*
> + * To avoid the duplicate hwirq number problem
> + * we use device id, mbigen node number and interrupt
> + * pin offset to generate a new hwirq number in mbigen
> + * domain.
> + *
> + * hwirq[32:12]: did. device id
> + * hwirq[11:8]: nid. mbigen node number
> + * hwirq[7:0]: pin. hardware pin offset of this interrupt
> + */
> +#defineCOMPOSE_MBIGEN_HWIRQ(did, nid, pin) \
> +   (((did) << MBIGEN_DEV_SHIFT) | \
> +   ((nid) << MBIGEN_NODE_SHIFT) | (pin))
> +
> +/* get the interrupt pin offset from mbigen hwirq */
> +#define 

Re: [PATCH] btrfs: trimming some start_transaction() code away

2015-08-28 Thread Alexandru Moise
On Fri, Aug 28, 2015 at 07:38:56PM +0200, David Sterba wrote:
> On Thu, Aug 27, 2015 at 11:53:45PM +, Alexandru Moise wrote:
> > Just call kmem_cache_zalloc() instead of calling kmem_cache_alloc().
> > We're just initializing most fields to 0, false and NULL later on
> > _anyway_, so to make the code mode readable and potentially gain
> > a bit of performance (completely untested claim), we should fill our
> > btrfs_trans_handle with zeros on allocation then just initialize
> > those five remaining fields (not counting the list_heads) as normal.
> > 
> > Signed-off-by: Alexandru Moise <00moses.alexande...@gmail.com>
> 
> The performance gain is arguable but the generated code should be
> smaller, which counts.
> 
> Reviewed-by: David Sterba 

Yeah, I ran a few iozone benchmarks on a Samsung 850 PRO SSD 
on 3 kernels, the latest archlinux kernel, my custom kernel which has:
CONFIG_BTRFS_ASSERT=y
CONFIG_BTRFS_DEBUG=y
CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y
CONFIG_BTRFS_FS_POSIX_ACL=y
with the patch, and my custom kernel without the patch.
I ran iozone 5 times on each kernel, There were huge differences
between my custom kernels and arch's kernel, but nothing conclusive
between my custom kernel with or without the patch. So it's safe
to say that it has not much of a visible effect on performance.

Thank you for your time!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Raghavendra K T
* David Miller  [2015-08-28 11:24:13]:

> From: Raghavendra K T 
> Date: Fri, 28 Aug 2015 12:09:52 +0530
> 
> > On 08/28/2015 12:08 AM, David Miller wrote:
> >> From: Raghavendra K T 
> >> Date: Wed, 26 Aug 2015 23:07:33 +0530
> >>
> >>> @@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64
> >>> *stats, void __percpu *mib,
> >>>   static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int
> >>>   attrtype,
> >>>int bytes)
> >>>   {
> >>> + u64 buff[IPSTATS_MIB_MAX] = {0,};
> >>> +
>  ...
> > hope you wanted to know the overhead than to change the current
> > patch. please let me know..
> 
> I want you to change that variable initializer to an explicit memset().
> 
> The compiler is emitting a memset() or similar _anyways_.
> 
> Not because it will have any impact at all upon performance, but because
> of how it looks to people trying to read and understand the code.
> 
> 

Hi David,
resending the patch with memset. Please let me know if you want to
resend all the patches.

8<
From: Raghavendra K T 
Subject: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking
 all the percpu data at once

Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.

reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data of an item (iteratively for around 90 items).

idea: This patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Docker creation got faster by more than 2x after the patch.

Result:
   Before   After
Docker creation time   6.836s   3.357s
cache miss 2.7% 1.38%

perf before:
50.73%  docker   [kernel.kallsyms]   [k] snmp_fold_field
 9.07%  swapper  [kernel.kallsyms]   [k] snooze_loop
 3.49%  docker   [kernel.kallsyms]   [k] veth_stats_one
 2.85%  swapper  [kernel.kallsyms]   [k] _raw_spin_lock

perf after:
10.56%  swapper  [kernel.kallsyms] [k] snooze_loop
 8.72%  docker   [kernel.kallsyms] [k] snmp_get_cpu_field
 7.59%  docker   [kernel.kallsyms] [k] veth_stats_one
 3.65%  swapper  [kernel.kallsyms] [k] _raw_spin_lock

Signed-off-by: Raghavendra K T 
---
 net/ipv6/addrconf.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

 Change in V2:
 - Allocate stat calculation buffer in stack (Eric)
 - Use memset to zero temp buffer (David)

Thanks David and Eric for coments on V1 and as both of them pointed,
unfortunately we cannot get rid of buffer for calculation without
avoiding unaligned op.


diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 21c2c81..9bdfba3 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4624,16 +4624,22 @@ static inline void __snmp6_fill_statsdev(u64 *stats, 
atomic_long_t *mib,
 }
 
 static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
- int items, int bytes, size_t syncpoff)
+   int items, int bytes, size_t syncpoff,
+   u64 *buff)
 {
-   int i;
+   int i, c;
int pad = bytes - sizeof(u64) * items;
BUG_ON(pad < 0);
 
/* Use put_unaligned() because stats may not be aligned for u64. */
put_unaligned(items, [0]);
+
+   for_each_possible_cpu(c)
+   for (i = 1; i < items; i++)
+   buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);
+
for (i = 1; i < items; i++)
-   put_unaligned(snmp_fold_field64(mib, i, syncpoff), [i]);
+   put_unaligned(buff[i], [i]);
 
memset([items], 0, pad);
 }
@@ -4641,10 +4647,13 @@ static inline void __snmp6_fill_stats64(u64 *stats, 
void __percpu *mib,
 static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
 int bytes)
 {
+   u64 buff[IPSTATS_MIB_MAX];
+
switch (attrtype) {
case IFLA_INET6_STATS:
-   __snmp6_fill_stats64(stats, idev->stats.ipv6,
-IPSTATS_MIB_MAX, bytes, offsetof(struct 
ipstats_mib, syncp));
+   memset(buff, 0, sizeof(buff));
+   __snmp6_fill_stats64(stats, idev->stats.ipv6, IPSTATS_MIB_MAX, 
bytes,
+offsetof(struct ipstats_mib, syncp), buff);
break;
case IFLA_INET6_ICMP6STATS:
__snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, 
ICMP6_MIB_MAX, bytes);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [PATCH 32/32] bpf: Introduce function for outputing data to perf event

2015-08-28 Thread Wangnan (F)



On 2015/8/29 10:49, Alexei Starovoitov wrote:

On 8/28/15 7:36 PM, Wangnan (F) wrote:
For current patch 32/32, I think it is useful enough for some simple 
cases,

and we have already start using it internally. What about keep it as
what it
is now and create a independent method for your usecase?


well, though the patch is small and contained, I think we can do better
and define more generic helper. I believe Namhyung back in July had
the same concern.


OK. I'll drop this one in my next pull request.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 32/32] bpf: Introduce function for outputing data to perf event

2015-08-28 Thread Alexei Starovoitov

On 8/28/15 7:36 PM, Wangnan (F) wrote:

For current patch 32/32, I think it is useful enough for some simple cases,
and we have already start using it internally. What about keep it as
what it
is now and create a independent method for your usecase?


well, though the patch is small and contained, I think we can do better
and define more generic helper. I believe Namhyung back in July had
the same concern.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/4] perf tools: Use the new ability of eBPF programs to access hardware PMU counter

2015-08-28 Thread Alexei Starovoitov

On 8/28/15 7:14 PM, xiakaixu wrote:

Right, this is just a little example. Actually, I have tested this
ability on kernel side and user space side, that is kprobe and uprobe.


great to hear.


At this time i wish to get your comment on the current chosen implementation.
Now the struct perf_event_map_def is introduced and the user can directly
define the struct perf_event_attr, so we can skip the parse_events process
and call the sys_perf_event_open on the events directly. This is the most
simple implementation, but I am not sure it is the most appropriate.


I think it's a bit kludgy. You are trying to squeeze more and more
information into sections and pass them via elf.
It worked for samples early on, but now it's time to do better.
Like in bcc we just write normal C and extract all necessary information
by looking at C via clang:rewriter api. I think it's a cleaner approach.
In our use case we can compile on the host, so no intermediate files,
no elf files. If you have to cross-compile you can still use the same
approach and let llvm generate .o and emit all extra stuff as another
configuration file (say in .json), then let host load .o and use .json
to setup pmu events and everything else. It will work for higher number
of use cases, but at the end I don't see how you can avoid moving to
c+python or c+whatever approach, since static configuration (whether in
.json or in elf section) are not going to be enough. You'd need a
program in user space to deal with all the data that bpf program
in kernel is collecting.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] task_work: remove fifo ordering guarantee

2015-08-28 Thread Eric Dumazet
From: Eric Dumazet 

In commit f341861fb0b ("task_work: add a scheduling point in
task_work_run()") I fixed a latency problem adding a cond_resched()
call.

Later, commit ac3d0da8f329 added yet another loop to reverse a list,
bringing back the latency spike :

I've seen in some cases this loop taking 275 ms, if for example a
process with 2,000,000 files is killed.

We could add yet another cond_resched() in the reverse loop, or we
can simply remove the reversal, as I do not think anything
would depend on order of task_work_add() submitted works.

Fixes: ac3d0da8f329 ("task_work: Make task_work_add() lockless")
Signed-off-by: Eric Dumazet 
Reported-by: Maciej Żenczykowski 
---
 kernel/task_work.c |   12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index 8727032e3a6f..53fa971d000d 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -18,6 +18,8 @@ static struct callback_head work_exited; /* all we need is 
->next == NULL */
  * This is like the signal handler which runs in kernel mode, but it doesn't
  * try to wake up the @task.
  *
+ * Note: there is no ordering guarantee on works queued here.
+ *
  * RETURNS:
  * 0 if succeeds or -ESRCH.
  */
@@ -108,16 +110,6 @@ void task_work_run(void)
raw_spin_unlock_wait(>pi_lock);
smp_mb();
 
-   /* Reverse the list to run the works in fifo order */
-   head = NULL;
-   do {
-   next = work->next;
-   work->next = head;
-   head = work;
-   work = next;
-   } while (work);
-
-   work = head;
do {
next = work->next;
work->func(work);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next] openvswitch: Fix conntrack compilation without mark.

2015-08-28 Thread Simon Horman
On Fri, Aug 28, 2015 at 07:22:11PM -0700, Joe Stringer wrote:
> Fix build with !CONFIG_NF_CONNTRACK_MARK && CONFIG_OPENVSWITCH_CONNTRACK
> 
> Fixes: 182e304 ("openvswitch: Allow matching on conntrack mark")
> Reported-by: Simon Horman 
> Signed-off-by: Joe Stringer 

Thanks Joe,

this seems to solve the build problem that I observed.

Tested-by: Simon Horman 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 32/32] bpf: Introduce function for outputing data to perf event

2015-08-28 Thread Wangnan (F)



On 2015/8/29 10:22, Alexei Starovoitov wrote:

On 8/28/15 7:15 PM, Wangnan (F) wrote:

I'd like to see whether it is possible to create dynamic tracepoints so
different receivers can listen on different tracepoints.


see my proposal A. I think ftrace instances might work for this.

I'm not sure about 'format' part though. Kernel side shouldn't be
aware of it. It's only the contract between bpf program and user process
that deals with it.


It is an option. Let's keep an open mind now :)

For current patch 32/32, I think it is useful enough for some simple cases,
and we have already start using it internally. What about keep it as what it
is now and create a independent method for your usecase?

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv6 net-next 00/10] OVS conntrack support

2015-08-28 Thread Joe Stringer
On 28 August 2015 at 16:57, Simon Horman  wrote:
> On Wed, Aug 26, 2015 at 11:31:43AM -0700, Joe Stringer wrote:
>> The goal of this series is to allow OVS to send packets through the Linux
>> kernel connection tracker, and subsequently match on fields populated by
>> conntrack. This functionality is enabled through a new
>> CONFIG_OPENVSWITCH_CONNTRACK option.
>>
>> This version addresses the feedback from v5, primarily checking the behaviour
>> is correct with different configurations such as disabling
>> CONFIG_OPENVSWITCH_CONNTRACK or disabling individual conntrack features like
>> connlabels.
>>
>> The branch below has been updated with the corresponding userspace pieces:
>> https://github.com/joestringer/ovs dev/ct_20150818
>
> Hi Joe,
>
> Nice work getting this patchset in order.
>
> I am seeing the following when compiling without NF_CONNTRACK_MARK set.
>
>   CC [M]  net/openvswitch//conntrack.o
> net/openvswitch//conntrack.c: In function ‘__ovs_ct_update_key’:
> net/openvswitch//conntrack.c:127:24: error: ‘const struct nf_conn’ has no 
> member named ‘mark’
>   key->ct.mark = ct ? ct->mark : 0;
> ^
> net/openvswitch//conntrack.c: In function ‘ovs_ct_set_mark’:
> net/openvswitch//conntrack.c:195:26: error: ‘struct nf_conn’ has no member 
> named ‘mark’
>   new_mark = ct_mark | (ct->mark & ~(mask));
>   ^
> net/openvswitch//conntrack.c:196:8: error: ‘struct nf_conn’ has no member 
> named ‘mark’
>   if (ct->mark != new_mark) {
> ^
> net/openvswitch//conntrack.c:197:5: error: ‘struct nf_conn’ has no member 
> named ‘mark’
>ct->mark = new_mark;
>  ^
> scripts/Makefile.build:258: recipe for target 'net/openvswitch//conntrack.o' 
> failed
> make[1]: *** [net/openvswitch//conntrack.o] Error 1
> Makefile:1386: recipe for target '_module_net/openvswitch/' failed
> make: *** [_module_net/openvswitch/] Error 2

Thanks for reporting this, I sent a patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net-next] openvswitch: Fix conntrack compilation without mark.

2015-08-28 Thread Joe Stringer
Fix build with !CONFIG_NF_CONNTRACK_MARK && CONFIG_OPENVSWITCH_CONNTRACK

Fixes: 182e304 ("openvswitch: Allow matching on conntrack mark")
Reported-by: Simon Horman 
Signed-off-by: Joe Stringer 
---
 net/openvswitch/conntrack.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 886bd27..e8e524a 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -100,6 +100,15 @@ static u8 ovs_ct_get_state(enum ip_conntrack_info ctinfo)
return ct_state;
 }
 
+static u32 ovs_ct_get_mark(const struct nf_conn *ct)
+{
+#if IS_ENABLED(CONFIG_NF_CONNTRACK_MARK)
+   return ct ? ct->mark : 0;
+#else
+   return 0;
+#endif
+}
+
 static void ovs_ct_get_label(const struct nf_conn *ct,
 struct ovs_key_ct_label *label)
 {
@@ -124,7 +133,7 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 
state,
 {
key->ct.state = state;
key->ct.zone = zone->id;
-   key->ct.mark = ct ? ct->mark : 0;
+   key->ct.mark = ovs_ct_get_mark(ct);
ovs_ct_get_label(ct, >ct.label);
 }
 
@@ -180,12 +189,11 @@ int ovs_ct_put_key(const struct sw_flow_key *key, struct 
sk_buff *skb)
 static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key,
   u32 ct_mark, u32 mask)
 {
+#if IS_ENABLED(CONFIG_NF_CONNTRACK_MARK)
enum ip_conntrack_info ctinfo;
struct nf_conn *ct;
u32 new_mark;
 
-   if (!IS_ENABLED(CONFIG_NF_CONNTRACK_MARK))
-   return -ENOTSUPP;
 
/* The connection could be invalid, in which case set_mark is no-op. */
ct = nf_ct_get(skb, );
@@ -200,6 +208,9 @@ static int ovs_ct_set_mark(struct sk_buff *skb, struct 
sw_flow_key *key,
}
 
return 0;
+#else
+   return -ENOTSUPP;
+#endif
 }
 
 static int ovs_ct_set_label(struct sk_buff *skb, struct sw_flow_key *key,
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 32/32] bpf: Introduce function for outputing data to perf event

2015-08-28 Thread Alexei Starovoitov

On 8/28/15 7:15 PM, Wangnan (F) wrote:

I'd like to see whether it is possible to create dynamic tracepoints so
different receivers can listen on different tracepoints.


see my proposal A. I think ftrace instances might work for this.

I'm not sure about 'format' part though. Kernel side shouldn't be
aware of it. It's only the contract between bpf program and user process
that deals with it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux Firmware Signing

2015-08-28 Thread Luis R. Rodriguez
On Thu, Aug 27, 2015 at 07:54:33PM -0400, Mimi Zohar wrote:
> On Thu, 2015-08-27 at 23:29 +0200, Luis R. Rodriguez wrote:
> > On Thu, Aug 27, 2015 at 10:57:23AM -, David Woodhouse wrote:
> > > > Luis R. Rodriguez  wrote:
> > > >
> > > >> "PKCS#7: Add an optional authenticated attribute to hold firmware name"
> > > >> https://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/commit/?h=fwsign-pkcs7=1448377a369993f864915743cfb34772e730213good
> > > >>
> > > >> 1.3.6.1.4.1.2312.16 Linux kernel
> > > >> 1.3.6.1.4.1.2312.16.2   - PKCS#7/CMS SignerInfo attribute types
> > > >> 1.3.6.1.4.1.2312.16.2.1   - firmwareName
> > > >>
> > > >> I take it you are referring to this?
> > > >
> > > > Yes.
> > > >
> > > >> If we follow this model we'd then need something like:
> > > >>
> > > >> 1.3.6.1.4.1.2312.16.2.2   - seLinuxPolicyName
> > > >>
> > > >> That should mean each OID that has different file names would need to 
> > > >> be
> > > >> explicit about and have a similar entry on the registry. I find that
> > > >> pretty redundant and would like to avoid that if possible.
> > > >
> > > > firmwareName is easy for people to understand - it's the name the kernel
> > > > asks for and the filename of the blob.  seLinuxPolicyName is, I think, a
> > > > lot more tricky since a lot of people don't use SELinux, and most that 
> > > > do
> > > > don't understand it (most people that use it aren't even really aware of
> > > > it).
> > > >
> > > > If you can use the firmwareName as the SELinux/LSM key, I would suggest
> > > > doing so - even if you dress it up as a path
> > > > (/lib/firmware/).
> > > 
> > > In conversation with Mimi last week she was very keen on the model where
> > > we load modules & firmware in such a fashion that the kernel has access to
> > > the original inode -- by passing in a fd,
> > 
> > Sure, so let's be specific to ensure what Mimi needs is there. I though 
> > there
> > was work needed on modules but that seems covered and work then seems only
> > needed for kexec and SELinux policy files (and a review of other possible 
> > file
> > consumers in the kernel) for what you describe. 

Correct me if I'm wrong:

> At last year's LSS linux-integrity status update, I mentioned 6
> measurement/appraisal gaps, kernel modules (linux-3.7), 

Done.

> firmware (linux-3.17), 

I'm working on it, but as far as LSMs are concerned the LSM hook
is in place.

> kexec,

I'll note kexec has both a kernel and initramfs :) so just keep that
in mind. Technically it should vet for both. It seems we just need
an LSM hook there.

> initramfs, 

Hm, what code path?

> eBPF/seccomp 

Same here, where's this?

> and policies,

Which ones?

>  that have
> been or need to be addressed.  Since then, a new kexec syscall, file
> descriptor based, was upstreamed that appraises the image.  Until we can
> preserve the measurement list across kexec,

I'm sorry I do not follow, can you elaborate on what you mean by this.
Its not clear to me what you mean by the measurement list. Do you mean
all the above items?

> it doesn't make sense to
> measure the image just to have it thrown away.  (skipping initramfs as
> that isn't related to LSM hooks

Hrm, it can be, I mean at least for the kexec case its a fd that is passed
as part of the syscall, not sure of the other case you mentioned yet
as I haven't reviewed that code yet.

>.)  Lastly, measuring/appraising policies
> (eg. IMA, SELinux, Smack, iptables/ebtables) 

OK for each of these:

how do we load the data? Is that the full list? Note we should
be able to use grammar rules to hunt these down, I just haven't
sat down to write them but if this is important well we should.

> or any other files consumed
> by the kernel.

:D likewise

> > I also went ahead and studied
> > areas where we can share code now as I was looking at this code now, and 
> > also
> > would like to recap on the idea of possibly just sharing the same LSM hook
> > for all "read this special file from the fs in the kernel" cases. Details 
> > below.
> > 
> > Fortnately the LSM hooks uses struct file and with this you can get the 
> > inode
> > with this:
> > 
> > struct inode *inode = file_inode(file);
> > 
> > For modules we have this LSM hook:
> > 
> > int (*kernel_module_from_file)(struct file *file);
> > 
> > This can be used for finit_module(). Its used as follows, the fd comes from
> > finit_module() syscall.
> > 
> > SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, 
> > flags)
> > {
> > ...
> > err = copy_module_from_fd(fd, );   
> > if (err)
> > return err; 
> > ...
> > }
> > 
> > static int copy_module_from_fd(int fd, struct load_info *info)
> > {
> > struct fd f = fdget(fd);
> > ...
> > err = security_kernel_module_from_file(f.file); 
> > if (err)  

Re: [PATCH 32/32] bpf: Introduce function for outputing data to perf event

2015-08-28 Thread Wangnan (F)



On 2015/8/29 9:34, Alexei Starovoitov wrote:

On 8/28/15 6:19 PM, Wangnan (F) wrote:
For me, I use bpf_output_trace_data() to output information like PMU 
count
value. Perf is the only receiver, so global collector is perfect. 
Could you

please describe your usecase in more detail?


there is a special receiver in user space that only wants the data from
the bpf program that it loaded. It shouldn't conflict with any other
processes. Like when it's running, I still should be able to use perf
for other performance analysis. There is no way to share single
bpf:bpf_output_data event, since these user processes are completely
independent.


I'd like to see whether it is possible to create dynamic tracepoints so
different receivers can listen on different tracepoints. For my side, maybe
I can encode format information into the new tracepoints so don't need
those LLVM patches.

For example:

# echo 'dynamic_tracepoint:mytracepoint ' >> 
/sys/kernel/debug/tracing/dynamic_trace_events

# perf list
  ...
  dynamic_tracepoint:mytracepoint
  ...

In perf side we can encode the creation of dynamic tracepoint into 
bpf-loader

like what we currectly do for probing the kprobes.

This way reqires us to create a fresh new event source, in parallel with 
tracepoint. I'm not sure

how much work it needs. What do you think?

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/4] perf tools: Use the new ability of eBPF programs to access hardware PMU counter

2015-08-28 Thread xiakaixu
于 2015/8/29 9:28, Alexei Starovoitov 写道:
> On 8/27/15 3:42 AM, Kaixu Xia wrote:
>> An example is pasted at the bottom of this cover letter. In that example,
>> we can get the cpu_cycles and exception taken in sys_write.
>>
>>   $ cat /sys/kernel/debug/tracing/trace_pipe
>>   $ ./perf record --event perf-bpf.o ls
>> ...
>>   cat-1653  [003] d..1 88174.613854: : ente:  CPU-3
>> cyc:48746333exc:84
>>   cat-1653  [003] d..2 88174.613861: : exit:  CPU-3
>> cyc:48756041exc:84
> 
> nice. probably more complex example that computes the delta of the pmu
> counters on the kernel side would be even more interesting.

Right, this is just a little example. Actually, I have tested this
ability on kernel side and user space side, that is kprobe and uprobe.
The collected delta of the pmu counters form kernel and glibc is correct
and meets the expected goals. I will give them in the next version.

At this time i wish to get your comment on the current chosen implementation.
Now the struct perf_event_map_def is introduced and the user can directly
define the struct perf_event_attr, so we can skip the parse_events process
and call the sys_perf_event_open on the events directly. This is the most
simple implementation, but I am not sure it is the most appropriate.
> Do you think you can extend 'perf stat' with a flag that does
> stats collection for a given kernel or user function instead of the
> whole process ?
> Then we can use perf record/report to figure out hot functions and
> follow with 'perf stat -f my_hot_func my_process' to drill into
> particular function stats.

Good idea! I will consider it when this patchset is basically completed.
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC / BUG] mtd: provide proper 32/64-bit compat_ioctl() support for BLKPG

2015-08-28 Thread Brian Norris
After a bit of poking around wondering why my 32-bit user-space can't
seem to send a proper ioctl(BLKPG) to an MTD on my 64-bit kernel
(ARM64), I noticed that struct blkpg_ioctl_arg is actually pretty
unsuitable for use in the ioctl() ABI, due to its use of raw pointers,
and its lack of alignment/packing restrictions (32-bit arch'es tend to
pack the 4 fields into 4 32-bit words, whereas 64-bit arch'es would add
padding after the third int, and make this 6 32-bit words).

Anyway, this means BLKPG deserves some special compat_ioctl handling. Do
the conversion in a small shim for MTD.

The same bug applies to block/ioctl.c, but I wanted to get some comments
first. I can send a non-RFC with the same approach for the block
subsystem. But then: which tree should it go in?

Tested only on MTD, with an ARM32 user space on an ARM64 kernel.

Signed-off-by: Brian Norris 
---
 drivers/mtd/mtdchar.c  | 42 +-
 include/uapi/linux/blkpg.h | 10 ++
 2 files changed, 43 insertions(+), 9 deletions(-)

diff --git a/drivers/mtd/mtdchar.c b/drivers/mtd/mtdchar.c
index 55fa27ecf4e1..bf966be09e79 100644
--- a/drivers/mtd/mtdchar.c
+++ b/drivers/mtd/mtdchar.c
@@ -498,21 +498,17 @@ static int shrink_ecclayout(const struct nand_ecclayout 
*from,
 }
 
 static int mtdchar_blkpg_ioctl(struct mtd_info *mtd,
-  struct blkpg_ioctl_arg __user *arg)
+  struct blkpg_ioctl_arg *arg)
 {
-   struct blkpg_ioctl_arg a;
struct blkpg_partition p;
 
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
 
-   if (copy_from_user(, arg, sizeof(struct blkpg_ioctl_arg)))
+   if (copy_from_user(, arg->data, sizeof(p)))
return -EFAULT;
 
-   if (copy_from_user(, a.data, sizeof(struct blkpg_partition)))
-   return -EFAULT;
-
-   switch (a.op) {
+   switch (arg->op) {
case BLKPG_ADD_PARTITION:
 
/* Only master mtd device must be used to add partitions */
@@ -966,8 +962,13 @@ static int mtdchar_ioctl(struct file *file, u_int cmd, 
u_long arg)
 
case BLKPG:
{
-   ret = mtdchar_blkpg_ioctl(mtd,
- (struct blkpg_ioctl_arg __user *)arg);
+   struct blkpg_ioctl_arg __user *blk_arg = argp;
+   struct blkpg_ioctl_arg a;
+
+   if (copy_from_user(, blk_arg, sizeof(a)))
+   ret = -EFAULT;
+   else
+   ret = mtdchar_blkpg_ioctl(mtd, );
break;
}
 
@@ -1046,6 +1047,29 @@ static long mtdchar_compat_ioctl(struct file *file, 
unsigned int cmd,
_user->start);
break;
}
+
+   case BLKPG:
+   {
+   /* Convert from blkpg_compat_ioctl_arg to blkpg_ioctl_arg */
+   struct blkpg_compat_ioctl_arg __user *uarg = argp;
+   struct blkpg_compat_ioctl_arg arg;
+   struct blkpg_ioctl_arg a;
+
+   if (copy_from_user(, uarg, sizeof(arg))) {
+   ret = -EFAULT;
+   break;
+   }
+
+   memset(, 0, sizeof(a));
+   a.op = arg.op;
+   a.flags = arg.flags;
+   a.datalen = arg.datalen;
+   a.data = compat_ptr(arg.data);
+
+   ret = mtdchar_blkpg_ioctl(mtd, );
+   break;
+   }
+
default:
ret = mtdchar_ioctl(file, cmd, (unsigned long)argp);
}
diff --git a/include/uapi/linux/blkpg.h b/include/uapi/linux/blkpg.h
index a8519446c111..0574147f4490 100644
--- a/include/uapi/linux/blkpg.h
+++ b/include/uapi/linux/blkpg.h
@@ -26,6 +26,7 @@
  */
 #include 
 #include 
+#include 
 
 #define BLKPG  _IO(0x12,105)
 
@@ -37,6 +38,15 @@ struct blkpg_ioctl_arg {
 void __user *data;
 };
 
+#ifdef CONFIG_COMPAT
+struct blkpg_compat_ioctl_arg {
+   compat_int_t op;
+   compat_int_t flags;
+   compat_int_t datalen;
+   compat_uptr_t data;
+};
+#endif
+
 /* The subfunctions (for the op field) */
 #define BLKPG_ADD_PARTITION1
 #define BLKPG_DEL_PARTITION2
-- 
2.5.0.457.gab17608

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] asm-generic/pci_iomap.h: make custom PCI BAR requirements explicit

2015-08-28 Thread Randy Dunlap
On 08/28/15 17:17, Luis R. Rodriguez wrote:
> 
>  arch/s390/Kconfig |  8 +
>  arch/s390/include/asm/io.h| 11 ---
>  arch/s390/include/asm/pci.h   |  2 --
>  arch/s390/include/asm/pci_iomap.h | 33 +
>  arch/s390/pci/pci.c   |  2 ++
>  include/asm-generic/io.h  | 12 
>  include/asm-generic/iomap.h   | 10 ---
>  include/asm-generic/pci_iomap.h   | 62 
> +++
>  lib/Kconfig   |  1 +
>  lib/pci_iomap.c   |  5 
>  10 files changed, 105 insertions(+), 41 deletions(-)
>  create mode 100644 arch/s390/include/asm/pci_iomap.h
> 
> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
> index 1d57000b1b24..1217b7db4265 100644
> --- a/arch/s390/Kconfig
> +++ b/arch/s390/Kconfig
> @@ -614,6 +614,14 @@ endif# PCI
>  config PCI_DOMAINS
>   def_bool PCI
>  
> +config ARCH_PCI_NON_DISJUNCTIVE
> + def_bool PCI
> + help
> +   On the S390 architecture PCI BAR spaces are not disjunctive, as such

are not disjoint?  may be overlapping?

> +   the PCI bar is required on a series of otherwise asm generic PCI
> +   routines, as such S390 requires itw own implemention for these

  its own implementation

> +   routines.
> +
>  config HAS_IOMEM
>   def_bool PCI
>  


-- 
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux Firmware Signing

2015-08-28 Thread Luis R. Rodriguez
On Fri, Aug 28, 2015 at 06:26:05PM -0400, Paul Moore wrote:
> On Fri, Aug 28, 2015 at 7:20 AM, Roberts, William C
>  wrote:
> > Even triggered updates make sense, since you can at least have some form of 
> > trust
> > of where that binary policy came from.
> 
> It isn't always that simple, see my earlier comments about
> customization and manipulation by the policy loading tools.

If the customization of the data is done in kernel then the kernel
can *first* verify the file's signature prior to doing any data
modification. If userspace does the modification then the signature
stuff won't work unless the tool will have access to the MOK and can
sign it pre-flight to the kernel selinuxfs.

> > Huh, not following? Perhaps, I am not following what your laying down here.
> >
> >  Right now there is no signing on the selinux policy file. We should be able
> > to just use the firmware signing api's as is (I have not looked on 
> > linux-next yet)
> > to unpack the blob.
> 
> I haven't looked at the existing fw signing hook in any detail to be
> able to comment on its use as a policy verification hook.  As long as
> we preserve backwards compatibility and don't introduce a new
> mechanism/API for loading SELinux policy I doubt I would have any
> objections.

You'd just have to implement a permissive model as we are with the
fw signing. No radical customizations, except one thing to note is
that on the fw signing side of things we're going to have the signature
of the file *detached* in separate file. I think what you're alluding
to is the issue of where that signature would be stuff in the SELinux
policy file and its correct that you'd need to address that. You could
just borrow the kernel's model and reader / sucker that strips out the
signature. Another possibility would be two files but then I guess
you'd need a trigger to annotate both are in place.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux Firmware Signing

2015-08-28 Thread Luis R. Rodriguez
On Fri, Aug 28, 2015 at 11:20:10AM +, Roberts, William C wrote:
> > -Original Message-
> > From: Paul Moore [mailto:p...@paul-moore.com]
> > 
> > While I question the usefulness of a SELinux policy signature in the 
> > general case,
> > there are some situations where it might make sense, e.g. embedded systems
> > with no post-build customizations, and I'm not opposed to added a signature 
> > to
> > the policy file for that reason.
> 
> Even triggered updates make sense, since you can at least have some form of 
> trust
> of where that binary policy came from. 

The problem that Paul describes stems from the requirement of such trust
needing post-boot / install / setup keys.  It may be possible for an
environment to exist where there's a food chain that enables some CA's to
easily hand out keys to each install, but that seems impractical. This is why
Paul had mentioned the Machine Owner Key (MOK) thing.

> > However, I haven't given any serious thought yet to how we would structure 
> > the
> > new blob format so as to support both signed/unsigned policies as well as
> > existing policies which predate any PKCS #7 changes.
> > 
> 
> Huh, not following? Perhaps, I am not following what your laying down here.
> 
>  Right now there is no signing on the selinux policy file. We should be able
> to just use the firmware signing api's as is (I have not looked on linux-next 
> yet)

Nitpick: its the system_verify_data() API, the fw signing stuff will make use
of this API as well.

> to unpack the blob.

Nitpick: to verify the data.

> In the case of falling back to loading an unsigned blob, we could do it ala 
> kernel
> module style. If it fails do to invalid format fall back to attempting to 
> read it as a straight policy file.
> If it fails on signature verification, we could still unpack it and pass it 
> on. So you would want to 
> be able to control if the signed unpacking from pkcs7 fails, whether or not 
> its fatal.
> 
> We would also likely want to convey this state, the ability to change this 
> setting to userspace in a 
> Controlled fashion via selinuxfs. Ie I would want to know that I can load 
> modules without valid signatures,
> And that my current policy file is in fact invalid or valid.

Sure that would work. Its how the module stuff can work in permissive mode.
We'd embrace the same practice for permissive fw signing as well.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] x86: PCI bus specific MSI operations

2015-08-28 Thread Jiang Liu
On 2015/8/29 0:54, Thomas Gleixner wrote:
> On Thu, 27 Aug 2015, Keith Busch wrote:
> 
>> This patch adds struct x86_msi_ops to x86's PCI sysdata. This gives a
>> host bridge driver the option to provide alternate MSI Data Register
>> and MSI-X Table Entry programming for devices in PCI domains that do
>> not subscribe to usual "IOAPIC" format.
> 
> I'm not too fond about more ad hoc indirection and special casing. We
> should be able to handle this with hierarchical irq domains. Jiang
> might have an idea how to do that for your case.
Hi Thomas and Keith,
I have noticed this patch set yesterday, but still investigating the
better way to handle this. Basically I think
we should build per-domain/per-bus/per-device PCI MSI irqdomain,
just like what ARM have done. That will give us a clear picture.
But I need more information about the hardware topology
to correctly build up the hierarchical irqdomain, especially the
relationship between the embedded host bridge and IOMMU units.
Keith, could you please help to provide some doc with
hardware details?
Thanks!
Gerry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 6/8] net: thunderx: Rework interrupt handler

2015-08-28 Thread Alexey Klimov
Hi Aleksey,

let me add few minor points below.

On Fri, Aug 28, 2015 at 5:59 PM, Aleksey Makarov
 wrote:
> From: Sunil Goutham 
>
> Rework interrupt handler to avoid checking IRQ affinity of
> CQ interrupts. Now separate handlers are registered for each IRQ
> including RBDR. Also register interrupt handlers for only those
> which are being used.

Also add nicvf_dump_intr_status() and use it in irq handler(s).
I suggest to check and extend commit message and think about commit
name. Maybe "net: thunderx: rework interrupt handling and
registration" at least?

Please consider possibility of splitting this patch into few patches too.

>
> Signed-off-by: Sunil Goutham 
> Signed-off-by: Aleksey Makarov 
> ---
>  drivers/net/ethernet/cavium/thunder/nic.h  |   1 +
>  drivers/net/ethernet/cavium/thunder/nicvf_main.c   | 172 
> -
>  drivers/net/ethernet/cavium/thunder/nicvf_queues.h |   2 +
>  3 files changed, 103 insertions(+), 72 deletions(-)
>
> diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
> b/drivers/net/ethernet/cavium/thunder/nic.h
> index a83f567..89b997e 100644
> --- a/drivers/net/ethernet/cavium/thunder/nic.h
> +++ b/drivers/net/ethernet/cavium/thunder/nic.h
> @@ -135,6 +135,7 @@
>  #defineNICVF_TX_TIMEOUT(50 * HZ)
>
>  struct nicvf_cq_poll {
> +   struct  nicvf *nicvf;
> u8  cq_idx; /* Completion queue index */
> struct  napi_struct napi;
>  };
> diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
> b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> index de51828..2198f61 100644
> --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> @@ -653,11 +653,20 @@ static void nicvf_handle_qs_err(unsigned long data)
> nicvf_enable_intr(nic, NICVF_INTR_QS_ERR, 0);
>  }
>
> +static inline void nicvf_dump_intr_status(struct nicvf *nic)
> +{
> +   if (netif_msg_intr(nic))
> +   netdev_info(nic->netdev, "%s: interrupt status 0x%llx\n",
> +   nic->netdev->name, nicvf_reg_read(nic, 
> NIC_VF_INT));
> +}

Please check if you really need to mark this 'inline' here.

>  static irqreturn_t nicvf_misc_intr_handler(int irq, void *nicvf_irq)
>  {
> struct nicvf *nic = (struct nicvf *)nicvf_irq;
> u64 intr;
>
> +   nicvf_dump_intr_status(nic);
> +
> intr = nicvf_reg_read(nic, NIC_VF_INT);
> /* Check for spurious interrupt */
> if (!(intr & NICVF_INTR_MBOX_MASK))
> @@ -668,59 +677,58 @@ static irqreturn_t nicvf_misc_intr_handler(int irq, 
> void *nicvf_irq)
> return IRQ_HANDLED;
>  }
>
> -static irqreturn_t nicvf_intr_handler(int irq, void *nicvf_irq)
> +static irqreturn_t nicvf_intr_handler(int irq, void *cq_irq)
> +{
> +   struct nicvf_cq_poll *cq_poll = (struct nicvf_cq_poll *)cq_irq;
> +   struct nicvf *nic = cq_poll->nicvf;
> +   int qidx = cq_poll->cq_idx;
> +
> +   nicvf_dump_intr_status(nic);
> +
> +   /* Disable interrupts */
> +   nicvf_disable_intr(nic, NICVF_INTR_CQ, qidx);
> +
> +   /* Schedule NAPI */
> +   napi_schedule(_poll->napi);
> +
> +   /* Clear interrupt */
> +   nicvf_clear_intr(nic, NICVF_INTR_CQ, qidx);
> +
> +   return IRQ_HANDLED;
> +}

You're not considering spurious irqs in all new irq handlers here and
below and schedule napi/tasklets unconditionally. Is it correct?
For me it looks like previous implementation relied on reading of
NIC_VF_INT to understand irq type and what actions should be
performed. It generally had idea that no interrupt might occur.


> +
> +static irqreturn_t nicvf_rbdr_intr_handler(int irq, void *nicvf_irq)
>  {
> -   u64 qidx, intr, clear_intr = 0;
> -   u64 cq_intr, rbdr_intr, qs_err_intr;
> struct nicvf *nic = (struct nicvf *)nicvf_irq;
> -   struct queue_set *qs = nic->qs;
> -   struct nicvf_cq_poll *cq_poll = NULL;
> +   u8 qidx;
>
> -   intr = nicvf_reg_read(nic, NIC_VF_INT);
> -   if (netif_msg_intr(nic))
> -   netdev_info(nic->netdev, "%s: interrupt status 0x%llx\n",
> -   nic->netdev->name, intr);
> -
> -   qs_err_intr = intr & NICVF_INTR_QS_ERR_MASK;
> -   if (qs_err_intr) {
> -   /* Disable Qset err interrupt and schedule softirq */
> -   nicvf_disable_intr(nic, NICVF_INTR_QS_ERR, 0);
> -   tasklet_hi_schedule(>qs_err_task);
> -   clear_intr |= qs_err_intr;
> -   }
>
> -   /* Disable interrupts and start polling */
> -   cq_intr = (intr & NICVF_INTR_CQ_MASK) >> NICVF_INTR_CQ_SHIFT;
> -   for (qidx = 0; qidx < qs->cq_cnt; qidx++) {
> -   if (!(cq_intr & (1 << qidx)))
> -   continue;
> -   if (!nicvf_is_intr_enabled(nic, NICVF_INTR_CQ, qidx))
> +   nicvf_dump_intr_status(nic);
> +
> +   /* Disable RBDR interrupt and schedule softirq */
> +   for (qidx = 0; qidx < 

Re: [PATCH net-next 7/8] net: thunderx: Support for upto 96 queues for a VF

2015-08-28 Thread Alexey Klimov
On Fri, Aug 28, 2015 at 5:59 PM, Aleksey Makarov
 wrote:
> From: Sunil Goutham 
>
> This patch adds support for handling multiple qsets assigned to a
> single VF. There by increasing no of queues from earlier 8 to max
> no of CPUs in the system i.e 48 queues on a single node and 96 on
> dual node system. User doesn't have option to assign which Qsets/VFs
>  to be merged. Upon request from VF, PF assigns next free Qsets as
> secondary qsets. To maintain current behavior no of queues is kept
> to 8 by default which can be increased via ethtool.
>
> If user wants to unbind NICVF driver from a secondary Qset then it
> should be done after tearing down primary VF's interface.
>
> Signed-off-by: Sunil Goutham 
> Signed-off-by: Aleksey Makarov 
> Signed-off-by: Robert Richter 
> ---
>  drivers/net/ethernet/cavium/thunder/nic.h  |  42 -
>  drivers/net/ethernet/cavium/thunder/nic_main.c | 173 +++--
>  .../net/ethernet/cavium/thunder/nicvf_ethtool.c| 136 +
>  drivers/net/ethernet/cavium/thunder/nicvf_main.c   | 210 
> +++--
>  drivers/net/ethernet/cavium/thunder/nicvf_queues.c |  32 +++-
>  5 files changed, 507 insertions(+), 86 deletions(-)
>
> diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
> b/drivers/net/ethernet/cavium/thunder/nic.h
> index 89b997e..35b2ee1 100644
> --- a/drivers/net/ethernet/cavium/thunder/nic.h
> +++ b/drivers/net/ethernet/cavium/thunder/nic.h
> @@ -258,13 +258,23 @@ struct nicvf_drv_stats {
>  };
>
>  struct nicvf {
> +   struct nicvf*pnicvf;
> struct net_device   *netdev;
> struct pci_dev  *pdev;
> u8  vf_id;
> u8  node;
> -   u8  tns_mode;
> +   u8  tns_mode:1;
> +   u8  sqs_mode:1;
> u16 mtu;
> struct queue_set*qs;
> +#defineMAX_SQS_PER_VF_SINGLE_NODE  5
> +#defineMAX_SQS_PER_VF  11
> +   u8  sqs_id;
> +   u8  sqs_count; /* Secondary Qset count */
> +   struct nicvf*snicvf[MAX_SQS_PER_VF];
> +   u8  rx_queues;
> +   u8  tx_queues;
> +   u8  max_queues;
> void __iomem*reg_base;
> boollink_up;
> u8  duplex;
> @@ -330,14 +340,19 @@ struct nicvf {
>  #defineNIC_MBOX_MSG_RQ_SW_SYNC 0x0F/* Flush inflight 
> pkts to RQ */
>  #defineNIC_MBOX_MSG_BGX_STATS  0x10/* Get stats from BGX 
> */
>  #defineNIC_MBOX_MSG_BGX_LINK_CHANGE0x11/* BGX:LMAC link 
> status */
> -#define NIC_MBOX_MSG_CFG_DONE  0x12/* VF configuration done */
> -#define NIC_MBOX_MSG_SHUTDOWN  0x13/* VF is being shutdown */
> +#defineNIC_MBOX_MSG_ALLOC_SQS  0x12/* Allocate secondary 
> Qset */
> +#defineNIC_MBOX_MSG_NICVF_PTR  0x13/* Send nicvf ptr to 
> PF */
> +#defineNIC_MBOX_MSG_PNICVF_PTR 0x14/* Get primary qset 
> nicvf ptr */
> +#defineNIC_MBOX_MSG_SNICVF_PTR 0x15/* Send sqet nicvf 
> ptr to PVF */
> +#defineNIC_MBOX_MSG_CFG_DONE   0xF0/* VF configuration 
> done */
> +#defineNIC_MBOX_MSG_SHUTDOWN   0xF1/* VF is being 
> shutdown */
>
>  struct nic_cfg_msg {
> u8msg;
> u8vf_id;
> -   u8tns_mode;
> u8node_id;
> +   u8tns_mode:1;
> +   u8sqs_mode:1;
> u8mac_addr[ETH_ALEN];
>  };
>
> @@ -345,6 +360,7 @@ struct nic_cfg_msg {
>  struct qs_cfg_msg {
> u8msg;
> u8num;
> +   u8sqs_count;
> u64   cfg;
>  };
>
> @@ -361,6 +377,7 @@ struct sq_cfg_msg {
> u8msg;
> u8qs_num;
> u8sq_num;
> +   bool  sqs_mode;
> u64   cfg;
>  };
>
> @@ -420,6 +437,21 @@ struct bgx_link_status {
> u32   speed;
>  };
>
> +/* Get Extra Qset IDs */
> +struct sqs_alloc {
> +   u8msg;
> +   u8vf_id;
> +   u8qs_count;
> +};
> +
> +struct nicvf_ptr {
> +   u8msg;
> +   u8vf_id;
> +   bool  sqs_mode;
> +   u8sqs_id;
> +   u64   nicvf;
> +};
> +
>  /* 128 bit shared memory between PF and each VF */
>  union nic_mbx {
> struct { u8 msg; }  msg;
> @@ -434,6 +466,8 @@ union nic_mbx {
> struct rss_cfg_msg  rss_cfg;
> struct bgx_stats_msgbgx_stats;
> struct bgx_link_status  link_status;
> +   struct sqs_allocsqs_alloc;
> +   struct nicvf_ptrnicvf;
>  };
>
>  #define NIC_NODE_ID_MASK   0x03
> diff --git a/drivers/net/ethernet/cavium/thunder/nic_main.c 
> b/drivers/net/ethernet/cavium/thunder/nic_main.c
> index 7dfec4a..51f3048 100644
> --- 

Re: Persistent Reservation API V3

2015-08-28 Thread Jeremy Linton
Hello,
So, looking at this, I don't see how it supports the algorithm I've 
been using
for years. For that algorithm to successfully migrate PRs across multiple paths
on a single machine without affecting other possible users (who may legitimately
have PR'ed the same device) I need PR_IN SA 1, READ RESERVATIONS to assure the
current node owns the reservation before attempting to preempt it on another
path. This can also assure that the device hasn't been reserved with a legacy
reservation.

So, this leads me to two more general questions. The first is why isn't 
the PR
API simply exported to filesystems as a general reserve/release so that the PR
happens during mount/dismount. Then DM and friends can be setup to transparently
migrate or share the reservation, rather than depending on userspace to handle
these operations...
Also, it seems to me the use of CLEAR is extremely dangerous in any 
environment
where actual arbitration or sharing of the resource is taking place.


thanks,

On 8/26/2015 11:56 AM, Christoph Hellwig wrote:
> This series adds support for a simplified Persistent Reservation API
> to the block layer.  The intent is that both in-kernel and userspace
> consumers can use the API instead of having to hand craft SCSI or NVMe
> command through the various pass through interfaces.  It also adds
> DM support as getting reservations through dm-multipath is a major
> pain with the current scheme.
> 
> NVMe support currently isn't included as I don't have a multihost
> NVMe setup to test on, but Keith offered to test it and I'll have
> a patch for it shortly.
> 
> The ioctl API is documented in Documentation/block/pr.txt, but to
> fully understand the concept you'll have to read up the SPC spec,
> PRs are too complicated that trying to rephrase them into different
> terminology is just going to create confusion.
> 
> Note that Mike wants to include the DM patches so through the DM
> tree, so they are only included for reference.
> 
> I also have a set of simple test tools available at:
> 
>   git://git.infradead.org/users/hch/pr-tests.git
> 
> Changes since V2:
>   - added an ignore flag to the reserve opertion as well, and redid
> the ioctl API to have general flags fields
>   - rebased on top of the latest block layer tree updates
> Changes since V1:
>   - rename DM ->ioctl to ->prepare_ioctl
>   - rename dm_get_ioctl_table to dm_get_live_table_for_ioctl
>   - merge two DM patches into one
>   - various spelling fixes
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> .
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 32/32] bpf: Introduce function for outputing data to perf event

2015-08-28 Thread Alexei Starovoitov

On 8/28/15 6:19 PM, Wangnan (F) wrote:

For me, I use bpf_output_trace_data() to output information like PMU count
value. Perf is the only receiver, so global collector is perfect. Could you
please describe your usecase in more detail?


there is a special receiver in user space that only wants the data from
the bpf program that it loaded. It shouldn't conflict with any other
processes. Like when it's running, I still should be able to use perf
for other performance analysis. There is no way to share single
bpf:bpf_output_data event, since these user processes are completely
independent.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: futex atomic vs ordering constraints

2015-08-28 Thread Davidlohr Bueso
On Wed, 2015-08-26 at 20:16 +0200, Peter Zijlstra wrote:
> Of course, if anything else prior to futex_atomic_op_inuser() implies an
> (RCsc) RELEASE or stronger the primitive can do without providing
> anything itself.
> 
> This turns out to be the case, a successful get_futex_key() implies a
> full memory barrier; recent: 1d0dcb3ad9d3 ("futex: Implement lockless
> wakeups").

Hmm while it is certainly true that get_futex_key() implies a full
barrier, I don't see why you're referring to the recent wake_q stuff;
where the futex "wakeup" is done much after futex_atomic_op_inuser. Yes,
that too implies a barrier, but not wrt get_futex_key() -- which
fundamentally relies on get_futex_key_refs().

> 
> And since get_futex_key() is fundamental to doing _anything_ with a
> futex, I think its semi-sane to rely on this.

Right, and it wouldn't be the first thing that relies on get_futex_key()
implying a full barrier.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/4] perf tools: Use the new ability of eBPF programs to access hardware PMU counter

2015-08-28 Thread Alexei Starovoitov

On 8/27/15 3:42 AM, Kaixu Xia wrote:

An example is pasted at the bottom of this cover letter. In that example,
we can get the cpu_cycles and exception taken in sys_write.

  $ cat /sys/kernel/debug/tracing/trace_pipe
  $ ./perf record --event perf-bpf.o ls
...
  cat-1653  [003] d..1 88174.613854: : ente:  CPU-3 cyc:48746333
exc:84
  cat-1653  [003] d..2 88174.613861: : exit:  CPU-3 cyc:48756041
exc:84


nice. probably more complex example that computes the delta of the pmu
counters on the kernel side would be even more interesting.
Do you think you can extend 'perf stat' with a flag that does
stats collection for a given kernel or user function instead of the
whole process ?
Then we can use perf record/report to figure out hot functions and
follow with 'perf stat -f my_hot_func my_process' to drill into
particular function stats.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 1/2] Documentation: dt: add Broadcom BCM7038 PWM controller binding

2015-08-28 Thread Florian Fainelli
Add a binding documentation for the Broadcom BCM7038 PWM controller found in
BCM7xxx chips.

Signed-off-by: Florian Fainelli 
---
Changes in v3:

- list 'clocks' property as mandatory

 .../devicetree/bindings/pwm/brcm,bcm7038-pwm.txt | 20 
 1 file changed, 20 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/pwm/brcm,bcm7038-pwm.txt

diff --git a/Documentation/devicetree/bindings/pwm/brcm,bcm7038-pwm.txt 
b/Documentation/devicetree/bindings/pwm/brcm,bcm7038-pwm.txt
new file mode 100644
index ..d9254a6da5ed
--- /dev/null
+++ b/Documentation/devicetree/bindings/pwm/brcm,bcm7038-pwm.txt
@@ -0,0 +1,20 @@
+Broadcom BCM7038 PWM controller (BCM7xxx Set Top Box PWM controller)
+
+Required properties:
+
+- compatible: must be "brcm,bcm7038-pwm"
+- reg: physical base address and length for this controller
+- #pwm-cells: should be 2. See pwm.txt in this directory for a description
+  of the cells format
+- clocks: a phandle to the reference clock for this block which is fed through
+  its internal variable clock frequency generator
+
+
+Example:
+
+   pwm: pwm@f0408000 {
+   compatible = "brcm,bcm7038-pwm";
+   reg = <0xf0408000 0x28>;
+   #pwm-cells = <2>;
+   clocks = <_fixed>;
+   };
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 2/2] pwm: Add Broadcom BCM7038 PWM controller support

2015-08-28 Thread Florian Fainelli
Add support for the BCM7038-style PWM controller found in all BCM7xxx STB SoCs.
This controller has a hardcoded 2 channels per controller, and cascades a
variable frequency generator on top of a fixed frequency generator which offers
a range of a 148ns period all the way to ~622ms periods.

Signed-off-by: Florian Fainelli 
---
Changes in v3:

- make clock mandatory
- removed a remaining div64_u64 use

hanges in v2:

- properly format comments
- utilize do_div instead of div64_u64
- avoid using a "done" variable for the while loop
- utilize a parameterized register accessor
- remove a bunch of unnecessary assignments
- provide a module author
- update depends to build on BMIPS_GENERIC (the other user)
- removed artificial padding
- removed used only once variable: dn
- utilize devm_ioremap_resource
- do not print success message
- removed THIS_MODULE from platform_driver structure

 drivers/pwm/Kconfig   |  10 ++
 drivers/pwm/Makefile  |   1 +
 drivers/pwm/pwm-brcmstb.c | 324 ++
 3 files changed, 335 insertions(+)
 create mode 100644 drivers/pwm/pwm-brcmstb.c

diff --git a/drivers/pwm/Kconfig b/drivers/pwm/Kconfig
index b1541f40fd8d..363c22b22071 100644
--- a/drivers/pwm/Kconfig
+++ b/drivers/pwm/Kconfig
@@ -111,6 +111,16 @@ config PWM_CLPS711X
  To compile this driver as a module, choose M here: the module
  will be called pwm-clps711x.
 
+config PWM_BRCMSTB
+   tristate "Broadcom STB PWM support"
+   depends on ARCH_BRCMSTB || BMIPS_GENERIC
+   help
+ Generic PWM framework driver for the Broadcom Set-top-Box
+ SoCs (BCM7xxx).
+
+ To compile this driver as a module, choose M Here: the module
+ will be called pwm-brcmstb.c.
+
 config PWM_EP93XX
tristate "Cirrus Logic EP93xx PWM support"
depends on ARCH_EP93XX
diff --git a/drivers/pwm/Makefile b/drivers/pwm/Makefile
index ec50eb5b5a8f..dc7b1b82d47e 100644
--- a/drivers/pwm/Makefile
+++ b/drivers/pwm/Makefile
@@ -7,6 +7,7 @@ obj-$(CONFIG_PWM_ATMEL_TCB) += pwm-atmel-tcb.o
 obj-$(CONFIG_PWM_BCM_KONA) += pwm-bcm-kona.o
 obj-$(CONFIG_PWM_BCM2835)  += pwm-bcm2835.o
 obj-$(CONFIG_PWM_BFIN) += pwm-bfin.o
+obj-$(CONFIG_PWM_BRCMSTB)  += pwm-brcmstb.o
 obj-$(CONFIG_PWM_CLPS711X) += pwm-clps711x.o
 obj-$(CONFIG_PWM_EP93XX)   += pwm-ep93xx.o
 obj-$(CONFIG_PWM_FSL_FTM)  += pwm-fsl-ftm.o
diff --git a/drivers/pwm/pwm-brcmstb.c b/drivers/pwm/pwm-brcmstb.c
new file mode 100644
index ..9ea73755f281
--- /dev/null
+++ b/drivers/pwm/pwm-brcmstb.c
@@ -0,0 +1,324 @@
+/*
+ * Broadcom BCM7038 PWM driver
+ * Author: Florian Fainelli
+ *
+ * Copyright (C) 2015 Broadcom Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt)KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define PWM_CTRL   0x00
+#define  CTRL_STARTBIT(0)
+#define  CTRL_OEB  BIT(1)
+#define  CTRL_FORCE_HIGH   BIT(2)
+#define  CTRL_OPENDRAINBIT(3)
+#define  CTRL_CHAN_OFFS4
+
+#define PWM_CTRL2  0x04
+#define  CTRL2_OUT_SELECT  BIT(0)
+
+#define PWM_CWORD_MSB  0x08
+#define PWM_CWORD_LSB  0x0C
+
+#define PWM_CH_SIZE0x8
+
+/* Number of bits for the CWORD value */
+#define CWORD_BIT_SIZE 16
+
+/*
+ * Maximum control word value allowed when variable-frequency PWM is used as a
+ * clock for the constant-frequency PMW.
+ */
+#define CONST_VAR_F_MAX32768
+#define CONST_VAR_F_MIN1
+
+#define PWM_ON(ch) (0x18 + ((ch) * PWM_CH_SIZE))
+#define  PWM_ON_MIN1
+#define PWM_PERIOD(ch) (0x1C + ((ch) * PWM_CH_SIZE))
+#define  PWM_PERIOD_MIN0
+
+#define PWM_ON_PERIOD_MAX  0xff
+
+struct brcmstb_pwm_dev {
+   void __iomem *base;
+   struct clk *clk;
+   struct pwm_chip chip;
+};
+
+static inline u32 pwm_readl(struct brcmstb_pwm_dev *p, u32 off)
+{
+   if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
+   return __raw_readl(p->base + off);
+   else
+   return readl_relaxed(p->base + off);
+}
+
+static inline void pwm_writel(struct brcmstb_pwm_dev *p, u32 val, u32 off)
+{
+   if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
+   __raw_writel(val, p->base + off);
+   else
+   

[PATCH v3 0/2] pwm: Broadcom BCM7038 PWM controller (v3)

2015-08-28 Thread Florian Fainelli
Hi,

This patch series add PWM support for the Broadcom BCM7xxx
chips which feature one or more PWM controllers capable of
output periods from 148ns to ~622ms using a combination of
variable and fixed frequency settings.

The controller does not support setting a polarity.

This is based on Thierry's pwm/next branch.

Florian Fainelli (2):
  Documentation: dt: add Broadcom BCM7038 PWM controller binding
  pwm: Add Broadcom BCM7038 PWM controller support

 .../devicetree/bindings/pwm/brcm,bcm7038-pwm.txt   |  20 ++
 drivers/pwm/Kconfig|  10 +
 drivers/pwm/Makefile   |   1 +
 drivers/pwm/pwm-brcmstb.c  | 324 +
 4 files changed, 355 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/pwm/brcm,bcm7038-pwm.txt
 create mode 100644 drivers/pwm/pwm-brcmstb.c

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] nfit: Fix undefined mmio_flush_range on x86_32

2015-08-28 Thread Toshi Kani
The following compile error was observed on x86_32 since nfit.c
relies on  to include , which only
works when CONFIG_ARCH_HAS_PMEM_API is set on x86_64.

  drivers/acpi/nfit.c:1085:5: error: implicit declaration of
  function 'mmio_flush_range' [-Werror=implicit-function-declaration]

Change nfit.c to include  directly for now.

Signed-off-by: Toshi Kani 
Cc: Dan Williams 
Cc: Ross Zwisler 
---
Apply on top of libnvdimm-for-next of the nvdimm tree.
This is a temporary fix and please feel free to replace it with
a better solution.
---
 drivers/acpi/nfit.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 82d07e8..f61e69f 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "nfit.h"
 
 /*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 32/32] bpf: Introduce function for outputing data to perf event

2015-08-28 Thread Wangnan (F)



On 2015/8/29 8:45, Alexei Starovoitov wrote:

On 8/28/15 12:06 AM, Wang Nan wrote:

his patch adds a new trace event to establish infrastruction for bpf to
output data to perf. Userspace perf tools can detect and use this event
as using the existing tracepoint events.

New bpf trace event entry in debugfs:

  /sys/kernel/debug/tracing/events/bpf/bpf_output_data

Userspace perf tools detect the new tracepoint event as:

  bpf:bpf_output_data  [Tracepoint event]

Data in ring-buffer of perf events added to this event will be polled
out, sample types and other attributes can be adjusted to those events
directly without touching the original kprobe events.


Wang,
I have 2nd thoughts on this.
I've played with it, but global bpf:bpf_output_data event is limiting.
I'd like to use this bpf_output_trace_data() helper for tcp estats
gathering, but global collector will prevent other similar bpf programs
running in parallel.


So current model work for you but the problem is all output goes into one
place, which prevents similar BPF programs run in parallel because the
reveicer is unable to tell what message is generated by who. So actually
you want a publish-and-subscribe model, subscriber get messages from only
the publisher it interested in. Am I understand your problem correctly?


So as a concept I think it's very useful, but we need a way to select
which ring-buffer to output data to.
proposal A:
Can we use ftrace:instances concept and make bpf_output_trace_data()
into that particular trace_pipe ?
proposal B:
bpf_perf_event_read() model is using nice concept of an array of
perf_events. Can we perf_event_open a 'new' event that can be mmaped
in user space and bpf_output_trace_data(idx, buf, buf_size) into it.
Where 'idx' will be an index of FD from perf_even_open of such
new event?



I've also thinking about adding the extra id parameter in 
bpf_output_trace_data()

but it is for encoding the type of output data, which is totally different
from what you want.

For me, I use bpf_output_trace_data() to output information like PMU count
value. Perf is the only receiver, so global collector is perfect. Could you
please describe your usecase in more detail?

Thank you for using that feature!


Thanks!




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] firmware: annotate thou shalt not request fw on init or probe

2015-08-28 Thread Luis R. Rodriguez
From: "Luis R. Rodriguez" 

We are phasing out the usermode helper from the kernel,
systemd already ripped support for this a while ago, the
only remaining valid user is the Dell rbu driver. The
firmware is now being read directly from the filesystem
by the kernel. What this means is that if you have a
device driver that needs firmware early, when it is
built-in to the kernel the firmware may not yet be
available. Folks building drivers that need firmware
early should either include it as part of the kernel or
stuff it into the initramfs used to boot.

In particular since we are accessing the firmware directly
folks cannot expect new found firmware on a filesystem
after we switch off from an initramfs with pivot_root().
Supporting such dynamic changes to load drivers would
be possible but adds complexity, instead lets document
the expectations properly and add a grammar rule to enable
folks to check / validate / police if drivers are using
the request firmware API early on init or probe.

The SmPL rule used to check for the probe routine is
loose and is currently defined through a regexp, that
can easily be extended to any other known bus probe
routine names.

Thou shalt not make firmware calls early on init or probe.

I spot only 2 offender right now.

mcgrof@ergon ~/linux-next (git::20150805-pend-all)$ export 
COCCI=scripts/coccinelle/api/request_firmware.cocci
mcgrof@ergon ~/linux-next (git::20150805-pend-all)$ make coccicheck MODE=report

Please check for false positives in the output before submitting a patch.
When using "patch" mode, carefully review the patch before submitting
it.

./drivers/fmc/fmc-write-eeprom.c:136:7-23: ERROR: driver call request firmware 
call on its probe routine on line 136.
./drivers/tty/serial/rp2.c:796:6-29: ERROR: driver call request firmware call 
on its probe routine on line 796.

Cc: Ming Lei 
Cc: Jonathan Corbet 
Cc: Julia Lawall 
Cc: Gilles Muller 
Cc: Nicolas Palix 
Cc: Michal Marek 
Cc: linux-...@vger.kernel.org
Cc: co...@systeme.lip6.fr
Cc: Alessandro Rubini 
Cc: Kevin Cernekee 
Cc: Greg Kroah-Hartman 
Cc: Jiri Slaby 
Cc: linux-ser...@vger.kernel.org
Signed-off-by: Luis R. Rodriguez 
---
 Documentation/firmware_class/README   | 24 +--
 drivers/base/Kconfig  |  2 +-
 scripts/coccinelle/api/request_firmware.cocci | 90 +++
 3 files changed, 110 insertions(+), 6 deletions(-)
 create mode 100644 scripts/coccinelle/api/request_firmware.cocci

diff --git a/Documentation/firmware_class/README 
b/Documentation/firmware_class/README
index 71f86859d7d8..7c59f4d07f1d 100644
--- a/Documentation/firmware_class/README
+++ b/Documentation/firmware_class/README
@@ -33,7 +33,7 @@
than 256, user should pass 'firmware_class.path=$CUSTOMIZED_PATH'
if firmware_class is built in kernel(the general situation)
 
- 2), userspace:
+ 2), userspace: (DEPRECATED)
- /sys/class/firmware/xxx/{loading,data} appear.
- hotplug gets called with a firmware identifier in $FIRMWARE
  and the usual hotplug environment.
@@ -41,14 +41,14 @@
 
  3), kernel: Discard any previous partial load.
 
- 4), userspace:
+ 4), userspace: (DEPRECATED)
- hotplug: cat appropriate_firmware_image > \
/sys/class/firmware/xxx/data
 
  5), kernel: grows a buffer in PAGE_SIZE increments to hold the image as it
 comes in.
 
- 6), userspace:
+ 6), userspace: (DEPRECATED)
- hotplug: echo 0 > /sys/class/firmware/xxx/loading
 
  7), kernel: request_firmware() returns and the driver has the firmware
@@ -66,8 +66,8 @@
copy_fw_to_device(fw_entry->data, fw_entry->size);
 release_firmware(fw_entry);
 
- Sample/simple hotplug script:
- 
+ Sample/simple hotplug script: (DEPRECATED)
+ ==
 
# Both $DEVPATH and $FIRMWARE are already provided in the environment.
 
@@ -93,6 +93,20 @@
user contexts to request firmware asynchronously, but can't be called
in atomic contexts.
 
+Requirements:
+=
+
+You should avoid at all costs requesting firmware on both init and probe paths
+of your device driver. Reason for this is as usermod helper functionality is
+being deprecated we will only have direct firmware access. This means that
+any routines requesting firmware will need the filesystem which contains the
+firmware available and mounted. Device drivers init and probe paths can be
+called early on prior to /lib/firmware being available. If you might need
+access to firmware early you should consider requiring your device driver to
+be only available as a module, this however as its own set of limitations.
+
+Folks building drivers that need firmware early should either include it as
+part of the kernel or stuff it into the initramfs used to boot.
 
  about in-kernel persistence:
  ---
diff --git a/drivers/base/Kconfig 

Re: Problems loading firmware using built-in drivers with kernels that use initramfs.

2015-08-28 Thread Luis R. Rodriguez
On Thu, Aug 27, 2015 at 08:55:13AM +0800, Ming Lei wrote:
> On Thu, Aug 27, 2015 at 2:07 AM, Linus Torvalds
>  wrote:
> > On Wed, Aug 26, 2015 at 1:06 AM, Liam Girdwood
> >  wrote:
> >>
> >> I think the options are to either :-
> >>
> >> 1) Don not support audio DSP drivers using topology data as built-in
> >> drivers. Audio is not really a critical system required for booting
> >> anyway.
> >
> > Yes, forcing it to be a module and not letting people compile it in by
> > mistake (and then not have it work) is an option.
> >
> > That said, there are situations where people don't want to use
> > modules. I used to eschew them for security reasons, for example - now
> > I instead just do a one-time temporary key. But others may have other
> > reasons to try to avoid modules.
> >
> >> 2) Create a default PCM for every driver that has topology data on the
> >> assumption that every sound card will at least 1 PCM. This PCM can then
> >> be re-configured when the FW is loaded.
> >
> > That would seem to be the better option if it is reasonably implementable.
> >
> > Of course, some kind of timer-based retry (limited *somehow*) of the
> > fw loading could work too, but smells really really hacky.
> 
> Yeah, years ago, we discussed to use -EPROBE_DEFER for the situation,
> which should be one kind of fix, but looks there were objections at that time.

That would still be a hack. I'll note there is also asynchronous probe support
now but to use that would also be a hack for this issue. We don't want to
encourage folks to go down that road.  They'd be hacks for this issue as you
are simply delaying the driver probe for a later time and there is no guarantee
that any pivot_root() might have already been completed later to ensure your
driver's fw file is present. So it may work or it may not.

We should instead strive to be clear about expectations and requirements both
through documentation and when possible through APIs. I'll send out an RFC
which adds some grammar rules which can help us police this. I currently only
spot two drivers that require fixing.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] remoteproc: report error if resource table doesn't exist

2015-08-28 Thread Stefan Agner
Currently, if the resource table is completely missing in the
firmware, powering up the remoteproc fails silently. Add a message
indicating that the resource table is missing in the firmware.

Signed-off-by: Stefan Agner 
---
Hi Ohad,

I am currently working on remoteproc support for Freescale Vybrid's
secondary Cortex-M4 core. I stumbled upon this rough spot since the
little test firmware I am using now does not have a resource table
yet.

This also opens up a more general question: Is it mandatory to have
a resource table in the firmware? Theoretically a remoteproc could
also work completely independent, all what would be used from the
remoteproc framework is the loading and starting capabilities...

--
Stefan

 drivers/remoteproc/remoteproc_core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/remoteproc/remoteproc_core.c 
b/drivers/remoteproc/remoteproc_core.c
index 8b3130f..29db8b3 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -823,8 +823,10 @@ static int rproc_fw_boot(struct rproc *rproc, const struct 
firmware *fw)
 
/* look for the resource table */
table = rproc_find_rsc_table(rproc, fw, );
-   if (!table)
+   if (!table) {
+   dev_err(dev, "Failed to find resource table\n");
goto clean_up;
+   }
 
/* Verify that resource table in loaded fw is unchanged */
if (rproc->table_csum != crc32(0, table, tablesz)) {
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Input: cyttsp - Remove unnecessary MODULE_ALIAS()

2015-08-28 Thread Javier Martinez Canillas
The drivers have a I2C device ID table that is used to create the module
aliases and also "cyttsp" and "cyttsp4" are not supported I2C device IDs
so these module aliases are never used.

Signed-off-by: Javier Martinez Canillas 

---

 drivers/input/touchscreen/cyttsp4_i2c.c | 1 -
 drivers/input/touchscreen/cyttsp_i2c.c  | 1 -
 2 files changed, 2 deletions(-)

diff --git a/drivers/input/touchscreen/cyttsp4_i2c.c 
b/drivers/input/touchscreen/cyttsp4_i2c.c
index 9a323dd915de..a9f95c7d3c00 100644
--- a/drivers/input/touchscreen/cyttsp4_i2c.c
+++ b/drivers/input/touchscreen/cyttsp4_i2c.c
@@ -86,4 +86,3 @@ module_i2c_driver(cyttsp4_i2c_driver);
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("Cypress TrueTouch(R) Standard Product (TTSP) I2C driver");
 MODULE_AUTHOR("Cypress");
-MODULE_ALIAS("i2c:cyttsp4");
diff --git a/drivers/input/touchscreen/cyttsp_i2c.c 
b/drivers/input/touchscreen/cyttsp_i2c.c
index 519e2de2f8df..eee51b3f2e3f 100644
--- a/drivers/input/touchscreen/cyttsp_i2c.c
+++ b/drivers/input/touchscreen/cyttsp_i2c.c
@@ -86,4 +86,3 @@ module_i2c_driver(cyttsp_i2c_driver);
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("Cypress TrueTouch(R) Standard Product (TTSP) I2C driver");
 MODULE_AUTHOR("Cypress");
-MODULE_ALIAS("i2c:cyttsp");
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v6 03/40] vfs: Add MAY_DELETE_SELF and MAY_DELETE_CHILD permission flags

2015-08-28 Thread Andy Lutomirski
On Aug 28, 2015 2:54 PM, "Andreas Grünbacher"
 wrote:
>
> 2015-08-28 23:36 GMT+02:00 Andy Lutomirski :
> > Silly question from the peanut gallery: is there any such thing as
> > opening an fd pointing at a file such that the "open file description"
> > (i.e. the struct file) captures the right to delete the file?
> >
> > IOW do we need FMODE_DELETE_SELF?
>
> When would that permission be checked, what syscall would you use to
> unlink an open file descriptor?

Good point.  It's remotely plausible that there's some trick with bind
mounts, it's likely possible to unlink a directory by fd (using
unlinkat), and you can *link* a file (with linkat or /proc), but
unlinkat doesn't appear to allow you to unlink a file by fd.

--Andy

>
> Thanks,
> Andreas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Eric Dumazet
On Fri, 2015-08-28 at 17:35 -0700, Joe Perches wrote:

> That of course depends on what a "leaf" is and
> whether or not any other function call in the
> "leaf" consumes stack.
> 
> inet6_fill_ifla6_attrs does call other functions
> (none of which has the stack frame size of k.alloc)

Just define/use this automatic array in the damn leaf function.

That should not be hard, and maybe no one will complain and we can
work on more complex issues.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 06/32] perf tools: Enable passing bpf object file to --event

2015-08-28 Thread Wangnan (F)



On 2015/8/28 15:05, Wang Nan wrote:

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index ef5fde6..24c8b63 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -3090,6 +3090,7 @@ int cmd_trace(int argc, const char **argv, const char 
*prefix __maybe_unused)
if (trace.evlist->nr_entries > 0)
evlist__set_evsel_handler(trace.evlist, trace__event_handler);
  
+	/* trace__record calls cmd_record, which calls bpf__clear() */

if ((argc >= 1) && (strcmp(argv[0], "record") == 0))
return trace__record(, argc-1, [1]);
  
@@ -3100,7 +3101,8 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)

if (!trace.trace_syscalls && !trace.trace_pgfaults &&
trace.evlist->nr_entries == 0 /* Was --events used? */) {
pr_err("Please specify something to trace.\n");
-   return -1;
+   err = -1;
+   goto out;
}
  
  	if (output_name != NULL) {

@@ -3159,5 +3161,6 @@ out_close:
if (output_name != NULL)
fclose(trace.output);
  out:
+   bpf__clear();
return err;
  }



Sorry, here is a silly mistake that I miss

#include "bpf-loader.h"

at the head of builtin-trace.c. In my default environment 
builtin-trace.c is not compiled
so I find this problem today when I compile it on another machine. I'll 
fix in my tree.


Arnaldo, since you suggest Ingo to pull directly, shall I make another pull 
request with the whole 32 patches
sent for fixing that line?

Thank you.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 32/32] bpf: Introduce function for outputing data to perf event

2015-08-28 Thread Alexei Starovoitov

On 8/28/15 12:06 AM, Wang Nan wrote:

his patch adds a new trace event to establish infrastruction for bpf to
output data to perf. Userspace perf tools can detect and use this event
as using the existing tracepoint events.

New bpf trace event entry in debugfs:

  /sys/kernel/debug/tracing/events/bpf/bpf_output_data

Userspace perf tools detect the new tracepoint event as:

  bpf:bpf_output_data  [Tracepoint event]

Data in ring-buffer of perf events added to this event will be polled
out, sample types and other attributes can be adjusted to those events
directly without touching the original kprobe events.


Wang,
I have 2nd thoughts on this.
I've played with it, but global bpf:bpf_output_data event is limiting.
I'd like to use this bpf_output_trace_data() helper for tcp estats
gathering, but global collector will prevent other similar bpf programs
running in parallel.
So as a concept I think it's very useful, but we need a way to select
which ring-buffer to output data to.
proposal A:
Can we use ftrace:instances concept and make bpf_output_trace_data()
into that particular trace_pipe ?
proposal B:
bpf_perf_event_read() model is using nice concept of an array of
perf_events. Can we perf_event_open a 'new' event that can be mmaped
in user space and bpf_output_trace_data(idx, buf, buf_size) into it.
Where 'idx' will be an index of FD from perf_even_open of such
new event?

Thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] soc: qcom: smem: Handle big endian CPUs

2015-08-28 Thread Stephen Boyd
The contents of smem are always in little endian, but the smem
driver is not capable of being used on big endian CPUs. Annotate
the little endian data members and update the code to do the
proper byte swapping.

Cc: Bjorn Andersson 
Signed-off-by: Stephen Boyd 
---
 drivers/soc/qcom/smem.c | 230 +---
 1 file changed, 139 insertions(+), 91 deletions(-)

diff --git a/drivers/soc/qcom/smem.c b/drivers/soc/qcom/smem.c
index 4c347e7c5880..eba4cbaaacbe 100644
--- a/drivers/soc/qcom/smem.c
+++ b/drivers/soc/qcom/smem.c
@@ -92,9 +92,9 @@
   * @params:   parameters to the command
   */
 struct smem_proc_comm {
-   u32 command;
-   u32 status;
-   u32 params[2];
+   __le32 command;
+   __le32 status;
+   __le32 params[2];
 };
 
 /**
@@ -106,10 +106,10 @@ struct smem_proc_comm {
  * the default region. bits 0,1 are reserved
  */
 struct smem_global_entry {
-   u32 allocated;
-   u32 offset;
-   u32 size;
-   u32 aux_base; /* bits 1:0 reserved */
+   __le32 allocated;
+   __le32 offset;
+   __le32 size;
+   __le32 aux_base; /* bits 1:0 reserved */
 };
 #define AUX_BASE_MASK  0xfffc
 
@@ -125,11 +125,11 @@ struct smem_global_entry {
  */
 struct smem_header {
struct smem_proc_comm proc_comm[4];
-   u32 version[32];
-   u32 initialized;
-   u32 free_offset;
-   u32 available;
-   u32 reserved;
+   __le32 version[32];
+   __le32 initialized;
+   __le32 free_offset;
+   __le32 available;
+   __le32 reserved;
struct smem_global_entry toc[SMEM_ITEM_COUNT];
 };
 
@@ -143,12 +143,12 @@ struct smem_header {
  * @reserved:  reserved entries for later use
  */
 struct smem_ptable_entry {
-   u32 offset;
-   u32 size;
-   u32 flags;
-   u16 host0;
-   u16 host1;
-   u32 reserved[8];
+   __le32 offset;
+   __le32 size;
+   __le32 flags;
+   __le16 host0;
+   __le16 host1;
+   __le32 reserved[8];
 };
 
 /**
@@ -160,13 +160,14 @@ struct smem_ptable_entry {
  * @entry: list of @smem_ptable_entry for the @num_entries partitions
  */
 struct smem_ptable {
-   u32 magic;
-   u32 version;
-   u32 num_entries;
-   u32 reserved[5];
+   u8 magic[4];
+   __le32 version;
+   __le32 num_entries;
+   __le32 reserved[5];
struct smem_ptable_entry entry[];
 };
-#define SMEM_PTABLE_MAGIC  0x434f5424 /* "$TOC" */
+
+static const u8 SMEM_PTABLE_MAGIC[] = { 0x24, 0x54, 0x4f, 0x43 }; /* "$TOC" */
 
 /**
  * struct smem_partition_header - header of the partitions
@@ -181,15 +182,16 @@ struct smem_ptable {
  * @reserved:  for now reserved entries
  */
 struct smem_partition_header {
-   u32 magic;
-   u16 host0;
-   u16 host1;
-   u32 size;
-   u32 offset_free_uncached;
-   u32 offset_free_cached;
-   u32 reserved[3];
+   u8 magic[4];
+   __le16 host0;
+   __le16 host1;
+   __le32 size;
+   __le32 offset_free_uncached;
+   __le32 offset_free_cached;
+   __le32 reserved[3];
 };
-#define SMEM_PART_MAGIC0x54525024 /* "$PRT" */
+
+static const u8 SMEM_PART_MAGIC[] = { 0x24, 0x50, 0x52, 0x54 };
 
 /**
  * struct smem_private_entry - header of each item in the private partition
@@ -201,12 +203,12 @@ struct smem_partition_header {
  * @reserved:  for now reserved entry
  */
 struct smem_private_entry {
-   u16 canary;
-   u16 item;
-   u32 size; /* includes padding bytes */
-   u16 padding_data;
-   u16 padding_hdr;
-   u32 reserved;
+   u16 canary; /* bytes are the same so no swapping needed */
+   __le16 item;
+   __le32 size; /* includes padding bytes */
+   __le16 padding_data;
+   __le16 padding_hdr;
+   __le32 reserved;
 };
 #define SMEM_PRIVATE_CANARY0xa5a5
 
@@ -242,6 +244,45 @@ struct qcom_smem {
struct smem_region regions[0];
 };
 
+static struct smem_private_entry *
+phdr_to_last_private_entry(struct smem_partition_header *phdr)
+{
+   void *p = phdr;
+
+   return p + le32_to_cpu(phdr->offset_free_uncached);
+}
+
+static void *phdr_to_first_cached_entry(struct smem_partition_header *phdr)
+{
+   void *p = phdr;
+
+   return p + le32_to_cpu(phdr->offset_free_cached);
+}
+
+static struct smem_private_entry *
+phdr_to_first_private_entry(struct smem_partition_header *phdr)
+{
+   void *p = phdr;
+
+   return p + sizeof(*phdr);
+}
+
+static struct smem_private_entry *
+private_entry_next(struct smem_private_entry *e)
+{
+   void *p = e;
+
+   return p + sizeof(*e) + le16_to_cpu(e->padding_hdr) +
+  le32_to_cpu(e->size);
+}
+
+static void *entry_to_item(struct smem_private_entry *e)
+{
+   void *p = e;
+
+   return p + sizeof(*e) + le16_to_cpu(e->padding_hdr);
+}
+
 /* Pointer to the one and only smem handle */
 static struct qcom_smem *__smem;
 
@@ -254,20 +295,20 @@ static int 

[PATCH] regulator: pfuze100: Remove unnecessary MODULE_ALIAS()

2015-08-28 Thread Javier Martinez Canillas
The driver has a I2C device id table that is used to create the modaliases
and also "pfuze100-regulator" is not a supported I2C id, so is never used.

Signed-off-by: Javier Martinez Canillas 

---

 drivers/regulator/pfuze100-regulator.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/regulator/pfuze100-regulator.c 
b/drivers/regulator/pfuze100-regulator.c
index 2f66821d53cb..2a44e5dd9c2a 100644
--- a/drivers/regulator/pfuze100-regulator.c
+++ b/drivers/regulator/pfuze100-regulator.c
@@ -652,4 +652,3 @@ module_i2c_driver(pfuze_driver);
 MODULE_AUTHOR("Robin Gong ");
 MODULE_DESCRIPTION("Regulator Driver for Freescale PFUZE100/PFUZE200 PMIC");
 MODULE_LICENSE("GPL v2");
-MODULE_ALIAS("i2c:pfuze100-regulator");
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Joe Perches
On Fri, 2015-08-28 at 17:06 -0700, Eric Dumazet wrote:
> On Fri, 2015-08-28 at 16:12 -0700, Joe Perches wrote: 
> > Generally true.  It's always difficult to know how much
> > stack has been consumed though and smaller stack frames
> > are generally better.
[] 
> So for a _leaf_ function, it is better to declare an automatic variable,
> as you in fact reduce max stack depth.

That of course depends on what a "leaf" is and
whether or not any other function call in the
"leaf" consumes stack.

inet6_fill_ifla6_attrs does call other functions
(none of which has the stack frame size of k.alloc)

> Not only it uses less kernel stack, it is also way faster, as you avoid
> kmalloc()/kfree() overhead and reuse probably already hot cache lines in
> kernel stack.

yup.

You'll also never neglect to free stack like the
original RFC patch neglected to free the alloc.

cheers, Joe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Input: max8997_haptic - Fix module alias

2015-08-28 Thread Dmitry Torokhov
On Wed, Aug 26, 2015 at 02:19:41AM +0200, Javier Martinez Canillas wrote:
> The driver is a platform driver and not a I2C driver so its modalias
> should be exported with MODULE_DEVICE_TABLE(platform,...) instead of
> MODULE_DEVICE_TABLE(i2c,...).
> 
> Also, remove the unnecessary MODULE_ALIAS("platform:max8997-haptic")
> now that the correct module alias is created.
> 
> Signed-off-by: Javier Martinez Canillas 

Applied, thank you.

> 
> ---
> 
>  drivers/input/misc/max8997_haptic.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/input/misc/max8997_haptic.c 
> b/drivers/input/misc/max8997_haptic.c
> index d0f687281339..a806ba3818f7 100644
> --- a/drivers/input/misc/max8997_haptic.c
> +++ b/drivers/input/misc/max8997_haptic.c
> @@ -394,7 +394,7 @@ static const struct platform_device_id 
> max8997_haptic_id[] = {
>   { "max8997-haptic", 0 },
>   { },
>  };
> -MODULE_DEVICE_TABLE(i2c, max8997_haptic_id);
> +MODULE_DEVICE_TABLE(platform, max8997_haptic_id);
>  
>  static struct platform_driver max8997_haptic_driver = {
>   .driver = {
> @@ -407,7 +407,6 @@ static struct platform_driver max8997_haptic_driver = {
>  };
>  module_platform_driver(max8997_haptic_driver);
>  
> -MODULE_ALIAS("platform:max8997-haptic");
>  MODULE_AUTHOR("Donggeun Kim ");
>  MODULE_DESCRIPTION("max8997_haptic driver");
>  MODULE_LICENSE("GPL");
> -- 
> 2.4.3
> 

-- 
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Input: elan_i2c - Fix typos for validpage_count

2015-08-28 Thread Dmitry Torokhov
On Wed, Aug 26, 2015 at 11:19:31AM -0700, Benson Leung wrote:
> Search for "vaildpage_count" and replace with "validpage_count".
> 
> Signed-off-by: Benson Leung 

Applied, thank you.

> ---
>  drivers/input/mouse/elan_i2c_core.c | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/input/mouse/elan_i2c_core.c 
> b/drivers/input/mouse/elan_i2c_core.c
> index bbdaedc..d4a38ca 100644
> --- a/drivers/input/mouse/elan_i2c_core.c
> +++ b/drivers/input/mouse/elan_i2c_core.c
> @@ -84,7 +84,7 @@ struct elan_tp_data {
>   int pressure_adjustment;
>   u8  mode;
>   u8  ic_type;
> - u16 fw_vaildpage_count;
> + u16 fw_validpage_count;
>   u16 fw_signature_address;
>  
>   boolirq_wake;
> @@ -94,28 +94,28 @@ struct elan_tp_data {
>   boolbaseline_ready;
>  };
>  
> -static int elan_get_fwinfo(u8 ic_type, u16 *vaildpage_count,
> +static int elan_get_fwinfo(u8 ic_type, u16 *validpage_count,
>  u16 *signature_address)
>  {
>   switch(ic_type) {
>   case 0x08:
> - *vaildpage_count = 512;
> + *validpage_count = 512;
>   break;
>   case 0x09:
> - *vaildpage_count = 768;
> + *validpage_count = 768;
>   break;
>   case 0x0D:
> - *vaildpage_count = 896;
> + *validpage_count = 896;
>   break;
>   default:
>   /* unknown ic type clear value */
> - *vaildpage_count = 0;
> + *validpage_count = 0;
>   *signature_address = 0;
>   return -ENXIO;
>   }
>  
>   *signature_address =
> - (*vaildpage_count * ETP_FW_PAGE_SIZE) - ETP_FW_SIGNATURE_SIZE;
> + (*validpage_count * ETP_FW_PAGE_SIZE) - ETP_FW_SIGNATURE_SIZE;
>  
>   return 0;
>  }
> @@ -264,7 +264,7 @@ static int elan_query_device_info(struct elan_tp_data 
> *data)
>   if (error)
>   return error;
>  
> - error = elan_get_fwinfo(data->ic_type, >fw_vaildpage_count,
> + error = elan_get_fwinfo(data->ic_type, >fw_validpage_count,
>   >fw_signature_address);
>   if (error) {
>   dev_err(>client->dev,
> @@ -356,7 +356,7 @@ static int __elan_update_firmware(struct elan_tp_data 
> *data,
>   iap_start_addr = get_unaligned_le16(>data[ETP_IAP_START_ADDR * 2]);
>  
>   boot_page_count = (iap_start_addr * 2) / ETP_FW_PAGE_SIZE;
> - for (i = boot_page_count; i < data->fw_vaildpage_count; i++) {
> + for (i = boot_page_count; i < data->fw_validpage_count; i++) {
>   u16 checksum = 0;
>   const u8 *page = >data[i * ETP_FW_PAGE_SIZE];
>  
> -- 
> 2.5.0.457.gab17608
> 

-- 
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 31/32] tools lib traceevent: Support function __get_dynamic_array_len

2015-08-28 Thread Alexei Starovoitov

On 8/28/15 12:06 AM, Wang Nan wrote:

From: He Kuang

Support helper function __get_dynamic_array_len() in libtraceevent,
this function is used accompany with __print_array() or __print_hex(),
but currently it is not an available function in the function list of
process_function().

The total allocated length of the dynamic array is embedded in the top
half of __data_loc_##item field. This patch adds new arg type
PRINT_DYNAMIC_ARRAY_LEN to return the length to eval_num_arg(),

Signed-off-by: He Kuang
Acked-by: Namhyung Kim


Tested-by: Alexei Starovoitov 

this patch fixes the perf crash:
  Warning: [bpf:bpf_output_data] function __get_dynamic_array_len not 
defined

  Warning: Error: expected type 5 but read 0
*** glibc detected *** perf_4.2.0: double free or corruption (fasttop): 
0x032caf20 ***

=== Backtrace: =
/lib/x86_64-linux-gnu/libc.so.6(+0x7ec96)[0x7f0d5d2d3c96]

it's not strictly necessary until patch 32 lands, but I think it's
a good fix regardless.
Steven, could you take it into your tree?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL 00/32] perf tools: filtering events using eBPF programs

2015-08-28 Thread Alexei Starovoitov

On 8/28/15 12:05 AM, Wang Nan wrote:

Hi Arnaldo,

This time I adjust all Cc and Link field in each patch.

Four new patches (1,2,3,12/32) is newly introduced for fixing a bug
related to '--filter' option. Patch 06/32 is also modified. Please keep
an eye on it.


Arnaldo, what is the latest news on this set?
I think you've looked at most of them over the last months and few patch
reorders were necessary. Is it all addressed ? All further work is
sadly blocked, because these core patches need to come in first.
I took another look today and to me patches 1-30 look good.
Thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >