date:20150723

[PATCH 16/18] perf record: Add option --switch-events to select PERF_RECORD_SWITCH events

2015-07-23 Thread Arnaldo Carvalho de Melo

From: Adrian Hunter 

Add an option to select PERF_RECORD_SWITCH events.

Signed-off-by: Adrian Hunter 
Acked-by: Peter Zijlstra (Intel) 
Tested-by: Arnaldo Carvalho de Melo 
Tested-by: Jiri Olsa 
Cc: Andi Kleen 
Cc: Mathieu Poirier 
Cc: Pawel Moll 
Cc: Stephane Eranian 
Link: 
http://lkml.kernel.org/r/1437471846-26995-4-git-send-email-adrian.hun...@intel.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/Documentation/perf-record.txt |  4 
 tools/perf/builtin-record.c  |  7 +++
 tools/perf/perf.h|  1 +
 tools/perf/util/evlist.h |  1 +
 tools/perf/util/evsel.c  |  3 +++
 tools/perf/util/record.c | 10 ++
 6 files changed, 26 insertions(+)

diff --git a/tools/perf/Documentation/perf-record.txt 
b/tools/perf/Documentation/perf-record.txt
index 29e5307945bf..63ee0408761d 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -293,6 +293,10 @@ When processing pre-existing threads /proc/XXX/mmap, it 
may take a long time,
 because the file may be huge. A time out is needed in such cases.
 This option sets the time out limit. The default value is 500 ms.
 
+--switch-events::
+Record context switch events i.e. events of type PERF_RECORD_SWITCH or
+PERF_RECORD_SWITCH_CPU_WIDE.
+
 SEE ALSO
 
 linkperf:perf-stat[1], linkperf:perf-list[1]
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 1932e27c00d8..445a64d19625 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1075,6 +1075,8 @@ struct option __record_options[] = {
  "opts", "AUX area tracing Snapshot Mode", ""),
OPT_UINTEGER(0, "proc-map-timeout", _map_timeout,
"per thread proc mmap processing timeout in ms"),
+   OPT_BOOLEAN(0, "switch-events", _switch_events,
+   "Record context switch events"),
OPT_END()
 };
 
@@ -1102,6 +1104,11 @@ int cmd_record(int argc, const char **argv, const char 
*prefix __maybe_unused)
  " system-wide mode\n");
usage_with_options(record_usage, record_options);
}
+   if (rec->opts.record_switch_events &&
+   !perf_can_record_switch_events()) {
+   ui__error("kernel does not support recording context switch 
events (--switch-events option)\n");
+   usage_with_options(record_usage, record_options);
+   }
 
if (!rec->itr) {
rec->itr = auxtrace_record__init(rec->evlist, );
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 937b16aa0300..cf459f89fc9b 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -57,6 +57,7 @@ struct record_opts {
bool running_time;
bool full_auxtrace;
bool auxtrace_snapshot_mode;
+   bool record_switch_events;
unsigned int freq;
unsigned int mmap_pages;
unsigned int auxtrace_mmap_pages;
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 406a8216a51e..a8930b68456b 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -114,6 +114,7 @@ void perf_evlist__close(struct perf_evlist *evlist);
 
 void perf_evlist__set_id_pos(struct perf_evlist *evlist);
 bool perf_can_sample_identifier(void);
+bool perf_can_record_switch_events(void);
 void perf_evlist__config(struct perf_evlist *evlist, struct record_opts *opts);
 int record_opts__config(struct record_opts *opts);
 
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 9e6e6f40b787..71f6905c7cb9 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -738,6 +738,9 @@ void perf_evsel__config(struct perf_evsel *evsel, struct 
record_opts *opts)
attr->mmap2 = track && !perf_missing_features.mmap2;
attr->comm  = track;
 
+   if (opts->record_switch_events)
+   attr->context_switch = track;
+
if (opts->sample_transaction)
perf_evsel__set_sample_bit(evsel, TRANSACTION);
 
diff --git a/tools/perf/util/record.c b/tools/perf/util/record.c
index 1f7becbe5e18..0d228a29526d 100644
--- a/tools/perf/util/record.c
+++ b/tools/perf/util/record.c
@@ -85,6 +85,11 @@ static void perf_probe_comm_exec(struct perf_evsel *evsel)
evsel->attr.comm_exec = 1;
 }
 
+static void perf_probe_context_switch(struct perf_evsel *evsel)
+{
+   evsel->attr.context_switch = 1;
+}
+
 bool perf_can_sample_identifier(void)
 {
return perf_probe_api(perf_probe_sample_identifier);
@@ -95,6 +100,11 @@ static bool perf_can_comm_exec(void)
return perf_probe_api(perf_probe_comm_exec);
 }
 
+bool perf_can_record_switch_events(void)
+{
+   return perf_probe_api(perf_probe_context_switch);
+}
+
 void perf_evlist__config(struct perf_evlist *evlist, struct record_opts *opts)
 {
struct perf_evsel *evsel;
-- 
2.1.0

--
To unsubscribe from this list: send the line

[PATCH 09/18] perf symbols: Provide libtraceevent callback to resolve kernel symbols

2015-07-23 Thread Arnaldo Carvalho de Melo

From: Arnaldo Carvalho de Melo 

That provides the function signature expected by libtraceevent's
pevent_set_function_resolver().

Acked-by: David Ahern 
Cc: Adrian Hunter 
Cc: Borislav Petkov 
Cc: Frederic Weisbecker 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Stephane Eranian 
Cc: Steven Rostedt 
Link: http://lkml.kernel.org/n/tip-ie6hvlb6u15y4ulg9j161...@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/util/machine.c | 14 ++
 tools/perf/util/machine.h |  4 
 tools/perf/util/trace-event.c | 45 ---
 tools/perf/util/trace-event.h |  1 +
 4 files changed, 49 insertions(+), 15 deletions(-)

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index d0bf1e590479..22006c15edf4 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1993,3 +1993,17 @@ struct dso *machine__findnew_dso(struct machine 
*machine, const char *filename)
 {
return dsos__findnew(>dsos, filename);
 }
+
+char *machine__resolve_kernel_addr(void *vmachine, unsigned long long *addrp, 
char **modp)
+{
+   struct machine *machine = vmachine;
+   struct map *map;
+   struct symbol *sym = map_groups__find_symbol(>kmaps, 
MAP__FUNCTION, *addrp, ,  NULL);
+
+   if (sym == NULL)
+   return NULL;
+
+   *modp = __map__is_kmodule(map) ? (char *)map->dso->short_name : NULL;
+   *addrp = map->unmap_ip(map, sym->start);
+   return sym->name;
+}
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index 887798e511e9..ff5f78c886e0 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -237,5 +237,9 @@ int machine__synthesize_threads(struct machine *machine, 
struct target *target,
 pid_t machine__get_current_tid(struct machine *machine, int cpu);
 int machine__set_current_tid(struct machine *machine, int cpu, pid_t pid,
 pid_t tid);
+/*
+ * For use with libtraceevent's pevent_set_function_resolver()
+ */
+char *machine__resolve_kernel_addr(void *vmachine, unsigned long long *addrp, 
char **modp);
 
 #endif /* __PERF_MACHINE_H */
diff --git a/tools/perf/util/trace-event.c b/tools/perf/util/trace-event.c
index 6322d37164c5..667bd109d16f 100644
--- a/tools/perf/util/trace-event.c
+++ b/tools/perf/util/trace-event.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include "trace-event.h"
+#include "machine.h"
 #include "util.h"
 
 /*
@@ -19,6 +20,7 @@
  * there.
  */
 static struct trace_event tevent;
+static bool tevent_initialized;
 
 int trace_event__init(struct trace_event *t)
 {
@@ -32,6 +34,32 @@ int trace_event__init(struct trace_event *t)
return pevent ? 0 : -1;
 }
 
+static int trace_event__init2(void)
+{
+   int be = traceevent_host_bigendian();
+   struct pevent *pevent;
+
+   if (trace_event__init())
+   return -1;
+
+   pevent = tevent.pevent;
+   pevent_set_flag(pevent, PEVENT_NSEC_OUTPUT);
+   pevent_set_file_bigendian(pevent, be);
+   pevent_set_host_bigendian(pevent, be);
+   tevent_initialized = true;
+   return 0;
+}
+
+int trace_event__register_resolver(struct machine *machine)
+{
+   if (!tevent_initialized && trace_event__init2())
+   return -1;
+
+   return pevent_set_function_resolver(tevent.pevent,
+   machine__resolve_kernel_addr,
+   machine);
+}
+
 void trace_event__cleanup(struct trace_event *t)
 {
traceevent_unload_plugins(t->plugin_list, t->pevent);
@@ -62,21 +90,8 @@ tp_format(const char *sys, const char *name)
 struct event_format*
 trace_event__tp_format(const char *sys, const char *name)
 {
-   static bool initialized;
-
-   if (!initialized) {
-   int be = traceevent_host_bigendian();
-   struct pevent *pevent;
-
-   if (trace_event__init())
-   return NULL;
-
-   pevent = tevent.pevent;
-   pevent_set_flag(pevent, PEVENT_NSEC_OUTPUT);
-   pevent_set_file_bigendian(pevent, be);
-   pevent_set_host_bigendian(pevent, be);
-   initialized = true;
-   }
+   if (!tevent_initialized && trace_event__init2())
+   return NULL;
 
return tp_format(sys, name);
 }
diff --git a/tools/perf/util/trace-event.h b/tools/perf/util/trace-event.h
index d5168f0be4ec..568128c3284a 100644
--- a/tools/perf/util/trace-event.h
+++ b/tools/perf/util/trace-event.h
@@ -18,6 +18,7 @@ struct trace_event {
 
 int trace_event__init(struct trace_event *t);
 void trace_event__cleanup(struct trace_event *t);
+int trace_event__register_resolver(struct machine *machine);
 struct event_format*
 trace_event__tp_format(const char *sys, const char *name);
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at

[PATCH 10/18] perf trace: Provide libtracevent with a kernel symbol resolver

2015-07-23 Thread Arnaldo Carvalho de Melo

From: Arnaldo Carvalho de Melo 

So that beautifiers wanting to resolve kernel function addresses to
names can do its work, now, for instance, the 'timer' tracepoints
beautifiers works with 'perf trace', see the "function=tick..." part:

 # perf trace --event timer:hrtimer_start

  0.000 timer:hrtimer_start:hrtimer=0x88026f3101c0 
function=tick_sched_timer/0x0 expires=5209833900 softexpires=5209833900)
  0.003 timer:hrtimer_start:hrtimer=0x88026f3101c0 
function=tick_sched_timer/0x0 expires=5209833900 softexpires=5209833900)


Reported-by: Thomas Gleixner 
Acked-by: David Ahern 
Cc: Adrian Hunter 
Cc: Borislav Petkov 
Cc: Frederic Weisbecker 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Stephane Eranian 
Cc: Steven Rostedt 
Link: http://lkml.kernel.org/n/tip-n4i0hxpbl1tnleiqkok47...@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/builtin-trace.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 32b4d280af4f..282841b10f24 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1489,6 +1489,9 @@ static int trace__symbols_init(struct trace *trace, 
struct perf_evlist *evlist)
if (trace->host == NULL)
return -ENOMEM;
 
+   if (trace_event__register_resolver(trace->host) < 0)
+   return -errno;
+
err = __machine__synthesize_threads(trace->host, >tool, 
>opts.target,
evlist->threads, 
trace__tool_process, false,
trace->opts.proc_map_timeout);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 08/18] tools lib traceevent: Allow setting an alternative symbol resolver

2015-07-23 Thread Arnaldo Carvalho de Melo

From: Arnaldo Carvalho de Melo 

The perf tools have a symbol resolver that includes solving kernel
symbols using either kallsyms or ELF symtabs, and it also is using
libtraceevent to format the trace events fields, including via
subsystem specific plugins, like the "timer" one.

To solve fields like "timer:hrtimer_start"'s "function", libtraceevent
needs a way to map from its value to a function name and addr.

This patch provides a way for tools that already have symbol resolving
facilities to ask libtraceevent to use it when needing to resolve
kernel symbols.

Reviewed-by: Steven Rostedt 
Acked-by: David Ahern 
Cc: Adrian Hunter 
Cc: Borislav Petkov 
Cc: Frederic Weisbecker 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Stephane Eranian 
Link: http://lkml.kernel.org/n/tip-fdx1fazols17w5py26ia3...@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/lib/traceevent/event-parse.c | 68 +-
 tools/lib/traceevent/event-parse.h |  8 +
 2 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/tools/lib/traceevent/event-parse.c 
b/tools/lib/traceevent/event-parse.c
index cc25f059ab3d..fcd8a9e3d2e1 100644
--- a/tools/lib/traceevent/event-parse.c
+++ b/tools/lib/traceevent/event-parse.c
@@ -418,7 +418,7 @@ static int func_map_init(struct pevent *pevent)
 }
 
 static struct func_map *
-find_func(struct pevent *pevent, unsigned long long addr)
+__find_func(struct pevent *pevent, unsigned long long addr)
 {
struct func_map *func;
struct func_map key;
@@ -434,6 +434,71 @@ find_func(struct pevent *pevent, unsigned long long addr)
return func;
 }
 
+struct func_resolver {
+   pevent_func_resolver_t *func;
+   void   *priv;
+   struct func_mapmap;
+};
+
+/**
+ * pevent_set_function_resolver - set an alternative function resolver
+ * @pevent: handle for the pevent
+ * @resolver: function to be used
+ * @priv: resolver function private state.
+ *
+ * Some tools may have already a way to resolve kernel functions, allow them to
+ * keep using it instead of duplicating all the entries inside
+ * pevent->funclist.
+ */
+int pevent_set_function_resolver(struct pevent *pevent,
+pevent_func_resolver_t *func, void *priv)
+{
+   struct func_resolver *resolver = malloc(sizeof(*resolver));
+
+   if (resolver == NULL)
+   return -1;
+
+   resolver->func = func;
+   resolver->priv = priv;
+
+   free(pevent->func_resolver);
+   pevent->func_resolver = resolver;
+
+   return 0;
+}
+
+/**
+ * pevent_reset_function_resolver - reset alternative function resolver
+ * @pevent: handle for the pevent
+ *
+ * Stop using whatever alternative resolver was set, use the default
+ * one instead.
+ */
+void pevent_reset_function_resolver(struct pevent *pevent)
+{
+   free(pevent->func_resolver);
+   pevent->func_resolver = NULL;
+}
+
+static struct func_map *
+find_func(struct pevent *pevent, unsigned long long addr)
+{
+   struct func_map *map;
+
+   if (!pevent->func_resolver)
+   return __find_func(pevent, addr);
+
+   map = >func_resolver->map;
+   map->mod  = NULL;
+   map->addr = addr;
+   map->func = pevent->func_resolver->func(pevent->func_resolver->priv,
+   >addr, >mod);
+   if (map->func == NULL)
+   return NULL;
+
+   return map;
+}
+
 /**
  * pevent_find_function - find a function by a given address
  * @pevent: handle for the pevent
@@ -6564,6 +6629,7 @@ void pevent_free(struct pevent *pevent)
free(pevent->trace_clock);
free(pevent->events);
free(pevent->sort_events);
+   free(pevent->func_resolver);
 
free(pevent);
 }
diff --git a/tools/lib/traceevent/event-parse.h 
b/tools/lib/traceevent/event-parse.h
index 063b1971eb35..204befb05a17 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -453,6 +453,10 @@ struct cmdline_list;
 struct func_map;
 struct func_list;
 struct event_handler;
+struct func_resolver;
+
+typedef char *(pevent_func_resolver_t)(void *priv,
+  unsigned long long *addrp, char **modp);
 
 struct pevent {
int ref_count;
@@ -481,6 +485,7 @@ struct pevent {
int cmdline_count;
 
struct func_map *func_map;
+   struct func_resolver *func_resolver;
struct func_list *funclist;
unsigned int func_count;
 
@@ -611,6 +616,9 @@ enum trace_flag_type {
TRACE_FLAG_SOFTIRQ  = 0x10,
 };
 
+int pevent_set_function_resolver(struct pevent *pevent,
+pevent_func_resolver_t *func, void *priv);
+void pevent_reset_function_resolver(struct pevent *pevent);
 int pevent_register_comm(struct pevent *pevent, const char *comm, int pid);
 int pevent_register_trace_clock(struct pevent *pevent, const char 
*trace_clock);
 int pevent_register_function(struct pevent

[GIT PULL 00/18] perf/core improvements and fixes

2015-07-23 Thread Arnaldo Carvalho de Melo

Hi Ingo,

Please consider pulling,

- Arnaldo

The following changes since commit a11c51acc52822754d66a11c15f6f6edd4d23c55:

  Merge tag 'perf-core-for-mingo' of 
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
(2015-07-21 07:58:06 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
tags/perf-core-for-mingo

for you to fetch changes up to 7c14898ba9386ee5c939bb418643ac6baff52840:

  perf script: Add option --show-switch-events (2015-07-23 22:51:14 -0300)


perf/core improvements and fixes:

New features:

- Introduce PERF_RECORD_SWITCH(_CPU_WIDE) and use it in 'record' to
  ask for context switches, allowing non priviledged tasks to know
  when they are switched in and out, which wasn't possible with
  the other context switch tracepoint and software events, see the
  patch description for a comprehensive justification (Adrian Hunter)

- Stop collecting /proc/kallsyms in perf.data files, saving about
  4.5MB on a typical x86-64 system, use the symbol resolution
  routines used in all the other tools (report, top, etc) now that
  we can ask libtraceevent to use perf's symbol resolution code.
  (Arnaldo Carvalho de Melo)

User visible fixes:

- Expose perf's symbol resolver to libtraceecent, so that its plugins can
  resolve tracepoint fields to kernel functions, like the 'function' field
  in the "timer:hrtimer_start tracepoint" (Arnaldo Carvalho de Melo)

Infrastructure:

- Map propagation of thread and cpu maps improvements, prep work for
  'perf stat' new features (Jiri Olsa)

Signed-off-by: Arnaldo Carvalho de Melo 


Adrian Hunter (5):
  perf: Add PERF_RECORD_SWITCH to indicate context switches
  perf tools: Add new PERF_RECORD_SWITCH event
  perf record: Add option --switch-events to select PERF_RECORD_SWITCH 
events
  perf script: Don't assume evsel position of tracking events
  perf script: Add option --show-switch-events

Arnaldo Carvalho de Melo (8):
  perf symbols: Add front end cache for DSO symbol lookup
  perf symbols: Introduce map__is_(kernel,kmodule)()
  tools lib traceevent: Allow setting an alternative symbol resolver
  perf symbols: Provide libtraceevent callback to resolve kernel symbols
  perf trace: Provide libtracevent with a kernel symbol resolver
  perf script: Switch from perf.data's kallsyms to perf's symbol resolver
  perf tools: Stop reading the kallsyms data from perf.data
  perf tools: Stop copying kallsyms into the perf.data file header

Jiri Olsa (5):
  perf test: Check for refcnt in thread_map test
  perf evlist: Force perf_evlist__set_maps to propagate maps through events
  perf evlist: Use bool instead of target argument in propagate_maps()
  perf evlist: Tolerate NULL maps in propagate_maps
  perf header: Use argv style storage for cmdline feature data

 include/uapi/linux/perf_event.h  |  31 +-
 kernel/events/core.c | 103 +++
 tools/lib/traceevent/event-parse.c   |  68 +++-
 tools/lib/traceevent/event-parse.h   |   8 +++
 tools/perf/Documentation/perf-record.txt |   4 ++
 tools/perf/Documentation/perf-script.txt |   4 ++
 tools/perf/builtin-inject.c  |   1 +
 tools/perf/builtin-record.c  |   7 +++
 tools/perf/builtin-script.c  |  48 --
 tools/perf/builtin-trace.c   |   3 +
 tools/perf/perf.h|   1 +
 tools/perf/tests/thread-map.c|   4 ++
 tools/perf/util/dso.h|   4 ++
 tools/perf/util/event.c  |  28 +
 tools/perf/util/event.h  |  12 
 tools/perf/util/evlist.c |  28 +++--
 tools/perf/util/evlist.h |  12 ++--
 tools/perf/util/evsel.c  |   4 ++
 tools/perf/util/header.c |  35 ++-
 tools/perf/util/header.h |   1 +
 tools/perf/util/machine.c|  25 
 tools/perf/util/machine.h|   6 ++
 tools/perf/util/map.c|  14 +
 tools/perf/util/map.h|   7 +++
 tools/perf/util/record.c |  10 +++
 tools/perf/util/session.c|  21 +++
 tools/perf/util/symbol.c |   7 ++-
 tools/perf/util/tool.h   |   1 +
 tools/perf/util/trace-event-info.c   |  22 +++
 tools/perf/util/trace-event-parse.c  |  30 -
 tools/perf/util/trace-event-read.c   |  28 -
 tools/perf/util/trace-event.c|  45 +-
 tools/perf/util/trace-event.h|   1 +
 33 files changed, 513 insertions(+), 110 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of

[PATCH 01/18] perf test: Check for refcnt in thread_map test

2015-07-23 Thread Arnaldo Carvalho de Melo

From: Jiri Olsa 

Checking also for refcnt in thread_map test.

Signed-off-by: Jiri Olsa 
Cc: David Ahern 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Link: 
http://lkml.kernel.org/r/1437481927-29538-2-git-send-email-jo...@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/tests/thread-map.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/perf/tests/thread-map.c b/tools/perf/tests/thread-map.c
index 5acf000939ea..138a0e3431fa 100644
--- a/tools/perf/tests/thread-map.c
+++ b/tools/perf/tests/thread-map.c
@@ -20,6 +20,8 @@ int test__thread_map(void)
TEST_ASSERT_VAL("wrong comm",
thread_map__comm(map, 0) &&
!strcmp(thread_map__comm(map, 0), "perf"));
+   TEST_ASSERT_VAL("wrong refcnt",
+   atomic_read(>refcnt) == 1);
thread_map__put(map);
 
/* test dummy pid */
@@ -33,6 +35,8 @@ int test__thread_map(void)
TEST_ASSERT_VAL("wrong comm",
thread_map__comm(map, 0) &&
!strcmp(thread_map__comm(map, 0), "dummy"));
+   TEST_ASSERT_VAL("wrong refcnt",
+   atomic_read(>refcnt) == 1);
thread_map__put(map);
return 0;
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 02/18] perf evlist: Force perf_evlist__set_maps to propagate maps through events

2015-07-23 Thread Arnaldo Carvalho de Melo

From: Jiri Olsa 

Forcing perf_evlist__set_maps to propagate maps through events, so
cpu/thread maps get set within evlist.

Signed-off-by: Jiri Olsa 
Cc: David Ahern 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Link: 
http://lkml.kernel.org/r/1437481927-29538-11-git-send-email-jo...@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/util/evlist.c | 17 +
 tools/perf/util/evlist.h | 11 +++
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index f7d9c77ee31b..6bfcab9b7108 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1150,6 +1150,23 @@ out_delete_threads:
return -1;
 }
 
+int perf_evlist__set_maps(struct perf_evlist *evlist,
+ struct cpu_map *cpus,
+ struct thread_map *threads)
+{
+   if (evlist->cpus)
+   cpu_map__put(evlist->cpus);
+
+   evlist->cpus = cpus;
+
+   if (evlist->threads)
+   thread_map__put(evlist->threads);
+
+   evlist->threads = threads;
+
+   return perf_evlist__propagate_maps(evlist, false);
+}
+
 int perf_evlist__apply_filters(struct perf_evlist *evlist, struct perf_evsel 
**err_evsel)
 {
struct perf_evsel *evsel;
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 037633c1da9d..406a8216a51e 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -152,14 +152,9 @@ int perf_evlist__enable_event_idx(struct perf_evlist 
*evlist,
 void perf_evlist__set_selected(struct perf_evlist *evlist,
   struct perf_evsel *evsel);
 
-static inline void perf_evlist__set_maps(struct perf_evlist *evlist,
-struct cpu_map *cpus,
-struct thread_map *threads)
-{
-   evlist->cpus= cpus;
-   evlist->threads = threads;
-}
-
+int perf_evlist__set_maps(struct perf_evlist *evlist,
+ struct cpu_map *cpus,
+ struct thread_map *threads);
 int perf_evlist__create_maps(struct perf_evlist *evlist, struct target 
*target);
 int perf_evlist__apply_filters(struct perf_evlist *evlist, struct perf_evsel 
**err_evsel);
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 04/18] perf evlist: Tolerate NULL maps in propagate_maps

2015-07-23 Thread Arnaldo Carvalho de Melo

From: Jiri Olsa 

Tolerating NULL maps in perf_evlist__propagate_maps, so we dont need to
pass evlist with both cpus and threads maps defined.

Signed-off-by: Jiri Olsa 
Cc: David Ahern 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Link: 
http://lkml.kernel.org/r/1437481927-29538-10-git-send-email-jo...@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/util/evlist.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 80ab942afa8a..3b9f411a6b46 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1119,7 +1119,8 @@ static int perf_evlist__propagate_maps(struct perf_evlist 
*evlist,
 
evsel->threads = thread_map__get(evlist->threads);
 
-   if (!evsel->cpus || !evsel->threads)
+   if ((evlist->cpus && !evsel->cpus) ||
+   (evlist->threads && !evsel->threads))
return -ENOMEM;
}
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

2015-07-23 Thread xiakaixu

于 2015/7/24 6:56, Alexei Starovoitov 写道:
> On 7/23/15 2:42 AM, Kaixu Xia wrote:
>> According to the perf_event_map_fd and index, the function
>> bpf_perf_event_read() can convert the corresponding map
>> value to the pointer to struct perf_event and return the
>> Hardware PMU counter value.
>>
>> Signed-off-by: Kaixu Xia 
> ...
>> +static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5)
>> +{
>> +struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
>> +struct bpf_array *array = container_of(map, struct bpf_array, map);
>> +struct perf_event *event;
>> +
>> +if (index >= array->map.max_entries)
>> +return -E2BIG;
>> +
>> +event = array->events[index];
>> +if (!event)
>> +return -EBADF;
> 
> probably ENOENT makes more sense here.
> 
>> +
>> +if (event->state != PERF_EVENT_STATE_ACTIVE)
>> +return -ENOENT;
> 
> and -EINVAL here?

Yeah, the errno is better.

Thanks!
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 3/3] samples/bpf: example of get selected PMU counter value

2015-07-23 Thread xiakaixu

于 2015/7/24 6:59, Alexei Starovoitov 写道:
> On 7/23/15 2:42 AM, Kaixu Xia wrote:
>> This is a simple example and shows how to use the new ability
>> to get the selected Hardware PMU counter value.
>>
>> Signed-off-by: Kaixu Xia 
> ...
>> +struct bpf_map_def SEC("maps") my_map = {
>> +.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
>> +.key_size = sizeof(int),
>> +.value_size = sizeof(unsigned long),
>> +.max_entries = 32,
>> +};
> 
> wait. how did it work here? value_size should be u32.

I tested the whole thing on ARM board. You are ringt, it
should be u32.
When create the array map, we choose the array->elem_size as
round_up(attr->value_size, 8), why 8?

Thanks!
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

linux-next: build failure after merge of the input tree

2015-07-23 Thread Stephen Rothwell

Hi Dmitry,

After merging the input tree, today's linux-next build (arm
multi_v7_defconfig) failed like this:

drivers/input/keyboard/samsung-keypad.c: In function 'samsung_keypad_parse_dt':
drivers/input/keyboard/samsung-keypad.c:302:40: error: 'pp' undeclared (first 
use in this function)
  pdata->wakeup = of_property_read_bool(pp, "wakeup-source") ||
^

Caused by commit

  7e324dd6cc21 ("Input: samsung-keypad - change name of wakeup property")

Please at least build test these changes :-(

I have used the input tree from next-20150723 for today.

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] tools lib traceevent: Allow setting an alternative symbol resolver

2015-07-23 Thread Steven Rostedt

On Thu, 23 Jul 2015 22:34:48 -0300
Arnaldo Carvalho de Melo  wrote:


> > > One more try:
> 
> > Third time's a charm, or was this the forth?
> 
> As many as needed would be put forth!

And the Lord said unto John, "Come forth and you will receive eternal life" 
But John came fifth, and won a toaster.

/me hides.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3] mtd: nand support for Toshiba BENAND (Built-in ECC NAND)

2015-07-23 Thread Brian Norris

On Fri, Jul 24, 2015 at 09:52:18AM +0900, KOBAYASHI Yoshitake wrote:
> This patch enables support for Toshiba BENAND. BENAND is a SLC NAND
> solution that automatically generates ECC inside NAND chip.
> 
> I considered to use the patch of on-die ECC, but I believe reading twice
> the same page approach will affect read performance. Additionally, BENAND
> does not support Disable ECC. So, I cannot use same approach.
> 
> BENAND Read Status CMD(70h) is able to report Rewrite Recommend. BENAND
> also has extended ECC Status CMD(7Ah). This command is able to report the
> number of bit error/sector data. When MTD_NAND_BENAND_ENABLE option enabled,
> this driver use the extended ECC status CMD(7Ah) to report real number of
> bitflips. In case of difficult to use CMD(7Ah), this driver use
> "mtd->bitflip_threshold" to assume the number of bitflips.
> 
> Signed-off-by: KOBAYASHI Yoshitake 
> ---

Where's the changelog? I see that this is version 3...

Brian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/7] tty: core: Add tty_debug() for printk(KERN_DEBUG) messages

2015-07-23 Thread Greg Kroah-Hartman

On Sun, Jul 12, 2015 at 10:49:08PM -0400, Peter Hurley wrote:
> Introduce tty_debug() macro to output uniform debug information for
> tty core debug messages (function name and tty name).
> 
> Note: printk(KERN_DEBUG) is retained here over pr_debug() since
> messages can be enabled in non-DEBUG builds.

But pr_debug() is the "standard" way to enable/disable debugging
messages, so I'd really like to see that be used here.

Even better, this is a tty device, so it should be using dev_dbg(),
which gives us tons of good information built-in for the tty and can
properly be parsed by userspace tools to know exactly what device caused
what message at what point in time.

So I'll take this for now, but moving it to use dev_dbg() would be best
eventually.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] tools lib traceevent: Allow setting an alternative symbol resolver

2015-07-23 Thread Arnaldo Carvalho de Melo

Em Thu, Jul 23, 2015 at 06:07:30PM -0400, Steven Rostedt escreveu:
> On Thu, 23 Jul 2015 18:58:36 -0300
> Arnaldo Carvalho de Melo  wrote:
> 
> > Em Thu, Jul 23, 2015 at 06:52:46PM -0300, Arnaldo Carvalho de Melo escreveu:
> > > Em Thu, Jul 23, 2015 at 05:35:24PM -0400, Steven Rostedt escreveu:
> > > > Also I wonder if we should add a way to clear the resolver. That is,
> > > > you want to use the default resolver?

> > > I am adding a reset_function_resolver(pevent);

> > > > Not really a necessity, as I don't see any current programs using it,
> > > > but it would complete the interface.

> > One more try:

> Third time's a charm, or was this the forth?

As many as needed would be put forth!
 
> Reviewed-by: Steven Rostedt 

Thanks!

- Arnaldo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v2] thermal: consistently use int for temperatures

2015-07-23 Thread Zhang, Rui

As the original patch has not been in upstream, I'd prefer a refreshed patch, 
rather than an incremental fix.

Thanks,
rui

> -Original Message-
> From: Sascha Hauer [mailto:s.ha...@pengutronix.de]
> Sent: Thursday, July 23, 2015 6:38 PM
> To: Zhang, Rui
> Cc: Punit Agrawal; linux...@vger.kernel.org; Eduardo Valentin; linux-
> ker...@vger.kernel.org; Jean Delvare; Peter Feuerer; Heiko Stuebner;
> Lukasz Majewski; Stephen Warren; Thierry Reding; linux-
> a...@vger.kernel.org; platform-driver-...@vger.kernel.org; linux-arm-
> ker...@lists.infradead.org; linux-o...@vger.kernel.org; linux-samsung-
> s...@vger.kernel.org; Guenter Roeck; Rafael J. Wysocki; Maxime Ripard;
> Darren Hart; lm-sens...@lm-sensors.org
> Subject: Re: [PATCH v2] thermal: consistently use int for temperatures
> Importance: High
> 
> Hi Zhang,
> 
> On Tue, Jul 21, 2015 at 01:35:31PM +, Zhang, Rui wrote:
> > > >
> > Patch applied.
> 
> Thanks for applying. I missed to convert another place, so we get a new
> compiler warning. The attached patch fixes this (suitable for git rebase --
> autosquash). Please let me know if you can handle this or if you prefer a new
> patch instead.
> 
> Thanks
>  Sascha
> 
> 
> -8<-
> 
> From 4907a7c32fd16eaf9f31d9f904276c9a0176b717 Mon Sep 17 00:00:00 2001
> From: Sascha Hauer 
> Date: Thu, 23 Jul 2015 12:32:31 +0200
> Subject: [PATCH] fixup! thermal: consistently use int for temperatures
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> fixes:
> drivers/power/charger-manager.c: In function
> ‘cm_get_battery_temperature’:
> drivers/power/charger-manager.c:622:45: warning: passing argument 2 of
> ‘thermal_zone_get_temp’ from incompatible pointer type
>ret = thermal_zone_get_temp(cm->tzd_batt, (unsigned long *)temp);
> 
> Signed-off-by: Sascha Hauer 
> ---
>  drivers/power/charger-manager.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/power/charger-manager.c b/drivers/power/charger-
> manager.c index 1c202cc..907293e 100644
> --- a/drivers/power/charger-manager.c
> +++ b/drivers/power/charger-manager.c
> @@ -619,7 +619,7 @@ static int cm_get_battery_temperature(struct
> charger_manager *cm,
> 
>  #ifdef CONFIG_THERMAL
>   if (cm->tzd_batt) {
> - ret = thermal_zone_get_temp(cm->tzd_batt, (unsigned long
> *)temp);
> + ret = thermal_zone_get_temp(cm->tzd_batt, temp);
>   if (!ret)
>   /* Calibrate temperature unit */
>   *temp /= 100;
> --
> 2.1.4
> 
> 
> --
> Pengutronix e.K.   | |
> Industrial Linux Solutions | http://www.pengutronix.de/  |
> Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0|
> Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |

RE: [RFC v2 0/6] IRQ bypass manager and irqfd consumer

2015-07-23 Thread Wu, Feng



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Eric Auger
> Sent: Monday, July 06, 2015 8:11 PM
> To: eric.au...@st.com; eric.au...@linaro.org;
> linux-arm-ker...@lists.infradead.org; kvm...@lists.cs.columbia.edu;
> k...@vger.kernel.org; christoffer.d...@linaro.org; marc.zyng...@arm.com;
> alex.william...@redhat.com; pbonz...@redhat.com; avi.kiv...@gmail.com;
> mtosa...@redhat.com; Wu, Feng; j...@8bytes.org;
> b.rey...@virtualopensystems.com
> Cc: linux-kernel@vger.kernel.org; patc...@linaro.org
> Subject: [RFC v2 0/6] IRQ bypass manager and irqfd consumer
> 
> This series introduces and extends the IRQ bypass manager written
> by Alex and transforms irqfd into an IRQ bypass manager consumer.
> The producer part, in my case the VFIO platform driver will be introduced
> separately. That code should be usable by both ARM IRQ forwarding
> and Intel Posted Interrupts.
> 
> The IRQ bypass manager enables to setup a negotiated link between an
> IRQ producer and an IRQ consumer (typically the VFIO driver and KVM irqfd).
> 
> The series currently includes Alex's code which was sent my email.
> Its structure obvioulsy will adapt to Alex's will.
> 
> Also the irq bypass manager gets compiled/linked on arm/arm64 along
> with KVM and VFIO platform driver.
> 
> can be found at:
> https://git.linaro.org/people/eric.auger/linux.git/shortlog/refs/heads/v4.2-rc1-
> bypass-fwd-v2
> 
> Best Regards
> 
> Eric
> 
> History:
> v1 -> v2:
> - isolate the bypass manager and irqfd consumer in this series
> - take into account Paolo's comments and use container_of strategy and
>   remove additional fields introduced in v1.
> - create kvm_irqfd.h
> - add unregistration in irqfd_shutdown

Hi Eric,

[4/6], [5/6], and [6/6] of this series are common to forwarded irq and posted
interrupts, did you have a chance to get a new version of them based on
Alex's latest irqbypass manager patch:

https://lkml.org/lkml/2015/7/16/810

Thanks a lot!

Thanks,
Feng

> 
> v1: originally part of [RFC 00/17] ARM IRQ forward control based on IRQ
> bypass manager (https://lkml.org/lkml/2015/7/2/268)
> 
> 
> Eric Auger (6):
>   KVM: arm/arm64: select IRQ_BYPASS_MANAGER
>   VFIO: platform: select IRQ_BYPASS_MANAGER
>   irq: bypass: Extend skeleton for ARM forwarding control
>   KVM: create kvm_irqfd.h
>   KVM: introduce kvm_arch functions for IRQ bypass
>   KVM: eventfd: add irq bypass consumer management
> 
>  arch/arm/kvm/Kconfig  |   1 +
>  arch/arm64/kvm/Kconfig|   1 +
>  drivers/vfio/platform/Kconfig |   1 +
>  include/linux/irqbypass.h |  19 ++--
>  include/linux/kvm_host.h  |  37 ++
>  include/linux/kvm_irqfd.h |  70 +++
>  kernel/irq/bypass.c   |  44 +++--
>  virt/kvm/Kconfig  |   3 ++
>  virt/kvm/eventfd.c| 109 
> +-
>  9 files changed, 203 insertions(+), 82 deletions(-)
>  create mode 100644 include/linux/kvm_irqfd.h
> 
> --
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Revisiting patch dependencies

2015-07-23 Thread Minfei Huang

On 07/23/15 at 01:07pm, Josh Poimboeuf wrote:
> On Thu, Jul 23, 2015 at 12:02:06PM +0800, Minfei Huang wrote:
> > On 07/22/15 at 09:40am, Josh Poimboeuf wrote:
> > > Is it really safe to assume that there are no dependencies between
> > > patches which patch different objects?
> > > 
> > 
> > I think so.
> 
> What about the following scenario:
> 
> 1. register and enable patch A, which patches vmlinux_func() and changes
>its call signature
> 2. register and enable patch B, which patches a (not yet loaded) module
>M so that it will call vmlinux_func() with its new call signature
> 3. load module M, which is immediately patched by patch B
> 4. disable patch A.  Now the patched module M calls the unpatched
>version of vmlinux_func() with the wrong call signature - BOOM
> 
> In this case B, a patch to a module, would have an implicit dependency
> on A, a patch to vmlinux.
> 
> So I don't think the approach in the above patch would work.  But I *do*
> think we may need to revisit how we handle dependencies...
> 
> Note that our current patch stacking protects against unloading out of
> order, but it assumes that the user loaded them in the correct order in
> the first place.  If M and B are loaded before A, then it would still go
> boom even with today's code.
> 
> So IMO the way we handle dependencies today is incomplete.  Some options
> for improvement are:
> 
> a) Don't allow dependencies between patches.  Instead all dependencies
>must be contained within the patch itself.  So patch A and patch B
>are combined into a single patch AB.  If, later, a new patch C is
>needed, which also depends on A, then create a new cumulative patch
>ABC which replaces AB.
> 
>Note there's no way to enforce the fact there are no dependencies,
>because they can be hidden.  So it would just have to be a documented
>rule that the patch author must follow, as part of the (yet to be
>written) patch creation guidelines.  This actually isn't a big deal
>because there are several other (still undocumented) rules the patch
>author must already follow.
> 
>This would mean that klp code can assume there are no dependencies,
>and so patch stacking would no longer be necessary.  We'd probably
>have to rework the ops->func_stack code a bit so that it's ordered by
>when the patches were registered instead of when they were enabled,
>so that disabling and re-enabling an older patch wouldn't override a
>newer cumulative one which replaces it.
> 
> b) Create a way for the patch author to specify explicit patch
>dependencies.
> 
> Note that both options a and b delegate responsibility to the patch
> author to ensure that dependencies are handled appropriately.
> Ultimately I don't think there's any way around that.
> 
> My vote would be option a for now, by removing patch stacking and
> documenting the guidelines.  With the eventual possibility of adding b
> if needed.

Thanks for your explaination.

Yes, kernel may crash, if module M calls the unpatched and exported
function vmlinux_func.

I may prefer to choice B, since user can make their own rule to restrict
the patches enabled/disabled. Thus livepatch may be simplier in code
layer.

Thanks
Minfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC V2 PATCH 1/1] panic/x86: Replace smp_send_stop() with crash_kexec version

2015-07-23 Thread Hidehiro Kawai

This patch fixes one of the problems reported by Daniel Walker
(https://lkml.org/lkml/2015/6/24/44).

If "crash_kexec_post_notifiers" boot option is specified,
other cpus are stopped by smp_send_stop() before entering
crash_kexec(), while usually machine_crash_shutdown() called by
crash_kexec() does that.  This behavior change leads two problems.

 Problem 1:
 Some functions in the crash_kexec() path depend on other cpus being
 still online.  If other cpus have been offlined already, they
 doesn't work properly.

  Example (MIPS OCTEON case):
   panic()
crash_kexec()
 machine_crash_shutdown()
  octeon_generic_shutdown() // shutdown watchdog for ONLINE cpus
 machine_kexec()

 Problem 2:
 Most of architectures stop other cpus in the machine_crash_shutdown()
 path and save register information at that time.  However, if
 smp_send_stop() is called before that, we can't save the register
 information.

This patch solves the problem 2 by replacing smp_send_stop() in
panic() with panic_smp_stop_cpus() which is a weak function and can be
replaced with suitable version for crash_kexec context.  In fact,
x86 replaces it with a function based on kdump_nmi_shootdown_cpus() to
stop other cpus and save their states.

Please note that crash_kexec() can be called directly without
entering panic().  A stop-other-cpus procedure is still needed
by crash_kexec().

Changes in V2:
- Replace smp_send_stop() call with crash_kexec version which
  saves cpu states and cleans up VMX/SVM
- Drop a fix for Problem 1 at this moment

Reported-by: Daniel Walker 
Fixes: f06e5153f4ae (kernel/panic.c: add "crash_kexec_post_notifiers" option
Signed-off-by: Hidehiro Kawai 
Cc: Andrew Morton 
Cc: Eric Biederman 
Cc: Vivek Goyal 
---
 arch/x86/kernel/crash.c |   16 +++-
 kernel/panic.c  |   29 +++--
 2 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index e068d66..913c621 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -130,16 +130,22 @@ static void kdump_nmi_callback(int cpu, struct pt_regs 
*regs)
disable_local_APIC();
 }
 
-static void kdump_nmi_shootdown_cpus(void)
+/* Please see the comment on the weak version in kernel/panic.c */
+void panic_smp_stop_cpus(void)
 {
+   static int cpus_stopped;
+
in_crash_kexec = 1;
-   nmi_shootdown_cpus(kdump_nmi_callback);
 
-   disable_local_APIC();
+   if (!cpus_stopped) {
+   nmi_shootdown_cpus(kdump_nmi_callback);
+   disable_local_APIC();
+   cpus_stopped = 1;
+   }
 }
 
 #else
-static void kdump_nmi_shootdown_cpus(void)
+void panic_smp_stop_cpus(void)
 {
/* There are no cpus to shootdown */
 }
@@ -158,7 +164,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
/* The kernel is broken so disable interrupts */
local_irq_disable();
 
-   kdump_nmi_shootdown_cpus();
+   panic_smp_stop_cpus();
 
/*
 * VMCLEAR VMCSs loaded on this cpu if needed.
diff --git a/kernel/panic.c b/kernel/panic.c
index 04e91ff..a507637 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -60,6 +60,28 @@ void __weak panic_smp_self_stop(void)
cpu_relax();
 }
 
+/*
+ * Stop other cpus in panic.  Architecture code may override this to
+ * with more suitable version.  Moreover, if the architecture supports
+ * crash dump, it should also save the states of stopped cpus.
+ *
+ * This function should be called only once.
+ */
+void __weak panic_smp_stop_cpus(void)
+{
+   static int cpus_stopped;
+
+   if (!cpus_stopped) {
+   /*
+* Note smp_send_stop is the usual smp shutdown function,
+* which unfortunately means it may not be hardened to
+* work in a panic situation.
+*/
+   smp_send_stop();
+   cpus_stopped = 1;
+   }
+}
+
 /**
  * panic - halt the system
  * @fmt: The text string to print
@@ -120,12 +142,7 @@ void panic(const char *fmt, ...)
if (!crash_kexec_post_notifiers)
crash_kexec(NULL);
 
-   /*
-* Note smp_send_stop is the usual smp shutdown function, which
-* unfortunately means it may not be hardened to work in a panic
-* situation.
-*/
-   smp_send_stop();
+   panic_smp_stop_cpus();
 
/*
 * Run any panic handlers, including those that might need to


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC V2 PATCH 0/1] kexec: crash_kexec_post_notifiers boot option related fixes

2015-07-23 Thread Hidehiro Kawai

This is a bugfix patch for crash_kexec_post_notifiers boot option
which allows users to call panic notifiers and kmsg dumpers before
kdump.

This fixes one of the problems reported by Daniel Walker
(https://lkml.org/lkml/2015/6/24/44). 

 Problem 1:
 If crash_kexec_post_notifiers boot option is specified, some
 shutting down process which assume other cpus are still alive
 don't work properly.

 Problem 2 (addressed by this patch):
 If crash_kexec_post_notifiers boot option is specified, register
 information of other cpus are not saved to crash dumps.

Following Vivek's opinion, this patch replaces smp_send_stop()
in panic() with suitable version for crash_kexec which saves
cpu states and other things like cleaning up VMX/SVM.  Since this
needs architecture specific implementation and it's not so trivial,
this version only support for x86.  So the problem 1, known to
happen on MIPS/OCTEON, is not addressed now.

To keep the modification impact low, this patch doesn't change
the logic basically if crash_kexec_post_notifiers is not specified.

Please note that crash_kexec() can be called directly without
entering panic().  Stopping other cpus functionality is still
needed in crash_kexec().

Changes in V2:
- Replace smp_send_stop() call with crash_kexec version which
  saves cpu states and does cleanups instead of changing execution
  flow
- Drop a fix for Problem 1
- Drop other patches because they aren't needed anymore

V1: https://lkml.org/lkml/2015/7/10/316

---

Hidehiro Kawai (1):
  panic/x86: Replace smp_send_stop() with crash_kexec version


 arch/x86/kernel/crash.c |   16 +++-
 kernel/panic.c  |   29 +++--
 2 files changed, 34 insertions(+), 11 deletions(-)


-- 
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 7/8] perf: Define PMU_TXN_READ interface

2015-07-23 Thread Sukadev Bhattiprolu

Peter Zijlstra [pet...@infradead.org] wrote:
| On Wed, Jul 22, 2015 at 04:19:16PM -0700, Sukadev Bhattiprolu wrote:
| > Peter Zijlstra [pet...@infradead.org] wrote:
| > | I've not woken up yet, and not actually fully read the email, but can
| > | you stuff the entire above chunk inside the IPI?
| > | 
| > | I think you could then actually optimize __perf_event_read() as well,
| > | because all these events should be on the same context, so no point in
| > | calling update_*time*() for every event or so.
| > | 
| > 
| > Do you mean something like this (will move the rename to a separate
| > patch before posting):
| 
| More like so.. please double check, I've not even had tea yet.

Yeah, I realized I had ignored the 'event->cpu' spec.
Will try this out. Thanks,

Sukadev

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] HID: hid-lg: Add USBID for Logitech G29 Wheel

2015-07-23 Thread Simon Wood

Since this wheel is now available, and the USBID is listed on their website,
this patch adds it to allow the hid-lg4ff force feedback driver to find it.

I do not have this wheel to test with, but this should at least get it working
in emulation mode.

Note: There is probably more work required for adjust HID descriptor and handle
switching between emulation and native modes.

Signed-off-by: Simon Wood 
---
 drivers/hid/hid-ids.h | 1 +
 drivers/hid/hid-lg.c  | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/drivers/hid/hid-ids.h b/drivers/hid/hid-ids.h
index b04b082..653bfd4 100644
--- a/drivers/hid/hid-ids.h
+++ b/drivers/hid/hid-ids.h
@@ -599,6 +599,7 @@
 #define USB_DEVICE_ID_LOGITECH_DUAL_ACTION 0xc216
 #define USB_DEVICE_ID_LOGITECH_RUMBLEPAD2  0xc218
 #define USB_DEVICE_ID_LOGITECH_RUMBLEPAD2_20xc219
+#define USB_DEVICE_ID_LOGITECH_G29_WHEEL   0xc24f
 #define USB_DEVICE_ID_LOGITECH_WINGMAN_F3D 0xc283
 #define USB_DEVICE_ID_LOGITECH_FORCE3D_PRO 0xc286
 #define USB_DEVICE_ID_LOGITECH_FLIGHT_SYSTEM_G940  0xc287
diff --git a/drivers/hid/hid-lg.c b/drivers/hid/hid-lg.c
index 429340d..5332fb7 100644
--- a/drivers/hid/hid-lg.c
+++ b/drivers/hid/hid-lg.c
@@ -776,6 +776,8 @@ static const struct hid_device_id lg_devices[] = {
.driver_data = LG_FF },
{ HID_USB_DEVICE(USB_VENDOR_ID_LOGITECH, 
USB_DEVICE_ID_LOGITECH_RUMBLEPAD2_2),
.driver_data = LG_FF },
+   { HID_USB_DEVICE(USB_VENDOR_ID_LOGITECH, 
USB_DEVICE_ID_LOGITECH_G29_WHEEL),
+   .driver_data = LG_FF4 },
{ HID_USB_DEVICE(USB_VENDOR_ID_LOGITECH, 
USB_DEVICE_ID_LOGITECH_WINGMAN_F3D),
.driver_data = LG_FF },
{ HID_USB_DEVICE(USB_VENDOR_ID_LOGITECH, 
USB_DEVICE_ID_LOGITECH_FORCE3D_PRO),
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2] Yama: remove needless CONFIG_SECURITY_YAMA_STACKED

2015-07-23 Thread Kees Cook

Now that minor LSMs can cleanly stack with major LSMs, remove the unneeded
config for Yama to be made to explicitly stack. Just selecting the main
Yama CONFIG will allow it to work, regardless of the major LSM. Since
distros using Yama are already forcing it to stack, this is effectively
a no-op change.

Additionally add MAINTAINERS entry.

Signed-off-by: Kees Cook 
---
v2:
- add MAINTAINERS entry
- drop CONFIG_DEFAULT_SECURITY_YAMA
- explicitly use yama_add_hooks to designate execution order
---
 Documentation/security/Yama.txt   | 10 --
 MAINTAINERS   |  6 ++
 arch/mips/configs/pistachio_defconfig |  1 -
 include/linux/lsm_hooks.h |  6 --
 security/Kconfig  |  5 -
 security/security.c   | 11 +++
 security/yama/Kconfig |  9 +
 security/yama/yama_lsm.c  | 32 ++--
 8 files changed, 28 insertions(+), 52 deletions(-)

diff --git a/Documentation/security/Yama.txt b/Documentation/security/Yama.txt
index 227a63f018a2..d9ee7d7a6c7f 100644
--- a/Documentation/security/Yama.txt
+++ b/Documentation/security/Yama.txt
@@ -1,9 +1,7 @@
-Yama is a Linux Security Module that collects a number of system-wide DAC
-security protections that are not handled by the core kernel itself. To
-select it at boot time, specify "security=yama" (though this will disable
-any other LSM).
-
-Yama is controlled through sysctl in /proc/sys/kernel/yama:
+Yama is a Linux Security Module that collects system-wide DAC security
+protections that are not handled by the core kernel itself. This is
+selectable at build-time with CONFIG_SECURITY_YAMA, and can be controlled
+at run-time through sysctls in /proc/sys/kernel/yama:
 
 - ptrace_scope
 
diff --git a/MAINTAINERS b/MAINTAINERS
index 2d3d55c8f5be..f013d89d4c61 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9101,6 +9101,12 @@ T:   git 
git://git.kernel.org/pub/scm/linux/kernel/git/jj/apparmor-dev.git
 S: Supported
 F: security/apparmor/
 
+YAMA SECURITY MODULE
+M: Kees Cook 
+T: git git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git 
yama/tip
+S: Supported
+F: security/yama/
+
 SENSABLE PHANTOM
 M: Jiri Slaby 
 S: Maintained
diff --git a/arch/mips/configs/pistachio_defconfig 
b/arch/mips/configs/pistachio_defconfig
index 1646cce032c3..642b50946943 100644
--- a/arch/mips/configs/pistachio_defconfig
+++ b/arch/mips/configs/pistachio_defconfig
@@ -320,7 +320,6 @@ CONFIG_KEYS=y
 CONFIG_SECURITY=y
 CONFIG_SECURITY_NETWORK=y
 CONFIG_SECURITY_YAMA=y
-CONFIG_SECURITY_YAMA_STACKED=y
 CONFIG_DEFAULT_SECURITY_DAC=y
 CONFIG_CRYPTO_AUTHENC=y
 CONFIG_CRYPTO_HMAC=y
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 9429f054c323..ec3a6bab29de 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1881,8 +1881,10 @@ static inline void security_delete_hooks(struct 
security_hook_list *hooks,
 
 extern int __init security_module_enable(const char *module);
 extern void __init capability_add_hooks(void);
-#ifdef CONFIG_SECURITY_YAMA_STACKED
-void __init yama_add_hooks(void);
+#ifdef CONFIG_SECURITY_YAMA
+extern void __init yama_add_hooks(void);
+#else
+static inline void __init yama_add_hooks(void) { }
 #endif
 
 #endif /* ! __LINUX_LSM_HOOKS_H */
diff --git a/security/Kconfig b/security/Kconfig
index bf4ec46474b6..e45237897b43 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -132,7 +132,6 @@ choice
default DEFAULT_SECURITY_SMACK if SECURITY_SMACK
default DEFAULT_SECURITY_TOMOYO if SECURITY_TOMOYO
default DEFAULT_SECURITY_APPARMOR if SECURITY_APPARMOR
-   default DEFAULT_SECURITY_YAMA if SECURITY_YAMA
default DEFAULT_SECURITY_DAC
 
help
@@ -151,9 +150,6 @@ choice
config DEFAULT_SECURITY_APPARMOR
bool "AppArmor" if SECURITY_APPARMOR=y
 
-   config DEFAULT_SECURITY_YAMA
-   bool "Yama" if SECURITY_YAMA=y
-
config DEFAULT_SECURITY_DAC
bool "Unix Discretionary Access Controls"
 
@@ -165,7 +161,6 @@ config DEFAULT_SECURITY
default "smack" if DEFAULT_SECURITY_SMACK
default "tomoyo" if DEFAULT_SECURITY_TOMOYO
default "apparmor" if DEFAULT_SECURITY_APPARMOR
-   default "yama" if DEFAULT_SECURITY_YAMA
default "" if DEFAULT_SECURITY_DAC
 
 endmenu
diff --git a/security/security.c b/security/security.c
index 595fffab48b0..e693ffcf9266 100644
--- a/security/security.c
+++ b/security/security.c
@@ -56,18 +56,13 @@ int __init security_init(void)
pr_info("Security Framework initialized\n");
 
/*
-* Always load the capability module.
+* Load minor LSMs, with the capability module always first.
 */
capability_add_hooks();
-#ifdef CONFIG_SECURITY_YAMA_STACKED
-   /*
-* If Yama is configured for stacking load it next.
-*/
yama_add_hooks();
-#endif
+

[PATCH] LSM: LoadPin for module and firmware loading restrictions

2015-07-23 Thread Kees Cook

This LSM enforces that kernel-loaded modules and firmware must all come
from the same filesystem, with the expectation that such a filesystem
is backed by a read-only device such as dm-verity or CDROM. This allows
systems that have a verified and/or unchangeable filesystem to enforce
module and firmware loading restrictions without needing to sign the
files individually.

Signed-off-by: Kees Cook 
---
 MAINTAINERS|   6 +
 include/linux/lsm_hooks.h  |   5 +
 security/Kconfig   |   1 +
 security/Makefile  |   2 +
 security/loadpin/Kconfig   |   9 ++
 security/loadpin/Makefile  |   1 +
 security/loadpin/loadpin.c | 279 +
 security/security.c|   2 +
 8 files changed, 305 insertions(+)
 create mode 100644 security/loadpin/Kconfig
 create mode 100644 security/loadpin/Makefile
 create mode 100644 security/loadpin/loadpin.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 2d3d55c8f5be..671e760cbe85 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9101,6 +9101,12 @@ T:   git 
git://git.kernel.org/pub/scm/linux/kernel/git/jj/apparmor-dev.git
 S: Supported
 F: security/apparmor/
 
+LOADPIN SECURITY MODULE
+M: Kees Cook 
+T: git git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git 
lsm/loadpin
+S: Supported
+F: security/loadpin/
+
 SENSABLE PHANTOM
 M: Jiri Slaby 
 S: Maintained
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 9429f054c323..d8ceb8099bc0 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1884,5 +1884,10 @@ extern void __init capability_add_hooks(void);
 #ifdef CONFIG_SECURITY_YAMA_STACKED
 void __init yama_add_hooks(void);
 #endif
+#ifdef CONFIG_SECURITY_LOADPIN
+void __init loadpin_add_hooks(void);
+#else
+static inline void loadpin_add_hooks(void) { };
+#endif
 
 #endif /* ! __LINUX_LSM_HOOKS_H */
diff --git a/security/Kconfig b/security/Kconfig
index bf4ec46474b6..f09b58ef43af 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -122,6 +122,7 @@ source security/selinux/Kconfig
 source security/smack/Kconfig
 source security/tomoyo/Kconfig
 source security/apparmor/Kconfig
+source security/loadpin/Kconfig
 source security/yama/Kconfig
 
 source security/integrity/Kconfig
diff --git a/security/Makefile b/security/Makefile
index c9bfbc84ff50..f2d71cdb8e19 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -8,6 +8,7 @@ subdir-$(CONFIG_SECURITY_SMACK) += smack
 subdir-$(CONFIG_SECURITY_TOMOYO)+= tomoyo
 subdir-$(CONFIG_SECURITY_APPARMOR) += apparmor
 subdir-$(CONFIG_SECURITY_YAMA) += yama
+subdir-$(CONFIG_SECURITY_LOADPIN)  += loadpin
 
 # always enable default capabilities
 obj-y  += commoncap.o
@@ -22,6 +23,7 @@ obj-$(CONFIG_AUDIT)   += lsm_audit.o
 obj-$(CONFIG_SECURITY_TOMOYO)  += tomoyo/
 obj-$(CONFIG_SECURITY_APPARMOR)+= apparmor/
 obj-$(CONFIG_SECURITY_YAMA)+= yama/
+obj-$(CONFIG_SECURITY_LOADPIN) += loadpin/
 obj-$(CONFIG_CGROUP_DEVICE)+= device_cgroup.o
 
 # Object integrity file lists
diff --git a/security/loadpin/Kconfig b/security/loadpin/Kconfig
new file mode 100644
index ..8efb8458a9a2
--- /dev/null
+++ b/security/loadpin/Kconfig
@@ -0,0 +1,9 @@
+config SECURITY_LOADPIN
+   bool "Pin loading of kernel modules and firmware to one filesystem"
+   depends on SECURITY && BLOCK
+   help
+ Kernel module and firmware loading will be pinned to the first
+ filesystem used for loading. Any files that come from other
+ filesystems will be rejected. This is best used on systems
+ without an initrd that have a root filesystem backed by a
+ read-only device such as dm-verity or a CDROM.
diff --git a/security/loadpin/Makefile b/security/loadpin/Makefile
new file mode 100644
index ..c2d77f83037b
--- /dev/null
+++ b/security/loadpin/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_SECURITY_LOADPIN) += loadpin.o
diff --git a/security/loadpin/loadpin.c b/security/loadpin/loadpin.c
new file mode 100644
index ..60efa69c9dfb
--- /dev/null
+++ b/security/loadpin/loadpin.c
@@ -0,0 +1,279 @@
+/*
+ * Module and Firmware Pinning Security Module
+ *
+ * Copyright 2011-2015 Google Inc.
+ *
+ * Author: Kees Cook 
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt) "LoadPin: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include   /* get_cmdline() */
+#include 
+#include 
+#include

[PATCH v3] mtd: nand support for Toshiba BENAND (Built-in ECC NAND)

2015-07-23 Thread KOBAYASHI Yoshitake

This patch enables support for Toshiba BENAND. BENAND is a SLC NAND
solution that automatically generates ECC inside NAND chip.

I considered to use the patch of on-die ECC, but I believe reading twice
the same page approach will affect read performance. Additionally, BENAND
does not support Disable ECC. So, I cannot use same approach.

BENAND Read Status CMD(70h) is able to report Rewrite Recommend. BENAND
also has extended ECC Status CMD(7Ah). This command is able to report the
number of bit error/sector data. When MTD_NAND_BENAND_ENABLE option enabled,
this driver use the extended ECC status CMD(7Ah) to report real number of
bitflips. In case of difficult to use CMD(7Ah), this driver use
"mtd->bitflip_threshold" to assume the number of bitflips.

Signed-off-by: KOBAYASHI Yoshitake 
---
 drivers/mtd/nand/Kconfig|   21 +
 drivers/mtd/nand/Makefile   |1 +
 drivers/mtd/nand/nand_base.c|   30 +++-
 drivers/mtd/nand/nand_benand.c  |  154 +++
 include/linux/mtd/nand.h|1 +
 include/linux/mtd/nand_benand.h |   48 
 6 files changed, 253 insertions(+), 2 deletions(-)
 create mode 100644 drivers/mtd/nand/nand_benand.c
 create mode 100644 include/linux/mtd/nand_benand.h

diff --git a/drivers/mtd/nand/Kconfig b/drivers/mtd/nand/Kconfig
index 5b2806a..67e13c2 100644
--- a/drivers/mtd/nand/Kconfig
+++ b/drivers/mtd/nand/Kconfig
@@ -22,6 +22,27 @@ menuconfig MTD_NAND
 
 if MTD_NAND
 
+config MTD_NAND_BENAND
+   tristate
+   depends on MTD_NAND_BENAND_ENABLE
+   default MTD_NAND
+
+config MTD_NAND_BENAND_ENABLE
+   bool "Support for Toshiba BENAND (Built-in ECC NAND)"
+   default y
+   help
+ This enables support for Toshiba BENAND.
+ Toshiba BENAND is a SLC NAND solution that automatically
+ generates ECC inside NAND chip.
+
+config MTD_NAND_BENAND_ECC_STATUS
+   bool "Enable ECC Status Read Command(0x7A)"
+   depends on MTD_NAND_BENAND_ENABLE
+   help
+ This enables support for ECC Status Read Command(0x7A) of BENAND.
+ When this enables, report the real number of bitflips.
+ In other cases, report the assumud number.
+
 config MTD_NAND_BCH
tristate
select BCH
diff --git a/drivers/mtd/nand/Makefile b/drivers/mtd/nand/Makefile
index 1f897ec..d019309 100644
--- a/drivers/mtd/nand/Makefile
+++ b/drivers/mtd/nand/Makefile
@@ -3,6 +3,7 @@
 #
 
 obj-$(CONFIG_MTD_NAND) += nand.o
+obj-$(CONFIG_MTD_NAND_BENAND)  += nand_benand.o
 obj-$(CONFIG_MTD_NAND_ECC) += nand_ecc.o
 obj-$(CONFIG_MTD_NAND_BCH) += nand_bch.o
 obj-$(CONFIG_MTD_NAND_IDS) += nand_ids.o
diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
index ceb68ca..1ffdb5c 100644
--- a/drivers/mtd/nand/nand_base.c
+++ b/drivers/mtd/nand/nand_base.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -3522,8 +3523,13 @@ static void nand_decode_ext_id(struct mtd_info *mtd, 
struct nand_chip *chip,
if (id_len >= 6 && id_data[0] == NAND_MFR_TOSHIBA &&
nand_is_slc(chip) &&
(id_data[5] & 0x7) == 0x6 /* 24nm */ &&
-   !(id_data[4] & 0x80) /* !BENAND */) {
-   mtd->oobsize = 32 * mtd->writesize >> 9;
+   (id_data[4] & 0x80) /* BENAND */) {
+
+   if (IS_ENABLED(CONFIG_MTD_NAND_BENAND))
+   chip->ecc.mode = NAND_ECC_BENAND;
+
+   } else {
+   mtd->oobsize = 32 * mtd->writesize >> 9; /* !BENAND */
}
 
}
@@ -4111,6 +4117,26 @@ int nand_scan_tail(struct mtd_info *mtd)
}
break;
 
+   case NAND_ECC_BENAND:
+   if (!mtd_nand_has_benand()) {
+   pr_warn("CONFIG_MTD_NAND_BENAND not enabled\n");
+   BUG();
+   }
+   ecc->calculate = NULL;
+   ecc->correct = NULL;
+   ecc->read_page = nand_read_page_benand;
+   ecc->read_subpage = nand_read_subpage_benand;
+   ecc->write_page = nand_write_page_raw;
+   ecc->read_page_raw = nand_read_page_raw;
+   ecc->write_page_raw = nand_write_page_raw;
+   ecc->read_oob = nand_read_oob_std;
+   ecc->write_oob = nand_write_oob_std;
+   ecc->size = mtd->writesize;
+   ecc->strength = 8;
+
+   nand_benand_init(mtd);
+   break;
+
case NAND_ECC_NONE:
pr_warn("NAND_ECC_NONE selected by board driver. This is not 
recommended!\n");
ecc->read_page = nand_read_page_raw;
diff --git a/drivers/mtd/nand/nand_benand.c b/drivers/mtd/nand/nand_benand.c
new file mode 100644
index 000..43f89ae
--- /dev/null
+++

Re: [PATCH v3 1/4] clk: samsung: exynos5250: add cpu clock configuration data and instantiate cpu clock

2015-07-23 Thread Michael Turquette

Quoting Bartlomiej Zolnierkiewicz (2015-07-01 06:10:35)
> From: Thomas Abraham 
> 
> With the addition of the new Samsung specific cpu-clock type, the
> arm clock can be represented as a cpu-clock type. Add the CPU clock
> configuration data and instantiate the CPU clock type for Exynos5250.
> 
> Changes by Bartlomiej:
> - split Exynos5250 support from the original patch
> - moved E5250_CPU_DIV[0,1]() macros to clk-exynos5250.c
> 
> Cc: Tomasz Figa 
> Cc: Michael Turquette 
> Cc: Javier Martinez Canillas 
> Signed-off-by: Thomas Abraham 
> Signed-off-by: Bartlomiej Zolnierkiewicz 

Acked-by: Michael Turquette 

If Kukjin wants to merge this through the samsung tree then an immutable
branch would be much appreciated.

Regards,
Mike

> ---
>  drivers/clk/samsung/clk-exynos5250.c   | 31 +++
>  include/dt-bindings/clock/exynos5250.h |  1 +
>  2 files changed, 32 insertions(+)
> 
> diff --git a/drivers/clk/samsung/clk-exynos5250.c 
> b/drivers/clk/samsung/clk-exynos5250.c
> index 70ec3d2..d87f34d 100644
> --- a/drivers/clk/samsung/clk-exynos5250.c
> +++ b/drivers/clk/samsung/clk-exynos5250.c
> @@ -19,6 +19,7 @@
>  #include 
>  
>  #include "clk.h"
> +#include "clk-cpu.h"
>  
>  #define APLL_LOCK  0x0
>  #define APLL_CON0  0x100
> @@ -748,6 +749,32 @@ static struct samsung_pll_clock exynos5250_plls[nr_plls] 
> __initdata = {
> VPLL_LOCK, VPLL_CON0, NULL),
>  };
>  
> +#define E5250_CPU_DIV0(apll, pclk_dbg, atb, periph, acp, cpud) \
> +   apll) << 24) | ((pclk_dbg) << 20) | ((atb) << 16) | \
> +((periph) << 12) | ((acp) << 8) | ((cpud) << 4)))
> +#define E5250_CPU_DIV1(hpm, copy)  \
> +   (((hpm) << 4) | (copy))
> +
> +static const struct exynos_cpuclk_cfg_data exynos5250_armclk_d[] __initconst 
> = {
> +   { 170, E5250_CPU_DIV0(5, 3, 7, 7, 7, 3), E5250_CPU_DIV1(2, 0), },
> +   { 160, E5250_CPU_DIV0(4, 1, 7, 7, 7, 3), E5250_CPU_DIV1(2, 0), },
> +   { 150, E5250_CPU_DIV0(4, 1, 7, 7, 7, 2), E5250_CPU_DIV1(2, 0), },
> +   { 140, E5250_CPU_DIV0(4, 1, 6, 7, 7, 2), E5250_CPU_DIV1(2, 0), },
> +   { 130, E5250_CPU_DIV0(3, 1, 6, 7, 7, 2), E5250_CPU_DIV1(2, 0), },
> +   { 120, E5250_CPU_DIV0(3, 1, 5, 7, 7, 2), E5250_CPU_DIV1(2, 0), },
> +   { 110, E5250_CPU_DIV0(3, 1, 5, 7, 7, 3), E5250_CPU_DIV1(2, 0), },
> +   { 100, E5250_CPU_DIV0(2, 1, 4, 7, 7, 1), E5250_CPU_DIV1(2, 0), },
> +   {  90, E5250_CPU_DIV0(2, 1, 4, 7, 7, 1), E5250_CPU_DIV1(2, 0), },
> +   {  80, E5250_CPU_DIV0(2, 1, 4, 7, 7, 1), E5250_CPU_DIV1(2, 0), },
> +   {  70, E5250_CPU_DIV0(1, 1, 3, 7, 7, 1), E5250_CPU_DIV1(2, 0), },
> +   {  60, E5250_CPU_DIV0(1, 1, 3, 7, 7, 1), E5250_CPU_DIV1(2, 0), },
> +   {  50, E5250_CPU_DIV0(1, 1, 2, 7, 7, 1), E5250_CPU_DIV1(2, 0), },
> +   {  40, E5250_CPU_DIV0(1, 1, 2, 7, 7, 1), E5250_CPU_DIV1(2, 0), },
> +   {  30, E5250_CPU_DIV0(1, 1, 1, 7, 7, 1), E5250_CPU_DIV1(2, 0), },
> +   {  20, E5250_CPU_DIV0(1, 1, 1, 7, 7, 1), E5250_CPU_DIV1(2, 0), },
> +   {  0 },
> +};
> +
>  static const struct of_device_id ext_clk_match[] __initconst = {
> { .compatible = "samsung,clock-xxti", .data = (void *)0, },
> { },
> @@ -797,6 +824,10 @@ static void __init exynos5250_clk_init(struct 
> device_node *np)
> ARRAY_SIZE(exynos5250_div_clks));
> samsung_clk_register_gate(ctx, exynos5250_gate_clks,
> ARRAY_SIZE(exynos5250_gate_clks));
> +   exynos_register_cpu_clock(ctx, CLK_ARM_CLK, "armclk",
> +   mout_cpu_p[0], mout_cpu_p[1], 0x200,
> +   exynos5250_armclk_d, ARRAY_SIZE(exynos5250_armclk_d),
> +   CLK_CPU_HAS_DIV1);
>  
> /*
>  * Enable arm clock down (in idle) and set arm divider
> diff --git a/include/dt-bindings/clock/exynos5250.h 
> b/include/dt-bindings/clock/exynos5250.h
> index 4273891d..8183d1c 100644
> --- a/include/dt-bindings/clock/exynos5250.h
> +++ b/include/dt-bindings/clock/exynos5250.h
> @@ -21,6 +21,7 @@
>  #define CLK_FOUT_CPLL  6
>  #define CLK_FOUT_EPLL  7
>  #define CLK_FOUT_VPLL  8
> +#define CLK_ARM_CLK9
>  
>  /* gate for special clocks (sclk) */
>  #define CLK_SCLK_CAM_BAYER 128
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v6 0/3] cpufreq: Use cpufreq-dt driver for Exynos3250

2015-07-23 Thread Michael Turquette

Quoting Kukjin Kim (2015-07-07 07:43:31)
> Bartlomiej Zolnierkiewicz wrote:
> > 
> > Hi,
> >
> Hi,
> 
> > On Thursday, July 02, 2015 09:42:38 AM Chanwoo Choi wrote:
> > > This patchset use cpufreq-dt driver to support Exynos3250 cpufreq and 
> > > tested it
> > > on Exynos3250-based Rinato board.
> > >
> > > Depends on:
> > > - next-20150701 tag (master branch) of linux-next kernel tree
> > > - This patch-set is based on Exynos5250 patch-set[1] because two patch-set
> > >   modify the 'arch/arm/mach-exynos/exynos.c' to add the compatible string.
> > >   [1] https://lkml.org/lkml/2015/6/29/361
> > >   : [PATCH v2 0/4] cpufreq: use generic cpufreq drivers for Exynos5250 
> > > platform
> > >
> > > Changes from v5:
> > > (https://lkml.org/lkml/2015/7/1/324)
> > > - Reorder the cpu dt node in exynos3250-rinato/monk.dts alpabetically.
> > > - Add reviewed-by tag of Krzysztof Kozlowski 
> > >
> > > Changes from v4:
> > > (https://lkml.org/lkml/2014/10/20/215)
> > > - Rebased on latest linux-next git repository.
> > > - Remove unnecessary divider clock flag from divider of DIV_CPU0/DIV_CPU1 
> > > register
> > >
> > > Changes from v3:
> > > - This patchset is based on 3.18-rc1 with new patchset[3] of Thomas 
> > > Abraham
> > >   [3] [PATCH v11 0/6] cpufreq: use generic cpufreq drivers for exynos 
> > > platforms
> > >   - http://www.spinics.net/lists/arm-kernel/msg370412.html
> > >
> > > Changes from v2:
> > > - Rebased on new patchset of Thomas Abraham
> > >   and for-next branch of samsunc-clk.git of Tomasz Figa
> > >
> > > Changes from v1:
> > > - Rebased on new patchset[1] by Thomas Abraham
> > >   [1] [PATCH v10 0/6] cpufreq: use generic cpufreq drivers for exynos 
> > > platforms
> > >   - http://www.spinics.net/lists/arm-kernel/msg364790.html
> > > - Modify clk-cpu.c to support Exynos3250
> > > - Drop documentation patch on previous patchset[2]
> > >   [2] http://www.spinics.net/lists/cpufreq/msg10265.html
> > > - Add only operating-points for Exynos3250 without armclk-divider-table
> > >
> > > Chanwoo Choi (3):
> > >   clk: samsung: exynos3250: Add cpu clock configuration data and 
> > > instaniate cpu clock
> > >   ARM: dts: Add CPU OPP and regulator supply property for Exynos3250
> > >   ARM: exynos: Add exynos3250 compatible to use generic cpufreq driver
> > >
> > >  arch/arm/boot/dts/exynos3250-monk.dts   |  4 
> > >  arch/arm/boot/dts/exynos3250-rinato.dts |  4 
> > >  arch/arm/boot/dts/exynos3250.dtsi   | 15 +++
> > >  arch/arm/mach-exynos/exynos.c   |  1 +
> > >  drivers/clk/samsung/clk-exynos3250.c| 32 
> > > ++--
> > >  include/dt-bindings/clock/exynos3250.h  |  1 +
> > >  6 files changed, 55 insertions(+), 2 deletions(-)
> > 
> > Reviewed-by: Bartlomiej Zolnierkiewicz 
> > 
> > Thank you for working on this.
> > 
> +1 Thanks.
> 
> Mike and Sylwester, if you're OK on this series, I'd like to pick up in 
> Samsung
> tree together. And if you want, I could provide topic branch for clk tree.

Kukjin,

A topic branch would be great.

Thanks,
Mike

> 
> Thanks,
> Kukjin
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] ipc: Use private shmem or hugetlbfs inodes for shm segments.

2015-07-23 Thread Dave Chinner

On Thu, Jul 23, 2015 at 12:28:33PM -0400, Stephen Smalley wrote:
> The shm implementation internally uses shmem or hugetlbfs inodes
> for shm segments.  As these inodes are never directly exposed to
> userspace and only accessed through the shm operations which are
> already hooked by security modules, mark the inodes with the
> S_PRIVATE flag so that inode security initialization and permission
> checking is skipped.
> 
> This was motivated by the following lockdep warning:
> ===
> [ INFO: possible circular locking dependency detected ]
> 4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: GW
> ---
> httpd/1597 is trying to acquire lock:
> (>rwsem){+.}, at: [] shm_close+0x34/0x130
> (>mmap_sem){++}, at: [] SyS_shmdt+0x4b/0x180
>   [] lock_acquire+0xc7/0x270
>   [] __might_fault+0x7a/0xa0
>   [] filldir+0x9e/0x130
>   [] xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
>   [] xfs_readdir+0x1b4/0x330 [xfs]
>   [] xfs_file_readdir+0x2b/0x30 [xfs]
>   [] iterate_dir+0x97/0x130
>   [] SyS_getdents+0x91/0x120
>   [] entry_SYSCALL_64_fastpath+0x12/0x76
>   [] lock_acquire+0xc7/0x270
>   [] down_read_nested+0x57/0xa0
>   [] xfs_ilock+0x167/0x350 [xfs]
>   [] xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
>   [] xfs_attr_get+0xbd/0x190 [xfs]
>   [] xfs_xattr_get+0x3d/0x70 [xfs]
>   [] generic_getxattr+0x4f/0x70
>   [] inode_doinit_with_dentry+0x162/0x670
>   [] sb_finish_set_opts+0xd9/0x230
>   [] selinux_set_mnt_opts+0x35c/0x660
>   [] superblock_doinit+0x77/0xf0
>   [] delayed_superblock_init+0x10/0x20
>   [] iterate_supers+0xb3/0x110
>   [] selinux_complete_init+0x2f/0x40
>   [] security_load_policy+0x103/0x600
>   [] sel_write_load+0xc1/0x750
>   [] __vfs_write+0x37/0x100
>   [] vfs_write+0xa9/0x1a0
>   [] SyS_write+0x58/0xd0
>   [] entry_SYSCALL_64_fastpath+0x12/0x76
>   [] lock_acquire+0xc7/0x270
>   [] mutex_lock_nested+0x7f/0x3e0
>   [] inode_doinit_with_dentry+0xb9/0x670
>   [] selinux_d_instantiate+0x1c/0x20
>   [] security_d_instantiate+0x36/0x60
>   [] d_instantiate+0x54/0x70
>   [] __shmem_file_setup+0xdc/0x240
>   [] shmem_file_setup+0x10/0x20
>   [] newseg+0x290/0x3a0
>   [] ipcget+0x208/0x2d0
>   [] SyS_shmget+0x54/0x70
>   [] entry_SYSCALL_64_fastpath+0x12/0x76
>   [] __lock_acquire+0x1a78/0x1d00
>   [] lock_acquire+0xc7/0x270
>   [] down_write+0x5a/0xc0
>   [] shm_close+0x34/0x130
>   [] remove_vma+0x45/0x80
>   [] do_munmap+0x2b0/0x460
>   [] SyS_shmdt+0xb5/0x180
>   [] entry_SYSCALL_64_fastpath+0x12/0x76

That's a completely screwed up stack trace. There are *4* syscall
entry points with 4 separate, unrelated syscall chains on that
stack trace, all starting at the same address. How is this a valid
stack trace and not a lockdep bug of some kind?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] ARM: errata 430973: move !ARCH_MULTIPLATFORM to Kconfig

2015-07-23 Thread Sebastian Reichel

Hi Russel,

On Thu, Jul 23, 2015 at 01:35:53PM +0100, Russell King - ARM Linux wrote:
> On Thu, Jul 23, 2015 at 02:48:03AM +0200, Sebastian Reichel wrote:
> > Having the !ARCH_MULTIPLATFORM dependency in the Kconfig file results
> > in one option less to think about when configuring the kernel.
> 
> > -#if defined(CONFIG_ARM_ERRATA_430973) && 
> > !defined(CONFIG_ARCH_MULTIPLATFORM)
> > +#ifdef CONFIG_ARM_ERRATA_430973
> > teq r3, #0x0010 @ only present in r1p*
> > mrceq   p15, 0, r0, c1, c0, 1   @ read aux control register
> > orreq   r0, r0, #(1 << 6)   @ set IBE to 1
> 
> NAK.  Please read the mailing list history, I'm not repeating myself
> again on this.  Thanks.

It's a bit hard to search the mailing list history without a bit
more information.

I guess you prefer to just add the !ARCH_MULTIPLATFORM dependency to
the Kconfig entry without removing the additional check in the code?

-- Sebastian


signature.asc
Description: Digital signature

Re: [PATCH v3 2/3] x86/ldt: Make modify_ldt optional

2015-07-23 Thread Kees Cook

On Thu, Jul 23, 2015 at 4:58 PM, Willy Tarreau  wrote:
> On Thu, Jul 23, 2015 at 04:40:14PM -0700, Andy Lutomirski wrote:
>> On Thu, Jul 23, 2015 at 4:36 PM, Kees Cook  wrote:
>> > I've been pondering something like this that is even MORE generic, for
>> > any syscall. Something like a "syscalls" directory under
>> > /proc/sys/kernel, with 1 entry per syscall. "0" is "available", "1" is
>> > disabled, and "-1" disabled until next boot.
>> >
>>
>> It might want to be /proc/sys/kernel/syscalls/[abi]/[name], possibly
>> with more than just those options.  We might want "disabled, returns
>> ENOSYS", "disabled, returns EPERM", and a lock bit.
>>
>> On x86 at least, the implementation's easy -- we can just poke the
>> syscall table.
>
> I wouldn't do it these days. Around 2000-2001, with a friend we designed
> a module with its userland counterpart which was called "overloader". The
> principle was to intercept syscalls in order to enforce some form of
> policies, log values, or remap paths, etc. The first use was to log all
> file creations during a "make install" to more easily build packages. It
> was at the era where it was easy to modify the syscall table from a module,
> in kernel 2.2.
>
> We quickly found that beyond logging/rewriting syscall arguments, it had
> limited use cases when used as a "syscall firewall" because many syscalls
> are still too coarse to decide whether you want to enable/disable them.
> I remember that socketcall() and ioctl() were among the annoying ones.
> Either you totally enable or totally disable. In the end, the only valid
> use cases we found for enabling/disabling a syscall were limited to a very
> small set for debugging purposes, in order to force some application code
> to detect a missing implementation and switch to an alternative (eg: these
> days if you suspect a bug in epoll you could disable it and force the app
> to use poll instead). It was still useful to disable module loading and
> FS mounting but that was about all by then.
>
> All this to say that probably only a handful of tricky syscalls would
> need an on/off switch but clearly not all of them at all, so I'd rather
> add a few entries just for the relevant ones, mainly to fix compatibility
> issues and nothing more. Eg: what's the point of disabling exit(), wait(),
> kill(), fork() or getpid()... It would only increase the difficulty to
> sort out bug reports.
>
> Just my opinion,

Well, I would really like to have something like this around so that I
can trivially globally disable syscalls when they have security risks.
My hack[1] to disable kexec_load, for example, was terrible while I
waited for a kernel that supported the disable_kexec_load sysctl.

-Kees

[1] https://outflux.net/blog/archives/2013/12/10/live-patching-the-kernel/

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] x86/ldt: Make modify_ldt optional

2015-07-23 Thread Willy Tarreau

On Thu, Jul 23, 2015 at 04:40:14PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 23, 2015 at 4:36 PM, Kees Cook  wrote:
> > I've been pondering something like this that is even MORE generic, for
> > any syscall. Something like a "syscalls" directory under
> > /proc/sys/kernel, with 1 entry per syscall. "0" is "available", "1" is
> > disabled, and "-1" disabled until next boot.
> >
> 
> It might want to be /proc/sys/kernel/syscalls/[abi]/[name], possibly
> with more than just those options.  We might want "disabled, returns
> ENOSYS", "disabled, returns EPERM", and a lock bit.
> 
> On x86 at least, the implementation's easy -- we can just poke the
> syscall table.

I wouldn't do it these days. Around 2000-2001, with a friend we designed
a module with its userland counterpart which was called "overloader". The
principle was to intercept syscalls in order to enforce some form of
policies, log values, or remap paths, etc. The first use was to log all
file creations during a "make install" to more easily build packages. It
was at the era where it was easy to modify the syscall table from a module,
in kernel 2.2.

We quickly found that beyond logging/rewriting syscall arguments, it had
limited use cases when used as a "syscall firewall" because many syscalls
are still too coarse to decide whether you want to enable/disable them.
I remember that socketcall() and ioctl() were among the annoying ones.
Either you totally enable or totally disable. In the end, the only valid
use cases we found for enabling/disabling a syscall were limited to a very
small set for debugging purposes, in order to force some application code
to detect a missing implementation and switch to an alternative (eg: these
days if you suspect a bug in epoll you could disable it and force the app
to use poll instead). It was still useful to disable module loading and
FS mounting but that was about all by then.

All this to say that probably only a handful of tricky syscalls would
need an on/off switch but clearly not all of them at all, so I'd rather
add a few entries just for the relevant ones, mainly to fix compatibility
issues and nothing more. Eg: what's the point of disabling exit(), wait(),
kill(), fork() or getpid()... It would only increase the difficulty to
sort out bug reports.

Just my opinion,
Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/7] Initial support for user namespace owned mounts

2015-07-23 Thread Dave Chinner

On Thu, Jul 23, 2015 at 09:19:28AM -0400, J. Bruce Fields wrote:
> On Thu, Jul 23, 2015 at 11:51:35AM +1000, Dave Chinner wrote:
> > On Wed, Jul 22, 2015 at 01:41:00PM -0400, J. Bruce Fields wrote:
> > > On Wed, Jul 22, 2015 at 12:52:58PM -0400, Austin S Hemmelgarn wrote:
> > > > On 2015-07-22 10:09, J. Bruce Fields wrote:
> > > > >On Wed, Jul 22, 2015 at 05:56:40PM +1000, Dave Chinner wrote:
> > > > >>On Tue, Jul 21, 2015 at 01:37:21PM -0400, J. Bruce Fields wrote:
> > > > >>>On Fri, Jul 17, 2015 at 12:47:35PM +1000, Dave Chinner wrote:
> > > > >>>So, for example, a screwed up on-disk directory structure shouldn't
> > > > >>>result in creating a cycle in the dcache and then deadlocking.
> > > > >>
> > > > >>Therein lies the problem: how do you detect such structural defects
> > > > >>without doing a full structure validation?
> > > > >
> > > > >You can prevent cycles in a graph if you can prevent adding an edge
> > > > >which would be part of a cycle.
> > > > >
> > > > Except if the user can write to the filesystem's backing storage (be
> > > > it a device or a file), and has sufficient knowledge of the on-disk
> > > > structures, they can create all the cycles they want in the
> > > > metadata. So unless the kernel builds the graph internally by
> > > > parsing the metadata _and_ has some way to detect that the on-disk
> > > > metadata has hit a cycle (which may not just involve 2 items),
> > > 
> > > Understood.  Again, see the d_ancestor call in d_splice_alias, this is
> > > exactly what it checks for.
> > 
> > But that only addresses one type of loop in one specific metadata
> > structure.
> 
> Yep, agreed!
> 
> > There's plenty of other ways you could construct metadata
> > loops that are essentially undetected and result in either deadlock
> > or livelock within the filesystem code itself. e.g. just make btree
> > sibling pointers loop over a range of entries that have the same
> > index key (e.g. free space extents of the same size). If allocation
> > then falls into this loop, the kernel will just spin searching the
> > same blocks for something it will never find.  Such resource
> > consumption attacks are trivial to construct but extremely difficult
> > to detect because they exploit normal behaviour of the structure and
> > algorithms by mangling trusted pointers.
> 
> Interesting example, thanks!  I doubt this particular example would be
> *that* hard to detect?

Yes, it can be detected, but it's not as easy as it sounds because
of abstractions between tree walking and record parsing.

>  But understood that there may be lots of others.

Yeah, that's just one of many, many ways I can think of modifying
on disk structures to screw up the kernel.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] x86/ldt: Make modify_ldt optional

2015-07-23 Thread Andy Lutomirski

On Thu, Jul 23, 2015 at 4:36 PM, Kees Cook  wrote:
> On Thu, Jul 23, 2015 at 3:24 AM, Willy Tarreau  wrote:
>>  #ifdef CONFIG_SMP
>>  static void flush_ldt(void *current_mm)
>>  {
>> @@ -254,6 +260,9 @@ asmlinkage int sys_modify_ldt(int func, void __user *ptr,
>>  {
>> int ret = -ENOSYS;
>>
>> +   if (!sysctl_modify_ldt)
>> +   return ret;
>> +
>> switch (func) {
>> case 0:
>> ret = read_ldt(ptr, bytecount);
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 2082b1a..60270c6 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -111,6 +111,9 @@ extern int sysctl_nr_open_min, sysctl_nr_open_max;
>>  #ifndef CONFIG_MMU
>>  extern int sysctl_nr_trim_pages;
>>  #endif
>> +#ifdef CONFIG_X86
>> +extern int sysctl_modify_ldt;
>> +#endif
>>
>>  /* Constants used for minimum and  maximum */
>>  #ifdef CONFIG_LOCKUP_DETECTOR
>> @@ -962,6 +965,13 @@ static struct ctl_table kern_table[] = {
>> .mode   = 0644,
>> .proc_handler   = proc_dointvec,
>> },
>> +   {
>> +   .procname   = "modify_ldt",
>> +   .data   = _modify_ldt,
>> +   .maxlen = sizeof(int),
>> +   .mode   = 0644,
>> +   .proc_handler   = proc_dointvec,
>> +   },
>>  #endif
>>  #if defined(CONFIG_MMU)
>> {
>
> I've been pondering something like this that is even MORE generic, for
> any syscall. Something like a "syscalls" directory under
> /proc/sys/kernel, with 1 entry per syscall. "0" is "available", "1" is
> disabled, and "-1" disabled until next boot.
>

It might want to be /proc/sys/kernel/syscalls/[abi]/[name], possibly
with more than just those options.  We might want "disabled, returns
ENOSYS", "disabled, returns EPERM", and a lock bit.

On x86 at least, the implementation's easy -- we can just poke the
syscall table.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] x86/ldt: Make modify_ldt optional

2015-07-23 Thread Kees Cook

On Thu, Jul 23, 2015 at 3:24 AM, Willy Tarreau  wrote:
> Hi Andy,
>
> On Wed, Jul 22, 2015 at 12:23:47PM -0700, Andy Lutomirski wrote:
>> The modify_ldt syscall exposes a large attack surface and is
>> unnecessary for modern userspace.  Make it optional.
>
> Wouldn't you prefer something like this which makes it possible to re-enable
> it at runtime so that we can hope distros ship with it disabled by default ?
>
> It's pretty efficient on your ldtgdt testcase :
>
> # echo 1 > /proc/sys/kernel/modify_ldt
> # ./a.out
> [OK]LDT entry 0 has AR 0x0040FA00 and limit 0x000A
> [OK]LDT entry 0 has AR 0x00C0FA00 and limit 0xAFFF
> [OK]LDT entry 1 is invalid
> [OK]LDT entry 2 has AR 0x00C0FA00 and limit 0xAFFF
> [OK]LDT entry 1 is invalid
> [OK]LDT entry 2 has AR 0x00C0FA00 and limit 0xAFFF
> [OK]LDT entry 2 has AR 0x00D0FA00 and limit 0xAFFF
> [OK]LDT entry 2 has AR 0x00D07A00 and limit 0xAFFF
> [OK]LDT entry 2 has AR 0x00907A00 and limit 0xAFFF
> [OK]LDT entry 2 has AR 0x00D07200 and limit 0xAFFF
> [OK]LDT entry 2 has AR 0x00D07000 and limit 0xAFFF
> [OK]LDT entry 2 has AR 0x00D07400 and limit 0xAFFF
> [OK]LDT entry 2 has AR 0x00507600 and limit 0x000A
> [OK]LDT entry 2 has AR 0x00507E00 and limit 0x000A
> [OK]LDT entry 2 has AR 0x00507C00 and limit 0x000A
> [OK]LDT entry 2 has AR 0x00507A00 and limit 0x000A
> [OK]LDT entry 2 has AR 0x00507800 and limit 0x000A
> [OK]LDT entry 2 has AR 0x00507800 and limit 0x000A
> [RUN]   Test fork
> [OK]LDT entry 2 has AR 0x00507800 and limit 0x000A
> [OK]LDT entry 1 is invalid
> [OK]LDT entry 0 has AR 0x0040FA00 and limit 0x000A
> [OK]LDT entry 0 has AR 0x00C0FA00 and limit 0xAFFF
> [OK]LDT entry 1 is invalid
> [OK]LDT entry 2 has AR 0x00C0FA00 and limit 0xAFFF
> [OK]LDT entry 1 is invalid
> [OK]LDT entry 2 has AR 0x00C0FA00 and limit 0xAFFF
> [OK]LDT entry 2 has AR 0x00D0FA00 and limit 0xAFFF
> [OK]LDT entry 2 has AR 0x00D07A00 and limit 0xAFFF
> [OK]LDT entry 2 has AR 0x00907A00 and limit 0xAFFF
> [OK]LDT entry 2 has AR 0x00D07200 and limit 0xAFFF
> [OK]LDT entry 2 has AR 0x00D07000 and limit 0xAFFF
> [OK]LDT entry 2 has AR 0x00D07400 and limit 0xAFFF
> [OK]LDT entry 2 has AR 0x00507600 and limit 0x000A
> [OK]LDT entry 2 has AR 0x00507E00 and limit 0x000A
> [OK]LDT entry 2 has AR 0x00507C00 and limit 0x000A
> [OK]LDT entry 2 has AR 0x00507A00 and limit 0x000A
> [OK]LDT entry 2 has AR 0x00507800 and limit 0x000A
> [OK]LDT entry 2 has AR 0x00507800 and limit 0x000A
> [RUN]   Test fork
> [OK]Child succeeded
> [OK]modify_ldt failure 22
> [OK]LDT entry 0 has AR 0xF200 and limit 0x
> [OK]LDT entry 0 has AR 0x7200 and limit 0x
> [OK]LDT entry 0 has AR 0xF000 and limit 0x
> [OK]LDT entry 0 has AR 0x7200 and limit 0x
> [OK]LDT entry 0 has AR 0x7000 and limit 0x0001
> [OK]LDT entry 0 has AR 0x7000 and limit 0x
> [OK]LDT entry 0 is invalid
> [OK]LDT entry 0 has AR 0x0040F200 and limit 0x
> [OK]LDT entry 0 is invalid
> [SKIP]  Cannot set affinity to CPU 1
>
>
> # echo 0 > /proc/sys/kernel/modify_ldt
> # ./a.out
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]LDT entry 1 is invalid
> [OK]modify_ldt is returned -ENOSYS
> [OK]LDT entry 1 is invalid
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [SKIP]  Skipping fork test because have no LDT
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [OK]modify_ldt is returned -ENOSYS
> [SKIP]  Cannot set affinity to CPU 1
>
> The patch is quite small (I stole your comment for the config option).
>
> Willy
>
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 226d569..b926f65 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1012,6 +1012,23 @@ config X86_16BIT
>   this option saves about 300 bytes on i386, or around 6K text
>   plus 16K runtime memory on x86-64,
>
>

Re: [PATCH v2 0/5] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter

2015-07-23 Thread Daniel Borkmann


On 07/22/2015 10:09 AM, Kaixu Xia wrote:

Previous patch v1 url:
https://lkml.org/lkml/2015/7/17/287


[ Sorry to chime in late, just noticed this series now as I wasn't in Cc for
  the core BPF changes. More below ... ]


This patchset allows user read PMU events in the following way:
  1. Open the PMU using perf_event_open() (for each CPUs or for
 each processes he/she'd like to watch);
  2. Create a BPF_MAP_TYPE_PERF_EVENT_ARRAY BPF map;
  3. Insert FDs into the map with some key-value mapping scheme
 (i.e. cpuid -> event on that CPU);
  4. Load and attach eBPF programs as usual;
  5. In eBPF program, get the perf_event_map_fd and key (i.e.
 cpuid get from bpf_get_smp_processor_id()) then use
 bpf_perf_event_read() to read from it.
  6. Do anything he/her want.

changes in V2:
  - put atomic_long_inc_not_zero() between fdget() and fdput();
  - limit the event type to PERF_TYPE_RAW and PERF_TYPE_HARDWARE;
  - Only read the event counter on current CPU or on current
process;
  - add new map type BPF_MAP_TYPE_PERF_EVENT_ARRAY to store the
pointer to the struct perf_event;
  - according to the perf_event_map_fd and key, the function
bpf_perf_event_read() can get the Hardware PMU counter value;

Patch 5/5 is a simple example and shows how to use this new eBPF
programs ability. The PMU counter data can be found in
/sys/kernel/debug/tracing/trace(trace_pipe).(the cycles PMU
value when 'kprobe/sys_write' sampling)

   $ cat /sys/kernel/debug/tracing/trace_pipe
   $ ./tracex6
...
  cat-677   [002] d..1   210.299270: : bpf count: CPU-2  5316659
  cat-677   [002] d..1   210.299316: : bpf count: CPU-2  5378639
  cat-677   [002] d..1   210.299362: : bpf count: CPU-2  5440654
  cat-677   [002] d..1   210.299408: : bpf count: CPU-2  5503211
  cat-677   [002] d..1   210.299454: : bpf count: CPU-2  5565438
  cat-677   [002] d..1   210.299500: : bpf count: CPU-2  5627433
  cat-677   [002] d..1   210.299547: : bpf count: CPU-2  5690033
  cat-677   [002] d..1   210.299593: : bpf count: CPU-2  5752184
  cat-677   [002] d..1   210.299639: : bpf count: CPU-2  5814543
<...>-548   [009] d..1   210.299667: : bpf count: CPU-9  605418074
<...>-548   [009] d..1   210.299692: : bpf count: CPU-9  605452692
  cat-677   [002] d..1   210.299700: : bpf count: CPU-2  5896319
<...>-548   [009] d..1   210.299710: : bpf count: CPU-9  605477824
<...>-548   [009] d..1   210.299728: : bpf count: CPU-9  605501726
<...>-548   [009] d..1   210.299745: : bpf count: CPU-9  605525279
<...>-548   [009] d..1   210.299762: : bpf count: CPU-9  605547817
<...>-548   [009] d..1   210.299778: : bpf count: CPU-9  605570433
<...>-548   [009] d..1   210.299795: : bpf count: CPU-9  605592743
...

The detail of patches is as follow:

Patch 1/5 introduces a new bpf map type. This map only stores the
pointer to struct perf_event;

Patch 2/5 introduces a map_traverse_elem() function for further use;

Patch 3/5 convets event file descriptors into perf_event structure when
add new element to the map;


So far all the map backends are of generic nature, knowing absolutely nothing
about a particular consumer/subsystem of eBPF (tc, socket filters, etc). The
tail call is a bit special, but nevertheless generic for each user and [very]
useful, so it makes sense to inherit from the array map and move the code there.

I don't really like that we start add new _special_-cased maps here into the
eBPF core code, it seems quite hacky. :( From your rather terse commit 
description
where you introduce the maps, I failed to see a detailed elaboration on this 
i.e.
why it cannot be abstracted any different?


Patch 4/5 implement function bpf_perf_event_read() that get the selected
hardware PMU conuter;

Patch 5/5 give a simple example.

Kaixu Xia (5):
   bpf: Add new bpf map type to store the pointer to struct perf_event
   bpf: Add function map->ops->map_traverse_elem() to traverse map elems
   bpf: Save the pointer to struct perf_event to map
   bpf: Implement function bpf_perf_event_read() that get the selected
 hardware PMU conuter
   samples/bpf: example of get selected PMU counter value

  include/linux/bpf.h|   6 +++
  include/linux/perf_event.h |   5 ++-
  include/uapi/linux/bpf.h   |   3 ++
  kernel/bpf/arraymap.c  | 110 +
  kernel/bpf/helpers.c   |  42 +
  kernel/bpf/syscall.c   |  26 +++
  kernel/events/core.c   |  30 -
  kernel/trace/bpf_trace.c   |   2 +
  samples/bpf/Makefile   |   4 ++
  samples/bpf/bpf_helpers.h  |   2 +
  samples/bpf/tracex6_kern.c |  27 +++
  samples/bpf/tracex6_user.c |  67 +++
  12 files changed, 321 insertions(+), 3 deletions(-)
  create mode

Re: [PATCH] mm: add resched points to remap_pmd_range/ioremap_pmd_range

2015-07-23 Thread Toshi Kani

On Thu, 2015-07-23 at 14:54 -0700, Spencer Baugh wrote:
> From: Joern Engel 
> 
> Mapping large memory spaces can be slow and prevent high-priority
> realtime threads from preempting lower-priority threads for a long time.

Yes, and one of the goals of large page ioremap support is to address such
problem.

> In my case it was a 256GB mapping causing at least 950ms scheduler
> delay.  Problem detection is ratelimited and depends on interrupts
> happening at the right time, so actual delay is likely worse.

ioremap supports 1GB and 2MB mappings now.  If you create 1GB mappings, you
only need to initialize 256 pud entries, which should not take a long time.

Is the 256GB range aligned by 1GB (or 2MB)?  From the log below, it appears
that you ended up with 4KB mappings, which is the problem.

> [ cut here ]
> WARNING: at arch/x86/kernel/irq.c:182 do_IRQ+0x126/0x140()
> Thread not rescheduled for 36 jiffies
> CPU: 14 PID: 6684 Comm: foo Tainted: G   O 3.10.59+
>  0009 883f7fbc3ee0 8163a12c 883f7fbc3f18
>  8103f131 887f48275ac0 0012 007c
>   887f5bc11fd8 883f7fbc3f78 8103f19c
> Call Trace:
>[] dump_stack+0x19/0x1b
>  [] warn_slowpath_common+0x61/0x80
>  [] warn_slowpath_fmt+0x4c/0x50
>  [] ? rcu_irq_exit+0x77/0xc0
>  [] do_IRQ+0x126/0x140
>  [] common_interrupt+0x6f/0x6f
>[] ? set_pageblock_migratetype+0x28/0x30
>  [] ? clear_page_c_e+0x7/0x10
>  [] ? get_page_from_freelist+0x5b3/0x880
>  [] __alloc_pages_nodemask+0xe3/0x810
>  [] ? trace_hardirqs_on_thunk+0x3a/0x3c
>  [] alloc_pages_current+0x86/0x120
>  [] __get_free_pages+0xe/0x50
>  [] pte_alloc_one_kernel+0x15/0x20
>  [] __pte_alloc_kernel+0x1d/0xf0

This shows that you created 4KB (pte) mappings.

>  [] ioremap_page_range+0x2cc/0x320
>  [] __ioremap_caller+0x1e9/0x2b0
>  [] ioremap_nocache+0x17/0x20
>  [] pci_iomap+0x55/0xb0
>  [] vfio_pci_mmap+0x1ea/0x210 [vfio_pci]
>  [] vfio_device_fops_mmap+0x23/0x30 [vfio]
>  [] mmap_region+0x3d8/0x5e0
>  [] do_mmap_pgoff+0x305/0x3c0
>  [] ? call_rwsem_down_write_failed+0x13/0x20
>  [] vm_mmap_pgoff+0x67/0xa0
>  [] SyS_mmap_pgoff+0x272/0x2e0
>  [] SyS_mmap+0x22/0x30
>  [] system_call_fastpath+0x16/0x1b
> ---[ end trace 6b0a8d2341444bdd ]---
> [ cut here ]
> WARNING: at arch/x86/kernel/irq.c:182 do_IRQ+0x126/0x140()
> Thread not rescheduled for 95 jiffies
> CPU: 14 PID: 6684 Comm: foo Tainted: GW  O 3.10.59+
>  0009 883f7fbc3ee0 8163a12c 883f7fbc3f18
>  8103f131 887f48275ac0 002f 007c
>   7fadd1e0 883f7fbc3f78 8103f19c
> Call Trace:
>[] dump_stack+0x19/0x1b
>  [] warn_slowpath_common+0x61/0x80
>  [] warn_slowpath_fmt+0x4c/0x50
>  [] ? rcu_irq_exit+0x77/0xc0
>  [] do_IRQ+0x126/0x140
>  [] common_interrupt+0x6f/0x6f
>[] ? _raw_spin_lock+0x13/0x30
>  [] __pte_alloc+0x31/0xc0
>  [] remap_pfn_range+0x45c/0x470

remap_pfn_range() does not have large page mappings support yet.  So, yes,
this can still take a long time at this point.  We can extend large page
support for this interface if necessary.

>  [] vfio_pci_mmap+0x148/0x210 [vfio_pci]
>  [] vfio_device_fops_mmap+0x23/0x30 [vfio]
>  [] mmap_region+0x3d8/0x5e0
>  [] do_mmap_pgoff+0x305/0x3c0
>  [] ? call_rwsem_down_write_failed+0x13/0x20
>  [] vm_mmap_pgoff+0x67/0xa0
>  [] SyS_mmap_pgoff+0x272/0x2e0
>  [] SyS_mmap+0x22/0x30
>  [] system_call_fastpath+0x16/0x1b
> ---[ end trace 6b0a8d2341444bde ]---
> [ cut here ]
> WARNING: at arch/x86/kernel/irq.c:182 do_IRQ+0x126/0x140()
> Thread not rescheduled for 45 jiffies
> CPU: 18 PID: 21726 Comm: foo Tainted: G   O 3.10.59+
>  0009 88203f203ee0 8163a13c 88203f203f18
>  8103f131 881ec5f1ad60 0016 006e
>   c939a6dd8000 88203f203f78 8103f19c
> Call Trace:
>[] dump_stack+0x19/0x1b
>  [] warn_slowpath_common+0x61/0x80
>  [] warn_slowpath_fmt+0x4c/0x50
>  [] ? rcu_irq_exit+0x77/0xc0
>  [] do_IRQ+0x126/0x140
>  [] common_interrupt+0x6f/0x6f
>[] ? retint_restore_args+0x13/0x13
>  [] ? free_memtype+0x87/0x150
>  [] ? vunmap_page_range+0x1e6/0x2a0
>  [] remove_vm_area+0x51/0x70
>  [] iounmap+0x67/0xa0

iounmap() should be fast if you created 1GB mappings.

Thanks,
-Toshi

>  [] pci_iounmap+0x35/0x40
>  [] vfio_pci_release+0x9a/0x150 [vfio_pci]
>  [] vfio_device_fops_release+0x1c/0x40 [vfio]
>  [] __fput+0xdb/0x220
>  [] fput+0xe/0x10
>  [] task_work_run+0xbc/0xe0
>  [] do_exit+0x3ce/0xe50
>  [] do_group_exit+0x3f/0xa0
>  [] get_signal_to_deliver+0x1a9/0x5b0
>  [] do_signal+0x48/0x5e0
>  [] ? k_getrusage+0x368/0x3d0
>  [] ? default_wake_function+0x12/0x20
>  [] ? kprobe_flush_task+0xc0/0x150
>  [] ? finish_task_switch+0xc4/0xe0
>  [] do_notify_resume+0x65/0x80
>  [] retint_signal+0x4d/0x9f
> ---[ end trace 3506c05e4a0af3e5 ]---

Re: [PATCH v1 4/4] mm/memory-failure: check __PG_HWPOISON separately from PAGE_FLAGS_CHECK_AT_*

2015-07-23 Thread Naoya Horiguchi

On Thu, Jul 23, 2015 at 01:37:02PM -0700, Andrew Morton wrote:
> On Thu, 16 Jul 2015 01:41:56 + Naoya Horiguchi 
>  wrote:
> 
> > The race condition addressed in commit add05cecef80 ("mm: soft-offline: 
> > don't
> > free target page in successful page migration") was not closed completely,
> > because that can happen not only for soft-offline, but also for 
> > hard-offline.
> > Consider that a slab page is about to be freed into buddy pool, and then an
> > uncorrected memory error hits the page just after entering 
> > __free_one_page(),
> > then VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP) is triggered,
> > despite the fact that it's not necessary because the data on the affected
> > page is not consumed.
> > 
> > To solve it, this patch drops __PG_HWPOISON from page flag checks at
> > allocation/free time. I think it's justified because __PG_HWPOISON flags is
> > defined to prevent the page from being reused and setting it outside the
> > page's alloc-free cycle is a designed behavior (not a bug.)
> > 
> > And the patch reverts most of the changes from commit add05cecef80 about
> > the new refcounting rule of soft-offlined pages, which is no longer 
> > necessary.
> > 
> > ...
> >
> > --- v4.2-rc2.orig/mm/memory-failure.c
> > +++ v4.2-rc2/mm/memory-failure.c
> > @@ -1723,6 +1723,9 @@ int soft_offline_page(struct page *page, int flags)
> >  
> > get_online_mems();
> >  
> > +   if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
> > +   set_migratetype_isolate(page, true);
> > +
> > ret = get_any_page(page, pfn, flags);
> > put_online_mems();
> > if (ret > 0) { /* for in-use pages */
> 
> This patch gets build-broken by your
> mm-page_isolation-make-set-unset_migratetype_isolate-file-local.patch,
> which I shall drop.

I apologize this build failure. At first I planned to add another hwpoison patch
after this to remove this migratetype thing separately, but I was not 100% sure
of the correctness, so I did not include it in this version.
But Vlastimil's cleanup patch showed me that using MIGRATE_ISOLATE at free time
(, which is what soft offline code does now,) is wrong (or not an expected 
usage).
So I shouldn't have reverted the above part.

So I want the patch "mm, page_isolation: make set/unset_migratetype_isolate()
file-local" to be merged first, and I'd like to update this hwpoison before
going into mmotm. Could you drop this series from your tree for now?
I'll repost the next version probably next week.

Thanks,
Naoya Horiguchi--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] lib/string_helpers: clarify esc arg in string_escape_mem

2015-07-23 Thread Kees Cook

The esc argument is used to reduce which characters will be escaped.
For example, using " " with ESCAPE_SPACE will not produce any escaped
spaces.

Signed-off-by: Kees Cook 
---
 lib/string_helpers.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/lib/string_helpers.c b/lib/string_helpers.c
index c98ae818eb4e..0a307a97d489 100644
--- a/lib/string_helpers.c
+++ b/lib/string_helpers.c
@@ -410,7 +410,7 @@ static bool escape_hex(unsigned char c, char **dst, char 
*end)
  * @dst:   destination buffer (escaped)
  * @osz:   destination buffer size
  * @flags: combination of the flags (bitwise OR):
- * %ESCAPE_SPACE:
+ * %ESCAPE_SPACE: (special white space, not space itself)
  * '\f' - form feed
  * '\n' - new line
  * '\r' - carriage return
@@ -432,8 +432,10 @@ static bool escape_hex(unsigned char c, char **dst, char 
*end)
  * all previous together
  * %ESCAPE_HEX:
  * '\xHH' - byte with hexadecimal value HH (2 digits)
- * @esc:   NULL-terminated string of characters any of which, if found in
- * the source, has to be escaped
+ * @esc:   NULL-terminated string containing characters used to limit
+ * the selected escape class. If characters are included in @esc
+ * that would not normally be escaped by the classes selected
+ * in @flags, they will be copied to @dst unescaped.
  *
  * Description:
  * The process of escaping byte buffer includes several parts. They are applied
@@ -441,7 +443,7 @@ static bool escape_hex(unsigned char c, char **dst, char 
*end)
  * 1. The character is matched to the printable class, if asked, and in
  *case of match it passes through to the output.
  * 2. The character is not matched to the one from @esc string and thus
- *must go as is to the output.
+ *must go as-is to the output.
  * 3. The character is checked if it falls into the class given by @flags.
  *%ESCAPE_OCTAL and %ESCAPE_HEX are going last since they cover any
  *character. Note that they actually can't go together, otherwise
-- 
1.9.1


-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] tools: iio: print error message when buffer enable fails

2015-07-23 Thread Hartmut Knaack

Irina Tirdea schrieb am 23.07.2015 um 19:22:
> Running generic_buffer without enabling any channel of the
> sensor will fail without printing any error message.
> 
> Add an error message that indicates buffer enable failed.

Hi,
please make use of the error code stored in ret (with negative sign), as
in most cases the value of errno has already changed since the original
error has occurred.
Thanks,

Hartmut

> 
> Signed-off-by: Irina Tirdea 
> ---
>  tools/iio/generic_buffer.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/iio/generic_buffer.c b/tools/iio/generic_buffer.c
> index 32f389eb..936469c 100644
> --- a/tools/iio/generic_buffer.c
> +++ b/tools/iio/generic_buffer.c
> @@ -364,8 +364,11 @@ int main(int argc, char **argv)
>  
>   /* Enable the buffer */
>   ret = write_sysfs_int("enable", buf_dir_name, 1);
> - if (ret < 0)
> + if (ret < 0) {
> + fprintf(stderr,
> + "Failed to enable buffer: %s\n", strerror(errno));
>   goto error_free_buf_dir_name;
> + }
>  
>   scan_size = size_from_channelarray(channels, num_channels);
>   data = malloc(scan_size * buf_len);
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] Avoid usb reset crashes by making tty_io cdevs truly dynamic

2015-07-23 Thread Richard Watts


Hi,

 Sure - sorry, my description was a little .. basic.

 So, I have a client who was having problems with machines hanging in
the field. Very rare, associated with a h/w change that introduced
more cores. Kernel dumps implied that the timer list was getting
corrupted.

 This configuration of machine is an SBC on a board which communicates
with the SBC (partly) via a USB CDC device, which pops up as
/dev/ttyACM0.

 So one of the things we turned on was CONFIG_DEBUG_KOBJECT_RELEASE.
One of the side-effects of this is to delay kobject destruction.

 When we did that, we could reproduce the crash by performing a
USB reset on the CDC device -  and logs suggest that this was
happening in the field too.

 When the USB reset happens, we get a bunch of complaints from the
kernel.

 Some of these are to do with races on the kobjects associated with the
sysfs entries for the ttyACM0 device. They turn out not to be fatal,
and have their own patch series ('Attempt to cope with device changes
and delayed kobject deallocation' on linux-kernel).

 The fatal one turns out to be an execution path that goes like this:

 1 USB device declares itself to be CDC
 2 tty driver fires up and allocates a cdev for the relevant tty.
 3 driver->cdevs[0].kobj gets initialised as part of the cdev_alloc()
 4 USB reset happens, queueing driver->cdevs[0].kobj for release.
 5 The tty driver calls cdev_init(>cdevs[0]), which
 reinitialises driver->cdevs[0].kobj with a refcount of 1.
 6 tty driver starts using that new cdev, queueing an operation on it.
This causes a timer entry to be added including the kobj.
 7 At this point, the release we scheduled in (4) happens and the
members of kobj are deallocated.
 8 Someone allocates the newly released memory for one of the members of
 cdriver->cdevs[0].kobj somewhere else and overwrites it.
 9 The timer goes off.
10 Boom

 My patch (ham-fistedly) fixes this by ensuring that because we
never reuse the cdev pointer, we are never fooled into reinitialising
a kobject queued for deletion.

 I'm not all that familiar with how the locking should go here, and
there is a definite argument that under non CONFIG_DEBUG_KOBJECT_RELEASE
conditions, the kobject_release() would have happened by 5, and
therefore this situation should never exist "for real".

 .. but (a) that makes it rather hard to test kernels with
CONFIG_DEBUG_KOBJECT_RELEASE, and (b) my customer's crashes have
(allegedly) now gone away even without CONFIG_DEBUG_KOBJECT_RELEASE
set.

 Does that help at all? I've attached my 0/1, just in case that
got lost somewhere.


Richard.



--- Begin Message ---

Sometimes, usb buses on which CDC ACM devices sit encounter a usb reset.

When this happens, particularly when CONFIG_DEBUG_KOBJECT_RELEASE is on,
we attempt to destroy the cdev for the associated tty and then
rapidly re-initialise it. Since kobject destruction is not immediate,
this potentially leaves us with cdev_init() calling kobject_init() on a
kobject that is about to be destroyed.

This turns out not to be such a good thing and this patch solves the
problem by making the cdevs tty_operations->cdevs dynamically
allocated.

This may not be a problem in the wild (though I have some circumstantial
evidence that it is), but I submit that we might want to think about
fixing it anyway, since it makes debugging on systems with
CONFIG_DEBUG_KOBJECT_RELEASE=y and USB resets rather difficult
(guess what I have been doing lately .. ).

Patch is against e26081808edadfd257c6c9d81014e3b25e9a6118 (head of
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git ).

 (in fact, you will still get an oops - which is the subject of
another, more controversial, patchset ..)



Richard.
--- End Message ---

Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

2015-07-23 Thread Jörn Engel

On Thu, Jul 23, 2015 at 03:54:43PM -0700, David Rientjes wrote:
> On Thu, 23 Jul 2015, Jörn Engel wrote:
> 
> > > This is wrong, you'd want to do any cond_resched() before the page 
> > > allocation to avoid racing with an update to h->nr_huge_pages or 
> > > h->surplus_huge_pages while hugetlb_lock was dropped that would result in 
> > > the page having been uselessly allocated.
> > 
> > There are three options.  Either
> > /* some allocation */
> > cond_resched();
> > or
> > cond_resched();
> > /* some allocation */
> > or
> > if (cond_resched()) {
> > spin_lock(_lock);
> > continue;
> > }
> > /* some allocation */
> > 
> > I think you want the second option instead of the first.  That way we
> > have a little less memory allocation for the time we are scheduled out.
> > Sure, we can do that.  It probably doesn't make a big difference either
> > way, but why not.
> > 
> 
> The loop is dropping the lock simply to do the allocation and it needs to 
> compare with the user-written number of hugepages to allocate.

And at this point the existing code is racy.  Page allocation might
block for minutes trying to free some memory.  A cond_resched doesn't
change that - it only increases the odds of hitting the race window.

> What we don't want is to allocate, reschedule, and check if we really 
> needed to allocate.  That's what your patch does because it races with 
> persistent_huge_page().  It's probably the worst place to do it.
> 
> Rather, what you want to do is check if you need to allocate, reschedule 
> if needed (and if so, recheck), and then allocate.
> 
> > If you are asking for the third option, I would rather avoid that.  It
> > makes the code more complex and doesn't change the fact that we have a
> > race and better be able to handle the race.  The code size growth will
> > likely cost us more performance that we would ever gain.  nr_huge_pages
> > tends to get updated once per system boot.
> 
> Your third option is nonsensical, you didn't save the state of whether you 
> locked the lock so you can't reliably unlock it, and you cannot hold a 
> spinlock while allocating in this context.

Are we looking at the same code?  Mine looks like this:
while (count > persistent_huge_pages(h)) {
/*
 * If this allocation races such that we no longer need the
 * page, free_huge_page will handle it by freeing the page
 * and reducing the surplus.
 */
spin_unlock(_lock);
if (hstate_is_gigantic(h))
ret = alloc_fresh_gigantic_page(h, nodes_allowed);
else
ret = alloc_fresh_huge_page(h, nodes_allowed);
spin_lock(_lock);
if (!ret)
goto out;

/* Bail for signals. Probably ctrl-c from user */
if (signal_pending(current))
goto out;
}

What state is there to save?  We just called spin_unlock, we did a
schedule and if we want to continue without doing page allocation we
better take the lock again.  Or do you want to go even more complex and
check for signals as well?

The case you are concerned about is rare.  It is so rare that it doesn't
matter from a performance point of view, only for correctness.  And if
we hit the rare case, the worst harm would be an unnecessary allocation
that we return back to the system.  How much complexity do you think it
is worth to avoid this allocation?  How much runtime will the bigger
text size cost you in the common cases?

What matters to me is the scheduler latency.  That is real and happens
reliably once per boot.

Jörn

--
Chance favors only the prepared mind.
-- Louis Pasteur
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] tools: iio: fix mask for 32 bit sensor data

2015-07-23 Thread Hartmut Knaack

Irina Tirdea schrieb am 23.07.2015 um 19:22:
> When the the sensor data uses 32 bits out of 32, generic_buffer prints
> the value 0 for all data read.
> 
> In this case, the mask is shifted 32 bits, which is beyond the size of
> an integer. This will lead to the mask always being 0. Before printing,
> the mask is applied to the raw value, thus generating a final value of 0.
> 
> Fix the mask by shifting a 64 bit value instead of an integer.
> 
> Signed-off-by: Irina Tirdea 
Acked-by: Hartmut Knaack 
> ---
>  tools/iio/iio_utils.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/iio/iio_utils.c b/tools/iio/iio_utils.c
> index 1dcdf03..a95270f 100644
> --- a/tools/iio/iio_utils.c
> +++ b/tools/iio/iio_utils.c
> @@ -168,7 +168,7 @@ int iioutils_get_type(unsigned *is_signed, unsigned 
> *bytes, unsigned *bits_used,
>   if (*bits_used == 64)
>   *mask = ~0;
>   else
> - *mask = (1 << *bits_used) - 1;
> + *mask = (1ULL << *bits_used) - 1;
>  
>   *is_signed = (signchar == 's');
>   if (fclose(sysfsfp)) {
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-23 Thread Dave Chinner

On Thu, Jul 23, 2015 at 01:08:36PM -0400, Mikulas Patocka wrote:
> On Wed, 22 Jul 2015, Dave Chinner wrote:
> > On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote:
> > > On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
> > > | $ cat
> > > | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
> > > | 1

[...]

> You can just stop retrying the I/Os when the user attempts to unmount the 
> filesystem - then, you don't need any configuration option.

See above - the default will do that, but there are users who do not
want that unmount behaviour

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 3/3] RTC: switch to using is_visible() to control sysfs attributes

2015-07-23 Thread Dmitry Torokhov

Instead of creating wakealarm attribute manually, after the device has been
registered, let's rely on facilities provided by the attribute groups to
control which attributes are visible and which are not. This allows to to
create all needed attributes at once, at the same time that we register RTC
class device.

Signed-off-by: Dmitry Torokhov 
---
 drivers/rtc/class.c |  4 +---
 drivers/rtc/rtc-core.h  | 19 +++-
 drivers/rtc/rtc-sysfs.c | 59 +
 3 files changed, 34 insertions(+), 48 deletions(-)

diff --git a/drivers/rtc/class.c b/drivers/rtc/class.c
index de7707f..de86578 100644
--- a/drivers/rtc/class.c
+++ b/drivers/rtc/class.c
@@ -202,6 +202,7 @@ struct rtc_device *rtc_device_register(const char *name, 
struct device *dev,
rtc->max_user_freq = 64;
rtc->dev.parent = dev;
rtc->dev.class = rtc_class;
+   rtc->dev.groups = rtc_get_dev_attribute_groups();
rtc->dev.release = rtc_device_release;
 
mutex_init(>ops_lock);
@@ -240,7 +241,6 @@ struct rtc_device *rtc_device_register(const char *name, 
struct device *dev,
}
 
rtc_dev_add_device(rtc);
-   rtc_sysfs_add_device(rtc);
rtc_proc_add_device(rtc);
 
dev_info(dev, "rtc core: registered %s as %s\n",
@@ -271,7 +271,6 @@ void rtc_device_unregister(struct rtc_device *rtc)
 * Remove innards of this RTC, then disable it, before
 * letting any rtc_class_open() users access it again
 */
-   rtc_sysfs_del_device(rtc);
rtc_dev_del_device(rtc);
rtc_proc_del_device(rtc);
device_del(>dev);
@@ -360,7 +359,6 @@ static int __init rtc_init(void)
}
rtc_class->pm = RTC_CLASS_DEV_PM_OPS;
rtc_dev_init();
-   rtc_sysfs_init(rtc_class);
return 0;
 }
 
diff --git a/drivers/rtc/rtc-core.h b/drivers/rtc/rtc-core.h
index 5f9df74..a098aea 100644
--- a/drivers/rtc/rtc-core.h
+++ b/drivers/rtc/rtc-core.h
@@ -48,23 +48,10 @@ static inline void rtc_proc_del_device(struct rtc_device 
*rtc)
 #endif
 
 #ifdef CONFIG_RTC_INTF_SYSFS
-
-extern void __init rtc_sysfs_init(struct class *);
-extern void rtc_sysfs_add_device(struct rtc_device *rtc);
-extern void rtc_sysfs_del_device(struct rtc_device *rtc);
-
+const struct attribute_group **rtc_get_dev_attribute_groups(void);
 #else
-
-static inline void rtc_sysfs_init(struct class *rtc)
-{
-}
-
-static inline void rtc_sysfs_add_device(struct rtc_device *rtc)
+static inline const struct attribute_group **rtc_get_dev_attribute_groups(void)
 {
+   return NULL;
 }
-
-static inline void rtc_sysfs_del_device(struct rtc_device *rtc)
-{
-}
-
 #endif
diff --git a/drivers/rtc/rtc-sysfs.c b/drivers/rtc/rtc-sysfs.c
index e3ce1dc..7273855 100644
--- a/drivers/rtc/rtc-sysfs.c
+++ b/drivers/rtc/rtc-sysfs.c
@@ -122,17 +122,6 @@ hctosys_show(struct device *dev, struct device_attribute 
*attr, char *buf)
 }
 static DEVICE_ATTR_RO(hctosys);
 
-static struct attribute *rtc_attrs[] = {
-   _attr_name.attr,
-   _attr_date.attr,
-   _attr_time.attr,
-   _attr_since_epoch.attr,
-   _attr_max_user_freq.attr,
-   _attr_hctosys.attr,
-   NULL,
-};
-ATTRIBUTE_GROUPS(rtc);
-
 static ssize_t
 wakealarm_show(struct device *dev, struct device_attribute *attr, char *buf)
 {
@@ -222,6 +211,16 @@ wakealarm_store(struct device *dev, struct 
device_attribute *attr,
 }
 static DEVICE_ATTR_RW(wakealarm);
 
+static struct attribute *rtc_attrs[] = {
+   _attr_name.attr,
+   _attr_date.attr,
+   _attr_time.attr,
+   _attr_since_epoch.attr,
+   _attr_max_user_freq.attr,
+   _attr_hctosys.attr,
+   _attr_wakealarm.attr,
+   NULL,
+};
 
 /* The reason to trigger an alarm with no process watching it (via sysfs)
  * is its side effect:  waking from a system state like suspend-to-RAM or
@@ -236,29 +235,31 @@ static bool rtc_does_wakealarm(struct rtc_device *rtc)
return rtc->ops->set_alarm != NULL;
 }
 
-
-void rtc_sysfs_add_device(struct rtc_device *rtc)
+static umode_t rtc_attr_is_visible(struct kobject *kobj,
+  struct attribute *attr, int n)
 {
-   int err;
+   struct device *dev = container_of(kobj, struct device, kobj);
+   struct rtc_device *rtc = to_rtc_device(dev);
+   umode_t mode = attr->mode;
 
-   /* not all RTCs support both alarms and wakeup */
-   if (!rtc_does_wakealarm(rtc))
-   return;
+   if (attr == _attr_wakealarm.attr)
+   if (!rtc_does_wakealarm(rtc))
+   mode = 0;
 
-   err = device_create_file(>dev, _attr_wakealarm);
-   if (err)
-   dev_err(rtc->dev.parent,
-   "failed to create alarm attribute, %d\n", err);
+   return mode;
 }
 
-void rtc_sysfs_del_device(struct rtc_device *rtc)
-{
-   /* REVISIT did we add it successfully? */
-   if (rtc_does_wakealarm(rtc))
-   device_remove_file(>dev, _attr_wakealarm);

[PATCH v2 2/3] RTC: switch wakealarm attribute to DEVICE_ATTR_RW

2015-07-23 Thread Dmitry Torokhov

Instead of using older style DEVICE_ATTR for wakealarm attribute let's
switch to using DEVICE_ATTR_RW that ensures consistent across the kernel
permissions on the attribute.

Signed-off-by: Dmitry Torokhov 
---
 drivers/rtc/rtc-sysfs.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/rtc/rtc-sysfs.c b/drivers/rtc/rtc-sysfs.c
index 2fbc11b..e3ce1dc 100644
--- a/drivers/rtc/rtc-sysfs.c
+++ b/drivers/rtc/rtc-sysfs.c
@@ -134,8 +134,7 @@ static struct attribute *rtc_attrs[] = {
 ATTRIBUTE_GROUPS(rtc);
 
 static ssize_t
-rtc_sysfs_show_wakealarm(struct device *dev, struct device_attribute *attr,
-   char *buf)
+wakealarm_show(struct device *dev, struct device_attribute *attr, char *buf)
 {
ssize_t retval;
unsigned long alarm;
@@ -159,7 +158,7 @@ rtc_sysfs_show_wakealarm(struct device *dev, struct 
device_attribute *attr,
 }
 
 static ssize_t
-rtc_sysfs_set_wakealarm(struct device *dev, struct device_attribute *attr,
+wakealarm_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t n)
 {
ssize_t retval;
@@ -221,8 +220,7 @@ rtc_sysfs_set_wakealarm(struct device *dev, struct 
device_attribute *attr,
retval = rtc_set_alarm(rtc, );
return (retval < 0) ? retval : n;
 }
-static DEVICE_ATTR(wakealarm, S_IRUGO | S_IWUSR,
-   rtc_sysfs_show_wakealarm, rtc_sysfs_set_wakealarm);
+static DEVICE_ATTR_RW(wakealarm);
 
 
 /* The reason to trigger an alarm with no process watching it (via sysfs)
-- 
2.5.0.rc2.392.g76e840b

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 1/3] RTC: make rtc_does_wakealarm() return boolean

2015-07-23 Thread Dmitry Torokhov

Users of rtc_does_wakealarm() return value treat it as boolean so let's
change the signature accordingly.

Signed-off-by: Dmitry Torokhov 
---
 drivers/rtc/rtc-sysfs.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/rtc/rtc-sysfs.c b/drivers/rtc/rtc-sysfs.c
index babd43b..2fbc11b 100644
--- a/drivers/rtc/rtc-sysfs.c
+++ b/drivers/rtc/rtc-sysfs.c
@@ -230,10 +230,11 @@ static DEVICE_ATTR(wakealarm, S_IRUGO | S_IWUSR,
  * suspend-to-disk.  So: no attribute unless that side effect is possible.
  * (Userspace may disable that mechanism later.)
  */
-static inline int rtc_does_wakealarm(struct rtc_device *rtc)
+static bool rtc_does_wakealarm(struct rtc_device *rtc)
 {
if (!device_can_wakeup(rtc->dev.parent))
-   return 0;
+   return false;
+
return rtc->ops->set_alarm != NULL;
 }
 
-- 
2.5.0.rc2.392.g76e840b

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-23 Thread Dave Chinner

On Thu, Jul 23, 2015 at 12:43:58PM -0400, Vivek Goyal wrote:
> On Thu, Jul 23, 2015 at 03:10:43PM +1000, Dave Chinner wrote:
> 
> [..]
> > I don't think knowing the bdev timeout is necessary because the
> > default is most likely to be "fail fast" in this case. i.e. no
> > retries, just shut down.  IOWs, if we describe the configs and
> > actions in neutral terms, then the default configurations easy for
> > users to understand. i.e:
> > 
> > bdev enospc XFS default
> > --- ---
> > Fail slow   Fail fast
> > Fail fast   Fail slow
> > Fail never  Fail never, Record in log
> > EOPNOTSUPP  Fail never
> > 
> > With that in mind, I'm thinking I should drop the
> > "permanent/transient" error classifications, and change it "failure
> > behaviour" with the options "fast slow [never]" and only the slow
> > option has retry/timeout configuration options.  I think the "never"
> > option still needs to "fail at unmount" config variable, but we
> > enable it by default rather than hanging unmount and requiring a
> > manual shutdown like we do now
> 
> I am wondering instead of 4 knobs (fast,slow,never,retry-timeout) can
> we just do with one knob per error type and that is retry-timout.

"retry-timeout" == "fail slow". i.e. a 5 minute retry timeout is
configured as:

# echo slow > fail_method
# echo 0 > max_retries
# echo 300 > retry_timeout

> retry-timeout=0 (Fail fast)
> retry-timeout=X (Fail slow)
> retry-timeout=-1 (Never Give up).

What do we do when we want to add a different failure type
with different configuration requirements?

> Also do we really need this timeout per error type.

I don't follow your logic here.  What do need a timeout for with
either the "never" or "fast" failure configurations?

> Also would be nice if this timeout was configurable using a mount
> option. Then we can just specify it during mount time and be done
> with it.

That way lies madness.  The error configuration iinfrastructure we
need is not just for ENOSPC errors on metadata buffers.  We need
configurable error behaviour for multiple different errors in
multiple different subsystems (e.g. data IO failure vs metadata
buffer IO failure vs memory allocation failure vs inode corruption
vs freespace corruption vs ).

And we still would need the sysfs interface for querying and
configuring at runtime, so mount options are just a bad idea.  And
with sysfs, the potential future route for automatic configuration
at mount time is via udev events and configuration files, similar to
block devices.

> Idea of auto tuning based on what block device is doing sounds reasonable
> but that should not be a requirement for this patch and can go in even
> later. It is one of those nice to have features.

"this patch"? Just the core infrastructure so far:

11 files changed, 290 insertions(+), 60 deletions(-)

and that will need to be split into 4-5 patches for review. There's
a bunch of cleanup that preceeds this, and then there's a patch per
error type we are going to handle in metadata buffer IO completion.
IOWs, the dm-thinp autotuning is just a simple, small patch at the
end of a much larger series - it's maybe 10 lines of code in XFS...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 3/3] samples/bpf: example of get selected PMU counter value

2015-07-23 Thread Alexei Starovoitov


On 7/23/15 2:42 AM, Kaixu Xia wrote:

This is a simple example and shows how to use the new ability
to get the selected Hardware PMU counter value.

Signed-off-by: Kaixu Xia 

...

+struct bpf_map_def SEC("maps") my_map = {
+   .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+   .key_size = sizeof(int),
+   .value_size = sizeof(unsigned long),
+   .max_entries = 32,
+};


wait. how did it work here? value_size should be u32.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Thanks

2015-07-23 Thread bbaumann

Hi,

I've an offer for you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

2015-07-23 Thread Alexei Starovoitov


On 7/23/15 2:42 AM, Kaixu Xia wrote:

According to the perf_event_map_fd and index, the function
bpf_perf_event_read() can convert the corresponding map
value to the pointer to struct perf_event and return the
Hardware PMU counter value.

Signed-off-by: Kaixu Xia 

...

+static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5)
+{
+   struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+   struct bpf_array *array = container_of(map, struct bpf_array, map);
+   struct perf_event *event;
+
+   if (index >= array->map.max_entries)
+   return -E2BIG;
+
+   event = array->events[index];
+   if (!event)
+   return -EBADF;


probably ENOENT makes more sense here.


+
+   if (event->state != PERF_EVENT_STATE_ACTIVE)
+   return -ENOENT;


and -EINVAL here?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: string_escape_mem ESCAPE_SPACE

2015-07-23 Thread Kees Cook

On Thu, Jul 23, 2015 at 1:50 PM, Andy Shevchenko
 wrote:
> On Thu, 2015-07-23 at 13:36 -0700, Kees Cook wrote:
>> On Thu, Jul 23, 2015 at 1:27 PM, Andy Shevchenko
>>  wrote:
>> > On Thu, 2015-07-23 at 12:59 -0700, Kees Cook wrote:
>> > > Hi,
>> > >
>> > > I'm curious why ESCAPE_SPACE doesn't escape spaces (0x20)?
>> >
>> > Space is a printable character.
>> > You perhaps wants something like ESCAPE_SPACE | ESCAPE_HEX.
>>
>> Yeah, I can get the effect I want with:
>>
>> flags = ESCAPE_SPACE | ESCAPE_SPECIAL | ESCAPE_NULL | ESCAPE_HEX;
>> esc = "\f\n\r\t\v\\\a\e\0 ";
>
> esc can't contain '\0' in the middle.

Ah, yes, of course.

> So, you would like to convert only space to hex and leave everything
> else printable as is?

That was one idea I was having, yes.

>> This isn't reachable via kasprintf, though (it always has a NULL
>> esc).
>> I will consider some options and send patches.
>
> Before doing this, describe your use case in detail, please.

Sure. I'd like to be able to hex-escape everything <0x20, >0x7f, and
". (I'm working on building a quotable string that is safe to log.) I
think I've settled for a subset of this as:

string_escape_mem(src, slen, dst, 0, ESCAPE_HEX, "\f\n\r\t\v\a\e\\\"")

>> > >  That is
>> > > surprising to me, especially since things like isspace() include
>> > > 0x20.
>> >
>> > Moreover, there are test cases in test-string_helpers.c module and
>> > they
>> > are based on the real use cases (before helpers were introduced and
>> > users were converted). So, there is no user which expects hex
>> > conversio
>> > n of the printable character if not asked explicitly.
>>
>> Yeah, I saw it was testing for space to be excluded. I guess I just
>> think the name "ESCAPE_SPACE" is misleading. :)
>
> For sake of name shortness I suppose. The idea is to escape *special*
> spaces by this.

Makes sense. I'll send a patch with some clarifications on the comments.

Thanks!

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

2015-07-23 Thread David Rientjes

On Thu, 23 Jul 2015, Jörn Engel wrote:

> > This is wrong, you'd want to do any cond_resched() before the page 
> > allocation to avoid racing with an update to h->nr_huge_pages or 
> > h->surplus_huge_pages while hugetlb_lock was dropped that would result in 
> > the page having been uselessly allocated.
> 
> There are three options.  Either
>   /* some allocation */
>   cond_resched();
> or
>   cond_resched();
>   /* some allocation */
> or
>   if (cond_resched()) {
>   spin_lock(_lock);
>   continue;
>   }
>   /* some allocation */
> 
> I think you want the second option instead of the first.  That way we
> have a little less memory allocation for the time we are scheduled out.
> Sure, we can do that.  It probably doesn't make a big difference either
> way, but why not.
> 

The loop is dropping the lock simply to do the allocation and it needs to 
compare with the user-written number of hugepages to allocate.

What we don't want is to allocate, reschedule, and check if we really 
needed to allocate.  That's what your patch does because it races with 
persistent_huge_page().  It's probably the worst place to do it.

Rather, what you want to do is check if you need to allocate, reschedule 
if needed (and if so, recheck), and then allocate.

> If you are asking for the third option, I would rather avoid that.  It
> makes the code more complex and doesn't change the fact that we have a
> race and better be able to handle the race.  The code size growth will
> likely cost us more performance that we would ever gain.  nr_huge_pages
> tends to get updated once per system boot.
> 

Your third option is nonsensical, you didn't save the state of whether you 
locked the lock so you can't reliably unlock it, and you cannot hold a 
spinlock while allocating in this context.

Re: [PATCH v3 1/3] bpf: Add new bpf map type to store the pointer to struct perf_event

2015-07-23 Thread Alexei Starovoitov


On 7/23/15 2:42 AM, Kaixu Xia wrote:

Introduce a new bpf map type 'BPF_MAP_TYPE_PERF_EVENT_ARRAY'.
This map only stores the pointer to struct perf_event. The
user space event FDs from perf_event_open() syscall are converted
to the pointer to struct perf_event and stored in map.

...

+static struct bpf_map *perf_event_array_map_alloc(union bpf_attr *attr)
+{
+   /* only the pointer to struct perf_event can be stored in
+* perf_event_array map
+*/
+   if (attr->value_size != sizeof(u32))
+   return ERR_PTR(-EINVAL);
+
+   return array_map_alloc(attr);
+}


since it's exactly the same as prog_array_map_alloc(),
just rename it to something like 'fd_array_map_alloc'
and use for both types.


+static int perf_event_array_map_get_next_key(struct bpf_map *map, void *key,
+void *next_key)
+{
+   return -EINVAL;
+}
+
+static void *perf_event_array_map_lookup_elem(struct bpf_map *map, void *key)
+{
+   return NULL;
+}


same for the above two.
rename prog_array_map_* into fd_array_map_* and use for both map types.


+static struct perf_event *convert_map_with_perf_event(void *value)
+{
+   struct perf_event *event;
+   u32 fd;
+
+   fd = *(u32 *)value;
+
+   event = perf_event_get(fd);
+   if (IS_ERR(event))
+   return NULL;


don't lose error code, do 'return event' instead.


+
+   /* limit the event type to PERF_TYPE_RAW
+* and PERF_TYPE_HARDWARE.
+*/
+   if (event->attr.type != PERF_TYPE_RAW &&
+   event->attr.type != PERF_TYPE_HARDWARE)
+   return NULL;


perf_event refcnt leak? need to do put_event.
and return ERR_PTR(-EINVAL)


+
+   return event;
+}
+
+/* only called from syscall */
+static int perf_event_array_map_update_elem(struct bpf_map *map, void *key,
+   void *value, u64 map_flags)
+{
+   struct bpf_array *array = container_of(map, struct bpf_array, map);
+   struct perf_event *event;
+   u32 index = *(u32 *)key;
+
+   if (map_flags != BPF_ANY)
+   return -EINVAL;
+
+   if (index >= array->map.max_entries)
+   return -E2BIG;
+
+   /* check if the value is already stored */
+   if (array->events[index])
+   return -EINVAL;
+
+   /* convert the fd to the pointer to struct perf_event */
+   event = convert_map_with_perf_event(value);


imo helper name is misleading and it's too short to be separate
function. Just inline it and you can reuse 'index' variable.


+   if (!event)
+   return -EBADF;
+
+   xchg(array->events + index, event);


refcnt leak of old event! Please think it through.
This type of bugs I shouldn't be finding.


+static int perf_event_array_map_delete_elem(struct bpf_map *map, void *key)
+{
+   return -EINVAL;
+}


no way to dec refcnt of perf_event from user space?
why not to do the same as prog_array_delete?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] locktorture: 'tis a slow death

2015-07-23 Thread Paul E. McKenney

On Thu, Jul 23, 2015 at 12:54:42PM -0700, Davidlohr Bueso wrote:
> On Wed, 2015-07-22 at 17:13 -0700, Paul E. McKenney wrote:
> > I need to see something more than what I am seeing for me to be able
> > to accept this, cute though it unarguably is.
> 
> heh I didn't consider copyright for this kind of stuff. And was naive to
> think that keeping his (what I assume to be) initials was enough. I've
> contacted the author for permission to use his work.
> 
> But yeah, I had to laugh when I saw this. Although I probably triggered
> some red flag by googling 'torture' and 'weapons' :-)

OK, once you get permission from the author to use under GPLv2, no
problem.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] irqchip: bcm7120-l2: Fix interrupt status for multiple parent IRQs

2015-07-23 Thread Florian Fainelli

Our irq-bcm7120-l2 interrupt controller driver utilizes the same handler
function for the different parent interrupts it services: UPG_MAIN, UPG_BSC for
instance.

The problem is that function reads the IRQSTAT register which can combine
interrupt causes for different parent interrupts, such that we can end-up in
the following situation:

- CPU takes an interrupt
- bcm7120_l2_intc_irq_handle() reads IRQSTAT
- generic_handle_irq() is invoked
- there are still pending interrupts flagged in IRQSTAT from a different parent
- handle_bad_irq() is invoked for these since they come from a different 
irq_desc/irq

In order to fix this, make sure that we always mask IRQSTAT with the
appropriate bits that correspond go the parent interrupt source this is coming
from. To simplify things, associate an unique structure per parent interrupt
handler to avoid multiplying the number of lookups.

Fixes: a5042de2688d ("irqchip: bcm7120-l2: Add Broadcom BCM7120-style Level 2 
interrupt controller")
Signed-off-by: Florian Fainelli 
---
 drivers/irqchip/irq-bcm7120-l2.c | 51 ++--
 1 file changed, 39 insertions(+), 12 deletions(-)

diff --git a/drivers/irqchip/irq-bcm7120-l2.c b/drivers/irqchip/irq-bcm7120-l2.c
index 3ba5cc780fcb..8302d45d13ac 100644
--- a/drivers/irqchip/irq-bcm7120-l2.c
+++ b/drivers/irqchip/irq-bcm7120-l2.c
@@ -47,14 +47,20 @@ struct bcm7120_l2_intc_data {
struct irq_domain *domain;
bool can_wake;
u32 irq_fwd_mask[MAX_WORDS];
-   u32 irq_map_mask[MAX_WORDS];
+   struct bcm7120_l1_intc_data *l1_data;
int num_parent_irqs;
const __be32 *map_mask_prop;
 };
 
+struct bcm7120_l1_intc_data {
+   struct bcm7120_l2_intc_data *b;
+   u32 irq_map_mask[MAX_WORDS];
+};
+
 static void bcm7120_l2_intc_irq_handle(unsigned int irq, struct irq_desc *desc)
 {
-   struct bcm7120_l2_intc_data *b = irq_desc_get_handler_data(desc);
+   struct bcm7120_l1_intc_data *data = irq_desc_get_handler_data(desc);
+   struct bcm7120_l2_intc_data *b = data->b;
struct irq_chip *chip = irq_desc_get_chip(desc);
unsigned int idx;
 
@@ -69,7 +75,8 @@ static void bcm7120_l2_intc_irq_handle(unsigned int irq, 
struct irq_desc *desc)
 
irq_gc_lock(gc);
pending = irq_reg_readl(gc, b->stat_offset[idx]) &
-   gc->mask_cache;
+   gc->mask_cache &
+   data->irq_map_mask[idx];
irq_gc_unlock(gc);
 
for_each_set_bit(hwirq, , IRQS_PER_WORD) {
@@ -107,8 +114,9 @@ static void bcm7120_l2_intc_resume(struct irq_data *d)
 
 static int bcm7120_l2_intc_init_one(struct device_node *dn,
struct bcm7120_l2_intc_data *data,
-   int irq)
+   int irq, u32 *valid_mask)
 {
+   struct bcm7120_l1_intc_data *l1_data = >l1_data[irq];
int parent_irq;
unsigned int idx;
 
@@ -120,18 +128,27 @@ static int bcm7120_l2_intc_init_one(struct device_node 
*dn,
 
/* For multiple parent IRQs with multiple words, this looks like:
 * 
+*
+* We need to associate a given parent interrupt with its corresponding
+* map_mask in order to mask the status register with it because we
+* have the same handler being called for multiple parent interrupts.
+*
+* This is typically something needed on BCM7xxx (STB chips).
 */
for (idx = 0; idx < data->n_words; idx++) {
if (data->map_mask_prop) {
-   data->irq_map_mask[idx] |=
+   l1_data->irq_map_mask[idx] |=
be32_to_cpup(data->map_mask_prop +
 irq * data->n_words + idx);
} else {
-   data->irq_map_mask[idx] = 0x;
+   l1_data->irq_map_mask[idx] = 0x;
}
+   valid_mask[idx] |= l1_data->irq_map_mask[idx];
}
 
-   irq_set_handler_data(parent_irq, data);
+   l1_data->b = data;
+
+   irq_set_handler_data(parent_irq, l1_data);
irq_set_chained_handler(parent_irq, bcm7120_l2_intc_irq_handle);
 
return 0;
@@ -214,6 +231,7 @@ int __init bcm7120_l2_intc_probe(struct device_node *dn,
struct irq_chip_type *ct;
int ret = 0;
unsigned int idx, irq, flags;
+   u32 valid_mask[MAX_WORDS] = { };
 
data = kzalloc(sizeof(*data), GFP_KERNEL);
if (!data)
@@ -226,9 +244,16 @@ int __init bcm7120_l2_intc_probe(struct device_node *dn,
goto out_unmap;
}
 
+   data->l1_data = kcalloc(data->num_parent_irqs, sizeof(*data->l1_data),
+   GFP_KERNEL);
+   if (!data->l1_data) {
+   ret = -ENOMEM;
+

[RFC 2/3] dts: zynq: Add devicetree entry for PL reset controller.

2015-07-23 Thread Moritz Fischer

Signed-off-by: Moritz Fischer 
---
 arch/arm/boot/dts/zynq-7000.dtsi | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/arm/boot/dts/zynq-7000.dtsi b/arch/arm/boot/dts/zynq-7000.dtsi
index b429e1d..a56fe11 100644
--- a/arch/arm/boot/dts/zynq-7000.dtsi
+++ b/arch/arm/boot/dts/zynq-7000.dtsi
@@ -258,6 +258,12 @@
reg = <0x100 0x100>;
};
 
+   rstc: rstc@240 {
+   #reset-cells = <1>;
+   compatible = "xlnx,zynq-reset-pl";
+   syscon = <>;
+   };
+
pinctrl0: pinctrl@700 {
compatible = "xlnx,pinctrl-zynq";
reg = <0x700 0x200>;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 1/3] docs: dts: Added documentation for Xilinx Zynq PL Reset bindings.

2015-07-23 Thread Moritz Fischer

Signed-off-by: Moritz Fischer 
---
 Documentation/devicetree/bindings/reset/zynq-reset-pl.txt | 13 +
 1 file changed, 13 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/reset/zynq-reset-pl.txt

diff --git a/Documentation/devicetree/bindings/reset/zynq-reset-pl.txt 
b/Documentation/devicetree/bindings/reset/zynq-reset-pl.txt
new file mode 100644
index 000..ac4499e
--- /dev/null
+++ b/Documentation/devicetree/bindings/reset/zynq-reset-pl.txt
@@ -0,0 +1,13 @@
+Xilinx Zynq PL Reset Manager
+
+Required properties:
+- compatible: "xlnx,zynq-reset-pl"
+- syscon <>;
+- #reset-cells: 1
+
+Example:
+   rstc: rstc@240 {
+   #reset-cells = <1>;
+   compatible = "xlnx,zynq-reset-pl";
+   syscon = <>;
+   };
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 0/3] Adding support for Zynq PL reset controller.

2015-07-23 Thread Moritz Fischer

Hi all,

while trying to get the devicetree overlays working using Alan's
simple-fpga-bus I couldn't find a way to independently reset
parts of the PL logic. I might have missed something and this
exists already somewhere, in that case, oh well ...

If Sören or Michael could take a look at this to let me know
if this is fundamentally wrong, that would be great.

Thanks,

Moritz

Moritz Fischer (3):
  docs: dts: Added documentation for Xilinx Zynq PL Reset bindings.
  dts: zynq: Add devicetree entry for PL reset controller.
  reset: reset-zynq-pl: Adding support for Xilinx Zynq PL reset.

 .../devicetree/bindings/reset/zynq-reset-pl.txt|  13 ++
 arch/arm/boot/dts/zynq-7000.dtsi   |   6 +
 drivers/reset/Makefile |   1 +
 drivers/reset/reset-zynq-pl.c  | 142 +
 4 files changed, 162 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/reset/zynq-reset-pl.txt
 create mode 100644 drivers/reset/reset-zynq-pl.c

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 3/3] reset: reset-zynq-pl: Adding support for Xilinx Zynq PL reset.

2015-07-23 Thread Moritz Fischer

The Zynq PL reset controller allows to control the 4
FCLK{0..3}_RESETN signals that can be used to reset custom IP in
the PL.

Signed-off-by: Moritz Fischer 
---
 drivers/reset/Makefile|   1 +
 drivers/reset/reset-zynq-pl.c | 142 ++
 2 files changed, 143 insertions(+)
 create mode 100644 drivers/reset/reset-zynq-pl.c

diff --git a/drivers/reset/Makefile b/drivers/reset/Makefile
index 157d421..5c86f92 100644
--- a/drivers/reset/Makefile
+++ b/drivers/reset/Makefile
@@ -3,3 +3,4 @@ obj-$(CONFIG_ARCH_SOCFPGA) += reset-socfpga.o
 obj-$(CONFIG_ARCH_BERLIN) += reset-berlin.o
 obj-$(CONFIG_ARCH_SUNXI) += reset-sunxi.o
 obj-$(CONFIG_ARCH_STI) += sti/
+obj-$(CONFIG_ARCH_ZYNQ) += reset-zynq-pl.o
diff --git a/drivers/reset/reset-zynq-pl.c b/drivers/reset/reset-zynq-pl.c
new file mode 100644
index 000..3e04ab0
--- /dev/null
+++ b/drivers/reset/reset-zynq-pl.c
@@ -0,0 +1,142 @@
+/*
+ * Xilinx Zynq PL Reset Controller
+ *
+ * Copyright (c) 2015, National Instruments Corp.
+ * Author: Moritz Fischer 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Offsets into SLCR regmap */
+#define SLCR_FPGA_RST_CTRL_OFFSET  0x240 /* FPGA Software Reset Control */
+
+struct zynq_pl_reset_data {
+   struct regmap *slcr;
+   struct reset_controller_dev rcdev;
+};
+
+static int zynq_pl_reset_assert(struct reset_controller_dev *rcdev,
+   unsigned long id)
+{
+   struct zynq_pl_reset_data *priv = container_of(rcdev,
+   struct zynq_pl_reset_data,
+   rcdev);
+
+   int offset = id % BITS_PER_LONG;
+
+   regmap_update_bits(priv->slcr,
+  SLCR_FPGA_RST_CTRL_OFFSET,
+  BIT(offset),
+  BIT(offset));
+
+   return 0;
+}
+
+static int zynq_pl_reset_deassert(struct reset_controller_dev *rcdev,
+ unsigned long id)
+{
+   struct zynq_pl_reset_data *priv = container_of(rcdev,
+   struct zynq_pl_reset_data,
+   rcdev);
+
+   int offset = id % BITS_PER_LONG;
+
+   regmap_update_bits(priv->slcr,
+  SLCR_FPGA_RST_CTRL_OFFSET,
+  BIT(offset),
+  ~BIT(offset));
+
+   return 0;
+}
+
+static int zynq_pl_reset_status(struct reset_controller_dev *rcdev,
+   unsigned long id)
+{
+   struct zynq_pl_reset_data *priv = container_of(rcdev,
+   struct zynq_pl_reset_data,
+   rcdev);
+   int offset = id % BITS_PER_LONG;
+   u32 reg;
+
+   regmap_read(priv->slcr, SLCR_FPGA_RST_CTRL_OFFSET, );
+
+   return !(reg & BIT(offset));
+}
+
+static const struct reset_control_ops zynq_pl_reset_ops = {
+   .assert = zynq_pl_reset_assert,
+   .deassert   = zynq_pl_reset_deassert,
+   .status = zynq_pl_reset_status,
+};
+
+static int zynq_pl_reset_probe(struct platform_device *pdev)
+{
+   struct zynq_pl_reset_data *priv;
+
+   priv = devm_kzalloc(>dev, sizeof(*priv), GFP_KERNEL);
+   if (!priv)
+   return -ENOMEM;
+   platform_set_drvdata(pdev, priv);
+
+   priv->slcr = syscon_regmap_lookup_by_phandle(pdev->dev.of_node,
+   "syscon");
+   if (IS_ERR(priv->slcr)) {
+   dev_err(>dev, "unable to get zynq-slcr regmap");
+   return PTR_ERR(priv->slcr);
+   }
+
+   priv->rcdev.owner = THIS_MODULE;
+   priv->rcdev.nr_resets = BITS_PER_LONG;
+   priv->rcdev.ops = _pl_reset_ops;
+   priv->rcdev.of_node = pdev->dev.of_node;
+   reset_controller_register(>rcdev);
+
+   return 0;
+}
+
+static int zynq_pl_reset_remove(struct platform_device *pdev)
+{
+   struct zynq_pl_reset_data *priv = platform_get_drvdata(pdev);
+
+   reset_controller_unregister(>rcdev);
+
+   return 0;
+}
+
+static const struct of_device_id zynq_pl_reset_dt_ids[] = {
+   { .compatible = "xlnx,zynq-reset-pl", },
+   { /* sentinel */ },
+};
+
+static struct platform_driver zynq_pl_reset_driver = {
+   .probe  = zynq_pl_reset_probe,
+   .remove = zynq_pl_reset_remove,
+   .driver = {
+

[linux41] Kernel panic at i686

2015-07-23 Thread Philip Müller

Hi all,

I started to test linux 4.1 series with rc6. However, I was never able
to boot that kernel in i686 architecture. Trying it again with
VirtualBox gave me more conclusions. Using one core it simply boots up.
Using more than one CPU core it crashes with:

Failed to access perfctr msr (MSR c0010007 is 0)

task: f58e ti: f58e8000 task.ti: f58e800
EIP: 0060:[] EFLAGS: 00010206 CPU: 0
EIP is at free_cache_attributes+0x83/0xd0
EAX: 0001 EBX: f589d46c ECX: 0090 EDX: 360c2000
ESI:  EDI: c1724a80 EBP: f58e9ec0 ESP: f58e9ea0
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: 00ac CR3: 01731000 CR4: 06d0

In more rich detail you can find that problem on my bug-tracker for
Manjaro Linux:

https://github.com/manjaro/packages-core/issues/14

I just want to know if you are aware of it. With current 4.1.3 release I
still face that issue ...

kind regards
Philip Müller
--
Manjaro Project-Lead
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] x86_64: Make int3 non-magical

2015-07-23 Thread Andy Lutomirski

On Thu, Jul 23, 2015 at 3:37 PM, Andy Lutomirski  wrote:
> int3 uses IST and the paranoid gsbase path.  Neither is necessary,
> although the IST stack may currently be necessary to avoid stack
> overruns.
>
> Clean up IRQ stacks, make them NMI safe, teach idtentry to use
> irqstacks if requested, and move int3 to the IRQ stack.
>
> This prepares us to return from int3 using RET.  While we could,
> in principle, return from an IST entry using RET, making that work
> seems likely to be much messier and more fragile than this approach.

Also, don't let the diffstat fool you.  If this works and if we can do
the same thing to do_debug, then we can do this:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry_ist=1bc1f0ae8f1ea76486059a98cdbdfbdbc668aaf9

which makes it a big net win in complexity.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] x86/entry/64: Refactor IRQ stacks and make then NMI-safe

2015-07-23 Thread Andy Lutomirski

This will allow IRQ stacks to nest inside NMIs or similar entries
that can happen during IRQ stack setup or teardown.

The Xen code here has a confusing comment.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/entry/entry_64.S| 72 ++--
 arch/x86/kernel/cpu/common.c |  2 +-
 arch/x86/kernel/process_64.c |  4 +++
 3 files changed, 47 insertions(+), 31 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index d3033183ed70..5f7df8949fa7 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -491,6 +491,39 @@ ENTRY(irq_entries_start)
 END(irq_entries_start)
 
 /*
+ * Enters the IRQ stack if we're not already using it.  NMI-safe.  Clobbers
+ * flags and puts old RSP into old_rsp, and leaves all other GPRs alone.
+ * Requires kernel GSBASE.
+ *
+ * The invariant is that, if irq_count != 0, then we're either on the
+ * IRQ stack or an IST stack, even if an NMI interrupts IRQ stack entry
+ * or exit.
+ */
+.macro ENTER_IRQ_STACK old_rsp
+   movq%rsp, \old_rsp
+   cmpl$0, PER_CPU_VAR(irq_count)
+   jne 694f
+   movqPER_CPU_VAR(irq_stack_ptr), %rsp
+   /*
+* Right now, we're on the irq stack with irq_count == 0.  A nested
+* IRQ stack switch could clobber the stack.  That's fine: the stack
+* is empty.
+*/
+694:
+   inclPER_CPU_VAR(irq_count)
+   pushq   \old_rsp
+.endm
+
+/*
+ * Undoes ENTER_IRQ_STACK
+ */
+.macro LEAVE_IRQ_STACK
+   /* We need to be off the IRQ stack before decrementing irq_count. */
+   popq%rsp
+   declPER_CPU_VAR(irq_count)
+.endm
+
+/*
  * Interrupt entry/exit.
  *
  * Interrupt entry points save only callee clobbered registers in fast path.
@@ -518,17 +551,7 @@ END(irq_entries_start)
 #endif
 
 1:
-   /*
-* Save previous stack pointer, optionally switch to interrupt stack.
-* irq_count is used to check if a CPU is already on an interrupt stack
-* or not. While this is essentially redundant with preempt_count it is
-* a little cheaper to use a separate counter in the PDA (short of
-* moving irq_enter into assembly, which would be too much work)
-*/
-   movq%rsp, %rdi
-   inclPER_CPU_VAR(irq_count)
-   cmovzq  PER_CPU_VAR(irq_stack_ptr), %rsp
-   pushq   %rdi
+   ENTER_IRQ_STACK old_rsp=%rdi
/* We entered an interrupt context - irqs are off: */
TRACE_IRQS_OFF
 
@@ -548,10 +571,8 @@ common_interrupt:
 ret_from_intr:
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
-   declPER_CPU_VAR(irq_count)
 
-   /* Restore saved previous stack */
-   popq%rsp
+   LEAVE_IRQ_STACK
 
testb   $3, CS(%rsp)
jz  retint_kernel
@@ -863,14 +884,9 @@ bad_gs:
 
 /* Call softirq on interrupt stack. Interrupts are off. */
 ENTRY(do_softirq_own_stack)
-   pushq   %rbp
-   mov %rsp, %rbp
-   inclPER_CPU_VAR(irq_count)
-   cmove   PER_CPU_VAR(irq_stack_ptr), %rsp
-   push%rbp/* frame pointer backlink */
+   ENTER_IRQ_STACK old_rsp=%r11
call__do_softirq
-   leaveq
-   declPER_CPU_VAR(irq_count)
+   LEAVE_IRQ_STACK
ret
 END(do_softirq_own_stack)
 
@@ -889,25 +905,21 @@ idtentry xen_hypervisor_callback 
xen_do_hypervisor_callback has_error_code=0
  * So, on entry to the handler we detect whether we interrupted an
  * existing activation in its critical region -- if so, we pop the current
  * activation and restart the handler using the previous one.
+ *
+ * XXX: I have no idea what this comment is talking about.  --luto
  */
 ENTRY(xen_do_hypervisor_callback)  /* 
do_hypervisor_callback(struct *pt_regs) */
-
+   ENTER_IRQ_STACK old_rsp=%r11
 /*
  * Since we don't modify %rdi, evtchn_do_upall(struct *pt_regs) will
  * see the correct pointer to the pt_regs
  */
-   movq%rdi, %rsp  /* we don't return, adjust the 
stack frame */
-11:inclPER_CPU_VAR(irq_count)
-   movq%rsp, %rbp
-   cmovzq  PER_CPU_VAR(irq_stack_ptr), %rsp
-   pushq   %rbp/* frame pointer backlink */
callxen_evtchn_do_upcall
-   popq%rsp
-   declPER_CPU_VAR(irq_count)
+   LEAVE_IRQ_STACK
 #ifndef CONFIG_PREEMPT
callxen_maybe_preempt_hcall
 #endif
-   jmp error_exit
+   ret
 END(xen_do_hypervisor_callback)
 
 /*
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 1c528b06f802..e9968531ce56 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1161,7 +1161,7 @@ EXPORT_PER_CPU_SYMBOL(current_task);
 DEFINE_PER_CPU(char *, irq_stack_ptr) =
init_per_cpu_var(irq_stack_union.irq_stack) + IRQ_STACK_SIZE - 64;
 
-DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
+DEFINE_PER_CPU(unsigned int, irq_count) __visible;

[PATCH 3/3] x86/entry/64: Move #BP from IST to the IRQ stack

2015-07-23 Thread Andy Lutomirski

There's nothing IST-worthy about #BP/int3.  We don't allow kprobes
in the small handful of places in the kernel that run at CPL0 with
an invalid stack, and 32-bit kernels have used normal interrupt
gates for #BP forever.

Furthermore, we don't allow kprobes in places that have usergs while
in kernel mode, so "paranoid" is also unnecessary.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/entry/entry_64.S |  2 +-
 arch/x86/kernel/traps.c   | 26 +-
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ce72beba6045..fb3253ae7ecc 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -990,7 +990,7 @@ apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
 #endif /* CONFIG_HYPERV */
 
 idtentry debug do_debughas_error_code=0
paranoid=1 shift_ist=DEBUG_STACK
-idtentry int3  do_int3 has_error_code=0
paranoid=1 shift_ist=DEBUG_STACK
+idtentry int3  do_int3 has_error_code=0
irqstack=1
 idtentry stack_segment do_stack_segmenthas_error_code=1
 
 #ifdef CONFIG_XEN
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 8e65d8a9b8db..d823db70f492 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -479,7 +479,7 @@ do_general_protection(struct pt_regs *regs, long error_code)
 }
 NOKPROBE_SYMBOL(do_general_protection);
 
-/* May run on IST stack. */
+/* Runs on IRQ stack. */
 dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 {
 #ifdef CONFIG_DYNAMIC_FTRACE
@@ -494,7 +494,15 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, 
long error_code)
if (poke_int3_handler(regs))
return;
 
+   /*
+* Use ist_enter despite the fact that we don't use an IST stack.
+* We can be called from a kprobe in non-CONTEXT_KERNEL kernel
+* mode or even during context tracking state changes.
+*
+* This means that we can't schedule.  That's okay.
+*/
ist_enter(regs);
+
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
@@ -511,15 +519,10 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, 
long error_code)
SIGTRAP) == NOTIFY_STOP)
goto exit;
 
-   /*
-* Let others (NMI) know that the debug stack is in use
-* as we may switch to the interrupt stack.
-*/
-   debug_stack_usage_inc();
preempt_conditional_sti(regs);
do_trap(X86_TRAP_BP, SIGTRAP, "int3", regs, error_code, NULL);
preempt_conditional_cli(regs);
-   debug_stack_usage_dec();
+
 exit:
ist_exit(regs);
 }
@@ -885,19 +888,16 @@ void __init trap_init(void)
cpu_init();
 
/*
-* X86_TRAP_DB and X86_TRAP_BP have been set
-* in early_trap_init(). However, ITS works only after
-* cpu_init() loads TSS. See comments in early_trap_init().
+* X86_TRAP_DB was installed in early_trap_init(). However,
+* IST works only after cpu_init() loads TSS. See comments
+* in early_trap_init().
 */
set_intr_gate_ist(X86_TRAP_DB, , DEBUG_STACK);
-   /* int3 can be called from all */
-   set_system_intr_gate_ist(X86_TRAP_BP, , DEBUG_STACK);
 
x86_init.irqs.trap_init();
 
 #ifdef CONFIG_X86_64
memcpy(_idt_table, _table, IDT_ENTRIES * 16);
set_nmi_gate(X86_TRAP_DB, );
-   set_nmi_gate(X86_TRAP_BP, );
 #endif
 }
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] x86/entry/64: Teach idtentry to use the IRQ stack

2015-07-23 Thread Andy Lutomirski

We don't specifically need IST for things like kprobes, but we do
want to avoid rare, surprising extra stack usage if a kprobe hits
with a deep stack.

Teach idtentry to use the IRQ stack for selected entries.

This implementation uses the IRQ stack even if we entered from user
mode.  This disallows tricks like ist_begin_non_atomic.  If we ever
need such a trick in one of these entries, we can rework this.  For
now, let's keep it simple.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/entry/entry_64.S | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 5f7df8949fa7..ce72beba6045 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -739,13 +739,17 @@ apicinterrupt IRQ_WORK_VECTOR 
irq_work_interrupt  smp_irq_work_interrupt
  */
 #define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss) + (TSS_ist + ((x) - 1) * 8)
 
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
+.macro idtentry sym do_sym has_error_code:req irqstack=0 paranoid=0 
shift_ist=-1
 ENTRY(\sym)
/* Sanity check */
.if \shift_ist != -1 && \paranoid == 0
.error "using shift_ist requires paranoid=1"
.endif
 
+   .if \irqstack && \paranoid
+   .error "using irqstack requires !paranoid"
+   .endif
+
ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME
 
@@ -787,8 +791,16 @@ ENTRY(\sym)
subq$EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
.endif
 
+   .if \irqstack
+   ENTER_IRQ_STACK old_rsp=%rcx
+   .endif
+
call\do_sym
 
+   .if \irqstack
+   LEAVE_IRQ_STACK
+   .endif
+
.if \shift_ist != -1
addq$EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
.endif
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/3] x86_64: Make int3 non-magical

2015-07-23 Thread Andy Lutomirski

int3 uses IST and the paranoid gsbase path.  Neither is necessary,
although the IST stack may currently be necessary to avoid stack
overruns.

Clean up IRQ stacks, make them NMI safe, teach idtentry to use
irqstacks if requested, and move int3 to the IRQ stack.

This prepares us to return from int3 using RET.  While we could,
in principle, return from an IST entry using RET, making that work
seems likely to be much messier and more fragile than this approach.

Andy Lutomirski (3):
  x86/entry/64: Refactor IRQ stacks and make then NMI-safe
  x86/entry/64: Teach idtentry to use the IRQ stack
  x86/entry/64: Move #BP from IST to the IRQ stack

 arch/x86/entry/entry_64.S| 88 
 arch/x86/kernel/cpu/common.c |  2 +-
 arch/x86/kernel/process_64.c |  4 ++
 arch/x86/kernel/traps.c  | 26 ++---
 4 files changed, 74 insertions(+), 46 deletions(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

2015-07-23 Thread Jörn Engel

On Thu, Jul 23, 2015 at 03:08:58PM -0700, David Rientjes wrote:
> On Thu, 23 Jul 2015, Spencer Baugh wrote:
> > From: Joern Engel 
> > 
> > ~150ms scheduler latency for both observed in the wild.
> > 
> > Signed-off-by: Joern Engel 
> > Signed-off-by: Spencer Baugh 
> > ---
> >  mm/hugetlb.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index a8c3087..2eb6919 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1836,6 +1836,7 @@ static unsigned long set_max_huge_pages(struct hstate 
> > *h, unsigned long count,
> > ret = alloc_fresh_gigantic_page(h, nodes_allowed);
> > else
> > ret = alloc_fresh_huge_page(h, nodes_allowed);
> > +   cond_resched();
> > spin_lock(_lock);
> > if (!ret)
> > goto out;
> 
> This is wrong, you'd want to do any cond_resched() before the page 
> allocation to avoid racing with an update to h->nr_huge_pages or 
> h->surplus_huge_pages while hugetlb_lock was dropped that would result in 
> the page having been uselessly allocated.

There are three options.  Either
/* some allocation */
cond_resched();
or
cond_resched();
/* some allocation */
or
if (cond_resched()) {
spin_lock(_lock);
continue;
}
/* some allocation */

I think you want the second option instead of the first.  That way we
have a little less memory allocation for the time we are scheduled out.
Sure, we can do that.  It probably doesn't make a big difference either
way, but why not.

If you are asking for the third option, I would rather avoid that.  It
makes the code more complex and doesn't change the fact that we have a
race and better be able to handle the race.  The code size growth will
likely cost us more performance that we would ever gain.  nr_huge_pages
tends to get updated once per system boot.

> > @@ -3521,6 +3522,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
> > vm_area_struct *vma,
> > spin_unlock(ptl);
> > ret = hugetlb_fault(mm, vma, vaddr,
> > (flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
> > +   cond_resched();
> > if (!(ret & VM_FAULT_ERROR))
> > continue;
> >  
> 
> This is almost certainly the wrong placement as well since it's inserted 
> inside a conditional inside a while loop and there's no reason to 
> hugetlb_fault(), schedule, and then check the return value.  You need to 
> insert your cond_resched()'s in legitimate places.

I assume you want the second option here as well.  Am I right?

Jörn

--
Sometimes it pays to stay in bed on Monday, rather than spending the rest
of the week debugging Monday's code.
-- Christopher Thompson
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] serial: 8250: Fix autoconfig_irq() to avoid race conditions

2015-07-23 Thread gre...@linuxfoundation.org

On Fri, Jun 05, 2015 at 09:57:40AM +, Taichi Kageyama wrote:
> The following race conditions can happen if a serial is used as console.
>   Case1. CPU_B handles an interrupt from a serial
> autoconfig_irq() fails whether the interrupt is raised or not
> if CPU_B is disabled to handle interrupts for longer than it expects.
>   Case2. CPU_B clears UART_IER just after CPU_A sets UART_IER
> A serial may not make an interrupt.
> autoconfig_irq() can fail if the interrupt is not raised.
>   Case3. CPU_A sets UART_IER just after CPU_B clears UART_IER
> This is an unexpected behavior for uart_console_write().
> 
>   CPU_A [autoconfig_irq]  CPU_B [serial8250_console_write]
>   -
>   probe_irq_on()  spin_lock_irqsave(>lock,)
>   serial_outp(,UART_IER,0x0f) serial_out(,UART_IER,0)
>   udelay(20); uart_console_write()
>   probe_irq_off()
>   spin_unlock_irqrestore(>lock,)
>   -
> 
> If autoconfig_irq() fails, the console doesn't work in interrupt mode,
> the mode cannot be changed anymore, and "input overrun"
> (which can make operation mistakes) happens easily.
> This problem happens with high rate every boot once it occurs
> because the boot sequence is always almost same.
> 
> Signed-off-by: Taichi Kageyama 
> Cc: Naoya Horiguchi 
> Reviewed-by: Peter Hurley 
> ---
>   drivers/tty/serial/8250/8250_core.c |6 ++
>   1 files changed, 6 insertions(+), 0 deletions(-)

Does not apply to 4.2-rc3 :(
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RESEND][PATCHv2] v4l2-ioctl: Give more information when device_caps are missing

2015-07-23 Thread Laura Abbott

Currently, the warning for missing device_caps gives a backtrace like so:

[] dump_stack+0x45/0x57
[] warn_slowpath_common+0x8a/0xc0
[] warn_slowpath_null+0x1a/0x20
[] v4l_querycap+0x43/0x80 [videodev]
[] __video_do_ioctl+0x2a4/0x320 [videodev]
[] ? do_last+0x195/0x1210
[] video_usercopy+0x22e/0x5b0 [videodev]
[] ? v4l_querycap+0x80/0x80 [videodev]
[] video_ioctl2+0x15/0x20 [videodev]
[] v4l2_ioctl+0x113/0x150 [videodev]
[] do_vfs_ioctl+0x2f8/0x4f0
[] ? __audit_syscall_entry+0xb4/0x110
[] ? do_audit_syscall_entry+0x6c/0x70
[] SyS_ioctl+0x81/0xa0
[] ? __audit_syscall_exit+0x1f6/0x2a0
[] system_call_fastpath+0x12/0x17

This indicates that device_caps are missing but doesn't give
much of a clue which driver is actually at fault. Improve
the warning output by showing the capabilities and the
responsible driver.

Signed-off-by: Laura Abbott 
---
I sent this out right before I went on vacation but I never saw any follow up.
---
 drivers/media/v4l2-core/v4l2-ioctl.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/media/v4l2-core/v4l2-ioctl.c 
b/drivers/media/v4l2-core/v4l2-ioctl.c
index 85de455..ad7e929 100644
--- a/drivers/media/v4l2-core/v4l2-ioctl.c
+++ b/drivers/media/v4l2-core/v4l2-ioctl.c
@@ -1025,8 +1025,9 @@ static int v4l_querycap(const struct v4l2_ioctl_ops *ops,
 * Drivers MUST fill in device_caps, so check for this and
 * warn if it was forgotten.
 */
-   WARN_ON(!(cap->capabilities & V4L2_CAP_DEVICE_CAPS) ||
-   !cap->device_caps);
+   WARN(!(cap->capabilities & V4L2_CAP_DEVICE_CAPS) ||
+   !cap->device_caps, "Bad caps for driver %s, %x %x",
+   cap->driver, cap->capabilities, cap->device_caps);
cap->device_caps |= V4L2_CAP_EXT_PIX_FORMAT;
 
return ret;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] target: add support for START_STOP_UNIT SCSI opcode

2015-07-23 Thread Spencer Baugh

From: Brian Bunker 

AIX servers using VIOS servers that virtualize FC cards will have a
problem booting without support for START_STOP_UNIT.

v2: Cite sb3r36 exactly, clean up if conditions

Signed-off-by: Brian Bunker 
Signed-off-by: Spencer Baugh 
---
 drivers/target/target_core_sbc.c | 36 
 1 file changed, 36 insertions(+)

diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c
index e318ddb..85c3c0a 100644
--- a/drivers/target/target_core_sbc.c
+++ b/drivers/target/target_core_sbc.c
@@ -154,6 +154,38 @@ sbc_emulate_readcapacity_16(struct se_cmd *cmd)
return 0;
 }
 
+static sense_reason_t
+sbc_emulate_startstop(struct se_cmd *cmd)
+{
+   unsigned char *cdb = cmd->t_task_cdb;
+
+   /*
+* See sb3r36 section 5.25
+* Immediate bit should be set since there is nothing to complete
+* POWER CONDITION MODIFIER 0h
+*/
+   if (!(cdb[1] & 1) || cdb[2] || cdb[3])
+   return TCM_INVALID_CDB_FIELD;
+
+   /*
+* See sb3r36 section 5.25
+* POWER CONDITION 0h START_VALID - process START and LOEJ
+*/
+   if (cdb[4] >> 4 & 0xf)
+   return TCM_INVALID_CDB_FIELD;
+
+   /*
+* See sb3r36 section 5.25
+* LOEJ 0h - nothing to load or unload
+* START 1h - we are ready
+*/
+   if (!(cdb[4] & 1) || (cdb[4] & 2) || (cdb[4] & 4))
+   return TCM_INVALID_CDB_FIELD;
+
+   target_complete_cmd(cmd, SAM_STAT_GOOD);
+   return 0;
+}
+
 sector_t sbc_get_write_same_sectors(struct se_cmd *cmd)
 {
u32 num_blocks;
@@ -1069,6 +1101,10 @@ sbc_parse_cdb(struct se_cmd *cmd, struct sbc_ops *ops)
size = 0;
cmd->execute_cmd = sbc_emulate_noop;
break;
+   case START_STOP:
+   size = 0;
+   cmd->execute_cmd = sbc_emulate_startstop;
+   break;
default:
ret = spc_parse_cdb(cmd, );
if (ret)
-- 
2.5.0.rc3
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -next] serial: etraxfs-uart: Update gpiod API

2015-07-23 Thread Greg Kroah-Hartman

On Tue, Jul 21, 2015 at 01:34:52PM -0700, Guenter Roeck wrote:
> Commit b17d1bf16cc7 ("gpio: make flags mandatory for gpiod_get functions")
> makes the flags argument to devm_gpiod_get_optional mandatory but does not
> update all users. This results in the following build error.
> 
> drivers/tty/serial/etraxfs-uart.c:933:16: error:
>   too few arguments to function ‘devm_gpiod_get_optional’
> 
> Fixes: b17d1bf16cc7 ("gpio: make flags mandatory for gpiod_get functions")

This patch isn't in Linus's tree, so whatever tree this commit is in,
needs to also take this fix.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 1/2] serial_core: add pci uart early console support

2015-07-23 Thread Greg Kroah-Hartman

On Mon, Jun 08, 2015 at 11:17:11AM -0700, Bin Gao wrote:
> On some Intel Atom SoCs, the legacy IO port UART(0x3F8) is not available.
> Instead, a 8250 compatible PCI uart can be used as early console.
> This patch adds pci support to the 8250 early console driver uart8250.
> For example, to enable pci uart(00:21.3) as early console on these
> platforms, append the following line to the kernel command line
> (assume baud rate is 115200):
> earlyprintk=uart8250,pci32,0:24.2,115200n8
> 
> Signed-off-by: Bin Gao 
> ---
> Changes in v6: None
> Changes in v5:
>  - updated Documentation/kernel-parameters.txt.
>  - moved earlyprintk= to patch 2/2 (requires x86 people's review).
>  - rolled back to simple_strto* APIs.
>  - seperate pci/pci32 format description.
>  - minor error and debug message changes.
>  - if/else statements in uart_parse_earlycon() were refactored to avoid
>logic steering locals.
> Changes in v4:
>  - moved PCI_EARLY definition from arch/x86/Kconfig to drivers/pci/Kconfig
>  - made earlycon= for all archs but earlyprintk= only for x86 by changing
>"#ifdef #else #endif" to "#if #endif".
> Changes in v3:
>  - introduced CONFIG_EARLY_PCI to protect pci codes in serial_core.c.
>  - added earlyprintk= as alia to earlycon= to keep x86 compatibility.
> changes in v2:
>  - added the second patch (2/2) to remove existed pci early console support
>from arch/x86/kernel/early_printk.c.
>  Documentation/kernel-parameters.txt |  15 +
>  arch/x86/Kconfig|   1 +
>  drivers/pci/Kconfig |  11 
>  drivers/tty/serial/serial_core.c| 106 
> +++-
>  4 files changed, 132 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index 61ab162..598606e 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -969,6 +969,16 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>   same format described for "console=ttyS"; if
>   unspecified, the h/w is not initialized.
>  
> + uart8250,pci,[,options]
> + uart8250,pci32,[,options]
> + Start an early, polled-mode console on the 8250/16550
> + UART at the specified PCI device (bus:dev.func).
> + The io or memory mmaped register width is either 8-bit
> + (pci) or 32-bit (pci32).
> + 'options' are specified in the same format described
> + for "console=ttyS"; if unspecified, the h/w is not
> + initialized.
> +
>   pl011,
>   Start an early, polled-mode console on a pl011 serial
>   port at the specified address. The pl011 serial port
> @@ -1009,6 +1019,8 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>   earlyprintk=serial[,0x...[,baudrate]]
>   earlyprintk=ttySn[,baudrate]
>   earlyprintk=dbgp[debugController#]
> + earlyprintk=uart8250,pci,[,options]
> + earlyprintk=uart8250,pci32,[,options]
>  
>   earlyprintk is useful when the kernel crashes before
>   the normal console is initialized. It is not enabled by
> @@ -1037,6 +1049,9 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>  
>   The xen output can only be used by Xen PV guests.
>  
> + The uart8250,pci and uart8250,pci32 output share the
> + same definition that is in earlycon= section.
> +
>   edac_report=[HW,EDAC] Control how to report EDAC event
>   Format: {"on" | "off" | "force"}
>   on: enable EDAC to report H/W event. May be overridden
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 226d569..bdedd61 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -143,6 +143,7 @@ config X86
>   select ACPI_LEGACY_TABLES_LOOKUP if ACPI
>   select X86_FEATURE_NAMES if PROC_FS
>   select SRCU
> + select PCI_EARLY if PCI
>  
>  config INSTRUCTION_DECODER
>   def_bool y
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 7a8f1c5..4f0f055 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -114,4 +114,15 @@ config PCI_LABEL
>   def_bool y if (DMI || ACPI)
>   select NLS
>  
> +config PCI_EARLY
> + bool "Early PCI access"
> + depends on PCI
> + default n

Default is always 'n' so this isn't needed here.


> + help
> +   This option indicates that a group of APIs are available (in
> +   asm/pci-direct.h) so the kernel can access pci config registers
> +   before the PCI subsystem is initialized. Any arch that supports
> +   early pci APIs should

Re: round_up integer underflow

2015-07-23 Thread Alexey Dobriyan

On Thu, Jul 23, 2015 at 11:10:30AM -0700, Jörn Engel wrote:
> On Thu, Jul 23, 2015 at 11:02:55AM -0700, Jörn Engel wrote:
> > Spencer spotted something nasty in the round_up macro.  We were
> > wondering why round_up() worked differently from ALIGN.  The only real
> > difference between the two patterns is overflow behaviour.  And both
> > version are buggy when used for signed integer types, round_up will
> > underflow on INT_MIN, ALIGN will overflow on INT_MAX.  Since signed
> > integer under/overflows are undefined, we might have subtle bugs lurking
> > in the kernel.
> > 
> > This example program produces a warning when compiling with gcc -O2 or
> > higher.  Clang doesn't warn.  Compiled code behaves correctly with both
> > compilers, but that is largely luck and the same compilers may create
> > wrong behaviour if the surrounding code changes.
> > 
> > #include 
> > #include 
> > 
> > #define __round_mask(x, y) ((__typeof__(x))((y)-1))
> > #define round_up(x, y) x)-1) | __round_mask(x, y))+1)
> > #define round_down(x, y) ((x) & ~__round_mask(x, y))
> > 
> > int main(void)
> > {
> > int i, r = 8;
> > 
> > for (i = INT_MIN; i; i++) {
> > printf("%2x: %2x %2x\n", i, round_down(i, r), round_up(i, r));
> > }
> > return 0;
> > }
> > 
> > I don't have a good answer yet.  We could make round_up check for
> > negative numbers, but I would prefer unconditional code that optimizes
> > down to nothing.  We could rewrite it in assembly, once for each
> > architecture.
> > 
> > Does anyone have better ideas?

You can fix overflow issues but the ALIGN(INT_MAX, a) creating much
smaller value is probably a bug anyway. It should BUG_ON or something
(yes, I'm aware of recent memo).

> #define round_up(x, y) (__typeof__(x)(__round_up((unsigned __typeof__(x)(x)), 
> (y
^^^

If only... :-(

> I.e. cast x to the matching unsigned type where overflows are
> well-defined, do the rounding, then cast the result back to the original
> type.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] target: fix crash in cmd tracing when cmd didn't match a LUN

2015-07-23 Thread Spencer Baugh

From: Alexei Potashnik 

If command didn't match a LUN and we're sending check condition, the
target_cmd_complete ftrace point will crash because it assumes that
cmd->t_task_cdb has been set.

The fix will temporarily set t_task_cdb to the se_cmd buffer
and copy first 6 bytes of cdb in there as soon as possible.
At a later point t_task_cdb is reset to the correct buffer,
but until then traces and printks don't cause a crash.

Signed-off-by: Alexei Potashnik 
Signed-off-by: Spencer Baugh 
---
 drivers/target/iscsi/iscsi_target.c|  4 ++--
 drivers/target/target_core_device.c| 11 +--
 drivers/target/target_core_transport.c |  9 +
 include/target/target_core_fabric.h|  2 +-
 4 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target.c 
b/drivers/target/iscsi/iscsi_target.c
index f615d75..98899f1 100644
--- a/drivers/target/iscsi/iscsi_target.c
+++ b/drivers/target/iscsi/iscsi_target.c
@@ -1003,8 +1003,8 @@ int iscsit_setup_scsi_cmd(struct iscsi_conn *conn, struct 
iscsi_cmd *cmd,
 
target_get_sess_cmd(>se_cmd, true);
 
-   cmd->sense_reason = transport_lookup_cmd_lun(>se_cmd,
-scsilun_to_int(>lun));
+   cmd->sense_reason = transport_lookup_cmd_lun_cdb(>se_cmd,
+
scsilun_to_int(>lun), hdr->cdb);
if (cmd->sense_reason)
goto attach_cmd;
 
diff --git a/drivers/target/target_core_device.c 
b/drivers/target/target_core_device.c
index c4a8db6..acf84cf 100644
--- a/drivers/target/target_core_device.c
+++ b/drivers/target/target_core_device.c
@@ -56,13 +56,20 @@ static struct se_hba *lun0_hba;
 struct se_device *g_lun0_dev;
 
 sense_reason_t
-transport_lookup_cmd_lun(struct se_cmd *se_cmd, u64 unpacked_lun)
+transport_lookup_cmd_lun_cdb(struct se_cmd *se_cmd, u64 unpacked_lun, unsigned 
char *cdb)
 {
struct se_lun *se_lun = NULL;
struct se_session *se_sess = se_cmd->se_sess;
struct se_node_acl *nacl = se_sess->se_node_acl;
struct se_dev_entry *deve;
 
+   /* Temporarily set t_task_cdb to the se_cmd buffer and save a portion
+* of cdb in there (fabrics must provide at least 6 bytes). t_task_cdb
+* will be correctly replaced in target_setup_cmd_from_cdb. Until then
+* tracing and printks can access t_task_cdb without causing a crash. */
+   se_cmd->t_task_cdb = se_cmd->__t_task_cdb;
+   memcpy(se_cmd->t_task_cdb, cdb, 6);
+
rcu_read_lock();
deve = target_nacl_find_deve(nacl, unpacked_lun);
if (deve) {
@@ -142,7 +149,7 @@ transport_lookup_cmd_lun(struct se_cmd *se_cmd, u64 
unpacked_lun)
 
return 0;
 }
-EXPORT_SYMBOL(transport_lookup_cmd_lun);
+EXPORT_SYMBOL(transport_lookup_cmd_lun_cdb);
 
 int transport_lookup_tmr_lun(struct se_cmd *se_cmd, u64 unpacked_lun)
 {
diff --git a/drivers/target/target_core_transport.c 
b/drivers/target/target_core_transport.c
index f6626bb..1f761a3 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -1210,15 +1210,16 @@ target_setup_cmd_from_cdb(struct se_cmd *cmd, unsigned 
char *cdb)
 * setup the pointer from __t_task_cdb to t_task_cdb.
 */
if (scsi_command_size(cdb) > sizeof(cmd->__t_task_cdb)) {
-   cmd->t_task_cdb = kzalloc(scsi_command_size(cdb),
-   GFP_KERNEL);
-   if (!cmd->t_task_cdb) {
+   unsigned char *ptr = kzalloc(scsi_command_size(cdb),
+GFP_KERNEL);
+   if (!ptr) {
pr_err("Unable to allocate cmd->t_task_cdb"
" %u > sizeof(cmd->__t_task_cdb): %lu ops\n",
scsi_command_size(cdb),
(unsigned long)sizeof(cmd->__t_task_cdb));
return TCM_OUT_OF_RESOURCES;
}
+   cmd->t_task_cdb = ptr;
} else
cmd->t_task_cdb = >__t_task_cdb[0];
/*
@@ -1404,7 +1405,7 @@ int target_submit_cmd_map_sgls(struct se_cmd *se_cmd, 
struct se_session *se_sess
/*
 * Locate se_lun pointer and attach it to struct se_cmd
 */
-   rc = transport_lookup_cmd_lun(se_cmd, unpacked_lun);
+   rc = transport_lookup_cmd_lun_cdb(se_cmd, unpacked_lun, cdb);
if (rc) {
transport_send_check_condition_and_sense(se_cmd, rc, 0);
target_put_sess_cmd(se_cmd);
diff --git a/include/target/target_core_fabric.h 
b/include/target/target_core_fabric.h
index 18afef9..bfa6368 100644
--- a/include/target/target_core_fabric.h
+++ b/include/target/target_core_fabric.h
@@ -116,7 +116,7 @@ voidtransport_deregister_session(struct se_session 
*);
 void   transport_init_se_cmd(struct se_cmd *,
const struct target_core_fabric_ops *,

Re: [PATCH v9 6/7] staging: add simple-fpga-bus

2015-07-23 Thread Jason Gunthorpe

On Thu, Jul 23, 2015 at 02:55:52PM -0700, Moritz Fischer wrote:
> Hi Alan,
> 
> I saw that your socfpga driver doesn't support the partial reconfig
> use case (not a big deal).
> What I currently do for Zynq is if I'm doing a non-partial reconfig is
> that I disable input
> level shifters and assert *all* resets while reprogramming in my FPGA
> manager .write_init() and .write_complete().

I do this as well, but it is a bit more complex.. FPGA specific code
has to run around and ensure all DMA is shut off, then we need to make
sure no CPU issued AXI transactions can happen, then we can tear down
the FPGA side.

If the FPGA is torn down while an AXI op is inprogress things go
sideways, we have to work to prevent that :)

This happens almost for free, I use DT and the device model to
disconnect the drivers. The drivers are careful to synchronously fence
off in-progress DMA. Then drop the DT nodes associated with the
FPGA, finally the actual FPGA cells can be reset.

> In a partial reconfiguration situation, would I have separate
> simple-fpga buses for each of the parts that I swap out, each with
> it's own reset and bitfile attached?

I'd think of partial reconfiguration as another nested FPGA. The
resets and so forth could be attached to soft controllers in the
unswappable part of the FPGA.

DT nodes have to surround it in some way...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] tty: serial: Fix typo in the comment

2015-07-23 Thread Greg KH

On Tue, Jun 09, 2015 at 03:11:36PM +0900, Hyuk Myeong wrote:
> This patch fix a spelling typo in the comment in synclink.c and
> synclinkmp.c.
> 
> Signed-off-by: Hyuk Myeong 
> ---
>  drivers/tty/synclink.c   | 4 ++--
>  drivers/tty/synclinkmp.c | 4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/tty/synclink.c b/drivers/tty/synclink.c
> index b799170..3bbb07c 100644
> --- a/drivers/tty/synclink.c
> +++ b/drivers/tty/synclink.c
> @@ -7468,9 +7468,9 @@ static bool mgsl_memory_test( struct mgsl_struct *info
> )

Patch is line-wrapped and does not apply :(

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] Avoid usb reset crashes by making tty_io cdevs truly dynamic

2015-07-23 Thread Greg KH

On Tue, May 19, 2015 at 04:06:53PM +0100, Richard Watts wrote:
> Avoid usb reset crashes by making tty_io cdevs truly dynamic

What USB reset crashes are you referring to here?

> 
> Signed-off-by: Richard Watts 
> Reported-by: Duncan Mackintosh 
> ---
>  drivers/tty/tty_io.c   | 24 
>  include/linux/tty_driver.h |  2 +-
>  2 files changed, 17 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
> index e569546..699cf20 100644
> --- a/drivers/tty/tty_io.c
> +++ b/drivers/tty/tty_io.c
> @@ -3168,9 +3168,12 @@ static int tty_cdev_add(struct tty_driver *driver,
> dev_t dev,
>   unsigned int index, unsigned int count)
>  {
>   /* init here, since reused cdevs cause crashes */
> - cdev_init(>cdevs[index], _fops);
> - driver->cdevs[index].owner = driver->owner;
> - return cdev_add(>cdevs[index], dev, count);
> + driver->cdevs[index] = cdev_alloc();
> + if (!driver->cdevs[index])
> + return -ENOMEM;
> + cdev_init(driver->cdevs[index], _fops);
> + driver->cdevs[index]->owner = driver->owner;
> + return cdev_add(driver->cdevs[index], dev, count);
>  }
> 
>  /**
> @@ -3276,8 +3279,10 @@ struct device *tty_register_device_attr(struct
> tty_driver *driver,
> 
>  error:
>   put_device(dev);
> - if (cdev)
> - cdev_del(>cdevs[index]);
> + if (cdev) {
> + cdev_del(driver->cdevs[index]);
> + driver->cdevs[index] = NULL;
> + }
>   return ERR_PTR(retval);
>  }
>  EXPORT_SYMBOL_GPL(tty_register_device_attr);
> @@ -3297,8 +3302,10 @@ void tty_unregister_device(struct tty_driver *driver,
> unsigned index)
>  {
>   device_destroy(tty_class,
>   MKDEV(driver->major, driver->minor_start) + index);
> - if (!(driver->flags & TTY_DRIVER_DYNAMIC_ALLOC))
> - cdev_del(>cdevs[index]);
> + if (!(driver->flags & TTY_DRIVER_DYNAMIC_ALLOC)) {
> + cdev_del(driver->cdevs[index]);
> + driver->cdevs[index] = NULL;
> + }
>  }
>  EXPORT_SYMBOL(tty_unregister_device);
> 
> @@ -3363,6 +3370,7 @@ err_free_all:
>   kfree(driver->ports);
>   kfree(driver->ttys);
>   kfree(driver->termios);
> + kfree(driver->cdevs);
>   kfree(driver);
>   return ERR_PTR(err);
>  }
> @@ -3391,7 +3399,7 @@ static void destruct_tty_driver(struct kref *kref)
>   }
>   proc_tty_unregister_driver(driver);
>   if (driver->flags & TTY_DRIVER_DYNAMIC_ALLOC)
> - cdev_del(>cdevs[0]);
> + cdev_del(driver->cdevs[0]);
>   }
>   kfree(driver->cdevs);
>   kfree(driver->ports);
> diff --git a/include/linux/tty_driver.h b/include/linux/tty_driver.h
> index 92e337c..1610524 100644
> --- a/include/linux/tty_driver.h
> +++ b/include/linux/tty_driver.h
> @@ -296,7 +296,7 @@ struct tty_operations {
>  struct tty_driver {
>   int magic;  /* magic number for this structure */
>   struct kref kref;   /* Reference management */
> - struct cdev *cdevs;
> + struct cdev **cdevs;
>   struct module   *owner;
>   const char  *driver_name;
>   const char  *name;

I don't understand what bug this patch is trying to solve, care to help
describe it better?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

2015-07-23 Thread David Rientjes

On Thu, 23 Jul 2015, Spencer Baugh wrote:

> From: Joern Engel 
> 
> ~150ms scheduler latency for both observed in the wild.
> 
> Signed-off-by: Joern Engel 
> Signed-off-by: Spencer Baugh 
> ---
>  mm/hugetlb.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a8c3087..2eb6919 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1836,6 +1836,7 @@ static unsigned long set_max_huge_pages(struct hstate 
> *h, unsigned long count,
>   ret = alloc_fresh_gigantic_page(h, nodes_allowed);
>   else
>   ret = alloc_fresh_huge_page(h, nodes_allowed);
> + cond_resched();
>   spin_lock(_lock);
>   if (!ret)
>   goto out;

This is wrong, you'd want to do any cond_resched() before the page 
allocation to avoid racing with an update to h->nr_huge_pages or 
h->surplus_huge_pages while hugetlb_lock was dropped that would result in 
the page having been uselessly allocated.

> @@ -3521,6 +3522,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   spin_unlock(ptl);
>   ret = hugetlb_fault(mm, vma, vaddr,
>   (flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
> + cond_resched();
>   if (!(ret & VM_FAULT_ERROR))
>   continue;
>  

This is almost certainly the wrong placement as well since it's inserted 
inside a conditional inside a while loop and there's no reason to 
hugetlb_fault(), schedule, and then check the return value.  You need to 
insert your cond_resched()'s in legitimate places.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] tools lib traceevent: Allow setting an alternative symbol resolver

2015-07-23 Thread Steven Rostedt

On Thu, 23 Jul 2015 18:58:36 -0300
Arnaldo Carvalho de Melo  wrote:

> Em Thu, Jul 23, 2015 at 06:52:46PM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Thu, Jul 23, 2015 at 05:35:24PM -0400, Steven Rostedt escreveu:
> > > On Thu, 23 Jul 2015 18:25:36 -0300
> > > Arnaldo Carvalho de Melo  wrote:
> > > > +   if (resolver == NULL) {
> > > > +   errno = ENOMEM;
> > > 
> > > Why set errno, wont a failed malloc set it for us?
> > 
> > Humm, I thought about that, I've read malloc's man page and it doesn't
> > mention that... in the return section, but later, in NOTES, it says
> > UNIX98 requires that and glibc does it, so I'm ditching this...
> 
> > > Also I wonder if we should add a way to clear the resolver. That is,
> > > you want to use the default resolver?
> > 
> > I am adding a reset_function_resolver(pevent);
> >  
> > > Not really a necessity, as I don't see any current programs using it,
> > > but it would complete the interface.
> 
> One more try:

Third time's a charm, or was this the forth?

Reviewed-by: Steven Rostedt 

-- Steve

> 
> commit 212a2417baaa89168cbe3112fe7c8efaddee28b8
> Author: Arnaldo Carvalho de Melo 
> Date:   Wed Jul 22 12:36:55 2015 -0300
> 
> tools lib traceevent: Allow setting an alternative symbol resolver
> 
> The perf tools have a symbol resolver that includes solving kernel
> symbols using either kallsyms or ELF symtabs, and it also is using
> libtraceevent to format the trace events fields, including via
> subsystem specific plugins, like the "timer" one.
> 
> To solve fields like "timer:hrtimer_start"'s "function", libtraceevent
> needs a way to map from its value to a function name and addr.
> 
> This patch provides a way for tools that already have symbol resolving
> facilities to ask libtraceevent to use it when needing to resolve
> kernel symbols.
> 
> Acked-by: David Ahern 
> Cc: Adrian Hunter 
> Cc: Borislav Petkov 
> Cc: Frederic Weisbecker 
> Cc: Jiri Olsa 
> Cc: Namhyung Kim 
> Cc: Stephane Eranian 
> Cc: Steven Rostedt 
> Link: http://lkml.kernel.org/n/tip-fdx1fazols17w5py26ia3...@git.kernel.org
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Dealing with the NMI mess

2015-07-23 Thread Linus Torvalds

On Thu, Jul 23, 2015 at 2:59 PM, Andy Lutomirski  wrote:
> OK, new proposal:
>
> In do_debug, if we trip an instruction breakpoint while
> !user_mode(regs) && ((regs->flags & X86_EFLAGS_IF) == 0), then disarm
> *that breakpoint*.

Ack.  The more targeted we can make this while still guaranteeing
forward progress, the better. So that sounds really good.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Dealing with the NMI mess

2015-07-23 Thread Andy Lutomirski

On Thu, Jul 23, 2015 at 2:54 PM, Linus Torvalds
 wrote:
> On Thu, Jul 23, 2015 at 2:45 PM, Andy Lutomirski  wrote:
>>
>> Or we just re-enable them on the way out of NMI (i.e. the very last
>> thing we do in the NMI handler).  I don't want to break regular
>> userspace gdb when perf is running.
>
> I'd really prefer it if we don't touch NMI code in those kinds of
> ways. The NMI code is fragile as hell. All the problems we have with
> it is exactly due to "where is the boundary" issues.
>
> That's why I *don't* want NMI code to do magic crap. Anything that
> says "disable this during this magic window" is broken. The problems
> we've had are exactly about atomicity of the entry/exit conditions,
> and there is no really good way to get them right.
>
> I'd be much happier with a _TIF_USER_WORK_MASK approach exactly
> because it's so *obvious* that it's not a boundary condition.
>
> I dislike the "disable and re-enable dr7 in the NMI handler" exactly
> because it smells like "we can only handle faults in _this_ region".
> It may be true, but it's also what I want us to get away from. I'd
> much rather have the "big picture" be that we can take faults anywhere
> at all (*), and that none of the core code really cares. Then we "fix
> up" user space.

OK, new proposal:

In do_debug, if we trip an instruction breakpoint while
!user_mode(regs) && ((regs->flags & X86_EFLAGS_IF) == 0), then disarm
*that breakpoint*.

Why?  It only looks at hardware state (dr6 and dr7), and it can't
break gdb, because gdb can't set a breakpoint that will cause this
problem.

All the other variants of this either need cached state or break gdb
watchpoints on stack variables with perf running.

--Andy

>
>Linus
>
> (*) And yes, sysenter and not having a stack at all is very special,
> and I think we will *always* have to have that magical special case of
> the first few instructions there. But that's a separate hardware
> limitation we can't get around.



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Dealing with the NMI mess

2015-07-23 Thread Linus Torvalds

On Thu, Jul 23, 2015 at 2:50 PM, Andy Lutomirski  wrote:
>
> What if we relax it slightly: "if the breakpoint happened during that
> interrupts-off region, I will clear all *kernel breakpoints* in %dr7
> to guarantee forward progress"?
>
> Watchpoints don't need RF to make forward progress, and, by leaving
> watchpoints alone, we avoid breaking gdb.

Hmmm. I thought watchpoints were "before the instruction" too, but
that's just because I haven't used them in ages, and I didn't remember
the details. I just looked it up.

You're right - the memory watchpoints trigger after the instruction
has executed, so RF isn't an issue. So yes, the only issue is
instruction breakpoints, and those are the only ones we need to clear.

And that makes it really easy.

So yes, I agree. We only need to clear all kernel breakpoints.

So we don't even need that _TIF_USER_WORK_MASK thing, because user
space isn't setting kernel code breakpoints, it's just kgdb.

Sounds good to me.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] ipc: Use private shmem or hugetlbfs inodes for shm segments.

2015-07-23 Thread Paul Moore

On Thu, Jul 23, 2015 at 12:28 PM, Stephen Smalley  wrote:
> The shm implementation internally uses shmem or hugetlbfs inodes
> for shm segments.  As these inodes are never directly exposed to
> userspace and only accessed through the shm operations which are
> already hooked by security modules, mark the inodes with the
> S_PRIVATE flag so that inode security initialization and permission
> checking is skipped.
>
> This was motivated by the following lockdep warning:
> ===
> [ INFO: possible circular locking dependency detected ]
> 4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: GW
> ---
> httpd/1597 is trying to acquire lock:
> (>rwsem){+.}, at: [] shm_close+0x34/0x130
> (>mmap_sem){++}, at: [] SyS_shmdt+0x4b/0x180
>   [] lock_acquire+0xc7/0x270
>   [] __might_fault+0x7a/0xa0
>   [] filldir+0x9e/0x130
>   [] xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
>   [] xfs_readdir+0x1b4/0x330 [xfs]
>   [] xfs_file_readdir+0x2b/0x30 [xfs]
>   [] iterate_dir+0x97/0x130
>   [] SyS_getdents+0x91/0x120
>   [] entry_SYSCALL_64_fastpath+0x12/0x76
>   [] lock_acquire+0xc7/0x270
>   [] down_read_nested+0x57/0xa0
>   [] xfs_ilock+0x167/0x350 [xfs]
>   [] xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
>   [] xfs_attr_get+0xbd/0x190 [xfs]
>   [] xfs_xattr_get+0x3d/0x70 [xfs]
>   [] generic_getxattr+0x4f/0x70
>   [] inode_doinit_with_dentry+0x162/0x670
>   [] sb_finish_set_opts+0xd9/0x230
>   [] selinux_set_mnt_opts+0x35c/0x660
>   [] superblock_doinit+0x77/0xf0
>   [] delayed_superblock_init+0x10/0x20
>   [] iterate_supers+0xb3/0x110
>   [] selinux_complete_init+0x2f/0x40
>   [] security_load_policy+0x103/0x600
>   [] sel_write_load+0xc1/0x750
>   [] __vfs_write+0x37/0x100
>   [] vfs_write+0xa9/0x1a0
>   [] SyS_write+0x58/0xd0
>   [] entry_SYSCALL_64_fastpath+0x12/0x76
>   [] lock_acquire+0xc7/0x270
>   [] mutex_lock_nested+0x7f/0x3e0
>   [] inode_doinit_with_dentry+0xb9/0x670
>   [] selinux_d_instantiate+0x1c/0x20
>   [] security_d_instantiate+0x36/0x60
>   [] d_instantiate+0x54/0x70
>   [] __shmem_file_setup+0xdc/0x240
>   [] shmem_file_setup+0x10/0x20
>   [] newseg+0x290/0x3a0
>   [] ipcget+0x208/0x2d0
>   [] SyS_shmget+0x54/0x70
>   [] entry_SYSCALL_64_fastpath+0x12/0x76
>   [] __lock_acquire+0x1a78/0x1d00
>   [] lock_acquire+0xc7/0x270
>   [] down_write+0x5a/0xc0
>   [] shm_close+0x34/0x130
>   [] remove_vma+0x45/0x80
>   [] do_munmap+0x2b0/0x460
>   [] SyS_shmdt+0xb5/0x180
>   [] entry_SYSCALL_64_fastpath+0x12/0x76
> Chain exists of:#012  >rwsem --> _dir_ilock_class --> >mmap_sem
> Possible unsafe locking scenario:
>   CPU0CPU1
>   
>  lock(>mmap_sem);
>  lock(_dir_ilock_class);
>   lock(>mmap_sem);
>  lock(>rwsem);
> 1 lock held by httpd/1597:
> CPU: 7 PID: 1597 Comm: httpd Tainted: G W   
> 4.2.0-0.rc3.git0.1.fc24.x86_64+Hardware name: VMware, Inc. VMware Virtual 
> Platform/440BX Desktop Reference Pla 6cb6fe9d 
> 88019ff07c58 81868175
>  82aea390 88019ff07ca8 81105903
> 88019ff07c78 88019ff07d08 0001 8800b75108f0
> Call Trace:
> [] dump_stack+0x4c/0x65
> [] print_circular_bug+0x1e3/0x250
> [] __lock_acquire+0x1a78/0x1d00
> [] ? unlink_file_vma+0x33/0x60
> [] lock_acquire+0xc7/0x270
> [] ? shm_close+0x34/0x130
> [] down_write+0x5a/0xc0
> [] ? shm_close+0x34/0x130
> [] shm_close+0x34/0x130
> [] remove_vma+0x45/0x80
> [] do_munmap+0x2b0/0x460
> [] ? SyS_shmdt+0x4b/0x180
> [] SyS_shmdt+0xb5/0x180
> [] entry_SYSCALL_64_fastpath+0x12/0x76
>
> Reported-by: Morten Stevens 
> Signed-off-by: Stephen Smalley 
> ---
>  fs/hugetlbfs/inode.c | 2 ++
>  ipc/shm.c| 2 +-
>  mm/shmem.c   | 4 ++--
>  3 files changed, 5 insertions(+), 3 deletions(-)

Seems reasonable and fits with what we've been doing.

Acked-by: Paul Moore 

> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 0cf74df..973c24c 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -1010,6 +1010,8 @@ struct file *hugetlb_file_setup(const char *name, 
> size_t size,
> inode = hugetlbfs_get_inode(sb, NULL, S_IFREG | S_IRWXUGO, 0);
> if (!inode)
> goto out_dentry;
> +   if (creat_flags == HUGETLB_SHMFS_INODE)
> +   inode->i_flags |= S_PRIVATE;
>
> file = ERR_PTR(-ENOMEM);
> if (hugetlb_reserve_pages(inode, 0,
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 06e5cf2..4aef24d 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -545,7 +545,7 @@ static int newseg(struct ipc_namespace *ns, struct 
> ipc_params *params)
> if  ((shmflg & SHM_NORESERVE) &&
>

Re: [PATCH v4 6/6] serial: 8250_dw: allow lower reference frequencies

2015-07-23 Thread Greg Kroah-Hartman

On Tue, Jul 14, 2015 at 06:12:03PM +0300, Andy Shevchenko wrote:
> We have couple of standard but rare used baudrates which are not supported by
> 1,8432MHz reference frequency. Besides that user can potentially ask for any
> baudrate (via BOTHER flag) and we currently don't fully support that. Since
> clk-fractional-divider is moved to use rational best approximation for
> reference frequency we may amend the driver to support whatever user wants.
> 
> Signed-off-by: Andy Shevchenko 
> ---
>  drivers/tty/serial/8250/8250_dw.c | 4 
>  1 file changed, 4 deletions(-)

Acked-by: Greg Kroah-Hartman 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] tools lib traceevent: Allow setting an alternative symbol resolver

2015-07-23 Thread Arnaldo Carvalho de Melo

Em Thu, Jul 23, 2015 at 06:52:46PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Thu, Jul 23, 2015 at 05:35:24PM -0400, Steven Rostedt escreveu:
> > On Thu, 23 Jul 2015 18:25:36 -0300
> > Arnaldo Carvalho de Melo  wrote:
> > > + if (resolver == NULL) {
> > > + errno = ENOMEM;
> > 
> > Why set errno, wont a failed malloc set it for us?
> 
> Humm, I thought about that, I've read malloc's man page and it doesn't
> mention that... in the return section, but later, in NOTES, it says
> UNIX98 requires that and glibc does it, so I'm ditching this...

> > Also I wonder if we should add a way to clear the resolver. That is,
> > you want to use the default resolver?
> 
> I am adding a reset_function_resolver(pevent);
>  
> > Not really a necessity, as I don't see any current programs using it,
> > but it would complete the interface.

One more try:

commit 212a2417baaa89168cbe3112fe7c8efaddee28b8
Author: Arnaldo Carvalho de Melo 
Date:   Wed Jul 22 12:36:55 2015 -0300

tools lib traceevent: Allow setting an alternative symbol resolver

The perf tools have a symbol resolver that includes solving kernel
symbols using either kallsyms or ELF symtabs, and it also is using
libtraceevent to format the trace events fields, including via
subsystem specific plugins, like the "timer" one.

To solve fields like "timer:hrtimer_start"'s "function", libtraceevent
needs a way to map from its value to a function name and addr.

This patch provides a way for tools that already have symbol resolving
facilities to ask libtraceevent to use it when needing to resolve
kernel symbols.

Acked-by: David Ahern 
Cc: Adrian Hunter 
Cc: Borislav Petkov 
Cc: Frederic Weisbecker 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Stephane Eranian 
Cc: Steven Rostedt 
Link: http://lkml.kernel.org/n/tip-fdx1fazols17w5py26ia3...@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 

diff --git a/tools/lib/traceevent/event-parse.c 
b/tools/lib/traceevent/event-parse.c
index cc25f059ab3d..fcd8a9e3d2e1 100644
--- a/tools/lib/traceevent/event-parse.c
+++ b/tools/lib/traceevent/event-parse.c
@@ -418,7 +418,7 @@ static int func_map_init(struct pevent *pevent)
 }
 
 static struct func_map *
-find_func(struct pevent *pevent, unsigned long long addr)
+__find_func(struct pevent *pevent, unsigned long long addr)
 {
struct func_map *func;
struct func_map key;
@@ -434,6 +434,71 @@ find_func(struct pevent *pevent, unsigned long long addr)
return func;
 }
 
+struct func_resolver {
+   pevent_func_resolver_t *func;
+   void   *priv;
+   struct func_mapmap;
+};
+
+/**
+ * pevent_set_function_resolver - set an alternative function resolver
+ * @pevent: handle for the pevent
+ * @resolver: function to be used
+ * @priv: resolver function private state.
+ *
+ * Some tools may have already a way to resolve kernel functions, allow them to
+ * keep using it instead of duplicating all the entries inside
+ * pevent->funclist.
+ */
+int pevent_set_function_resolver(struct pevent *pevent,
+pevent_func_resolver_t *func, void *priv)
+{
+   struct func_resolver *resolver = malloc(sizeof(*resolver));
+
+   if (resolver == NULL)
+   return -1;
+
+   resolver->func = func;
+   resolver->priv = priv;
+
+   free(pevent->func_resolver);
+   pevent->func_resolver = resolver;
+
+   return 0;
+}
+
+/**
+ * pevent_reset_function_resolver - reset alternative function resolver
+ * @pevent: handle for the pevent
+ *
+ * Stop using whatever alternative resolver was set, use the default
+ * one instead.
+ */
+void pevent_reset_function_resolver(struct pevent *pevent)
+{
+   free(pevent->func_resolver);
+   pevent->func_resolver = NULL;
+}
+
+static struct func_map *
+find_func(struct pevent *pevent, unsigned long long addr)
+{
+   struct func_map *map;
+
+   if (!pevent->func_resolver)
+   return __find_func(pevent, addr);
+
+   map = >func_resolver->map;
+   map->mod  = NULL;
+   map->addr = addr;
+   map->func = pevent->func_resolver->func(pevent->func_resolver->priv,
+   >addr, >mod);
+   if (map->func == NULL)
+   return NULL;
+
+   return map;
+}
+
 /**
  * pevent_find_function - find a function by a given address
  * @pevent: handle for the pevent
@@ -6564,6 +6629,7 @@ void pevent_free(struct pevent *pevent)
free(pevent->trace_clock);
free(pevent->events);
free(pevent->sort_events);
+   free(pevent->func_resolver);
 
free(pevent);
 }
diff --git a/tools/lib/traceevent/event-parse.h 
b/tools/lib/traceevent/event-parse.h
index 063b1971eb35..204befb05a17 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -453,6 +453,10 @@ struct cmdline_list;
 struct func_map;
 struct func_list;
 struct

[PATCH] e1000: make eeprom read/write scheduler friendly

2015-07-23 Thread Spencer Baugh

From: Joern Engel 

Code was responsible for ~150ms scheduler latencies.

Signed-off-by: Joern Engel 
Signed-off-by: Spencer Baugh 
---
 drivers/net/ethernet/intel/e1000/e1000_hw.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_hw.c 
b/drivers/net/ethernet/intel/e1000/e1000_hw.c
index 45c8c864..e74e9dd 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_hw.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_hw.c
@@ -106,7 +106,7 @@ u16 
e1000_igp_cable_length_table[IGP01E1000_AGC_LENGTH_TABLE_SIZE] = {
120, 120
 };
 
-static DEFINE_SPINLOCK(e1000_eeprom_lock);
+static DEFINE_MUTEX(e1000_eeprom_lock);
 static DEFINE_SPINLOCK(e1000_phy_lock);
 
 /**
@@ -3882,9 +3882,9 @@ static s32 e1000_spi_eeprom_ready(struct e1000_hw *hw)
 s32 e1000_read_eeprom(struct e1000_hw *hw, u16 offset, u16 words, u16 *data)
 {
s32 ret;
-   spin_lock(_eeprom_lock);
+   mutex_lock(_eeprom_lock);
ret = e1000_do_read_eeprom(hw, offset, words, data);
-   spin_unlock(_eeprom_lock);
+   mutex_unlock(_eeprom_lock);
return ret;
 }
 
@@ -3972,6 +3972,7 @@ static s32 e1000_do_read_eeprom(struct e1000_hw *hw, u16 
offset, u16 words,
 */
data[i] = e1000_shift_in_ee_bits(hw, 16);
e1000_standby_eeprom(hw);
+   cond_resched();
}
}
 
@@ -4056,9 +4057,9 @@ s32 e1000_update_eeprom_checksum(struct e1000_hw *hw)
 s32 e1000_write_eeprom(struct e1000_hw *hw, u16 offset, u16 words, u16 *data)
 {
s32 ret;
-   spin_lock(_eeprom_lock);
+   mutex_lock(_eeprom_lock);
ret = e1000_do_write_eeprom(hw, offset, words, data);
-   spin_unlock(_eeprom_lock);
+   mutex_unlock(_eeprom_lock);
return ret;
 }
 
@@ -4124,6 +4125,7 @@ static s32 e1000_write_eeprom_spi(struct e1000_hw *hw, 
u16 offset, u16 words,
return -E1000_ERR_EEPROM;
 
e1000_standby_eeprom(hw);
+   cond_resched();
 
/*  Send the WRITE ENABLE command (8 bit opcode )  */
e1000_shift_out_ee_bits(hw, EEPROM_WREN_OPCODE_SPI,
@@ -4232,6 +4234,7 @@ static s32 e1000_write_eeprom_microwire(struct e1000_hw 
*hw, u16 offset,
 
/* Recover from write */
e1000_standby_eeprom(hw);
+   cond_resched();
 
words_written++;
}
-- 
2.5.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] Input: goodix - Fix touch coords on WinBook TW100 and TW700

2015-07-23 Thread Dmitry Torokhov

Hi Bastien,

On Thu, Jul 23, 2015 at 04:13:28PM +0200, Bastien Nocera wrote:
> The touchscreen on the WinBook TW100 and TW700 don't match the default
> display, with 0,0 touches being reported when touching at the bottom
> right of the screen.
> 
>   1280,800 0,800
>  +-+
>  | |
>  | |
>  | |
>  +-+
> 1280,0 0,0
> 
> It's unfortunately impossible to detect this problem with data from the
> DSDT, or other auxiliary metadata, so fallback to quirking this specific
> model of tablet instead.
> 
> Signed-off-by: Bastien Nocera 
> Reviewed-by: Benjamin Tissoires 
> ---
>  drivers/input/touchscreen/goodix.c | 31 +++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/drivers/input/touchscreen/goodix.c 
> b/drivers/input/touchscreen/goodix.c
> index b4d12e2..3722806 100644
> --- a/drivers/input/touchscreen/goodix.c
> +++ b/drivers/input/touchscreen/goodix.c
> @@ -15,6 +15,7 @@
>   */
>  
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -34,6 +35,7 @@ struct goodix_ts_data {
>   int abs_y_max;
>   unsigned int max_touch_num;
>   unsigned int int_trigger_type;
> + bool rotated_screen;
>  };
>  
>  #define GOODIX_MAX_HEIGHT4096
> @@ -60,6 +62,24 @@ static const unsigned long goodix_irq_flags[] = {
>   IRQ_TYPE_LEVEL_HIGH,
>  };
>  
> +/* Those tablets have their coords origin at the bottom right
> + * of the tablet, as if rotated 180 degrees */

/*
 * Multi
 * line
 * comment
 */

please.

> +static const struct dmi_system_id rotated_screen[] = {

#if defined(CONFIG_DMI) && defined(CONFIG_X86)

> + {
> + .ident = "WinBook TW100",
> + .matches = {
> + DMI_MATCH(DMI_SYS_VENDOR, "WinBook"),
> + DMI_MATCH(DMI_PRODUCT_NAME, "TW100")
> + },
> + .ident = "WinBook TW700",
> + .matches = {
> + DMI_MATCH(DMI_SYS_VENDOR, "WinBook"),
> + DMI_MATCH(DMI_PRODUCT_NAME, "TW100")
> + },

This does not do what you want it to do... First of all you probably
wanted TW700 on the second entry, second you need separate them into 2
entries into array (right now it is one still).

Thank you.

-- 
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] block: round timeouts to 100ms instead of 1s

2015-07-23 Thread Spencer Baugh

From: Joern Engel 

Users can request timeouts as low as 1s.  However, whatever the request
timeout happens to be, we always round it up by up to 1s.  So at the
lower end the rounding doubles the user-requested timeout.

Reduce the impact of rounding for small timeout values.

Curious side note: The staggering done in round_jiffies_common() has the
effect of firing timers at slightly different times on different cpus.
The intended result seems to be that not all cpus handle timers at the
same time.

However, this trick only works if the timeout calculation and the firing
of the timer happen on the same cpu.  For block queues the effect is
that instead of bunching timers to trigger just once per second, they
trigger about once per second _per cpu_.  Or rather they used to before
this patch.  So on reasonably-sized systems the timers can actually
trigger less frequently, in spite of better precision.

Signed-off-by: Joern Engel 
Signed-off-by: Spencer Baugh 
---
 block/blk-timeout.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 246dfb1..0d06162 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -127,6 +127,17 @@ static void blk_rq_check_expired(struct request *rq, 
unsigned long *next_timeout
}
 }
 
+/*
+ * With SSDs it gets realistic to set a short timeout of 1s.  But if
+ * every timeout gets rounded up by as much as a second, the effective
+ * limit is 2s.  Round jiffies a bit more precisely to about 100ms
+ * instead.
+ */
+static unsigned long round_jiffies_up_100ms(unsigned long j)
+{
+   return round_up(j, rounddown_pow_of_two(HZ / 10));
+}
+
 void blk_rq_timed_out_timer(unsigned long data)
 {
struct request_queue *q = (struct request_queue *) data;
@@ -140,7 +151,7 @@ void blk_rq_timed_out_timer(unsigned long data)
blk_rq_check_expired(rq, , _set);
 
if (next_set)
-   mod_timer(>timeout, round_jiffies_up(next));
+   mod_timer(>timeout, round_jiffies_up_100ms(next));
 
spin_unlock_irqrestore(q->queue_lock, flags);
 }
@@ -170,7 +181,7 @@ unsigned long blk_rq_timeout(unsigned long timeout)
 {
unsigned long maxt;
 
-   maxt = round_jiffies_up(jiffies + BLK_MAX_TIMEOUT);
+   maxt = round_jiffies_up_100ms(jiffies + BLK_MAX_TIMEOUT);
if (time_after(timeout, maxt))
timeout = maxt;
 
@@ -215,7 +226,7 @@ void blk_add_timer(struct request *req)
 * than an existing one, modify the timer. Round up to next nearest
 * second.
 */
-   expiry = blk_rq_timeout(round_jiffies_up(req->deadline));
+   expiry = blk_rq_timeout(round_jiffies_up_100ms(req->deadline));
 
if (!timer_pending(>timeout) ||
time_before(expiry, q->timeout.expires)) {
-- 
2.5.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v9 6/7] staging: add simple-fpga-bus

2015-07-23 Thread Moritz Fischer

Hi Alan,

I saw that your socfpga driver doesn't support the partial reconfig
use case (not a big deal).
What I currently do for Zynq is if I'm doing a non-partial reconfig is
that I disable input
level shifters and assert *all* resets while reprogramming in my FPGA
manager .write_init() and .write_complete().
For a partial reconfig use case that obviously doesn't work, since I
don't want to
bring down the entire interconnect.
In a partial reconfiguration situation, would I have separate simple-fpga buses
for each of the parts that I swap out, each with it's own reset and
bitfile attached?

On Fri, Jul 17, 2015 at 8:51 AM,   wrote:
> From: Alan Tull 
>
> Add simple fpga bus.  This is a bus that configures an fpga and its
> bridges before populating the devices below it.  This is intended
> for use with device tree overlays.
>
> Note that FPGA bridges are seen as reset controllers so no special
> framework for FPGA bridges will need to be added.
>
> This supports fpga use where hardware blocks on a fpga will need
> drivers (as opposed to fpga used as an acceleration without drivers.)
>
> Signed-off-by: Alan Tull 
> ---
>  drivers/staging/fpga/Kconfig   |   11 ++
>  drivers/staging/fpga/Makefile  |1 +
>  drivers/staging/fpga/simple-fpga-bus.c |  323 
> 
>  3 files changed, 335 insertions(+)
>  create mode 100644 drivers/staging/fpga/simple-fpga-bus.c
>
> diff --git a/drivers/staging/fpga/Kconfig b/drivers/staging/fpga/Kconfig
> index 8254ca0..8d003e3 100644
> --- a/drivers/staging/fpga/Kconfig
> +++ b/drivers/staging/fpga/Kconfig
> @@ -11,4 +11,15 @@ config FPGA
>   kernel.  The FPGA framework adds a FPGA manager class and FPGA
>   manager drivers.
>
> +if FPGA
> +
> +config SIMPLE_FPGA_BUS
> +   bool "Simple FPGA Bus"
> +   depends on OF
> +   help
> + Simple FPGA Bus allows loading FPGA images under control of
> +Device Tree.
> +
> +endif # FPGA
> +
>  endmenu
> diff --git a/drivers/staging/fpga/Makefile b/drivers/staging/fpga/Makefile
> index 3313c52..6115213 100644
> --- a/drivers/staging/fpga/Makefile
> +++ b/drivers/staging/fpga/Makefile
> @@ -4,5 +4,6 @@
>
>  # Core FPGA Manager Framework
>  obj-$(CONFIG_FPGA) += fpga-mgr.o
> +obj-$(CONFIG_SIMPLE_FPGA_BUS)  += simple-fpga-bus.o
>
>  # FPGA Manager Drivers
> diff --git a/drivers/staging/fpga/simple-fpga-bus.c 
> b/drivers/staging/fpga/simple-fpga-bus.c
> new file mode 100644
> index 000..bf178d8
> --- /dev/null
> +++ b/drivers/staging/fpga/simple-fpga-bus.c
> @@ -0,0 +1,323 @@
> +/*
> + * Simple FPGA Bus
> + *
> + *  Copyright (C) 2013-2015 Altera Corporation
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along 
> with
> + * this program.  If not, see .
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/**
> + * struct simple_fpga_bus - simple fpga bus private data
> + * @dev:   device from pdev
> + * @mgr:   the fpga manager associated with this bus
> + * @bridges:   an array of reset controls for controlling FPGA bridges
> + * associated with this bus
> + * @num_bridges: size of the bridges array
> + */
> +struct simple_fpga_bus {
> +   struct device *dev;
> +   struct fpga_manager *mgr;
> +   struct reset_control **bridges;
> +   int num_bridges;
> +};
> +
> +/**
> + * simple_fpga_bus_get_mgr - get associated fpga manager
> + * @priv: simple fpga bus private data
> + * pointer to fpga manager in priv->mgr on success
> + *
> + * Given a simple fpga bus, get a reference to its the fpga manager specified
> + * by its "fpga-mgr" device tree property.
> + *
> + * Return: 0 if success or if the fpga manager is not specified.
> + * Negative error code otherwise.
> + */
> +static int simple_fpga_bus_get_mgr(struct simple_fpga_bus *priv)
> +{
> +   struct device *dev = priv->dev;
> +   struct device_node *np = dev->of_node;
> +   struct fpga_manager *mgr;
> +   struct device_node *mgr_node;
> +
> +   /*
> +* Return 0 (not an error) if fpga manager is not specified.
> +* This supports the case where fpga was already configured.
> +*/
> +   mgr_node = of_parse_phandle(np, "fpga-mgr", 0);
> +   if (!mgr_node) {
> +   dev_dbg(dev, "could not find fpga-mgr DT property\n");
> +   return 0;
> +   }
> +
> +   mgr =

[PATCH] mm: add resched points to remap_pmd_range/ioremap_pmd_range

2015-07-23 Thread Spencer Baugh

From: Joern Engel 

Mapping large memory spaces can be slow and prevent high-priority
realtime threads from preempting lower-priority threads for a long time.
In my case it was a 256GB mapping causing at least 950ms scheduler
delay.  Problem detection is ratelimited and depends on interrupts
happening at the right time, so actual delay is likely worse.

[ cut here ]
WARNING: at arch/x86/kernel/irq.c:182 do_IRQ+0x126/0x140()
Thread not rescheduled for 36 jiffies
CPU: 14 PID: 6684 Comm: foo Tainted: G   O 3.10.59+
 0009 883f7fbc3ee0 8163a12c 883f7fbc3f18
 8103f131 887f48275ac0 0012 007c
  887f5bc11fd8 883f7fbc3f78 8103f19c
Call Trace:
   [] dump_stack+0x19/0x1b
 [] warn_slowpath_common+0x61/0x80
 [] warn_slowpath_fmt+0x4c/0x50
 [] ? rcu_irq_exit+0x77/0xc0
 [] do_IRQ+0x126/0x140
 [] common_interrupt+0x6f/0x6f
   [] ? set_pageblock_migratetype+0x28/0x30
 [] ? clear_page_c_e+0x7/0x10
 [] ? get_page_from_freelist+0x5b3/0x880
 [] __alloc_pages_nodemask+0xe3/0x810
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] alloc_pages_current+0x86/0x120
 [] __get_free_pages+0xe/0x50
 [] pte_alloc_one_kernel+0x15/0x20
 [] __pte_alloc_kernel+0x1d/0xf0
 [] ioremap_page_range+0x2cc/0x320
 [] __ioremap_caller+0x1e9/0x2b0
 [] ioremap_nocache+0x17/0x20
 [] pci_iomap+0x55/0xb0
 [] vfio_pci_mmap+0x1ea/0x210 [vfio_pci]
 [] vfio_device_fops_mmap+0x23/0x30 [vfio]
 [] mmap_region+0x3d8/0x5e0
 [] do_mmap_pgoff+0x305/0x3c0
 [] ? call_rwsem_down_write_failed+0x13/0x20
 [] vm_mmap_pgoff+0x67/0xa0
 [] SyS_mmap_pgoff+0x272/0x2e0
 [] SyS_mmap+0x22/0x30
 [] system_call_fastpath+0x16/0x1b
---[ end trace 6b0a8d2341444bdd ]---
[ cut here ]
WARNING: at arch/x86/kernel/irq.c:182 do_IRQ+0x126/0x140()
Thread not rescheduled for 95 jiffies
CPU: 14 PID: 6684 Comm: foo Tainted: GW  O 3.10.59+
 0009 883f7fbc3ee0 8163a12c 883f7fbc3f18
 8103f131 887f48275ac0 002f 007c
  7fadd1e0 883f7fbc3f78 8103f19c
Call Trace:
   [] dump_stack+0x19/0x1b
 [] warn_slowpath_common+0x61/0x80
 [] warn_slowpath_fmt+0x4c/0x50
 [] ? rcu_irq_exit+0x77/0xc0
 [] do_IRQ+0x126/0x140
 [] common_interrupt+0x6f/0x6f
   [] ? _raw_spin_lock+0x13/0x30
 [] __pte_alloc+0x31/0xc0
 [] remap_pfn_range+0x45c/0x470
 [] vfio_pci_mmap+0x148/0x210 [vfio_pci]
 [] vfio_device_fops_mmap+0x23/0x30 [vfio]
 [] mmap_region+0x3d8/0x5e0
 [] do_mmap_pgoff+0x305/0x3c0
 [] ? call_rwsem_down_write_failed+0x13/0x20
 [] vm_mmap_pgoff+0x67/0xa0
 [] SyS_mmap_pgoff+0x272/0x2e0
 [] SyS_mmap+0x22/0x30
 [] system_call_fastpath+0x16/0x1b
---[ end trace 6b0a8d2341444bde ]---
[ cut here ]
WARNING: at arch/x86/kernel/irq.c:182 do_IRQ+0x126/0x140()
Thread not rescheduled for 45 jiffies
CPU: 18 PID: 21726 Comm: foo Tainted: G   O 3.10.59+
 0009 88203f203ee0 8163a13c 88203f203f18
 8103f131 881ec5f1ad60 0016 006e
  c939a6dd8000 88203f203f78 8103f19c
Call Trace:
   [] dump_stack+0x19/0x1b
 [] warn_slowpath_common+0x61/0x80
 [] warn_slowpath_fmt+0x4c/0x50
 [] ? rcu_irq_exit+0x77/0xc0
 [] do_IRQ+0x126/0x140
 [] common_interrupt+0x6f/0x6f
   [] ? retint_restore_args+0x13/0x13
 [] ? free_memtype+0x87/0x150
 [] ? vunmap_page_range+0x1e6/0x2a0
 [] remove_vm_area+0x51/0x70
 [] iounmap+0x67/0xa0
 [] pci_iounmap+0x35/0x40
 [] vfio_pci_release+0x9a/0x150 [vfio_pci]
 [] vfio_device_fops_release+0x1c/0x40 [vfio]
 [] __fput+0xdb/0x220
 [] fput+0xe/0x10
 [] task_work_run+0xbc/0xe0
 [] do_exit+0x3ce/0xe50
 [] do_group_exit+0x3f/0xa0
 [] get_signal_to_deliver+0x1a9/0x5b0
 [] do_signal+0x48/0x5e0
 [] ? k_getrusage+0x368/0x3d0
 [] ? default_wake_function+0x12/0x20
 [] ? kprobe_flush_task+0xc0/0x150
 [] ? finish_task_switch+0xc4/0xe0
 [] do_notify_resume+0x65/0x80
 [] retint_signal+0x4d/0x9f
---[ end trace 3506c05e4a0af3e5 ]---

Signed-off-by: Joern Engel 
Signed-off-by: Spencer Baugh 
---
 lib/ioremap.c | 1 +
 mm/memory.c   | 1 +
 mm/vmalloc.c  | 1 +
 3 files changed, 3 insertions(+)

diff --git a/lib/ioremap.c b/lib/ioremap.c
index 86c8911..d38e46d 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -90,6 +90,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long 
addr,
 
if (ioremap_pte_range(pmd, addr, next, phys_addr + addr, prot))
return -ENOMEM;
+   cond_resched();
} while (pmd++, addr = next, addr != end);
return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 388dcf9..1541880 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1656,6 +1656,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, 
pud_t *pud,
if (remap_pte_range(mm, pmd, addr, next,
pfn + (addr >> PAGE_SHIFT), prot))
return -ENOMEM;
+

[PATCH] aer: add cond_resched to aer_isr

2015-07-23 Thread Spencer Baugh

From: Joern Engel 

Multiple nested loops.  I have observed 590ms scheduler latency caused
by this loop and interrupts.  Interrupts were responsible for 190ms, the
rest could have been avoided with a cond_resched.

Signed-off-by: Joern Engel 
Signed-off-by: Spencer Baugh 
---
 drivers/pci/pcie/aer/aerdrv_core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
b/drivers/pci/pcie/aer/aerdrv_core.c
index 9803e3d..32b1b5c 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -780,8 +780,10 @@ void aer_isr(struct work_struct *work)
struct aer_err_source uninitialized_var(e_src);
 
mutex_lock(>rpc_mutex);
-   while (get_e_source(rpc, _src))
+   while (get_e_source(rpc, _src)) {
aer_isr_one_error(p_device, _src);
+   cond_resched();
+   }
mutex_unlock(>rpc_mutex);
 
wake_up(>wait_release);
-- 
2.5.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

2015-07-23 Thread Spencer Baugh

From: Joern Engel 

~150ms scheduler latency for both observed in the wild.

Signed-off-by: Joern Engel 
Signed-off-by: Spencer Baugh 
---
 mm/hugetlb.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087..2eb6919 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1836,6 +1836,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, 
unsigned long count,
ret = alloc_fresh_gigantic_page(h, nodes_allowed);
else
ret = alloc_fresh_huge_page(h, nodes_allowed);
+   cond_resched();
spin_lock(_lock);
if (!ret)
goto out;
@@ -3521,6 +3522,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
spin_unlock(ptl);
ret = hugetlb_fault(mm, vma, vaddr,
(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
+   cond_resched();
if (!(ret & VM_FAULT_ERROR))
continue;
 
-- 
2.5.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Dealing with the NMI mess

2015-07-23 Thread Linus Torvalds

On Thu, Jul 23, 2015 at 2:45 PM, Andy Lutomirski  wrote:
>
> Or we just re-enable them on the way out of NMI (i.e. the very last
> thing we do in the NMI handler).  I don't want to break regular
> userspace gdb when perf is running.

I'd really prefer it if we don't touch NMI code in those kinds of
ways. The NMI code is fragile as hell. All the problems we have with
it is exactly due to "where is the boundary" issues.

That's why I *don't* want NMI code to do magic crap. Anything that
says "disable this during this magic window" is broken. The problems
we've had are exactly about atomicity of the entry/exit conditions,
and there is no really good way to get them right.

I'd be much happier with a _TIF_USER_WORK_MASK approach exactly
because it's so *obvious* that it's not a boundary condition.

I dislike the "disable and re-enable dr7 in the NMI handler" exactly
because it smells like "we can only handle faults in _this_ region".
It may be true, but it's also what I want us to get away from. I'd
much rather have the "big picture" be that we can take faults anywhere
at all (*), and that none of the core code really cares. Then we "fix
up" user space.

   Linus

(*) And yes, sysenter and not having a stack at all is very special,
and I think we will *always* have to have that magical special case of
the first few instructions there. But that's a separate hardware
limitation we can't get around.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] Add blank line under variable declaration.

2015-07-23 Thread Greg KH

For all of these, you need a better subject line that shows what part of
the kernel you are modifying.

For example, this one would be:
Subject: [PATCH 1/5] staging: lustre: cl_page.c: add blank line after 
variable definition

On Thu, Jul 23, 2015 at 02:21:10PM +0800, Incarnation P. Lee wrote:
> Signed-off-by: Incarnation P. Lee 
> 

We need a changelog entry, it can't be blank.

And I need a hint of "real" name here, is "Incarnation P. Lee" how you
sign legal documents?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3] target: Drop iSCSI use of mutex around max_cmd_sn increment

2015-07-23 Thread Spencer Baugh


From: Roland Dreier 

In a performance profile, taking a mutex in iscsit_increment_maxcmdsn()
shows up very high.  However taking a mutex around "sess->max_cmd_sn += 1"
seems pretty silly: we're not serializing against other contexts in
any useful way.

I did a quick audit and there don't appear to be any other places that
use max_cmd_sn within the mutex more than once, so this lock can't be
providing any useful serialization.

v2: Get correct values for logging
v3: Fix whitespace damage

Signed-off-by: Roland Dreier 
Signed-off-by: Spencer Baugh 
---
 drivers/target/iscsi/iscsi_target.c  | 18 +-
 drivers/target/iscsi/iscsi_target_configfs.c |  6 --
 drivers/target/iscsi/iscsi_target_device.c   |  7 ++-
 drivers/target/iscsi/iscsi_target_login.c|  2 +-
 drivers/target/iscsi/iscsi_target_nego.c |  9 +++--
 drivers/target/iscsi/iscsi_target_tmr.c  |  2 +-
 drivers/target/iscsi/iscsi_target_util.c |  7 ---
 include/target/iscsi/iscsi_target_core.h |  2 +-
 8 files changed, 25 insertions(+), 28 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target.c 
b/drivers/target/iscsi/iscsi_target.c
index ebb1ece..f615d75 100644
--- a/drivers/target/iscsi/iscsi_target.c
+++ b/drivers/target/iscsi/iscsi_target.c
@@ -2555,7 +2555,7 @@ static int iscsit_send_conn_drop_async_message(
cmd->stat_sn= conn->stat_sn++;
hdr->statsn = cpu_to_be32(cmd->stat_sn);
hdr->exp_cmdsn  = cpu_to_be32(conn->sess->exp_cmd_sn);
-   hdr->max_cmdsn  = cpu_to_be32(conn->sess->max_cmd_sn);
+   hdr->max_cmdsn  = cpu_to_be32((u32) 
atomic_read(>sess->max_cmd_sn));
hdr->async_event= ISCSI_ASYNC_MSG_DROPPING_CONNECTION;
hdr->param1 = cpu_to_be16(cmd->logout_cid);
hdr->param2 = 
cpu_to_be16(conn->sess->sess_ops->DefaultTime2Wait);
@@ -2627,7 +2627,7 @@ iscsit_build_datain_pdu(struct iscsi_cmd *cmd, struct 
iscsi_conn *conn,
hdr->statsn = cpu_to_be32(0x);
 
hdr->exp_cmdsn  = cpu_to_be32(conn->sess->exp_cmd_sn);
-   hdr->max_cmdsn  = cpu_to_be32(conn->sess->max_cmd_sn);
+   hdr->max_cmdsn  = cpu_to_be32((u32) 
atomic_read(>sess->max_cmd_sn));
hdr->datasn = cpu_to_be32(datain->data_sn);
hdr->offset = cpu_to_be32(datain->offset);
 
@@ -2838,7 +2838,7 @@ iscsit_build_logout_rsp(struct iscsi_cmd *cmd, struct 
iscsi_conn *conn,
 
iscsit_increment_maxcmdsn(cmd, conn->sess);
hdr->exp_cmdsn  = cpu_to_be32(conn->sess->exp_cmd_sn);
-   hdr->max_cmdsn  = cpu_to_be32(conn->sess->max_cmd_sn);
+   hdr->max_cmdsn  = cpu_to_be32((u32) 
atomic_read(>sess->max_cmd_sn));
 
pr_debug("Built Logout Response ITT: 0x%08x StatSN:"
" 0x%08x Response: 0x%02x CID: %hu on CID: %hu\n",
@@ -2901,7 +2901,7 @@ iscsit_build_nopin_rsp(struct iscsi_cmd *cmd, struct 
iscsi_conn *conn,
iscsit_increment_maxcmdsn(cmd, conn->sess);
 
hdr->exp_cmdsn  = cpu_to_be32(conn->sess->exp_cmd_sn);
-   hdr->max_cmdsn  = cpu_to_be32(conn->sess->max_cmd_sn);
+   hdr->max_cmdsn  = cpu_to_be32((u32) 
atomic_read(>sess->max_cmd_sn));
 
pr_debug("Built NOPIN %s Response ITT: 0x%08x, TTT: 0x%08x,"
" StatSN: 0x%08x, Length %u\n", (nopout_response) ?
@@ -3048,7 +3048,7 @@ static int iscsit_send_r2t(
hdr->ttt= cpu_to_be32(r2t->targ_xfer_tag);
hdr->statsn = cpu_to_be32(conn->stat_sn);
hdr->exp_cmdsn  = cpu_to_be32(conn->sess->exp_cmd_sn);
-   hdr->max_cmdsn  = cpu_to_be32(conn->sess->max_cmd_sn);
+   hdr->max_cmdsn  = cpu_to_be32((u32) 
atomic_read(>sess->max_cmd_sn));
hdr->r2tsn  = cpu_to_be32(r2t->r2t_sn);
hdr->data_offset= cpu_to_be32(r2t->offset);
hdr->data_length= cpu_to_be32(r2t->xfer_len);
@@ -3201,7 +3201,7 @@ void iscsit_build_rsp_pdu(struct iscsi_cmd *cmd, struct 
iscsi_conn *conn,
 
iscsit_increment_maxcmdsn(cmd, conn->sess);
hdr->exp_cmdsn  = cpu_to_be32(conn->sess->exp_cmd_sn);
-   hdr->max_cmdsn  = cpu_to_be32(conn->sess->max_cmd_sn);
+   hdr->max_cmdsn  = cpu_to_be32((u32) 
atomic_read(>sess->max_cmd_sn));
 
pr_debug("Built SCSI Response, ITT: 0x%08x, StatSN: 0x%08x,"
" Response: 0x%02x, SAM Status: 0x%02x, CID: %hu\n",
@@ -3320,7 +3320,7 @@ iscsit_build_task_mgt_rsp(struct iscsi_cmd *cmd, struct 
iscsi_conn *conn,
 
iscsit_increment_maxcmdsn(cmd, conn->sess);
hdr->exp_cmdsn  = cpu_to_be32(conn->sess->exp_cmd_sn);
-   hdr->max_cmdsn  = cpu_to_be32(conn->sess->max_cmd_sn);
+   hdr->max_cmdsn  = cpu_to_be32((u32) 
atomic_read(>sess->max_cmd_sn));
 
pr_debug("Built Task Management

Re: [RFC PATCH] tools lib traceevent: Allow setting an alternative symbol resolver

2015-07-23 Thread Arnaldo Carvalho de Melo

Em Thu, Jul 23, 2015 at 05:35:24PM -0400, Steven Rostedt escreveu:
> On Thu, 23 Jul 2015 18:25:36 -0300
> Arnaldo Carvalho de Melo  wrote:
> 
> > Like this?
> 
> Yep, but some comments.
> 
> > +int pevent_set_function_resolver(struct pevent *pevent,
> > +pevent_func_resolver_t *func, void *priv)
> > +{
> > +   struct func_resolver *resolver = malloc(sizeof(*resolver));
> > +
> > +   if (resolver == NULL) {
> > +   errno = ENOMEM;
> 
> Why set errno, wont a failed malloc set it for us?

Humm, I thought about that, I've read malloc's man page and it doesn't
mention that... in the return section, but later, in NOTES, it says
UNIX98 requires that and glibc does it, so I'm ditching this...
 
> > +   return -1;
> > +   }
> > +
> > +   resolver->func = func;
> > +   resolver->priv = priv;
> > +
> > +   free(pevent->func_resolver);
> > +   pevent->func_resolver = resolver;
> 
> Also I wonder if we should add a way to clear the resolver. That is,
> you want to use the default resolver?

I am adding a reset_function_resolver(pevent);
 
> Not really a necessity, as I don't see any current programs using it,
> but it would complete the interface.
> 
> -- Steve
> 
> > +
> > +   return 0;
> > +}
> > +
> > +static struct func_map *
> > +find_func(struct pevent *pevent, unsigned long long addr)
> > +{
> > +   struct func_map *map;
> > +
> > +   if (!pevent->func_resolver)
> > +   return __find_func(pevent, addr);
> > +
> > +   map = >func_resolver->map;
> > +   map->mod  = NULL;
> > +   map->addr = addr;
> > +   map->func = pevent->func_resolver->func(pevent->func_resolver->priv,
> > +   >addr, >mod);
> > +   if (map->func == NULL)
> > +   return NULL;
> > +
> > +   return map;
> > +}
> > +
> >  /**
> >   * pevent_find_function - find a function by a given address
> >   * @pevent: handle for the pevent
> > @@ -6564,6 +6618,7 @@ void pevent_free(struct pevent *pevent)
> > free(pevent->trace_clock);
> > free(pevent->events);
> > free(pevent->sort_events);
> > +   free(pevent->func_resolver);
> >  
> > free(pevent);
> >  }
> > diff --git a/tools/lib/traceevent/event-parse.h 
> > b/tools/lib/traceevent/event-parse.h
> > index 063b1971eb35..416e1bd9fe33 100644
> > --- a/tools/lib/traceevent/event-parse.h
> > +++ b/tools/lib/traceevent/event-parse.h
> > @@ -453,6 +453,10 @@ struct cmdline_list;
> >  struct func_map;
> >  struct func_list;
> >  struct event_handler;
> > +struct func_resolver;
> > +
> > +typedef char *(pevent_func_resolver_t)(void *priv,
> > +  unsigned long long *addrp, char **modp);
> >  
> >  struct pevent {
> > int ref_count;
> > @@ -481,6 +485,7 @@ struct pevent {
> > int cmdline_count;
> >  
> > struct func_map *func_map;
> > +   struct func_resolver *func_resolver;
> > struct func_list *funclist;
> > unsigned int func_count;
> >  
> > @@ -611,6 +616,8 @@ enum trace_flag_type {
> > TRACE_FLAG_SOFTIRQ  = 0x10,
> >  };
> >  
> > +int pevent_set_function_resolver(struct pevent *pevent,
> > +pevent_func_resolver_t *func, void *priv);
> >  int pevent_register_comm(struct pevent *pevent, const char *comm, int pid);
> >  int pevent_register_trace_clock(struct pevent *pevent, const char 
> > *trace_clock);
> >  int pevent_register_function(struct pevent *pevent, char *name,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Dealing with the NMI mess

2015-07-23 Thread Willy Tarreau

On Thu, Jul 23, 2015 at 02:46:49PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 23, 2015 at 2:46 PM, Willy Tarreau  wrote:
> > Can't the back link of the TSS tell us where we come from ? At least
> > it should not be manipulable from user-space.
> 
> Not on 64-bit -- there are no tasks :)

Ah crap, sorry for the noise then!

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Dealing with the NMI mess

2015-07-23 Thread Andy Lutomirski

On Thu, Jul 23, 2015 at 2:48 PM, Linus Torvalds
 wrote:
> On Thu, Jul 23, 2015 at 2:31 PM, Steven Rostedt  wrote:
>>
>> Let me get this straight. The idea is in the #DB handler to detect that
>> it was triggered in NMI context, and if so, simply disarm that
>> breakpoint permanently, right?
>
> No, for simplicity, I'd make it cover not just NMI code, but any
> "kernel code with interrupts disabled".
>
> Because that's the test we'd use for "use ret instead of iret".
>
> And that wider test is exactly because it's so damn hard to get the
> exact instruction boundaries right. Let's *not* go down the path
> (again) of having to get the whole %rip range and "magic stack pointer
> values" etc.
>
> Make it simple and completely unambiguous. The rule really would be:
>
>  - if we return to kernel space and interrupts are disabled, we will
> use "ret" rather than "iret"
>
>Hard rule. Simple. Straightforward. No random %rip values. No
> random %rsp values. NO CRAP.
>
>  - but because we use "ret" rather than "iret" we can't get RF
> semantics, it means that #DB is special. RF is supposed to make us
> make forward progress
>
>So for that reason, #DB just says "if the breakpoint happened
> during that interrupts-ff reghion, I will clear %dr7 to guarantee
> forward progress"

What if we relax it slightly: "if the breakpoint happened during that
interrupts-off region, I will clear all *kernel breakpoints* in %dr7
to guarantee forward progress"?

Watchpoints don't need RF to make forward progress, and, by leaving
watchpoints alone, we avoid breaking gdb.

>
> So those would be the two main rules. Very simple, and avoiding all nasty 
> cases.
>
> Now, I'd be willing to then hide the "oops, we clear dr7 very
> agrressively" issue by having a few additional _heuristics_. But I
> call them "heuristics" because unlike the current NMI nesting games,
> they aren't about core stability. They are about "ok, maybe somebody
> wants to trigger those faults, and we'll be _nice_ and try to make it
> easy for them", but nothing more.
>
> So for example, if that "#DB clears %dr7" happened, it sounds easy to
> set _TIF_USER_WORK_MASK, and just force %dr7 to be re-loaded from a
> cached value, so that if we disabled things because of some user stack
> trace access, it will be re-enabled by the time we return to user
> space. I think that sounds reasonable, but it's not something the core
> low-level entry x86 assembly code needs to even care about. It's not
> that level of "core", it's just being polite.

Once we limit it to instruction breakpoints, I don't think re-enabling
before returning to userspace matters.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/7] Initial support for user namespace owned mounts

2015-07-23 Thread Casey Schaufler

On 7/22/2015 5:15 PM, Eric W. Biederman wrote:
> Casey Schaufler  writes:
>
>> On 7/22/2015 12:32 PM, Seth Forshee wrote:
>>> On Wed, Jul 22, 2015 at 11:10:46AM -0700, Casey Schaufler wrote:
 On 7/22/2015 8:56 AM, Seth Forshee wrote:
> On Tue, Jul 21, 2015 at 06:52:31PM -0700, Casey Schaufler wrote:
>> On 7/21/2015 1:35 PM, Seth Forshee wrote:
>>> On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
 On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler 
  wrote:
> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
>> I really don't see the benefit of making up extra rules that apply to
>> users outside a userns who try to access specifically a filesystem
>> with backing store.  They wouldn't make sense for filesystems without
>> backing store.
> Sure it would. For Smack, it would be the label a file would be
> created with, which would be the label of the process creating
> the memory based filesystem. For SELinux the rules are more a
> touch more sophisticated, but I'm sure that Paul or Stephen could
> come up with how to determine it.
>
> The point, looping all the way back to the beginning, where we
> were talking about just ignoring the labels on the filesystem,
> is that if you use the same Smack label on the files in the
> filesystem as the backing store file has, we'll all be happy.
> If that label isn't something user can write to, he won't be
> able to write to the mounted objects, either. If there is no
> backing store then use the label of the process creating the
> filesystem, which will be the user, which will mean everything
> will work hunky dory.
>
> Yes, there's work involved, but I doubt there's a lot. Getting
> the label from the backing store or the creating process is
> simple enough.
>
>>> So something like the diff below (untested)?
>> I think that this is close, and quite good for someone
>> who isn't very familiar with Smack. It's definitely headed
>> in the right direction.
>>
>>> All I'm really doing is setting smk_default as you describe above and
>>> then using it instead of smk_of_current() in
>>> smack_inode_alloc_security() and instead of the label from the disk in
>>> smack_d_instantiate().
>> Let's say your backing store is a file labeled Rubble.
>>
>> mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
>>
>> It is completely reasonable for a process labeled Flintstone to
>> have rwxa access to a file labeled Rubble.
>>
>> Smack rule: Flintstone Rubble rwxa
>>
>> In the case of writing to an existing Rubble file, what you
>> have looks fine. What's not so great is that if the Flintstone
>> process creates a file, it should be labeled Flintstone. Your
>> use of the smk_default, which is going to violate the principle
>> of least astonishment, and break the Smack policy as well.
>>
>> Let's make a minor change. Instead of using smackfsroot let's
>> use smackfstransmute and a slightly different access rule:
>>
>> mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
>>
>> Smack rule: Flintstone Rubble rwxat
>>
>> Now the only change we have to make to the Smack code is
>> that we don't want to create any files unless either the
>> process is labeled Rubble or the rule allowing the creation
>> has the "t" for transmute access. That should ensure that
>> everything is labeled Rubble. If it isn't, someone has mucked
>> with the metadata in a detectable way.
> All right, that kind of makes sense, but I'm still missing some pieces.
> Questions follow.
>
>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>> index 32f598db0b0d..4597420ab933 100644
>>> --- a/include/linux/fs.h
>>> +++ b/include/linux/fs.h
>>> @@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct 
>>> super_block *sb)
>>> __sb_start_write(sb, SB_FREEZE_FS, true);
>>>  }
>>>  
>>> +static inline bool sb_in_userns(struct super_block *sb)
>>> +{
>>> +   return sb->s_user_ns != _user_ns;
>>> +}
>>>  
>>>  extern bool inode_owner_or_capable(const struct inode *inode);
>>>  
>>> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
>>> index a143328f75eb..591fd19294e7 100644
>>> --- a/security/smack/smack_lsm.c
>>> +++ b/security/smack/smack_lsm.c
>>> @@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char 
>>> *name, struct inode *ip,
>>> char *buffer;
>>> struct smack_known *skp = NULL;
>>>  
>>> +   /* Should never fetch xattrs from untrusted mounts */
>>> +   if (WARN_ON(sb_in_userns(ip->i_sb)))
>>> +   return ERR_PTR(-EPERM);

Re: Dealing with the NMI mess

2015-07-23 Thread Linus Torvalds

On Thu, Jul 23, 2015 at 2:31 PM, Steven Rostedt  wrote:
>
> Let me get this straight. The idea is in the #DB handler to detect that
> it was triggered in NMI context, and if so, simply disarm that
> breakpoint permanently, right?

No, for simplicity, I'd make it cover not just NMI code, but any
"kernel code with interrupts disabled".

Because that's the test we'd use for "use ret instead of iret".

And that wider test is exactly because it's so damn hard to get the
exact instruction boundaries right. Let's *not* go down the path
(again) of having to get the whole %rip range and "magic stack pointer
values" etc.

Make it simple and completely unambiguous. The rule really would be:

 - if we return to kernel space and interrupts are disabled, we will
use "ret" rather than "iret"

   Hard rule. Simple. Straightforward. No random %rip values. No
random %rsp values. NO CRAP.

 - but because we use "ret" rather than "iret" we can't get RF
semantics, it means that #DB is special. RF is supposed to make us
make forward progress

   So for that reason, #DB just says "if the breakpoint happened
during that interrupts-ff reghion, I will clear %dr7 to guarantee
forward progress"

So those would be the two main rules. Very simple, and avoiding all nasty cases.

Now, I'd be willing to then hide the "oops, we clear dr7 very
agrressively" issue by having a few additional _heuristics_. But I
call them "heuristics" because unlike the current NMI nesting games,
they aren't about core stability. They are about "ok, maybe somebody
wants to trigger those faults, and we'll be _nice_ and try to make it
easy for them", but nothing more.

So for example, if that "#DB clears %dr7" happened, it sounds easy to
set _TIF_USER_WORK_MASK, and just force %dr7 to be re-loaded from a
cached value, so that if we disabled things because of some user stack
trace access, it will be re-enabled by the time we return to user
space. I think that sounds reasonable, but it's not something the core
low-level entry x86 assembly code needs to even care about. It's not
that level of "core", it's just being polite.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Dealing with the NMI mess

2015-07-23 Thread Andy Lutomirski

On Thu, Jul 23, 2015 at 2:20 PM, Steven Rostedt  wrote:
> On Thu, 23 Jul 2015 13:21:16 -0700
> Andy Lutomirski  wrote:
>
>> 3. Forbid faults (other than MCE) inside NMI.
>>
>> Option 3 is almost easy.  There are really only two kinds of faults
>> that can legitimately nest inside NMI: #PF and #DB.  #DB is easy to
>> fix (e.g. with my patches or Peter's patches).
>
> What about int3? Which is needed to make ftrace work. This was a
> requirement to get rid of stomp-machine when updating ftrace functions,
> as well as the rational for doing the whole NMI nesting work in the
> first place.

OK, I'm convinced.

So I'll keep working on fixing up int3 to be less magical.  Patches
coming eventually.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 1842 matches

Mail list logo