from:"VMware"

[PATCH v4 0/6] Add "func_no_repete" tracing option

2021-04-15 Thread Yordan Karadzhov (VMware)

The new option for function tracing aims to save space on the ring
buffer and to make it more readable in the case when a single function
is called number of times consecutively:

while (cond)
do_func();

Instead of having an identical records for each call of the function
we will record only the first call, followed by an event showing the
number of repeats.

v4 changes:
  * "[PATCH 1/5] Define static void trace_print_time()" is added to
the patch-set. This patch simply applies the modification suggested
by Steven Rostedt in his review of v3. My contribution to the code
of this patch is really negligible.

  * FUNC_REPEATS_GET_DELTA_TS macro is improved.

  * The way we print the time of the last function repeat is improved.

  * "tracer_alloc_func_repeats(struct trace_array *tr)" is removed from
trace.h.

  * FUNC_REPEATS_GET_DELTA_TS macro is replased by a static inline
function

v3 changes:
  * FUNC_REPEATS_SET_DELTA_TS macro has been optimised.

  * Fixed bug in func_set_flag(): In the previous version the value
of the "new_flags" variable was not properly calculated because
I misinterpreted the meaning of the "bit" argument of the function.
This bug in v2 had no real effect, because for the moment we have
only two "function options" so the value of "new_flags" was correct,
although its way of calculating was wrong.

v2 changes:
  * As suggested by Steven in his review, we now record not only the
repetition count, but also the time elapsed between the last
repetition of the function and the actual generation of the
"func_repeats" event. 16 bits are used to record the repetition
count. In the case of an overflow of the counter a second pair of
"function" and "func_repeats" events will be generated. The time
interval gets codded by using up to 48 (32 + 16) bits.


Yordan Karadzhov (VMware) (6):
  tracing: Define static void trace_print_time()
  tracing: Define new ftrace event "func_repeats"
  tracing: Add "last_func_repeats" to struct trace_array
  tracing: Add method for recording "func_repeats" events
  tracing: Unify the logic for function tracing options
  tracing: Add "func_no_repeats" option for function tracing

 kernel/trace/trace.c   |  35 ++
 kernel/trace/trace.h   |  19 +++
 kernel/trace/trace_entries.h   |  22 
 kernel/trace/trace_functions.c | 223 -
 kernel/trace/trace_output.c|  74 +--
 5 files changed, 336 insertions(+), 37 deletions(-)

-- 
2.25.1

[PATCH v4 6/6] tracing: Add "func_no_repeats" option for function tracing

2021-04-15 Thread Yordan Karadzhov (VMware)

If the option is activated the function tracing record gets
consolidated in the cases when a single function is called number
of times consecutively. Instead of having an identical record for
each call of the function we will record only the first call
following by event showing the number of repeats.

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace_functions.c | 162 -
 1 file changed, 159 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index f37f73a9b1b8..1f0e63f5d1f9 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -27,15 +27,27 @@ function_trace_call(unsigned long ip, unsigned long 
parent_ip,
 static void
 function_stack_trace_call(unsigned long ip, unsigned long parent_ip,
  struct ftrace_ops *op, struct ftrace_regs *fregs);
+static void
+function_no_repeats_trace_call(unsigned long ip, unsigned long parent_ip,
+  struct ftrace_ops *op, struct ftrace_regs 
*fregs);
+static void
+function_stack_no_repeats_trace_call(unsigned long ip, unsigned long parent_ip,
+struct ftrace_ops *op,
+struct ftrace_regs *fregs);
 static struct tracer_flags func_flags;
 
 /* Our option */
 enum {
-   TRACE_FUNC_NO_OPTS  = 0x0, /* No flags set. */
-   TRACE_FUNC_OPT_STACK= 0x1,
+
+   TRACE_FUNC_NO_OPTS  = 0x0, /* No flags set. */
+   TRACE_FUNC_OPT_STACK= 0x1,
+   TRACE_FUNC_OPT_NO_REPEATS   = 0x2,
+
+   /* Update this to next highest bit. */
+   TRACE_FUNC_OPT_HIGHEST_BIT  = 0x4
 };
 
-#define TRACE_FUNC_OPT_MASK(TRACE_FUNC_OPT_STACK)
+#define TRACE_FUNC_OPT_MASK(TRACE_FUNC_OPT_HIGHEST_BIT - 1)
 
 int ftrace_allocate_ftrace_ops(struct trace_array *tr)
 {
@@ -96,11 +108,27 @@ static ftrace_func_t select_trace_function(u32 flags_val)
return function_trace_call;
case TRACE_FUNC_OPT_STACK:
return function_stack_trace_call;
+   case TRACE_FUNC_OPT_NO_REPEATS:
+   return function_no_repeats_trace_call;
+   case TRACE_FUNC_OPT_STACK | TRACE_FUNC_OPT_NO_REPEATS:
+   return function_stack_no_repeats_trace_call;
default:
return NULL;
}
 }
 
+static bool handle_func_repeats(struct trace_array *tr, u32 flags_val)
+{
+   if (!tr->last_func_repeats &&
+   (flags_val & TRACE_FUNC_OPT_NO_REPEATS)) {
+   tr->last_func_repeats = alloc_percpu(struct trace_func_repeats);
+   if (!tr->last_func_repeats)
+   return false;
+   }
+
+   return true;
+}
+
 static int function_trace_init(struct trace_array *tr)
 {
ftrace_func_t func;
@@ -116,6 +144,9 @@ static int function_trace_init(struct trace_array *tr)
if (!func)
return -EINVAL;
 
+   if (!handle_func_repeats(tr, func_flags.val))
+   return -ENOMEM;
+
ftrace_init_array_ops(tr, func);
 
tr->array_buffer.cpu = raw_smp_processor_id();
@@ -217,10 +248,132 @@ function_stack_trace_call(unsigned long ip, unsigned 
long parent_ip,
local_irq_restore(flags);
 }
 
+static inline bool is_repeat_check(struct trace_array *tr,
+  struct trace_func_repeats *last_info,
+  unsigned long ip, unsigned long parent_ip)
+{
+   if (last_info->ip == ip &&
+   last_info->parent_ip == parent_ip &&
+   last_info->count < U16_MAX) {
+   last_info->ts_last_call =
+   ring_buffer_time_stamp(tr->array_buffer.buffer);
+   last_info->count++;
+   return true;
+   }
+
+   return false;
+}
+
+static inline void process_repeats(struct trace_array *tr,
+  unsigned long ip, unsigned long parent_ip,
+  struct trace_func_repeats *last_info,
+  unsigned int trace_ctx)
+{
+   if (last_info->count) {
+   trace_last_func_repeats(tr, last_info, trace_ctx);
+   last_info->count = 0;
+   }
+
+   last_info->ip = ip;
+   last_info->parent_ip = parent_ip;
+}
+
+static void
+function_no_repeats_trace_call(unsigned long ip, unsigned long parent_ip,
+  struct ftrace_ops *op,
+  struct ftrace_regs *fregs)
+{
+   struct trace_func_repeats *last_info;
+   struct trace_array *tr = op->private;
+   struct trace_array_cpu *data;
+   unsigned int trace_ctx;
+   unsigned long flags;
+   int bit;
+   int cpu;
+
+   if (unlikely(!tr->function_enabled))
+   return;
+
+   bit = ftrace_test_recursion_trylock(ip, parent_ip);

[PATCH v4 3/6] tracing: Add "last_func_repeats" to struct trace_array

2021-04-15 Thread Yordan Karadzhov (VMware)

The field is used to keep track of the consecutive (on the same CPU) calls
of a single function. This information is needed in order to consolidate
the function tracing record in the cases when a single function is called
number of times.

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace.c |  1 +
 kernel/trace/trace.h | 12 
 2 files changed, 13 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 507a30bf26e4..82833be07c1e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9104,6 +9104,7 @@ static int __remove_instance(struct trace_array *tr)
ftrace_clear_pids(tr);
ftrace_destroy_function_files(tr);
tracefs_remove(tr->dir);
+   free_percpu(tr->last_func_repeats);
free_trace_buffers(tr);
 
for (i = 0; i < tr->nr_topts; i++) {
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 6a5b4c2a0fa7..a4f1b66049fd 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -262,6 +262,17 @@ struct cond_snapshot {
cond_update_fn_tupdate;
 };
 
+/*
+ * struct trace_func_repeats - used to keep track of the consecutive
+ * (on the same CPU) calls of a single function.
+ */
+struct trace_func_repeats {
+   unsigned long   ip;
+   unsigned long   parent_ip;
+   unsigned long   count;
+   u64 ts_last_call;
+};
+
 /*
  * The trace array - an array of per-CPU trace arrays. This is the
  * highest level data structure that individual tracers deal with.
@@ -358,6 +369,7 @@ struct trace_array {
 #ifdef CONFIG_TRACER_SNAPSHOT
struct cond_snapshot*cond_snapshot;
 #endif
+   struct trace_func_repeats   __percpu *last_func_repeats;
 };
 
 enum {
-- 
2.25.1

[PATCH v4 4/6] tracing: Add method for recording "func_repeats" events

2021-04-15 Thread Yordan Karadzhov (VMware)

This patch only provides the implementation of the method.
Later we will used it in a combination with a new option for
function tracing.

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace.c | 34 ++
 kernel/trace/trace.h |  4 
 2 files changed, 38 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 82833be07c1e..66a4ad93b5e9 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3117,6 +3117,40 @@ static void ftrace_trace_userstack(struct trace_array 
*tr,
 
 #endif /* CONFIG_STACKTRACE */
 
+static inline void
+func_repeats_set_delta_ts(struct func_repeats_entry *entry,
+ unsigned long long delta)
+{
+   entry->bottom_delta_ts = delta & U32_MAX;
+   entry->top_delta_ts = (delta >> 32);
+}
+
+void trace_last_func_repeats(struct trace_array *tr,
+struct trace_func_repeats *last_info,
+unsigned int trace_ctx)
+{
+   struct trace_buffer *buffer = tr->array_buffer.buffer;
+   struct func_repeats_entry *entry;
+   struct ring_buffer_event *event;
+   u64 delta;
+
+   event = __trace_buffer_lock_reserve(buffer, TRACE_FUNC_REPEATS,
+   sizeof(*entry), trace_ctx);
+   if (!event)
+   return;
+
+   delta = ring_buffer_event_time_stamp(buffer, event) -
+   last_info->ts_last_call;
+
+   entry = ring_buffer_event_data(event);
+   entry->ip = last_info->ip;
+   entry->parent_ip = last_info->parent_ip;
+   entry->count = last_info->count;
+   func_repeats_set_delta_ts(entry, delta);
+
+   __buffer_unlock_commit(buffer, event);
+}
+
 /* created for use with alloc_percpu */
 struct trace_buffer_struct {
int nesting;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index a4f1b66049fd..cd80d046c7a5 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -695,6 +695,10 @@ static inline void __trace_stack(struct trace_array *tr, 
unsigned int trace_ctx,
 }
 #endif /* CONFIG_STACKTRACE */
 
+void trace_last_func_repeats(struct trace_array *tr,
+struct trace_func_repeats *last_info,
+unsigned int trace_ctx);
+
 extern u64 ftrace_now(int cpu);
 
 extern void trace_find_cmdline(int pid, char comm[]);
-- 
2.25.1

[PATCH v4 1/6] tracing: Define static void trace_print_time()

2021-04-15 Thread Yordan Karadzhov (VMware)

The part of the code that prints the time of the trace record in
"int trace_print_context()" gets extracted in a static function. This
is done as a preparation for a following patch, in which we will define
a new ftrace event called "func_repeats". The new static method,
defined here, will be used by this new event to print the time of the
last repeat of a function that is consecutively called number of times.

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace_output.c | 26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a0146e1fffdf..333233d45596 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -587,13 +587,26 @@ lat_print_timestamp(struct trace_iterator *iter, u64 
next_ts)
return !trace_seq_has_overflowed(s);
 }
 
+static void trace_print_time(struct trace_seq *s, struct trace_iterator *iter,
+unsigned long long ts)
+{
+   unsigned long secs, usec_rem;
+   unsigned long long t;
+
+   if (iter->iter_flags & TRACE_FILE_TIME_IN_NS) {
+   t = ns2usecs(ts);
+   usec_rem = do_div(t, USEC_PER_SEC);
+   secs = (unsigned long)t;
+   trace_seq_printf(s, " %5lu.%06lu", secs, usec_rem);
+   } else
+   trace_seq_printf(s, " %12llu", ts);
+}
+
 int trace_print_context(struct trace_iterator *iter)
 {
struct trace_array *tr = iter->tr;
struct trace_seq *s = >seq;
struct trace_entry *entry = iter->ent;
-   unsigned long long t;
-   unsigned long secs, usec_rem;
char comm[TASK_COMM_LEN];
 
trace_find_cmdline(entry->pid, comm);
@@ -614,13 +627,8 @@ int trace_print_context(struct trace_iterator *iter)
if (tr->trace_flags & TRACE_ITER_IRQ_INFO)
trace_print_lat_fmt(s, entry);
 
-   if (iter->iter_flags & TRACE_FILE_TIME_IN_NS) {
-   t = ns2usecs(iter->ts);
-   usec_rem = do_div(t, USEC_PER_SEC);
-   secs = (unsigned long)t;
-   trace_seq_printf(s, " %5lu.%06lu: ", secs, usec_rem);
-   } else
-   trace_seq_printf(s, " %12llu: ", iter->ts);
+   trace_print_time(s, iter, iter->ts);
+   trace_seq_puts(s, ": ");
 
return !trace_seq_has_overflowed(s);
 }
-- 
2.25.1

[PATCH v4 5/6] tracing: Unify the logic for function tracing options

2021-04-15 Thread Yordan Karadzhov (VMware)

Currently the logic for dealing with the options for function tracing
has two different implementations. One is used when we set the flags
(in "static int func_set_flag()") and another used when we initialize
the tracer (in "static int function_trace_init()"). Those two
implementations are meant to do essentially the same thing and they
are both not very convenient for adding new options. In this patch
we add a helper function that provides a single implementation of
the logic for dealing with the options and we make it such that new
options can be easily added.

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace_functions.c | 65 --
 1 file changed, 38 insertions(+), 27 deletions(-)

diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index f93723ca66bc..f37f73a9b1b8 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -31,9 +31,12 @@ static struct tracer_flags func_flags;
 
 /* Our option */
 enum {
+   TRACE_FUNC_NO_OPTS  = 0x0, /* No flags set. */
TRACE_FUNC_OPT_STACK= 0x1,
 };
 
+#define TRACE_FUNC_OPT_MASK(TRACE_FUNC_OPT_STACK)
+
 int ftrace_allocate_ftrace_ops(struct trace_array *tr)
 {
struct ftrace_ops *ops;
@@ -86,6 +89,18 @@ void ftrace_destroy_function_files(struct trace_array *tr)
ftrace_free_ftrace_ops(tr);
 }
 
+static ftrace_func_t select_trace_function(u32 flags_val)
+{
+   switch (flags_val & TRACE_FUNC_OPT_MASK) {
+   case TRACE_FUNC_NO_OPTS:
+   return function_trace_call;
+   case TRACE_FUNC_OPT_STACK:
+   return function_stack_trace_call;
+   default:
+   return NULL;
+   }
+}
+
 static int function_trace_init(struct trace_array *tr)
 {
ftrace_func_t func;
@@ -97,12 +112,9 @@ static int function_trace_init(struct trace_array *tr)
if (!tr->ops)
return -ENOMEM;
 
-   /* Currently only the global instance can do stack tracing */
-   if (tr->flags & TRACE_ARRAY_FL_GLOBAL &&
-   func_flags.val & TRACE_FUNC_OPT_STACK)
-   func = function_stack_trace_call;
-   else
-   func = function_trace_call;
+   func = select_trace_function(func_flags.val);
+   if (!func)
+   return -EINVAL;
 
ftrace_init_array_ops(tr, func);
 
@@ -213,7 +225,7 @@ static struct tracer_opt func_opts[] = {
 };
 
 static struct tracer_flags func_flags = {
-   .val = 0, /* By default: all flags disabled */
+   .val = TRACE_FUNC_NO_OPTS, /* By default: all flags disabled */
.opts = func_opts
 };
 
@@ -235,30 +247,29 @@ static struct tracer function_trace;
 static int
 func_set_flag(struct trace_array *tr, u32 old_flags, u32 bit, int set)
 {
-   switch (bit) {
-   case TRACE_FUNC_OPT_STACK:
-   /* do nothing if already set */
-   if (!!set == !!(func_flags.val & TRACE_FUNC_OPT_STACK))
-   break;
-
-   /* We can change this flag when not running. */
-   if (tr->current_trace != _trace)
-   break;
+   ftrace_func_t func;
+   u32 new_flags;
 
-   unregister_ftrace_function(tr->ops);
+   /* Do nothing if already set. */
+   if (!!set == !!(func_flags.val & bit))
+   return 0;
 
-   if (set) {
-   tr->ops->func = function_stack_trace_call;
-   register_ftrace_function(tr->ops);
-   } else {
-   tr->ops->func = function_trace_call;
-   register_ftrace_function(tr->ops);
-   }
+   /* We can change this flag only when not running. */
+   if (tr->current_trace != _trace)
+   return 0;
 
-   break;
-   default:
+   new_flags = (func_flags.val & ~bit) | (set ? bit : 0);
+   func = select_trace_function(new_flags);
+   if (!func)
return -EINVAL;
-   }
+
+   /* Check if there's anything to change. */
+   if (tr->ops->func == func)
+   return 0;
+
+   unregister_ftrace_function(tr->ops);
+   tr->ops->func = func;
+   register_ftrace_function(tr->ops);
 
return 0;
 }
-- 
2.25.1

[PATCH v4 2/6] tracing: Define new ftrace event "func_repeats"

2021-04-15 Thread Yordan Karadzhov (VMware)

The event aims to consolidate the function tracing record in the cases
when a single function is called number of times consecutively.

while (cond)
do_func();

This may happen in various scenarios (busy waiting for example).
The new ftrace event can be used to show repeated function events with
a single event and save space on the ring buffer

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace.h |  3 +++
 kernel/trace/trace_entries.h | 22 +
 kernel/trace/trace_output.c  | 48 
 3 files changed, 73 insertions(+)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 5506424eae2a..6a5b4c2a0fa7 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -45,6 +45,7 @@ enum trace_type {
TRACE_BPUTS,
TRACE_HWLAT,
TRACE_RAW_DATA,
+   TRACE_FUNC_REPEATS,
 
__TRACE_LAST_TYPE,
 };
@@ -442,6 +443,8 @@ extern void __ftrace_bad_type(void);
  TRACE_GRAPH_ENT); \
IF_ASSIGN(var, ent, struct ftrace_graph_ret_entry,  \
  TRACE_GRAPH_RET); \
+   IF_ASSIGN(var, ent, struct func_repeats_entry,  \
+ TRACE_FUNC_REPEATS);  \
__ftrace_bad_type();\
} while (0)
 
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 4547ac59da61..251c819cf0c5 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -338,3 +338,25 @@ FTRACE_ENTRY(hwlat, hwlat_entry,
 __entry->nmi_total_ts,
 __entry->nmi_count)
 );
+
+#define FUNC_REPEATS_GET_DELTA_TS(entry)   \
+   (((u64)(entry)->top_delta_ts << 32) | (entry)->bottom_delta_ts) \
+
+FTRACE_ENTRY(func_repeats, func_repeats_entry,
+
+   TRACE_FUNC_REPEATS,
+
+   F_STRUCT(
+   __field(unsigned long,  ip  )
+   __field(unsigned long,  parent_ip   )
+   __field(u16 ,   count   )
+   __field(u16 ,   top_delta_ts)
+   __field(u32 ,   bottom_delta_ts )
+   ),
+
+   F_printk(" %ps <-%ps\t(repeats:%u  delta: -%llu)",
+(void *)__entry->ip,
+(void *)__entry->parent_ip,
+__entry->count,
+FUNC_REPEATS_GET_DELTA_TS(__entry))
+);
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 333233d45596..3037f0c88f90 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1381,6 +1381,53 @@ static struct trace_event trace_raw_data_event = {
.funcs  = _raw_data_funcs,
 };
 
+static enum print_line_t
+trace_func_repeats_raw(struct trace_iterator *iter, int flags,
+struct trace_event *event)
+{
+   struct func_repeats_entry *field;
+   struct trace_seq *s = >seq;
+
+   trace_assign_type(field, iter->ent);
+
+   trace_seq_printf(s, "%lu %lu %u %llu\n",
+field->ip,
+field->parent_ip,
+field->count,
+FUNC_REPEATS_GET_DELTA_TS(field));
+
+   return trace_handle_return(s);
+}
+
+static enum print_line_t
+trace_func_repeats_print(struct trace_iterator *iter, int flags,
+struct trace_event *event)
+{
+   struct func_repeats_entry *field;
+   struct trace_seq *s = >seq;
+
+   trace_assign_type(field, iter->ent);
+
+   seq_print_ip_sym(s, field->ip, flags);
+   trace_seq_puts(s, " <-");
+   seq_print_ip_sym(s, field->parent_ip, flags);
+   trace_seq_printf(s, " (repeats: %u, last_ts:", field->count);
+   trace_print_time(s, iter,
+iter->ts - FUNC_REPEATS_GET_DELTA_TS(field));
+   trace_seq_puts(s, ")\n");
+
+   return trace_handle_return(s);
+}
+
+static struct trace_event_functions trace_func_repeats_funcs = {
+   .trace  = trace_func_repeats_print,
+   .raw= trace_func_repeats_raw,
+};
+
+static struct trace_event trace_func_repeats_event = {
+   .type   = TRACE_FUNC_REPEATS,
+   .funcs  = _func_repeats_funcs,
+};
 
 static struct trace_event *events[] __initdata = {
_fn_event,
@@ -1393,6 +1440,7 @@ static struct trace_event *events[] __initdata = {
_print_event,
_hwlat_event,
_raw_data_event,
+   _func_repeats_event,
NULL
 };
 
-- 
2.25.1

[PATCH v3 2/5] tracing: Add "last_func_repeats" to struct trace_array

2021-04-09 Thread Yordan Karadzhov (VMware)

The field is used to keep track of the consecutive (on the same CPU) calls
of a single function. This information is needed in order to consolidate
the function tracing record in the cases when a single function is called
number of times.

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace.c |  1 +
 kernel/trace/trace.h | 18 ++
 2 files changed, 19 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 507a30bf26e4..82833be07c1e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9104,6 +9104,7 @@ static int __remove_instance(struct trace_array *tr)
ftrace_clear_pids(tr);
ftrace_destroy_function_files(tr);
tracefs_remove(tr->dir);
+   free_percpu(tr->last_func_repeats);
free_trace_buffers(tr);
 
for (i = 0; i < tr->nr_topts; i++) {
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 6a5b4c2a0fa7..1cd4da7ba769 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -262,6 +262,17 @@ struct cond_snapshot {
cond_update_fn_tupdate;
 };
 
+/*
+ * struct trace_func_repeats - used to keep track of the consecutive
+ * (on the same CPU) calls of a single function.
+ */
+struct trace_func_repeats {
+   unsigned long   ip;
+   unsigned long   parent_ip;
+   unsigned long   count;
+   u64 ts_last_call;
+};
+
 /*
  * The trace array - an array of per-CPU trace arrays. This is the
  * highest level data structure that individual tracers deal with.
@@ -358,8 +369,15 @@ struct trace_array {
 #ifdef CONFIG_TRACER_SNAPSHOT
struct cond_snapshot*cond_snapshot;
 #endif
+   struct trace_func_repeats   __percpu *last_func_repeats;
 };
 
+static inline struct trace_func_repeats __percpu *
+tracer_alloc_func_repeats(struct trace_array *tr)
+{
+   return tr->last_func_repeats = alloc_percpu(struct trace_func_repeats);
+}
+
 enum {
TRACE_ARRAY_FL_GLOBAL   = (1 << 0)
 };
-- 
2.25.1

[PATCH v3 3/5] tracing: Add method for recording "func_repeats" events

2021-04-09 Thread Yordan Karadzhov (VMware)

This patch only provides the implementation of the method.
Later we will used it in a combination with a new option for
function tracing.

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace.c | 26 ++
 kernel/trace/trace.h |  4 
 kernel/trace/trace_entries.h |  6 ++
 3 files changed, 36 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 82833be07c1e..bbc57cf3bda4 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3117,6 +3117,32 @@ static void ftrace_trace_userstack(struct trace_array 
*tr,
 
 #endif /* CONFIG_STACKTRACE */
 
+void trace_last_func_repeats(struct trace_array *tr,
+struct trace_func_repeats *last_info,
+unsigned int trace_ctx)
+{
+   struct trace_buffer *buffer = tr->array_buffer.buffer;
+   struct func_repeats_entry *entry;
+   struct ring_buffer_event *event;
+   u64 delta;
+
+   event = __trace_buffer_lock_reserve(buffer, TRACE_FUNC_REPEATS,
+   sizeof(*entry), trace_ctx);
+   if (!event)
+   return;
+
+   delta = ring_buffer_event_time_stamp(buffer, event) -
+   last_info->ts_last_call;
+
+   entry = ring_buffer_event_data(event);
+   entry->ip = last_info->ip;
+   entry->parent_ip = last_info->parent_ip;
+   entry->count = last_info->count;
+   FUNC_REPEATS_SET_DELTA_TS(entry, delta)
+
+   __buffer_unlock_commit(buffer, event);
+}
+
 /* created for use with alloc_percpu */
 struct trace_buffer_struct {
int nesting;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 1cd4da7ba769..e1f34119c036 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -701,6 +701,10 @@ static inline void __trace_stack(struct trace_array *tr, 
unsigned int trace_ctx,
 }
 #endif /* CONFIG_STACKTRACE */
 
+void trace_last_func_repeats(struct trace_array *tr,
+struct trace_func_repeats *last_info,
+unsigned int trace_ctx);
+
 extern u64 ftrace_now(int cpu);
 
 extern void trace_find_cmdline(int pid, char comm[]);
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index fdd022a7aecf..5e9dc56af4b1 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -342,6 +342,12 @@ FTRACE_ENTRY(hwlat, hwlat_entry,
 #define FUNC_REPEATS_GET_DELTA_TS(entry)   \
 (((u64)entry->top_delta_ts << 32) | entry->bottom_delta_ts)\
 
+#define FUNC_REPEATS_SET_DELTA_TS(entry, delta)\
+   do {\
+   entry->bottom_delta_ts = delta & U32_MAX;   \
+   entry->top_delta_ts = (delta >> 32);\
+   } while (0);\
+
 FTRACE_ENTRY(func_repeats, func_repeats_entry,
 
TRACE_FUNC_REPEATS,
-- 
2.25.1

[PATCH v3 1/5] tracing: Define new ftrace event "func_repeats"

2021-04-09 Thread Yordan Karadzhov (VMware)

The event aims to consolidate the function tracing record in the cases
when a single function is called number of times consecutively.

while (cond)
do_func();

This may happen in various scenarios (busy waiting for example).
The new ftrace event can be used to show repeated function events with
a single event and save space on the ring buffer

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace.h |  3 +++
 kernel/trace/trace_entries.h | 22 +
 kernel/trace/trace_output.c  | 47 
 3 files changed, 72 insertions(+)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 5506424eae2a..6a5b4c2a0fa7 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -45,6 +45,7 @@ enum trace_type {
TRACE_BPUTS,
TRACE_HWLAT,
TRACE_RAW_DATA,
+   TRACE_FUNC_REPEATS,
 
__TRACE_LAST_TYPE,
 };
@@ -442,6 +443,8 @@ extern void __ftrace_bad_type(void);
  TRACE_GRAPH_ENT); \
IF_ASSIGN(var, ent, struct ftrace_graph_ret_entry,  \
  TRACE_GRAPH_RET); \
+   IF_ASSIGN(var, ent, struct func_repeats_entry,  \
+ TRACE_FUNC_REPEATS);  \
__ftrace_bad_type();\
} while (0)
 
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 4547ac59da61..fdd022a7aecf 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -338,3 +338,25 @@ FTRACE_ENTRY(hwlat, hwlat_entry,
 __entry->nmi_total_ts,
 __entry->nmi_count)
 );
+
+#define FUNC_REPEATS_GET_DELTA_TS(entry)   \
+(((u64)entry->top_delta_ts << 32) | entry->bottom_delta_ts)\
+
+FTRACE_ENTRY(func_repeats, func_repeats_entry,
+
+   TRACE_FUNC_REPEATS,
+
+   F_STRUCT(
+   __field(unsigned long,  ip  )
+   __field(unsigned long,  parent_ip   )
+   __field(u16 ,   count   )
+   __field(u16 ,   top_delta_ts)
+   __field(u32 ,   bottom_delta_ts )
+   ),
+
+   F_printk(" %ps <-%ps\t(repeats:%u  delta_ts: -%llu)",
+(void *)__entry->ip,
+(void *)__entry->parent_ip,
+__entry->count,
+FUNC_REPEATS_GET_DELTA_TS(__entry))
+);
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a0146e1fffdf..55b08e146afc 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1373,6 +1373,52 @@ static struct trace_event trace_raw_data_event = {
.funcs  = _raw_data_funcs,
 };
 
+static enum print_line_t
+trace_func_repeats_raw(struct trace_iterator *iter, int flags,
+struct trace_event *event)
+{
+   struct func_repeats_entry *field;
+   struct trace_seq *s = >seq;
+
+   trace_assign_type(field, iter->ent);
+
+   trace_seq_printf(s, "%lu %lu %u %llu\n",
+field->ip,
+field->parent_ip,
+field->count,
+FUNC_REPEATS_GET_DELTA_TS(field));
+
+   return trace_handle_return(s);
+}
+
+static enum print_line_t
+trace_func_repeats_print(struct trace_iterator *iter, int flags,
+struct trace_event *event)
+{
+   struct func_repeats_entry *field;
+   struct trace_seq *s = >seq;
+
+   trace_assign_type(field, iter->ent);
+
+   seq_print_ip_sym(s, field->ip, flags);
+   trace_seq_puts(s, " <-");
+   seq_print_ip_sym(s, field->parent_ip, flags);
+   trace_seq_printf(s, " (repeats: %u, delta_ts: -%llu)\n",
+field->count,
+FUNC_REPEATS_GET_DELTA_TS(field));
+
+   return trace_handle_return(s);
+}
+
+static struct trace_event_functions trace_func_repeats_funcs = {
+   .trace  = trace_func_repeats_print,
+   .raw= trace_func_repeats_raw,
+};
+
+static struct trace_event trace_func_repeats_event = {
+   .type   = TRACE_FUNC_REPEATS,
+   .funcs  = _func_repeats_funcs,
+};
 
 static struct trace_event *events[] __initdata = {
_fn_event,
@@ -1385,6 +1431,7 @@ static struct trace_event *events[] __initdata = {
_print_event,
_hwlat_event,
_raw_data_event,
+   _func_repeats_event,
NULL
 };
 
-- 
2.25.1

[PATCH v3 5/5] tracing: Add "func_no_repeats" option for function tracing

2021-04-09 Thread Yordan Karadzhov (VMware)

If the option is activated the function tracing record gets
consolidated in the cases when a single function is called number
of times consecutively. Instead of having an identical record for
each call of the function we will record only the first call
following by event showing the number of repeats.

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace_functions.c | 161 -
 1 file changed, 158 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index f37f73a9b1b8..9a3cbdbfd1f7 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -27,15 +27,27 @@ function_trace_call(unsigned long ip, unsigned long 
parent_ip,
 static void
 function_stack_trace_call(unsigned long ip, unsigned long parent_ip,
  struct ftrace_ops *op, struct ftrace_regs *fregs);
+static void
+function_no_repeats_trace_call(unsigned long ip, unsigned long parent_ip,
+  struct ftrace_ops *op, struct ftrace_regs 
*fregs);
+static void
+function_stack_no_repeats_trace_call(unsigned long ip, unsigned long parent_ip,
+struct ftrace_ops *op,
+struct ftrace_regs *fregs);
 static struct tracer_flags func_flags;
 
 /* Our option */
 enum {
-   TRACE_FUNC_NO_OPTS  = 0x0, /* No flags set. */
-   TRACE_FUNC_OPT_STACK= 0x1,
+
+   TRACE_FUNC_NO_OPTS  = 0x0, /* No flags set. */
+   TRACE_FUNC_OPT_STACK= 0x1,
+   TRACE_FUNC_OPT_NO_REPEATS   = 0x2,
+
+   /* Update this to next highest bit. */
+   TRACE_FUNC_OPT_HIGHEST_BIT  = 0x4
 };
 
-#define TRACE_FUNC_OPT_MASK(TRACE_FUNC_OPT_STACK)
+#define TRACE_FUNC_OPT_MASK(TRACE_FUNC_OPT_HIGHEST_BIT - 1)
 
 int ftrace_allocate_ftrace_ops(struct trace_array *tr)
 {
@@ -96,11 +108,26 @@ static ftrace_func_t select_trace_function(u32 flags_val)
return function_trace_call;
case TRACE_FUNC_OPT_STACK:
return function_stack_trace_call;
+   case TRACE_FUNC_OPT_NO_REPEATS:
+   return function_no_repeats_trace_call;
+   case TRACE_FUNC_OPT_STACK | TRACE_FUNC_OPT_NO_REPEATS:
+   return function_stack_no_repeats_trace_call;
default:
return NULL;
}
 }
 
+static bool handle_func_repeats(struct trace_array *tr, u32 flags_val)
+{
+   if (!tr->last_func_repeats &&
+   (flags_val & TRACE_FUNC_OPT_NO_REPEATS)) {
+   if (!tracer_alloc_func_repeats(tr))
+   return false;
+   }
+
+   return true;
+}
+
 static int function_trace_init(struct trace_array *tr)
 {
ftrace_func_t func;
@@ -116,6 +143,9 @@ static int function_trace_init(struct trace_array *tr)
if (!func)
return -EINVAL;
 
+   if (!handle_func_repeats(tr, func_flags.val))
+   return -ENOMEM;
+
ftrace_init_array_ops(tr, func);
 
tr->array_buffer.cpu = raw_smp_processor_id();
@@ -217,10 +247,132 @@ function_stack_trace_call(unsigned long ip, unsigned 
long parent_ip,
local_irq_restore(flags);
 }
 
+static inline bool is_repeat_check(struct trace_array *tr,
+  struct trace_func_repeats *last_info,
+  unsigned long ip, unsigned long parent_ip)
+{
+   if (last_info->ip == ip &&
+   last_info->parent_ip == parent_ip &&
+   last_info->count < U16_MAX) {
+   last_info->ts_last_call =
+   ring_buffer_time_stamp(tr->array_buffer.buffer);
+   last_info->count++;
+   return true;
+   }
+
+   return false;
+}
+
+static inline void process_repeats(struct trace_array *tr,
+  unsigned long ip, unsigned long parent_ip,
+  struct trace_func_repeats *last_info,
+  unsigned int trace_ctx)
+{
+   if (last_info->count) {
+   trace_last_func_repeats(tr, last_info, trace_ctx);
+   last_info->count = 0;
+   }
+
+   last_info->ip = ip;
+   last_info->parent_ip = parent_ip;
+}
+
+static void
+function_no_repeats_trace_call(unsigned long ip, unsigned long parent_ip,
+  struct ftrace_ops *op,
+  struct ftrace_regs *fregs)
+{
+   struct trace_func_repeats *last_info;
+   struct trace_array *tr = op->private;
+   struct trace_array_cpu *data;
+   unsigned int trace_ctx;
+   unsigned long flags;
+   int bit;
+   int cpu;
+
+   if (unlikely(!tr->function_enabled))
+   return;
+
+   bit = ftrace_test_recursion_trylock(ip, parent_ip);
+   if (bit < 0)
+   return;
+
+   preempt_disable_

[PATCH v3 4/5] tracing: Unify the logic for function tracing options

2021-04-09 Thread Yordan Karadzhov (VMware)

Currently the logic for dealing with the options for function tracing
has two different implementations. One is used when we set the flags
(in "static int func_set_flag()") and another used when we initialize
the tracer (in "static int function_trace_init()"). Those two
implementations are meant to do essentially the same thing and they
are both not very convenient for adding new options. In this patch
we add a helper function that provides a single implementation of
the logic for dealing with the options and we make it such that new
options can be easily added.

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace_functions.c | 65 --
 1 file changed, 38 insertions(+), 27 deletions(-)

diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index f93723ca66bc..f37f73a9b1b8 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -31,9 +31,12 @@ static struct tracer_flags func_flags;
 
 /* Our option */
 enum {
+   TRACE_FUNC_NO_OPTS  = 0x0, /* No flags set. */
TRACE_FUNC_OPT_STACK= 0x1,
 };
 
+#define TRACE_FUNC_OPT_MASK(TRACE_FUNC_OPT_STACK)
+
 int ftrace_allocate_ftrace_ops(struct trace_array *tr)
 {
struct ftrace_ops *ops;
@@ -86,6 +89,18 @@ void ftrace_destroy_function_files(struct trace_array *tr)
ftrace_free_ftrace_ops(tr);
 }
 
+static ftrace_func_t select_trace_function(u32 flags_val)
+{
+   switch (flags_val & TRACE_FUNC_OPT_MASK) {
+   case TRACE_FUNC_NO_OPTS:
+   return function_trace_call;
+   case TRACE_FUNC_OPT_STACK:
+   return function_stack_trace_call;
+   default:
+   return NULL;
+   }
+}
+
 static int function_trace_init(struct trace_array *tr)
 {
ftrace_func_t func;
@@ -97,12 +112,9 @@ static int function_trace_init(struct trace_array *tr)
if (!tr->ops)
return -ENOMEM;
 
-   /* Currently only the global instance can do stack tracing */
-   if (tr->flags & TRACE_ARRAY_FL_GLOBAL &&
-   func_flags.val & TRACE_FUNC_OPT_STACK)
-   func = function_stack_trace_call;
-   else
-   func = function_trace_call;
+   func = select_trace_function(func_flags.val);
+   if (!func)
+   return -EINVAL;
 
ftrace_init_array_ops(tr, func);
 
@@ -213,7 +225,7 @@ static struct tracer_opt func_opts[] = {
 };
 
 static struct tracer_flags func_flags = {
-   .val = 0, /* By default: all flags disabled */
+   .val = TRACE_FUNC_NO_OPTS, /* By default: all flags disabled */
.opts = func_opts
 };
 
@@ -235,30 +247,29 @@ static struct tracer function_trace;
 static int
 func_set_flag(struct trace_array *tr, u32 old_flags, u32 bit, int set)
 {
-   switch (bit) {
-   case TRACE_FUNC_OPT_STACK:
-   /* do nothing if already set */
-   if (!!set == !!(func_flags.val & TRACE_FUNC_OPT_STACK))
-   break;
-
-   /* We can change this flag when not running. */
-   if (tr->current_trace != _trace)
-   break;
+   ftrace_func_t func;
+   u32 new_flags;
 
-   unregister_ftrace_function(tr->ops);
+   /* Do nothing if already set. */
+   if (!!set == !!(func_flags.val & bit))
+   return 0;
 
-   if (set) {
-   tr->ops->func = function_stack_trace_call;
-   register_ftrace_function(tr->ops);
-   } else {
-   tr->ops->func = function_trace_call;
-   register_ftrace_function(tr->ops);
-   }
+   /* We can change this flag only when not running. */
+   if (tr->current_trace != _trace)
+   return 0;
 
-   break;
-   default:
+   new_flags = (func_flags.val & ~bit) | (set ? bit : 0);
+   func = select_trace_function(new_flags);
+   if (!func)
return -EINVAL;
-   }
+
+   /* Check if there's anything to change. */
+   if (tr->ops->func == func)
+   return 0;
+
+   unregister_ftrace_function(tr->ops);
+   tr->ops->func = func;
+   register_ftrace_function(tr->ops);
 
return 0;
 }
-- 
2.25.1

[PATCH v3 0/5] Add "func_no_repete" tracing option

2021-04-09 Thread Yordan Karadzhov (VMware)

The new option for function tracing aims to save space on the ring
buffer and to make it more readable in the case when a single function
is called number of times consecutively:

while (cond)
do_func();

Instead of having an identical records for each call of the function
we will record only the first call, followed by an event showing the
number of repeats.

v3 changes:
  * FUNC_REPEATS_SET_DELTA_TS macro has been optimised.

  * Fixed bug in func_set_flag(): In the previous version the value
of the "new_flags" variable was not properly calculated because
I misinterpreted the meaning of the "bit" argument of the function.
This bug in v2 had no real effect, because for the moment we have
only two "function options" so the value of "new_flags" was correct,
although its way of calculating was wrong.

v2 changes:
  * As suggested by Steven in his review, we now record not only the
repetition count, but also the time elapsed between the last
repetition of the function and the actual generation of the
"func_repeats" event. 16 bits are used to record the repetition
count. In the case of an overflow of the counter a second pair of
"function" and "func_repeats" events will be generated. The time
interval gets codded by using up to 48 (32 + 16) bits.


Yordan Karadzhov (VMware) (5):
  tracing: Define new ftrace event "func_repeats"
  tracing: Add "last_func_repeats" to struct trace_array
  tracing: Add method for recording "func_repeats" events
  tracing: Unify the logic for function tracing options
  tracing: Add "func_no_repeats" option for function tracing

 kernel/trace/trace.c   |  27 
 kernel/trace/trace.h   |  25 
 kernel/trace/trace_entries.h   |  28 +
 kernel/trace/trace_functions.c | 222 -
 kernel/trace/trace_output.c|  47 +++
 5 files changed, 321 insertions(+), 28 deletions(-)

-- 
2.25.1

Re: [PATCH v2 4/5] tracing: Unify the logic for function tracing options

2021-04-07 Thread Yordan Karadzhov (VMware)


Hi Steven,

Hi Steven,

On 6.04.21 г. 1:15, Steven Rostedt wrote:
  
@@ -235,30 +248,31 @@ static struct tracer function_trace;

  static int
  func_set_flag(struct trace_array *tr, u32 old_flags, u32 bit, int set)
  {
-   switch (bit) {
-   case TRACE_FUNC_OPT_STACK:
-   /* do nothing if already set */
-   if (!!set == !!(func_flags.val & TRACE_FUNC_OPT_STACK))
-   break;
+   ftrace_func_t func;
+   u32 new_flags_val;

Nit, but the variable should just be "new_flags", which is consistent with
old_flags. In the kernel we don't need to the variable names to be so
verbose.

  
-		/* We can change this flag when not running. */

-   if (tr->current_trace != _trace)
-   break;
+   /* Do nothing if already set. */
+   if (!!set == !!(func_flags.val & bit))
+   return 0;
  
-		unregister_ftrace_function(tr->ops);

+   /* We can change this flag only when not running. */
+   if (tr->current_trace != _trace)
+   return 0;
  
-		if (set) {

-   tr->ops->func = function_stack_trace_call;
-   register_ftrace_function(tr->ops);
-   } else {
-   tr->ops->func = function_trace_call;
-   register_ftrace_function(tr->ops);
-   }
+   new_flags_val = (func_flags.val & ~(1UL << (bit - 1)));
+   new_flags_val |= (set << (bit - 1));

bit is already the mask, no need to shift it, nor there's no reason for the
extra set of parenthesis. And the above can be done in one line.

new_flags = (func_flags.val & ~bit) | (set ? bit : 0);



OK, I totally misinterpreted the meaning of the "bit" argument of the 
function. I did not realized it is a mask. I was thinking the argument 
gives only the number of the bit that changes (like 5 for the 5-th bit 
inside the mask).


Thanks!
Yordan

[PATCH v2 5/5] tracing: Add "func_no_repeats" option for function tracing

2021-03-29 Thread Yordan Karadzhov (VMware)

If the option is activated the function tracing record gets
consolidated in the cases when a single function is called number
of times consecutively. Instead of having an identical record for
each call of the function we will record only the first call
following by event showing the number of repeats.

Signed-off-by: Yordan Karadzhov (VMware) 

fix last
---
 kernel/trace/trace_functions.c | 161 -
 1 file changed, 158 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 6c912eb0508a..72d2e07dc103 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -27,16 +27,28 @@ function_trace_call(unsigned long ip, unsigned long 
parent_ip,
 static void
 function_stack_trace_call(unsigned long ip, unsigned long parent_ip,
  struct ftrace_ops *op, struct ftrace_regs *fregs);
+static void
+function_no_repeats_trace_call(unsigned long ip, unsigned long parent_ip,
+  struct ftrace_ops *op, struct ftrace_regs 
*fregs);
+static void
+function_stack_no_repeats_trace_call(unsigned long ip, unsigned long parent_ip,
+struct ftrace_ops *op,
+struct ftrace_regs *fregs);
 static ftrace_func_t select_trace_function(u32 flags_val);
 static struct tracer_flags func_flags;
 
 /* Our option */
 enum {
-   TRACE_FUNC_NO_OPTS  = 0x0, /* No flags set. */
-   TRACE_FUNC_OPT_STACK= 0x1,
+
+   TRACE_FUNC_NO_OPTS  = 0x0, /* No flags set. */
+   TRACE_FUNC_OPT_STACK= 0x1,
+   TRACE_FUNC_OPT_NO_REPEATS   = 0x2,
+
+   /* Update this to next highest bit. */
+   TRACE_FUNC_OPT_HIGHEST_BIT  = 0x4
 };
 
-#define TRACE_FUNC_OPT_MASK(TRACE_FUNC_OPT_STACK)
+#define TRACE_FUNC_OPT_MASK(TRACE_FUNC_OPT_HIGHEST_BIT - 1)
 
 int ftrace_allocate_ftrace_ops(struct trace_array *tr)
 {
@@ -90,6 +102,17 @@ void ftrace_destroy_function_files(struct trace_array *tr)
ftrace_free_ftrace_ops(tr);
 }
 
+static bool handle_func_repeats(struct trace_array *tr, u32 flags_val)
+{
+   if (!tr->last_func_repeats &&
+   (flags_val & TRACE_FUNC_OPT_NO_REPEATS)) {
+   if (!tracer_alloc_func_repeats(tr))
+   return false;
+   }
+
+   return true;
+}
+
 static int function_trace_init(struct trace_array *tr)
 {
ftrace_func_t func;
@@ -105,6 +128,9 @@ static int function_trace_init(struct trace_array *tr)
if (!func)
return -EINVAL;
 
+   if (!handle_func_repeats(tr, func_flags.val))
+   return -ENOMEM;
+
ftrace_init_array_ops(tr, func);
 
tr->array_buffer.cpu = raw_smp_processor_id();
@@ -206,6 +232,127 @@ function_stack_trace_call(unsigned long ip, unsigned long 
parent_ip,
local_irq_restore(flags);
 }
 
+static inline bool is_repeat_check(struct trace_array *tr,
+  struct trace_func_repeats *last_info,
+  unsigned long ip, unsigned long parent_ip)
+{
+   if (last_info->ip == ip &&
+   last_info->parent_ip == parent_ip &&
+   last_info->count < U16_MAX) {
+   last_info->ts_last_call =
+   ring_buffer_time_stamp(tr->array_buffer.buffer);
+   last_info->count++;
+   return true;
+   }
+
+   return false;
+}
+
+static inline void process_repeats(struct trace_array *tr,
+  unsigned long ip, unsigned long parent_ip,
+  struct trace_func_repeats *last_info,
+  unsigned int trace_ctx)
+{
+   if (last_info->count) {
+   trace_last_func_repeats(tr, last_info, trace_ctx);
+   last_info->count = 0;
+   }
+
+   last_info->ip = ip;
+   last_info->parent_ip = parent_ip;
+}
+
+static void
+function_no_repeats_trace_call(unsigned long ip, unsigned long parent_ip,
+  struct ftrace_ops *op,
+  struct ftrace_regs *fregs)
+{
+   struct trace_func_repeats *last_info;
+   struct trace_array *tr = op->private;
+   struct trace_array_cpu *data;
+   unsigned int trace_ctx;
+   unsigned long flags;
+   int bit;
+   int cpu;
+
+   if (unlikely(!tr->function_enabled))
+   return;
+
+   bit = ftrace_test_recursion_trylock(ip, parent_ip);
+   if (bit < 0)
+   return;
+
+   preempt_disable_notrace();
+
+   cpu = smp_processor_id();
+   data = per_cpu_ptr(tr->array_buffer.data, cpu);
+   if (atomic_read(>disabled))
+   goto out;
+
+   /*
+* An interrupt may happen at any place here. But as far as I can see,
+* the only damage that this can

[PATCH v2 4/5] tracing: Unify the logic for function tracing options

2021-03-29 Thread Yordan Karadzhov (VMware)

Currently the logic for dealing with the options for function tracing
has two different implementations. One is used when we set the flags
(in "static int func_set_flag()") and another used when we initialize
the tracer (in "static int function_trace_init()"). Those two
implementations are meant to do essentially the same thing and they
are both not very convenient for adding new options. In this patch
we add a helper function that provides a single implementation of
the logic for dealing with the options and we make it such that new
options can be easily added.

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace_functions.c | 66 --
 1 file changed, 40 insertions(+), 26 deletions(-)

diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index f93723ca66bc..6c912eb0508a 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -27,13 +27,17 @@ function_trace_call(unsigned long ip, unsigned long 
parent_ip,
 static void
 function_stack_trace_call(unsigned long ip, unsigned long parent_ip,
  struct ftrace_ops *op, struct ftrace_regs *fregs);
+static ftrace_func_t select_trace_function(u32 flags_val);
 static struct tracer_flags func_flags;
 
 /* Our option */
 enum {
+   TRACE_FUNC_NO_OPTS  = 0x0, /* No flags set. */
TRACE_FUNC_OPT_STACK= 0x1,
 };
 
+#define TRACE_FUNC_OPT_MASK(TRACE_FUNC_OPT_STACK)
+
 int ftrace_allocate_ftrace_ops(struct trace_array *tr)
 {
struct ftrace_ops *ops;
@@ -97,12 +101,9 @@ static int function_trace_init(struct trace_array *tr)
if (!tr->ops)
return -ENOMEM;
 
-   /* Currently only the global instance can do stack tracing */
-   if (tr->flags & TRACE_ARRAY_FL_GLOBAL &&
-   func_flags.val & TRACE_FUNC_OPT_STACK)
-   func = function_stack_trace_call;
-   else
-   func = function_trace_call;
+   func = select_trace_function(func_flags.val);
+   if (!func)
+   return -EINVAL;
 
ftrace_init_array_ops(tr, func);
 
@@ -205,6 +206,18 @@ function_stack_trace_call(unsigned long ip, unsigned long 
parent_ip,
local_irq_restore(flags);
 }
 
+static ftrace_func_t select_trace_function(u32 flags_val)
+{
+   switch (flags_val & TRACE_FUNC_OPT_MASK) {
+   case TRACE_FUNC_NO_OPTS:
+   return function_trace_call;
+   case TRACE_FUNC_OPT_STACK:
+   return function_stack_trace_call;
+   default:
+   return NULL;
+   }
+}
+
 static struct tracer_opt func_opts[] = {
 #ifdef CONFIG_STACKTRACE
{ TRACER_OPT(func_stack_trace, TRACE_FUNC_OPT_STACK) },
@@ -213,7 +226,7 @@ static struct tracer_opt func_opts[] = {
 };
 
 static struct tracer_flags func_flags = {
-   .val = 0, /* By default: all flags disabled */
+   .val = TRACE_FUNC_NO_OPTS, /* By default: all flags disabled */
.opts = func_opts
 };
 
@@ -235,30 +248,31 @@ static struct tracer function_trace;
 static int
 func_set_flag(struct trace_array *tr, u32 old_flags, u32 bit, int set)
 {
-   switch (bit) {
-   case TRACE_FUNC_OPT_STACK:
-   /* do nothing if already set */
-   if (!!set == !!(func_flags.val & TRACE_FUNC_OPT_STACK))
-   break;
+   ftrace_func_t func;
+   u32 new_flags_val;
 
-   /* We can change this flag when not running. */
-   if (tr->current_trace != _trace)
-   break;
+   /* Do nothing if already set. */
+   if (!!set == !!(func_flags.val & bit))
+   return 0;
 
-   unregister_ftrace_function(tr->ops);
+   /* We can change this flag only when not running. */
+   if (tr->current_trace != _trace)
+   return 0;
 
-   if (set) {
-   tr->ops->func = function_stack_trace_call;
-   register_ftrace_function(tr->ops);
-   } else {
-   tr->ops->func = function_trace_call;
-   register_ftrace_function(tr->ops);
-   }
+   new_flags_val = (func_flags.val & ~(1UL << (bit - 1)));
+   new_flags_val |= (set << (bit - 1));
 
-   break;
-   default:
+   func = select_trace_function(new_flags_val);
+   if (!func)
return -EINVAL;
-   }
+
+   /* Check if there's anything to change. */
+   if (tr->ops->func == func)
+   return 0;
+
+   unregister_ftrace_function(tr->ops);
+   tr->ops->func = func;
+   register_ftrace_function(tr->ops);
 
return 0;
 }
-- 
2.25.1

[PATCH v2 2/5] tracing: Add "last_func_repeats" to struct trace_array

2021-03-29 Thread Yordan Karadzhov (VMware)

The field is used to keep track of the consecutive (on the same CPU) calls
of a single function. This information is needed in order to consolidate
the function tracing record in the cases when a single function is called
number of times.

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace.c |  1 +
 kernel/trace/trace.h | 18 ++
 2 files changed, 19 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 3c605957bb5c..6fcc159c34a8 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9103,6 +9103,7 @@ static int __remove_instance(struct trace_array *tr)
ftrace_clear_pids(tr);
ftrace_destroy_function_files(tr);
tracefs_remove(tr->dir);
+   free_percpu(tr->last_func_repeats);
free_trace_buffers(tr);
 
for (i = 0; i < tr->nr_topts; i++) {
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 6a5b4c2a0fa7..1cd4da7ba769 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -262,6 +262,17 @@ struct cond_snapshot {
cond_update_fn_tupdate;
 };
 
+/*
+ * struct trace_func_repeats - used to keep track of the consecutive
+ * (on the same CPU) calls of a single function.
+ */
+struct trace_func_repeats {
+   unsigned long   ip;
+   unsigned long   parent_ip;
+   unsigned long   count;
+   u64 ts_last_call;
+};
+
 /*
  * The trace array - an array of per-CPU trace arrays. This is the
  * highest level data structure that individual tracers deal with.
@@ -358,8 +369,15 @@ struct trace_array {
 #ifdef CONFIG_TRACER_SNAPSHOT
struct cond_snapshot*cond_snapshot;
 #endif
+   struct trace_func_repeats   __percpu *last_func_repeats;
 };
 
+static inline struct trace_func_repeats __percpu *
+tracer_alloc_func_repeats(struct trace_array *tr)
+{
+   return tr->last_func_repeats = alloc_percpu(struct trace_func_repeats);
+}
+
 enum {
TRACE_ARRAY_FL_GLOBAL   = (1 << 0)
 };
-- 
2.25.1

[PATCH v2 1/5] tracing: Define new ftrace event "func_repeats"

2021-03-29 Thread Yordan Karadzhov (VMware)

The event aims to consolidate the function tracing record in the cases
when a single function is called number of times consecutively.

while (cond)
do_func();

This may happen in various scenarios (busy waiting for example).
The new ftrace event can be used to show repeated function events with
a single event and save space on the ring buffer

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace.h |  3 +++
 kernel/trace/trace_entries.h | 39 ++
 kernel/trace/trace_output.c  | 47 
 3 files changed, 89 insertions(+)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 5506424eae2a..6a5b4c2a0fa7 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -45,6 +45,7 @@ enum trace_type {
TRACE_BPUTS,
TRACE_HWLAT,
TRACE_RAW_DATA,
+   TRACE_FUNC_REPEATS,
 
__TRACE_LAST_TYPE,
 };
@@ -442,6 +443,8 @@ extern void __ftrace_bad_type(void);
  TRACE_GRAPH_ENT); \
IF_ASSIGN(var, ent, struct ftrace_graph_ret_entry,  \
  TRACE_GRAPH_RET); \
+   IF_ASSIGN(var, ent, struct func_repeats_entry,  \
+ TRACE_FUNC_REPEATS);  \
__ftrace_bad_type();\
} while (0)
 
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 4547ac59da61..6f98c3b4e4fa 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -338,3 +338,42 @@ FTRACE_ENTRY(hwlat, hwlat_entry,
 __entry->nmi_total_ts,
 __entry->nmi_count)
 );
+
+#define FUNC_REPEATS_GET_DELTA_TS(entry)   
\
+(((u64)entry->top_delta_ts << 32) | entry->bottom_delta_ts)
\
+
+#define FUNC_REPEATS_SET_DELTA_TS(entry, delta)
\
+   do {
\
+   if (likely(!((u64)delta >> 32))) {  
\
+   entry->bottom_delta_ts = delta; 
\
+   entry->top_delta_ts = 0;
\
+   } else {
\
+   if (likely(!((u64)delta >> 48))) {  
\
+   entry->bottom_delta_ts = delta & U32_MAX;   
\
+   entry->top_delta_ts = (delta >> 32);
\
+   } else {
\
+   /* Timestamp overflow. Set to max. */   
\
+   entry->bottom_delta_ts = U32_MAX;   
\
+   entry->top_delta_ts = U16_MAX;  
\
+   }   
\
+   }   
\
+   } while (0);
\
+
+FTRACE_ENTRY(func_repeats, func_repeats_entry,
+
+   TRACE_FUNC_REPEATS,
+
+   F_STRUCT(
+   __field(unsigned long,  ip  )
+   __field(unsigned long,  parent_ip   )
+   __field(u16 ,   count   )
+   __field(u16 ,   top_delta_ts)
+   __field(u32 ,   bottom_delta_ts )
+   ),
+
+   F_printk(" %ps <-%ps\t(repeats:%u  delta_ts: -%llu)",
+(void *)__entry->ip,
+(void *)__entry->parent_ip,
+__entry->count,
+FUNC_REPEATS_GET_DELTA_TS(__entry))
+);
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a0146e1fffdf..55b08e146afc 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1373,6 +1373,52 @@ static struct trace_event trace_raw_data_event = {
.funcs  = _raw_data_funcs,
 };
 
+static enum print_line_t
+trace_func_repeats_raw(struct trace_iterator *iter, int flags,
+struct trace_event *event)
+{
+   struct func_repeats_entry *field;
+   struct trace_seq *s = >seq;
+
+   trace_assign_type(field, iter->ent);
+
+   trace_seq_printf(s, "%lu %lu %u %llu\n",
+field->ip,
+field->parent_ip,
+field->count,
+FUNC_REPEATS_GET_DELTA_TS(field));
+
+   return trace_handle_return(s);
+}
+
+static enum print_line_t
+trace_func_repeats_print(struct trace_iterator *iter, int flags,
+

[PATCH v2 3/5] tracing: Add method for recording "func_repeats" events

2021-03-29 Thread Yordan Karadzhov (VMware)

This patch only provides the implementation of the method.
Later we will used it in a combination with a new option for
function tracing.

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace.c | 26 ++
 kernel/trace/trace.h |  4 
 2 files changed, 30 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6fcc159c34a8..0d3d14112f50 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3116,6 +3116,32 @@ static void ftrace_trace_userstack(struct trace_array 
*tr,
 
 #endif /* CONFIG_STACKTRACE */
 
+void trace_last_func_repeats(struct trace_array *tr,
+struct trace_func_repeats *last_info,
+unsigned int trace_ctx)
+{
+   struct trace_buffer *buffer = tr->array_buffer.buffer;
+   struct func_repeats_entry *entry;
+   struct ring_buffer_event *event;
+   u64 delta;
+
+   event = __trace_buffer_lock_reserve(buffer, TRACE_FUNC_REPEATS,
+   sizeof(*entry), trace_ctx);
+   if (!event)
+   return;
+
+   delta = ring_buffer_event_time_stamp(buffer, event) -
+   last_info->ts_last_call;
+
+   entry = ring_buffer_event_data(event);
+   entry->ip = last_info->ip;
+   entry->parent_ip = last_info->parent_ip;
+   entry->count = last_info->count;
+   FUNC_REPEATS_SET_DELTA_TS(entry, delta)
+
+   __buffer_unlock_commit(buffer, event);
+}
+
 /* created for use with alloc_percpu */
 struct trace_buffer_struct {
int nesting;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 1cd4da7ba769..e1f34119c036 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -701,6 +701,10 @@ static inline void __trace_stack(struct trace_array *tr, 
unsigned int trace_ctx,
 }
 #endif /* CONFIG_STACKTRACE */
 
+void trace_last_func_repeats(struct trace_array *tr,
+struct trace_func_repeats *last_info,
+unsigned int trace_ctx);
+
 extern u64 ftrace_now(int cpu);
 
 extern void trace_find_cmdline(int pid, char comm[]);
-- 
2.25.1

[PATCH v2 0/5] Add "func_no_repete" tracing option

2021-03-29 Thread Yordan Karadzhov (VMware)

The new option for function tracing aims to save space on the ring
buffer and to make it more readable in the case when a single function
is called number of times consecutively:

while (cond)
do_func();

Instead of having an identical records for each call of the function
we will record only the first call, followed by an event showing the
number of repeats.

v2 changes:
  * As suggested by Steven in his review, we now record not only the
repetition count, but also the time elapsed between the last
repetition of the function and the actual generation of the
"func_repeats" event. 16 bits are used to record the repetition
count. In the case of an overflow of the counter a second pair of
"function" and "func_repeats" events will be generated. The time
interval gets codded by using up to 48 (32 + 16) bits.

Yordan Karadzhov (VMware) (5):
  tracing: Define new ftrace event "func_repeats"
  tracing: Add "last_func_repeats" to struct trace_array
  tracing: Add method for recording "func_repeats" events
  tracing: Unify the logic for function tracing options
  tracing: Add "func_no_repeats" option for function tracing

 kernel/trace/trace.c   |  29 +
 kernel/trace/trace.h   |  25 
 kernel/trace/trace_entries.h   |  28 +
 kernel/trace/trace_functions.c | 222 +
 kernel/trace/trace_output.c|  47 +++
 5 files changed, 324 insertions(+), 27 deletions(-)

-- 
2.25.1

[PATCH] tracing: Remove unused argument from "ring_buffer_time_stamp()

2021-03-29 Thread Yordan Karadzhov (VMware)

The "cpu" parameter is not being used by the function.

Signed-off-by: Yordan Karadzhov (VMware) 
---
 include/linux/ring_buffer.h | 2 +-
 kernel/trace/ring_buffer.c  | 2 +-
 kernel/trace/trace.c| 8 
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 057b7ed4fe24..dac53fd3afea 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -181,7 +181,7 @@ unsigned long ring_buffer_commit_overrun_cpu(struct 
trace_buffer *buffer, int cp
 unsigned long ring_buffer_dropped_events_cpu(struct trace_buffer *buffer, int 
cpu);
 unsigned long ring_buffer_read_events_cpu(struct trace_buffer *buffer, int 
cpu);
 
-u64 ring_buffer_time_stamp(struct trace_buffer *buffer, int cpu);
+u64 ring_buffer_time_stamp(struct trace_buffer *buffer);
 void ring_buffer_normalize_time_stamp(struct trace_buffer *buffer,
  int cpu, u64 *ts);
 void ring_buffer_set_clock(struct trace_buffer *buffer,
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f4216df58e31..2c0ee6484990 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1080,7 +1080,7 @@ static inline u64 rb_time_stamp(struct trace_buffer 
*buffer)
return ts << DEBUG_SHIFT;
 }
 
-u64 ring_buffer_time_stamp(struct trace_buffer *buffer, int cpu)
+u64 ring_buffer_time_stamp(struct trace_buffer *buffer)
 {
u64 time;
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c8e54b674d3e..3c605957bb5c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -771,7 +771,7 @@ static u64 buffer_ftrace_now(struct array_buffer *buf, int 
cpu)
if (!buf->buffer)
return trace_clock_local();
 
-   ts = ring_buffer_time_stamp(buf->buffer, cpu);
+   ts = ring_buffer_time_stamp(buf->buffer);
ring_buffer_normalize_time_stamp(buf->buffer, cpu, );
 
return ts;
@@ -7173,7 +7173,7 @@ static int tracing_time_stamp_mode_open(struct inode 
*inode, struct file *file)
 u64 tracing_event_time_stamp(struct trace_buffer *buffer, struct 
ring_buffer_event *rbe)
 {
if (rbe == this_cpu_read(trace_buffered_event))
-   return ring_buffer_time_stamp(buffer, smp_processor_id());
+   return ring_buffer_time_stamp(buffer);
 
return ring_buffer_event_time_stamp(buffer, rbe);
 }
@@ -8087,7 +8087,7 @@ tracing_stats_read(struct file *filp, char __user *ubuf,
trace_seq_printf(s, "oldest event ts: %5llu.%06lu\n",
t, usec_rem);
 
-   t = ns2usecs(ring_buffer_time_stamp(trace_buf->buffer, cpu));
+   t = ns2usecs(ring_buffer_time_stamp(trace_buf->buffer));
usec_rem = do_div(t, USEC_PER_SEC);
trace_seq_printf(s, "now ts: %5llu.%06lu\n", t, usec_rem);
} else {
@@ -8096,7 +8096,7 @@ tracing_stats_read(struct file *filp, char __user *ubuf,
ring_buffer_oldest_event_ts(trace_buf->buffer, 
cpu));
 
trace_seq_printf(s, "now ts: %llu\n",
-   ring_buffer_time_stamp(trace_buf->buffer, cpu));
+   ring_buffer_time_stamp(trace_buf->buffer));
}
 
cnt = ring_buffer_dropped_events_cpu(trace_buf->buffer, cpu);
-- 
2.25.1

Re: [RFC PATCH 1/5] tracing: Define new ftrace event "func_repeats"

2021-03-08 Thread Yordan Karadzhov (VMware)




On 4.03.21 г. 18:38, Steven Rostedt wrote:

On Thu,  4 Mar 2021 11:01:37 +0200
"Yordan Karadzhov (VMware)"  wrote:

Thanks Yordan for doing this!

I have some comments below.



Hi Steven,

Thank you very much for looking into this!

Your suggestion makes perfect sense. I only have one clarifying question 
below.



diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 4547ac59da61..8007f9b6417f 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -338,3 +338,19 @@ FTRACE_ENTRY(hwlat, hwlat_entry,
 __entry->nmi_total_ts,
 __entry->nmi_count)
  );
+
+FTRACE_ENTRY(func_repeats, func_repeats_entry,
+
+   TRACE_FUNC_REPEATS,
+
+   F_STRUCT(
+   __field(unsigned long,  ip  )
+   __field(unsigned long,  pip )
+   __field(unsigned long,  count   )
+   ),
+
+   F_printk(" %ps <-%ps\t(repeats:%lu)",
+(void *)__entry->ip,
+(void *)__entry->pip,
+__entry->count)


After playing with this a little, I realized that we should also store the
last timestamp as well. I think that would be interesting information.

<...>-37  [004] ...1  2022.303820: gc_worker <-process_one_work
<...>-37  [004] ...1  2022.303820: ___might_sleep <-gc_worker
<...>-37  [004] ...1  2022.303831: ___might_sleep <-gc_worker 
(repeats: 127)
<...>-37  [004] ...1  2022.303831: queue_delayed_work_on 
<-process_one_work

The above shows that __might_sleep() was called 128 times, but what I don't
get from the above, is when that last call was made. You'll see that the
timestamp for the repeat output is the same as the next function shown
(queue_delayed_work_on()). But the timestamp for the last call to
__might_sleep() is lost, and the repeat event ends up being written when
it is detected that there are no more repeats.

If we had:

<...>-37  [004] ...1  2022.303820: gc_worker <-process_one_work
<...>-37  [004] ...1  2022.303820: ___might_sleep <-gc_worker
<...>-37  [004] ...1  2022.303831: ___might_sleep <-gc_worker 
(last ts: 2022.303828 repeats: 127)
<...>-37  [004] ...1  2022.303831: queue_delayed_work_on 
<-process_one_work

We would know the last time __might_sleep was called.

That is, not only should we save the ip and pip in the trace_func_repeats
structure, but we should also be storing the last time stamp of the last
function event that repeated. Otherwise the above looks like the last
__might_sleep called above happened when the queue_delayed_work_on
happened, where that may not be the case.


If we store the last timestamp, this means we will need to use 
additional 64b on the buffer, every time we record the "func_repeats" 
event. This looks like an overkill to me.
Can we store only the duration of the repeats (the difference between 
the timestamp)? This way we can use less memory at the price of having 
one extra arithmetic operation.
Alternative approach can be to store only the least-significant bits of 
the timestamp.


What do you think?

Best regards,
Yordan



-- Steve

[PATCH] tracing: Remove duplicate declaration from trace.h

2021-03-04 Thread Yordan Karadzhov (VMware)

A declaration of function "int trace_empty(struct trace_iterator *iter)"
shows up twice in the header file kernel/trace/trace.h

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index dec13ff66077..a6446c03cfbc 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -605,7 +605,6 @@ void trace_graph_function(struct trace_array *tr,
 void trace_latency_header(struct seq_file *m);
 void trace_default_header(struct seq_file *m);
 void print_trace_header(struct seq_file *m, struct trace_iterator *iter);
-int trace_empty(struct trace_iterator *iter);
 
 void trace_graph_return(struct ftrace_graph_ret *trace);
 int trace_graph_entry(struct ftrace_graph_ent *trace);
-- 
2.25.1

[RFC PATCH 3/5] tracing: Add method for recording "func_repeats" events

2021-03-04 Thread Yordan Karadzhov (VMware)

This patch only provides the implementation of the method.
Later we will used it in a combination with a new option for
function tracing.

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace.c | 21 +
 kernel/trace/trace.h |  4 
 2 files changed, 25 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 5f5fa08c0644..5c62fda666af 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3109,6 +3109,27 @@ static void ftrace_trace_userstack(struct trace_array 
*tr,
 
 #endif /* CONFIG_STACKTRACE */
 
+void trace_last_func_repeats(struct trace_array *tr,
+struct trace_func_repeats *last_info,
+unsigned int trace_ctx)
+{
+   struct trace_buffer *buffer = tr->array_buffer.buffer;
+   struct func_repeats_entry *entry;
+   struct ring_buffer_event *event;
+
+   event = __trace_buffer_lock_reserve(buffer, TRACE_FUNC_REPEATS,
+   sizeof(*entry), trace_ctx);
+   if (!event)
+   return;
+
+   entry = ring_buffer_event_data(event);
+   entry->ip = last_info->ip;
+   entry->pip = last_info->parent_ip;
+   entry->count = last_info->count;
+
+   __buffer_unlock_commit(buffer, event);
+}
+
 /* created for use with alloc_percpu */
 struct trace_buffer_struct {
int nesting;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 09bf12c038f4..0ef823bb9594 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -696,6 +696,10 @@ static inline void __trace_stack(struct trace_array *tr, 
unsigned int trace_ctx,
 }
 #endif /* CONFIG_STACKTRACE */
 
+void trace_last_func_repeats(struct trace_array *tr,
+struct trace_func_repeats *last_info,
+unsigned int trace_ctx);
+
 extern u64 ftrace_now(int cpu);
 
 extern void trace_find_cmdline(int pid, char comm[]);
-- 
2.25.1

[RFC PATCH 4/5] tracing: Unify the logic for function tracing options

2021-03-04 Thread Yordan Karadzhov (VMware)

Currently the logic for dealing with the options for function tracing
has two different implementations. One is used when we set the flags
(in "static int func_set_flag()") and another used when we initialize
the tracer (in "static int function_trace_init()"). Those two
implementations are meant to do essentially the same thing and they
are both not very convenient for adding new options. In this patch
we add a helper function that provides a single implementation of
the logic for dealing with the options and we make it such that new
options can be easily added.

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace_functions.c | 66 --
 1 file changed, 40 insertions(+), 26 deletions(-)

diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index f93723ca66bc..6c912eb0508a 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -27,13 +27,17 @@ function_trace_call(unsigned long ip, unsigned long 
parent_ip,
 static void
 function_stack_trace_call(unsigned long ip, unsigned long parent_ip,
  struct ftrace_ops *op, struct ftrace_regs *fregs);
+static ftrace_func_t select_trace_function(u32 flags_val);
 static struct tracer_flags func_flags;
 
 /* Our option */
 enum {
+   TRACE_FUNC_NO_OPTS  = 0x0, /* No flags set. */
TRACE_FUNC_OPT_STACK= 0x1,
 };
 
+#define TRACE_FUNC_OPT_MASK(TRACE_FUNC_OPT_STACK)
+
 int ftrace_allocate_ftrace_ops(struct trace_array *tr)
 {
struct ftrace_ops *ops;
@@ -97,12 +101,9 @@ static int function_trace_init(struct trace_array *tr)
if (!tr->ops)
return -ENOMEM;
 
-   /* Currently only the global instance can do stack tracing */
-   if (tr->flags & TRACE_ARRAY_FL_GLOBAL &&
-   func_flags.val & TRACE_FUNC_OPT_STACK)
-   func = function_stack_trace_call;
-   else
-   func = function_trace_call;
+   func = select_trace_function(func_flags.val);
+   if (!func)
+   return -EINVAL;
 
ftrace_init_array_ops(tr, func);
 
@@ -205,6 +206,18 @@ function_stack_trace_call(unsigned long ip, unsigned long 
parent_ip,
local_irq_restore(flags);
 }
 
+static ftrace_func_t select_trace_function(u32 flags_val)
+{
+   switch (flags_val & TRACE_FUNC_OPT_MASK) {
+   case TRACE_FUNC_NO_OPTS:
+   return function_trace_call;
+   case TRACE_FUNC_OPT_STACK:
+   return function_stack_trace_call;
+   default:
+   return NULL;
+   }
+}
+
 static struct tracer_opt func_opts[] = {
 #ifdef CONFIG_STACKTRACE
{ TRACER_OPT(func_stack_trace, TRACE_FUNC_OPT_STACK) },
@@ -213,7 +226,7 @@ static struct tracer_opt func_opts[] = {
 };
 
 static struct tracer_flags func_flags = {
-   .val = 0, /* By default: all flags disabled */
+   .val = TRACE_FUNC_NO_OPTS, /* By default: all flags disabled */
.opts = func_opts
 };
 
@@ -235,30 +248,31 @@ static struct tracer function_trace;
 static int
 func_set_flag(struct trace_array *tr, u32 old_flags, u32 bit, int set)
 {
-   switch (bit) {
-   case TRACE_FUNC_OPT_STACK:
-   /* do nothing if already set */
-   if (!!set == !!(func_flags.val & TRACE_FUNC_OPT_STACK))
-   break;
+   ftrace_func_t func;
+   u32 new_flags_val;
 
-   /* We can change this flag when not running. */
-   if (tr->current_trace != _trace)
-   break;
+   /* Do nothing if already set. */
+   if (!!set == !!(func_flags.val & bit))
+   return 0;
 
-   unregister_ftrace_function(tr->ops);
+   /* We can change this flag only when not running. */
+   if (tr->current_trace != _trace)
+   return 0;
 
-   if (set) {
-   tr->ops->func = function_stack_trace_call;
-   register_ftrace_function(tr->ops);
-   } else {
-   tr->ops->func = function_trace_call;
-   register_ftrace_function(tr->ops);
-   }
+   new_flags_val = (func_flags.val & ~(1UL << (bit - 1)));
+   new_flags_val |= (set << (bit - 1));
 
-   break;
-   default:
+   func = select_trace_function(new_flags_val);
+   if (!func)
return -EINVAL;
-   }
+
+   /* Check if there's anything to change. */
+   if (tr->ops->func == func)
+   return 0;
+
+   unregister_ftrace_function(tr->ops);
+   tr->ops->func = func;
+   register_ftrace_function(tr->ops);
 
return 0;
 }
-- 
2.25.1

[RFC PATCH 5/5] tracing: Add "func_no_repeats" option for function tracing

2021-03-04 Thread Yordan Karadzhov (VMware)

If the option is activated the function tracing record gets
consolidated in the cases when a single function is called number
of times consecutively. Instead of having an identical record for
each call of the function we will record only the first call
following by event showing the number of repeats.

Signed-off-by: Yordan Karadzhov (VMware) 

fix last
---
 kernel/trace/trace_functions.c | 157 -
 1 file changed, 154 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 6c912eb0508a..fbf60ff93ffb 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -27,16 +27,28 @@ function_trace_call(unsigned long ip, unsigned long 
parent_ip,
 static void
 function_stack_trace_call(unsigned long ip, unsigned long parent_ip,
  struct ftrace_ops *op, struct ftrace_regs *fregs);
+static void
+function_no_repeats_trace_call(unsigned long ip, unsigned long parent_ip,
+  struct ftrace_ops *op, struct ftrace_regs 
*fregs);
+static void
+function_stack_no_repeats_trace_call(unsigned long ip, unsigned long parent_ip,
+struct ftrace_ops *op,
+struct ftrace_regs *fregs);
 static ftrace_func_t select_trace_function(u32 flags_val);
 static struct tracer_flags func_flags;
 
 /* Our option */
 enum {
-   TRACE_FUNC_NO_OPTS  = 0x0, /* No flags set. */
-   TRACE_FUNC_OPT_STACK= 0x1,
+
+   TRACE_FUNC_NO_OPTS  = 0x0, /* No flags set. */
+   TRACE_FUNC_OPT_STACK= 0x1,
+   TRACE_FUNC_OPT_NO_REPEATS   = 0x2,
+
+   /* Update this to next highest bit. */
+   TRACE_FUNC_OPT_HIGHEST_BIT  = 0x4
 };
 
-#define TRACE_FUNC_OPT_MASK(TRACE_FUNC_OPT_STACK)
+#define TRACE_FUNC_OPT_MASK(TRACE_FUNC_OPT_HIGHEST_BIT - 1)
 
 int ftrace_allocate_ftrace_ops(struct trace_array *tr)
 {
@@ -90,6 +102,17 @@ void ftrace_destroy_function_files(struct trace_array *tr)
ftrace_free_ftrace_ops(tr);
 }
 
+static bool handle_func_repeats(struct trace_array *tr, u32 flags_val)
+{
+   if (!tr->last_func_repeats &&
+   (flags_val & TRACE_FUNC_OPT_NO_REPEATS)) {
+   if (!tracer_alloc_func_repeats(tr))
+   return false;
+   }
+
+   return true;
+}
+
 static int function_trace_init(struct trace_array *tr)
 {
ftrace_func_t func;
@@ -105,6 +128,9 @@ static int function_trace_init(struct trace_array *tr)
if (!func)
return -EINVAL;
 
+   if (!handle_func_repeats(tr, func_flags.val))
+   return -ENOMEM;
+
ftrace_init_array_ops(tr, func);
 
tr->array_buffer.cpu = raw_smp_processor_id();
@@ -206,6 +232,123 @@ function_stack_trace_call(unsigned long ip, unsigned long 
parent_ip,
local_irq_restore(flags);
 }
 
+static inline bool is_repeat(struct trace_func_repeats *last_info,
+unsigned long ip, unsigned long parent_ip)
+{
+   if (last_info->ip == ip &&
+   last_info->parent_ip == parent_ip) {
+   last_info->count++;
+   return true;
+   }
+
+   return false;
+}
+
+static inline void process_repeats(struct trace_array *tr,
+  unsigned long ip, unsigned long parent_ip,
+  struct trace_func_repeats *last_info,
+  unsigned int trace_ctx)
+{
+   if (last_info->count) {
+   trace_last_func_repeats(tr, last_info, trace_ctx);
+   last_info->count = 0;
+   }
+
+   last_info->ip = ip;
+   last_info->parent_ip = parent_ip;
+}
+
+static void
+function_no_repeats_trace_call(unsigned long ip, unsigned long parent_ip,
+  struct ftrace_ops *op,
+  struct ftrace_regs *fregs)
+{
+   struct trace_func_repeats *last_info;
+   struct trace_array *tr = op->private;
+   struct trace_array_cpu *data;
+   unsigned int trace_ctx;
+   unsigned long flags;
+   int bit;
+   int cpu;
+
+   if (unlikely(!tr->function_enabled))
+   return;
+
+   bit = ftrace_test_recursion_trylock(ip, parent_ip);
+   if (bit < 0)
+   return;
+
+   preempt_disable_notrace();
+
+   cpu = smp_processor_id();
+   data = per_cpu_ptr(tr->array_buffer.data, cpu);
+   if (atomic_read(>disabled))
+   goto out;
+
+   /*
+* An interrupt may happen at any place here. But as far as I can see,
+* the only damage that this can cause is to mess up the repetition
+* counter without valuable data being lost.
+* TODO: think about a solution that is better than just hoping to be
+* lucky.
+*/
+   last_info = per_cpu_ptr(tr->last_f

[RFC PATCH 2/5] tracing: Add "last_func_repeats" to struct trace_array

2021-03-04 Thread Yordan Karadzhov (VMware)

The field is used to keep track of the consecutive (on the same CPU) calls
of a single function. This information is needed in order to consolidate
the function tracing record in the cases when a single function is called
number of times.

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace.c |  1 +
 kernel/trace/trace.h | 17 +
 2 files changed, 18 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e295c413580e..5f5fa08c0644 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8895,6 +8895,7 @@ static int __remove_instance(struct trace_array *tr)
ftrace_clear_pids(tr);
ftrace_destroy_function_files(tr);
tracefs_remove(tr->dir);
+   free_percpu(tr->last_func_repeats);
free_trace_buffers(tr);
 
for (i = 0; i < tr->nr_topts; i++) {
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 2be4a56879de..09bf12c038f4 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -262,6 +262,16 @@ struct cond_snapshot {
cond_update_fn_tupdate;
 };
 
+/*
+ * struct trace_func_repeats - used to keep track of the consecutive
+ * (on the same CPU) calls of a single function.
+ */
+struct trace_func_repeats {
+   unsigned long ip;
+   unsigned long parent_ip;
+   unsigned long count;
+};
+
 /*
  * The trace array - an array of per-CPU trace arrays. This is the
  * highest level data structure that individual tracers deal with.
@@ -358,8 +368,15 @@ struct trace_array {
 #ifdef CONFIG_TRACER_SNAPSHOT
struct cond_snapshot*cond_snapshot;
 #endif
+   struct trace_func_repeats   __percpu *last_func_repeats;
 };
 
+static inline struct trace_func_repeats *
+tracer_alloc_func_repeats(struct trace_array *tr)
+{
+   return tr->last_func_repeats = alloc_percpu(struct trace_func_repeats);
+}
+
 enum {
TRACE_ARRAY_FL_GLOBAL   = (1 << 0)
 };
-- 
2.25.1

[RFC PATCH 1/5] tracing: Define new ftrace event "func_repeats"

2021-03-04 Thread Yordan Karadzhov (VMware)

The event aims to consolidate the function tracing record in the cases
when a single function is called number of times consecutively.

while (cond)
do_func();

This may happen in various scenarios (busy waiting for example).
The new ftrace event can be used to show repeated function events with
a single event and save space on the ring buffer

Signed-off-by: Yordan Karadzhov (VMware) 
---
 kernel/trace/trace.h |  3 +++
 kernel/trace/trace_entries.h | 16 +
 kernel/trace/trace_output.c  | 44 
 3 files changed, 63 insertions(+)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index a6446c03cfbc..2be4a56879de 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -45,6 +45,7 @@ enum trace_type {
TRACE_BPUTS,
TRACE_HWLAT,
TRACE_RAW_DATA,
+   TRACE_FUNC_REPEATS,
 
__TRACE_LAST_TYPE,
 };
@@ -441,6 +442,8 @@ extern void __ftrace_bad_type(void);
  TRACE_GRAPH_ENT); \
IF_ASSIGN(var, ent, struct ftrace_graph_ret_entry,  \
  TRACE_GRAPH_RET); \
+   IF_ASSIGN(var, ent, struct func_repeats_entry,  \
+ TRACE_FUNC_REPEATS);  \
__ftrace_bad_type();\
} while (0)
 
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 4547ac59da61..8007f9b6417f 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -338,3 +338,19 @@ FTRACE_ENTRY(hwlat, hwlat_entry,
 __entry->nmi_total_ts,
 __entry->nmi_count)
 );
+
+FTRACE_ENTRY(func_repeats, func_repeats_entry,
+
+   TRACE_FUNC_REPEATS,
+
+   F_STRUCT(
+   __field(unsigned long,  ip  )
+   __field(unsigned long,  pip )
+   __field(unsigned long,  count   )
+   ),
+
+   F_printk(" %ps <-%ps\t(repeats:%lu)",
+(void *)__entry->ip,
+(void *)__entry->pip,
+__entry->count)
+);
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 61255bad7e01..af6b066972e9 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1373,6 +1373,49 @@ static struct trace_event trace_raw_data_event = {
.funcs  = _raw_data_funcs,
 };
 
+static enum print_line_t
+trace_func_repeats_raw(struct trace_iterator *iter, int flags,
+struct trace_event *event)
+{
+   struct func_repeats_entry *field;
+   struct trace_seq *s = >seq;
+
+   trace_assign_type(field, iter->ent);
+
+   trace_seq_printf(s, "%lu %lu %li\n",
+field->pip,
+field->ip,
+field->count);
+
+   return trace_handle_return(s);
+}
+
+static enum print_line_t
+trace_func_repeats_print(struct trace_iterator *iter, int flags,
+struct trace_event *event)
+{
+   struct func_repeats_entry *field;
+   struct trace_seq *s = >seq;
+
+   trace_assign_type(field, iter->ent);
+
+   seq_print_ip_sym(s, field->ip, flags);
+   trace_seq_puts(s, " <-");
+   seq_print_ip_sym(s, field->pip, flags);
+   trace_seq_printf(s, " (repeats: %li)\n", field->count);
+
+   return trace_handle_return(s);
+}
+
+static struct trace_event_functions trace_func_repeats_funcs = {
+   .trace  = trace_func_repeats_print,
+   .raw= trace_func_repeats_raw,
+};
+
+static struct trace_event trace_func_repeats_event = {
+   .type   = TRACE_FUNC_REPEATS,
+   .funcs  = _func_repeats_funcs,
+};
 
 static struct trace_event *events[] __initdata = {
_fn_event,
@@ -1385,6 +1428,7 @@ static struct trace_event *events[] __initdata = {
_print_event,
_hwlat_event,
_raw_data_event,
+   _func_repeats_event,
NULL
 };
 
-- 
2.25.1

[RFC PATCH 0/5] Add "func_no_repete" tracing option

2021-03-04 Thread Yordan Karadzhov (VMware)

The new option for function tracing aims to save space on the ring
buffer and to make it more readable in the case when a single function
is called number of times consecutively:

while (cond)
do_func();

Instead of having an identical records for each call of the function
we will record only the first call, followed by an event showing the
number of repeats.

Yordan Karadzhov (VMware) (5):
  tracing: Define new ftrace event "func_repeats"
  tracing: Add "last_func_repeats" to struct trace_array
  tracing: Add method for recording "func_repeats" events
  tracing: Unify the logic for function tracing options
  tracing: Add "func_no_repeats" option for function tracing

 kernel/trace/trace.c   |  22 
 kernel/trace/trace.h   |  24 
 kernel/trace/trace_entries.h   |  16 +++
 kernel/trace/trace_functions.c | 219 +
 kernel/trace/trace_output.c|  44 +++
 5 files changed, 298 insertions(+), 27 deletions(-)

-- 
2.25.1

[tip: locking/core] jump_label: Do not profile branch annotations

2021-01-22 Thread tip-bot2 for Steven Rostedt (VMware)

The following commit has been merged into the locking/core branch of tip:

Commit-ID: 2f0df49c89acaa58571d509830bc481250699885
Gitweb:
https://git.kernel.org/tip/2f0df49c89acaa58571d509830bc481250699885
Author:Steven Rostedt (VMware) 
AuthorDate:Fri, 11 Dec 2020 16:37:54 -05:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 22 Jan 2021 11:08:56 +01:00

jump_label: Do not profile branch annotations

While running my branch profiler that checks for incorrect "likely" and
"unlikely"s around the kernel, there's a large number of them that are
incorrect due to being "static_branches".

As static_branches are rather special, as they are likely or unlikely for
other reasons than normal annotations are used for, there's no reason to
have them be profiled.

Expose the "unlikely_notrace" and "likely_notrace" so that the
static_branch can use them, and have them be ignored by the branch
profilers.

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20201211163754.58517...@gandalf.local.home
---
 include/linux/compiler.h   |  2 ++
 include/linux/jump_label.h | 12 ++--
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index b8fe0c2..df5b405 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -76,6 +76,8 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int 
val,
 #else
 # define likely(x) __builtin_expect(!!(x), 1)
 # define unlikely(x)   __builtin_expect(!!(x), 0)
+# define likely_notrace(x) likely(x)
+# define unlikely_notrace(x)   unlikely(x)
 #endif
 
 /* Optimization barrier */
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 3280962..d926912 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -261,14 +261,14 @@ static __always_inline void jump_label_init(void)
 
 static __always_inline bool static_key_false(struct static_key *key)
 {
-   if (unlikely(static_key_count(key) > 0))
+   if (unlikely_notrace(static_key_count(key) > 0))
return true;
return false;
 }
 
 static __always_inline bool static_key_true(struct static_key *key)
 {
-   if (likely(static_key_count(key) > 0))
+   if (likely_notrace(static_key_count(key) > 0))
return true;
return false;
 }
@@ -460,7 +460,7 @@ extern bool wrong_branch_error(void);
branch = !arch_static_branch_jump(&(x)->key, true); 
\
else
\
branch = wrong_branch_error();  
\
-   likely(branch); 
\
+   likely_notrace(branch); 
\
 })
 
 #define static_branch_unlikely(x)  
\
@@ -472,13 +472,13 @@ extern bool wrong_branch_error(void);
branch = arch_static_branch(&(x)->key, false);  
\
else
\
branch = wrong_branch_error();  
\
-   unlikely(branch);   
\
+   unlikely_notrace(branch);   
\
 })
 
 #else /* !CONFIG_JUMP_LABEL */
 
-#define static_branch_likely(x)
likely(static_key_enabled(&(x)->key))
-#define static_branch_unlikely(x)  unlikely(static_key_enabled(&(x)->key))
+#define static_branch_likely(x)
likely_notrace(static_key_enabled(&(x)->key))
+#define static_branch_unlikely(x)  
unlikely_notrace(static_key_enabled(&(x)->key))
 
 #endif /* CONFIG_JUMP_LABEL */

[PATCH 2/3 v7] ftrace/x86: Allow for arguments to be passed in to ftrace_regs by default

2020-11-13 Thread VMware

From: "Steven Rostedt (VMware)" 

Currently, the only way to get access to the registers of a function via a
ftrace callback is to set the "FL_SAVE_REGS" bit in the ftrace_ops. But as this
saves all regs as if a breakpoint were to trigger (for use with kprobes), it
is expensive.

The regs are already saved on the stack for the default ftrace callbacks, as
that is required otherwise a function being traced will get the wrong
arguments and possibly crash. And on x86, the arguments are already stored
where they would be on a pt_regs structure to use that code for both the
regs version of a callback, it makes sense to pass that information always
to all functions.

If an architecture does this (as x86_64 now does), it is to set
HAVE_DYNAMIC_FTRACE_WITH_ARGS, and this will let the generic code that it
could have access to arguments without having to set the flags.

This also includes having the stack pointer being saved, which could be used
for accessing arguments on the stack, as well as having the function graph
tracer not require its own trampoline!

Acked-by: Peter Zijlstra (Intel) 
Signed-off-by: Steven Rostedt (VMware) 
---
 arch/x86/Kconfig  |  1 +
 arch/x86/include/asm/ftrace.h | 15 +++
 arch/x86/kernel/ftrace_64.S   | 11 +--
 include/linux/ftrace.h|  7 ++-
 kernel/trace/Kconfig  |  9 +
 5 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f6946b81f74a..478526aabe5d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -167,6 +167,7 @@ config X86
select HAVE_DMA_CONTIGUOUS
select HAVE_DYNAMIC_FTRACE
select HAVE_DYNAMIC_FTRACE_WITH_REGS
+   select HAVE_DYNAMIC_FTRACE_WITH_ARGSif X86_64
select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
select HAVE_EBPF_JIT
select HAVE_EFFICIENT_UNALIGNED_ACCESS
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index 84b9449be080..e00fe88146e0 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -41,6 +41,21 @@ static inline void arch_ftrace_set_direct_caller(struct 
pt_regs *regs, unsigned
regs->orig_ax = addr;
 }
 
+#ifdef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
+struct ftrace_regs {
+   struct pt_regs  regs;
+};
+
+static __always_inline struct pt_regs *
+arch_ftrace_get_regs(struct ftrace_regs *fregs)
+{
+   /* Only when FL_SAVE_REGS is set, cs will be non zero */
+   if (!fregs->regs.cs)
+   return NULL;
+   return >regs;
+}
+#endif
+
 #ifdef CONFIG_DYNAMIC_FTRACE
 
 struct dyn_arch_ftrace {
diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index ac3d5f22fe64..60e3b64f5ea6 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -140,12 +140,19 @@ SYM_FUNC_START(ftrace_caller)
/* save_mcount_regs fills in first two parameters */
save_mcount_regs
 
+   /* Stack - skipping return address of ftrace_caller */
+   leaq MCOUNT_REG_SIZE+8(%rsp), %rcx
+   movq %rcx, RSP(%rsp)
+
 SYM_INNER_LABEL(ftrace_caller_op_ptr, SYM_L_GLOBAL)
/* Load the ftrace_ops into the 3rd parameter */
movq function_trace_op(%rip), %rdx
 
-   /* regs go into 4th parameter (but make it NULL) */
-   movq $0, %rcx
+   /* regs go into 4th parameter */
+   leaq (%rsp), %rcx
+
+   /* Only ops with REGS flag set should have CS register set */
+   movq $0, CS(%rsp)
 
 SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
call ftrace_stub
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 24e1fa52337d..588ea7023a7a 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -90,16 +90,21 @@ ftrace_enable_sysctl(struct ctl_table *table, int write,
 
 struct ftrace_ops;
 
+#ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
+
 struct ftrace_regs {
struct pt_regs  regs;
 };
+#define arch_ftrace_get_regs(fregs) (&(fregs)->regs)
+
+#endif /* CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS */
 
 static __always_inline struct pt_regs *ftrace_get_regs(struct ftrace_regs 
*fregs)
 {
if (!fregs)
return NULL;
 
-   return >regs;
+   return arch_ftrace_get_regs(fregs);
 }
 
 typedef void (*ftrace_func_t)(unsigned long ip, unsigned long parent_ip,
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 6aa36ec73ccb..c9b64dea1216 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -31,6 +31,15 @@ config HAVE_DYNAMIC_FTRACE_WITH_REGS
 config HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
bool
 
+config HAVE_DYNAMIC_FTRACE_WITH_ARGS
+   bool
+   help
+If this is set, then arguments and stack can be found from
+the pt_regs passed into the function callback regs parameter
+by default, even without setting the REGS flag in the ftrace_ops.
+This allows for use of regs_get_kernel_argument() and
+

[PATCH 1/3 v7] ftrace: Have the callbacks receive a struct ftrace_regs instead of pt_regs

2020-11-13 Thread VMware

From: "Steven Rostedt (VMware)" 

In preparation to have arguments of a function passed to callbacks attached
to functions as default, change the default callback prototype to receive a
struct ftrace_regs as the forth parameter instead of a pt_regs.

For callbacks that set the FL_SAVE_REGS flag in their ftrace_ops flags, they
will now need to get the pt_regs via a ftrace_get_regs() helper call. If
this is called by a callback that their ftrace_ops did not have a
FL_SAVE_REGS flag set, it that helper function will return NULL.

This will allow the ftrace_regs to hold enough just to get the parameters
and stack pointer, but without the worry that callbacks may have a pt_regs
that is not completely filled.

Acked-by: Peter Zijlstra (Intel) 
Reviewed-by: Masami Hiramatsu 
Signed-off-by: Steven Rostedt (VMware) 
---
 arch/csky/kernel/probes/ftrace.c |  4 +++-
 arch/nds32/kernel/ftrace.c   |  4 ++--
 arch/parisc/kernel/ftrace.c  |  8 +---
 arch/powerpc/kernel/kprobes-ftrace.c |  4 +++-
 arch/s390/kernel/ftrace.c|  4 +++-
 arch/x86/kernel/kprobes/ftrace.c |  3 ++-
 fs/pstore/ftrace.c   |  2 +-
 include/linux/ftrace.h   | 16 ++--
 include/linux/kprobes.h  |  2 +-
 kernel/livepatch/patch.c |  3 ++-
 kernel/trace/ftrace.c| 27 +++
 kernel/trace/trace_event_perf.c  |  2 +-
 kernel/trace/trace_events.c  |  2 +-
 kernel/trace/trace_functions.c   |  9 -
 kernel/trace/trace_irqsoff.c |  2 +-
 kernel/trace/trace_sched_wakeup.c|  2 +-
 kernel/trace/trace_selftest.c| 20 +++-
 kernel/trace/trace_stack.c   |  2 +-
 18 files changed, 71 insertions(+), 45 deletions(-)

diff --git a/arch/csky/kernel/probes/ftrace.c b/arch/csky/kernel/probes/ftrace.c
index f30b179924ef..ae2b1c7b3b5c 100644
--- a/arch/csky/kernel/probes/ftrace.c
+++ b/arch/csky/kernel/probes/ftrace.c
@@ -11,17 +11,19 @@ int arch_check_ftrace_location(struct kprobe *p)
 
 /* Ftrace callback handler for kprobes -- called under preepmt disabed */
 void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
-  struct ftrace_ops *ops, struct pt_regs *regs)
+  struct ftrace_ops *ops, struct ftrace_regs *fregs)
 {
int bit;
bool lr_saver = false;
struct kprobe *p;
struct kprobe_ctlblk *kcb;
+   struct pt_regs *regs;
 
bit = ftrace_test_recursion_trylock(ip, parent_ip);
if (bit < 0)
return;
 
+   regs = ftrace_get_regs(fregs);
preempt_disable_notrace();
p = get_kprobe((kprobe_opcode_t *)ip);
if (!p) {
diff --git a/arch/nds32/kernel/ftrace.c b/arch/nds32/kernel/ftrace.c
index 3763b3f8c3db..414f8a780cc3 100644
--- a/arch/nds32/kernel/ftrace.c
+++ b/arch/nds32/kernel/ftrace.c
@@ -10,7 +10,7 @@ extern void (*ftrace_trace_function)(unsigned long, unsigned 
long,
 extern void ftrace_graph_caller(void);
 
 noinline void __naked ftrace_stub(unsigned long ip, unsigned long parent_ip,
- struct ftrace_ops *op, struct pt_regs *regs)
+ struct ftrace_ops *op, struct ftrace_regs 
*fregs)
 {
__asm__ ("");  /* avoid to optimize as pure function */
 }
@@ -38,7 +38,7 @@ EXPORT_SYMBOL(_mcount);
 #else /* CONFIG_DYNAMIC_FTRACE */
 
 noinline void __naked ftrace_stub(unsigned long ip, unsigned long parent_ip,
- struct ftrace_ops *op, struct pt_regs *regs)
+ struct ftrace_ops *op, struct ftrace_regs 
*fregs)
 {
__asm__ ("");  /* avoid to optimize as pure function */
 }
diff --git a/arch/parisc/kernel/ftrace.c b/arch/parisc/kernel/ftrace.c
index 1c5d3732bda2..0a1e75af5382 100644
--- a/arch/parisc/kernel/ftrace.c
+++ b/arch/parisc/kernel/ftrace.c
@@ -51,7 +51,7 @@ static void __hot prepare_ftrace_return(unsigned long *parent,
 void notrace __hot ftrace_function_trampoline(unsigned long parent,
unsigned long self_addr,
unsigned long org_sp_gr3,
-   struct pt_regs *regs)
+   struct ftrace_regs *fregs)
 {
 #ifndef CONFIG_DYNAMIC_FTRACE
extern ftrace_func_t ftrace_trace_function;
@@ -61,7 +61,7 @@ void notrace __hot ftrace_function_trampoline(unsigned long 
parent,
if (function_trace_op->flags & FTRACE_OPS_FL_ENABLED &&
ftrace_trace_function != ftrace_stub)
ftrace_trace_function(self_addr, parent,
-   function_trace_op, regs);
+   function_trace_op, fregs);
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
if (dereference_function_descriptor(ftrace_graph_return) !=
@@ -204,9 +204,10 @@ int ftrace_make_nop(struct module *mod, struct dyn_

[PATCH 3/3 v7] livepatch: Use the default ftrace_ops instead of REGS when ARGS is available

2020-11-13 Thread VMware

From: "Steven Rostedt (VMware)" 

When CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS is available, the ftrace call
will be able to set the ip of the calling function. This will improve the
performance of live kernel patching where it does not need all the regs to
be stored just to change the instruction pointer.

If all archs that support live kernel patching also support
HAVE_DYNAMIC_FTRACE_WITH_ARGS, then the architecture specific function
klp_arch_set_pc() could be made generic.

It is possible that an arch can support HAVE_DYNAMIC_FTRACE_WITH_ARGS but
not HAVE_DYNAMIC_FTRACE_WITH_REGS and then have access to live patching.

Cc: Josh Poimboeuf 
Cc: Jiri Kosina 
Cc: live-patch...@vger.kernel.org
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Miroslav Benes 
Signed-off-by: Steven Rostedt (VMware) 
---
Changes since v6:
 - Updated to use ftrace_instruction_pointer_set() macro

 arch/powerpc/include/asm/livepatch.h | 4 +++-
 arch/s390/include/asm/livepatch.h| 5 -
 arch/x86/include/asm/ftrace.h| 3 +++
 arch/x86/include/asm/livepatch.h | 4 ++--
 arch/x86/kernel/ftrace_64.S  | 4 
 include/linux/ftrace.h   | 7 +++
 kernel/livepatch/Kconfig | 2 +-
 kernel/livepatch/patch.c | 9 +
 8 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/livepatch.h 
b/arch/powerpc/include/asm/livepatch.h
index 4a3d5d25fed5..ae25e6e72997 100644
--- a/arch/powerpc/include/asm/livepatch.h
+++ b/arch/powerpc/include/asm/livepatch.h
@@ -12,8 +12,10 @@
 #include 
 
 #ifdef CONFIG_LIVEPATCH
-static inline void klp_arch_set_pc(struct pt_regs *regs, unsigned long ip)
+static inline void klp_arch_set_pc(struct ftrace_regs *fregs, unsigned long ip)
 {
+   struct pt_regs *regs = ftrace_get_regs(fregs);
+
regs->nip = ip;
 }
 
diff --git a/arch/s390/include/asm/livepatch.h 
b/arch/s390/include/asm/livepatch.h
index 818612b784cd..d578a8c76676 100644
--- a/arch/s390/include/asm/livepatch.h
+++ b/arch/s390/include/asm/livepatch.h
@@ -11,10 +11,13 @@
 #ifndef ASM_LIVEPATCH_H
 #define ASM_LIVEPATCH_H
 
+#include 
 #include 
 
-static inline void klp_arch_set_pc(struct pt_regs *regs, unsigned long ip)
+static inline void klp_arch_set_pc(struct ftrace_regs *fregs, unsigned long ip)
 {
+   struct pt_regs *regs = ftrace_get_regs(fregs);
+
regs->psw.addr = ip;
 }
 
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index e00fe88146e0..9f3130f40807 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -54,6 +54,9 @@ arch_ftrace_get_regs(struct ftrace_regs *fregs)
return NULL;
return >regs;
 }
+
+#define ftrace_instruction_pointer_set(fregs, _ip) \
+   do { (fregs)->regs.ip = (_ip); } while (0)
 #endif
 
 #ifdef CONFIG_DYNAMIC_FTRACE
diff --git a/arch/x86/include/asm/livepatch.h b/arch/x86/include/asm/livepatch.h
index 1fde1ab6559e..7c5cc6660e4b 100644
--- a/arch/x86/include/asm/livepatch.h
+++ b/arch/x86/include/asm/livepatch.h
@@ -12,9 +12,9 @@
 #include 
 #include 
 
-static inline void klp_arch_set_pc(struct pt_regs *regs, unsigned long ip)
+static inline void klp_arch_set_pc(struct ftrace_regs *fregs, unsigned long ip)
 {
-   regs->ip = ip;
+   ftrace_instruction_pointer_set(fregs, ip);
 }
 
 #endif /* _ASM_X86_LIVEPATCH_H */
diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index 60e3b64f5ea6..0d54099c2a3a 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -157,6 +157,10 @@ SYM_INNER_LABEL(ftrace_caller_op_ptr, SYM_L_GLOBAL)
 SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
call ftrace_stub
 
+   /* Handlers can change the RIP */
+   movq RIP(%rsp), %rax
+   movq %rax, MCOUNT_REG_SIZE(%rsp)
+
restore_mcount_regs
 
/*
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 588ea7023a7a..9a8ce28e4485 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -97,6 +97,13 @@ struct ftrace_regs {
 };
 #define arch_ftrace_get_regs(fregs) (&(fregs)->regs)
 
+/*
+ * ftrace_instruction_pointer_set() is to be defined by the architecture
+ * if to allow setting of the instruction pointer from the ftrace_regs
+ * when HAVE_DYNAMIC_FTRACE_WITH_ARGS is set and it supports
+ * live kernel patching.
+ */
+#define ftrace_instruction_pointer_set(fregs, ip) do { } while (0)
 #endif /* CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS */
 
 static __always_inline struct pt_regs *ftrace_get_regs(struct ftrace_regs 
*fregs)
diff --git a/kernel/livepatch/Kconfig b/kernel/livepatch/Kconfig
index 54102deb50ba..53d51ed619a3 100644
--- a/kernel/livepatch/Kconfig
+++ b/kernel/livepatch/Kconfig
@@ -6,7 +6,7 @@ config HAVE_LIVEPATCH
 
 config LIVEPATCH
bool "Kernel Live Patching"
-   depends on DYNAMIC_FTRACE_WITH_REGS
+   depends on DYNAMIC_FTRACE_WITH_REGS || DYNAMIC_FTRACE_WITH_ARGS
depends on MODULE

[PATCH 0/3 v7] ftrace: Add access to function arguments for all callbacks

2020-11-13 Thread VMware

This is something I wanted to implement a long time ago, but held off until
there was a good reason to do so. Now it appears that having access to the
arguments of the function by default is very useful. As a bonus, because
arguments must be saved regardless before calling a callback, because they
need to be restored before returning back to the start of the traced
function, there's not much work to do to have them always be there for
normal function callbacks.

The basic idea is that if CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS is set, then
all callbacks registered to ftrace can use the regs parameter for the stack
and arguments (kernel_stack_pointer(regs), regs_get_kernel_argument(regs, n)),
without the need to set REGS that causes overhead by saving all registers as
REGS simulates a breakpoint.

This could be extended to move the REGS portion to kprobes itself, and
remove the SAVE_REGS flag completely, but for now, kprobes still uses the
full SAVE_REGS support.

The last patch extends the WITH_ARGS to allow default function tracing to
modify the instruction pointer, where livepatching for x86 no longer needs
to save all registers.

The idea of this approach is to give enough information to a callback that
it could retrieve all arguments, which includes the stack pointer and
instruction pointer.

This can also be extended to modify the function graph tracer to use the
function tracer instead of having a separate trampoline.


Changes since v6:
 - Use the new name ftrace_instruction_pointer_set() in the live patching code.

Steven Rostedt (VMware) (3):
  ftrace: Have the callbacks receive a struct ftrace_regs instead of pt_regs
  ftrace/x86: Allow for arguments to be passed in to ftrace_regs by default
  livepatch: Use the default ftrace_ops instead of REGS when ARGS is 
available


 arch/csky/kernel/probes/ftrace.c |  4 +++-
 arch/nds32/kernel/ftrace.c   |  4 ++--
 arch/parisc/kernel/ftrace.c  |  8 +---
 arch/powerpc/include/asm/livepatch.h |  4 +++-
 arch/powerpc/kernel/kprobes-ftrace.c |  4 +++-
 arch/s390/include/asm/livepatch.h|  5 -
 arch/s390/kernel/ftrace.c|  4 +++-
 arch/x86/Kconfig |  1 +
 arch/x86/include/asm/ftrace.h| 18 ++
 arch/x86/include/asm/livepatch.h |  4 ++--
 arch/x86/kernel/ftrace_64.S  | 15 +--
 arch/x86/kernel/kprobes/ftrace.c |  3 ++-
 fs/pstore/ftrace.c   |  2 +-
 include/linux/ftrace.h   | 28 ++--
 include/linux/kprobes.h  |  2 +-
 kernel/livepatch/Kconfig |  2 +-
 kernel/livepatch/patch.c | 10 ++
 kernel/trace/Kconfig |  9 +
 kernel/trace/ftrace.c| 27 +++
 kernel/trace/trace_event_perf.c  |  2 +-
 kernel/trace/trace_events.c  |  2 +-
 kernel/trace/trace_functions.c   |  9 -
 kernel/trace/trace_irqsoff.c |  2 +-
 kernel/trace/trace_sched_wakeup.c|  2 +-
 kernel/trace/trace_selftest.c| 20 +++-
 kernel/trace/trace_stack.c   |  2 +-
 26 files changed, 138 insertions(+), 55 deletions(-)

[PATCH 04/11 v3] pstore/ftrace: Add recursion protection to the ftrace callback

2020-11-05 Thread VMware

From: "Steven Rostedt (VMware)" 

If a ftrace callback does not supply its own recursion protection and
does not set the RECURSION_SAFE flag in its ftrace_ops, then ftrace will
make a helper trampoline to do so before calling the callback instead of
just calling the callback directly.

The default for ftrace_ops is going to change. It will expect that handlers
provide their own recursion protection, unless its ftrace_ops states
otherwise.

Link: https://lkml.kernel.org/r/20201028115612.990886...@goodmis.org

Cc: Masami Hiramatsu 
Cc: Andrew Morton 
Cc: Thomas Meyer 
Reviewed-by: Kees Cook 
Signed-off-by: Steven Rostedt (VMware) 
---
 fs/pstore/ftrace.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/pstore/ftrace.c b/fs/pstore/ftrace.c
index 5c0450701293..816210fc5d3a 100644
--- a/fs/pstore/ftrace.c
+++ b/fs/pstore/ftrace.c
@@ -28,6 +28,7 @@ static void notrace pstore_ftrace_call(unsigned long ip,
   struct ftrace_ops *op,
   struct pt_regs *regs)
 {
+   int bit;
unsigned long flags;
struct pstore_ftrace_record rec = {};
struct pstore_record record = {
@@ -40,6 +41,10 @@ static void notrace pstore_ftrace_call(unsigned long ip,
if (unlikely(oops_in_progress))
return;
 
+   bit = ftrace_test_recursion_trylock();
+   if (bit < 0)
+   return;
+
local_irq_save(flags);
 
rec.ip = ip;
@@ -49,6 +54,7 @@ static void notrace pstore_ftrace_call(unsigned long ip,
psinfo->write();
 
local_irq_restore(flags);
+   ftrace_test_recursion_unlock(bit);
 }
 
 static struct ftrace_ops pstore_ftrace_ops __read_mostly = {
-- 
2.28.0

[PATCH 06/11 v3] livepatch/ftrace: Add recursion protection to the ftrace callback

2020-11-05 Thread VMware

From: "Steven Rostedt (VMware)" 

If a ftrace callback does not supply its own recursion protection and
does not set the RECURSION_SAFE flag in its ftrace_ops, then ftrace will
make a helper trampoline to do so before calling the callback instead of
just calling the callback directly.

The default for ftrace_ops is going to change. It will expect that handlers
provide their own recursion protection, unless its ftrace_ops states
otherwise.

Link: https://lkml.kernel.org/r/20201028115613.291169...@goodmis.org

Cc: Masami Hiramatsu 
Cc: Andrew Morton 
Cc: Josh Poimboeuf 
Cc: Jiri Kosina 
Cc: Miroslav Benes 
Cc: Joe Lawrence 
Cc: live-patch...@vger.kernel.org
Reviewed-by: Petr Mladek 
Signed-off-by: Steven Rostedt (VMware) 
---
 kernel/livepatch/patch.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
index b552cf2d85f8..6c0164d24bbd 100644
--- a/kernel/livepatch/patch.c
+++ b/kernel/livepatch/patch.c
@@ -45,9 +45,13 @@ static void notrace klp_ftrace_handler(unsigned long ip,
struct klp_ops *ops;
struct klp_func *func;
int patch_state;
+   int bit;
 
ops = container_of(fops, struct klp_ops, fops);
 
+   bit = ftrace_test_recursion_trylock();
+   if (bit < 0)
+   return;
/*
 * A variant of synchronize_rcu() is used to allow patching functions
 * where RCU is not watching, see klp_synchronize_transition().
@@ -117,6 +121,7 @@ static void notrace klp_ftrace_handler(unsigned long ip,
 
 unlock:
preempt_enable_notrace();
+   ftrace_test_recursion_unlock(bit);
 }
 
 /*
-- 
2.28.0

[PATCH 07/11 v3] livepatch: Trigger WARNING if livepatch function fails due to recursion

2020-11-05 Thread VMware

From: "Steven Rostedt (VMware)" 

If for some reason a function is called that triggers the recursion
detection of live patching, trigger a warning. By not executing the live
patch code, it is possible that the old unpatched function will be called
placing the system into an unknown state.

Link: https://lore.kernel.org/r/20201029145709.GD16774@alley

Cc: Josh Poimboeuf 
Cc: Jiri Kosina 
Cc: Joe Lawrence 
Cc: live-patch...@vger.kernel.org
Suggested-by: Miroslav Benes 
Reviewed-by: Petr Mladek 
Signed-off-by: Steven Rostedt (VMware) 
---
Changes since v2:

 - Blame Miroslav instead of Petr ;-)

 kernel/livepatch/patch.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
index 6c0164d24bbd..15480bf3ce88 100644
--- a/kernel/livepatch/patch.c
+++ b/kernel/livepatch/patch.c
@@ -50,7 +50,7 @@ static void notrace klp_ftrace_handler(unsigned long ip,
ops = container_of(fops, struct klp_ops, fops);
 
bit = ftrace_test_recursion_trylock();
-   if (bit < 0)
+   if (WARN_ON_ONCE(bit < 0))
return;
/*
 * A variant of synchronize_rcu() is used to allow patching functions
-- 
2.28.0

[PATCH 03/11 v3] ftrace: Optimize testing what context current is in

2020-11-05 Thread VMware

From: "Steven Rostedt (VMware)" 

The preempt_count() is not a simple location in memory, it could be part of
per_cpu code or more. Each access to preempt_count(), or one of its accessor
functions (like in_interrupt()) takes several cycles. By reading
preempt_count() once, and then doing tests to find the context against the
value return is slightly faster than using in_nmi() and in_interrupt().

Link: https://lkml.kernel.org/r/20201028115612.780796...@goodmis.org

Signed-off-by: Steven Rostedt (VMware) 
---
 include/linux/trace_recursion.h | 33 -
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index f2a949dbfec7..ac3d73484cb2 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -117,22 +117,29 @@ enum {
 
 #define TRACE_CONTEXT_MASK TRACE_LIST_MAX
 
+/*
+ * Used for setting context
+ *  NMI = 0
+ *  IRQ = 1
+ *  SOFTIRQ = 2
+ *  NORMAL  = 3
+ */
+enum {
+   TRACE_CTX_NMI,
+   TRACE_CTX_IRQ,
+   TRACE_CTX_SOFTIRQ,
+   TRACE_CTX_NORMAL,
+};
+
 static __always_inline int trace_get_context_bit(void)
 {
-   int bit;
-
-   if (in_interrupt()) {
-   if (in_nmi())
-   bit = 0;
-
-   else if (in_irq())
-   bit = 1;
-   else
-   bit = 2;
-   } else
-   bit = 3;
+   unsigned long pc = preempt_count();
 
-   return bit;
+   if (!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
+   return TRACE_CTX_NORMAL;
+   else
+   return pc & NMI_MASK ? TRACE_CTX_NMI :
+   pc & HARDIRQ_MASK ? TRACE_CTX_IRQ : TRACE_CTX_SOFTIRQ;
 }
 
 static __always_inline int trace_test_and_set_recursion(int start, int max)
-- 
2.28.0

[PATCH 00/11 v3] ftrace: Have callbacks handle their own recursion

2020-11-05 Thread VMware



I found that having the ftrace infrastructure use its own trampoline to
handle recursion and RCU by defaulte unless the ftrace_ops set the
appropriate flags, was an issue that nobody set those flags. But then their
callbacks would suffer from an unnecessary overhead instead of simply
handling the recursion itself.

This series makes it mandatory that ftrace callbacks handle recursion or set
a flag asking ftrace to do it for it. It also creates helper functions to
help these callbacks to have recursion protection.

Changes since v2:

 - Move get_kprobe() into preempt disabled sections for various archs
 - Use trace_recursion flags in current for protecting recursion of recursion 
recording
 - Make the recursion logic a little cleaner
 - Export GPL the recursion recording

Steven Rostedt (VMware) (11):
  ftrace: Move the recursion testing into global headers
  ftrace: Add ftrace_test_recursion_trylock() helper function
  ftrace: Optimize testing what context current is in
  pstore/ftrace: Add recursion protection to the ftrace callback
  kprobes/ftrace: Add recursion protection to the ftrace callback
  livepatch/ftrace: Add recursion protection to the ftrace callback
  livepatch: Trigger WARNING if livepatch function fails due to recursion
  perf/ftrace: Add recursion protection to the ftrace callback
  perf/ftrace: Check for rcu_is_watching() in callback function
  ftrace: Reverse what the RECURSION flag means in the ftrace_ops
  ftrace: Add recording of functions that caused recursion


 Documentation/trace/ftrace-uses.rst   |  84 
 arch/csky/kernel/probes/ftrace.c  |  12 +-
 arch/parisc/kernel/ftrace.c   |  16 ++-
 arch/powerpc/kernel/kprobes-ftrace.c  |  11 +-
 arch/s390/kernel/ftrace.c |  16 ++-
 arch/x86/kernel/kprobes/ftrace.c  |  12 +-
 fs/pstore/ftrace.c|   6 +
 include/linux/ftrace.h|  13 +-
 include/linux/trace_recursion.h   | 240 ++
 kernel/livepatch/patch.c  |   5 +
 kernel/trace/Kconfig  |  25 
 kernel/trace/Makefile |   1 +
 kernel/trace/fgraph.c |   3 +-
 kernel/trace/ftrace.c |  24 ++--
 kernel/trace/trace.h  | 177 -
 kernel/trace/trace_event_perf.c   |  13 +-
 kernel/trace/trace_events.c   |   1 -
 kernel/trace/trace_functions.c|  14 +-
 kernel/trace/trace_output.c   |   6 +-
 kernel/trace/trace_output.h   |   1 +
 kernel/trace/trace_recursion_record.c | 236 +
 kernel/trace/trace_selftest.c |   7 +-
 kernel/trace/trace_stack.c|   1 -
 23 files changed, 673 insertions(+), 251 deletions(-)
 create mode 100644 include/linux/trace_recursion.h
 create mode 100644 kernel/trace/trace_recursion_record.c

[PATCH 01/11 v3] ftrace: Move the recursion testing into global headers

2020-11-05 Thread VMware

From: "Steven Rostedt (VMware)" 

Currently, if a callback is registered to a ftrace function and its
ftrace_ops does not have the RECURSION flag set, it is encapsulated in a
helper function that does the recursion for it.

Really, all the callbacks should have their own recursion protection for
performance reasons. But they should not all implement their own. Move the
recursion helpers to global headers, so that all callbacks can use them.

Link: https://lkml.kernel.org/r/20201028115612.460535...@goodmis.org

Signed-off-by: Steven Rostedt (VMware) 
---
 include/linux/ftrace.h  |   1 +
 include/linux/trace_recursion.h | 187 
 kernel/trace/trace.h| 177 --
 3 files changed, 188 insertions(+), 177 deletions(-)
 create mode 100644 include/linux/trace_recursion.h

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 1bd3a0356ae4..0e4164a7f56d 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -7,6 +7,7 @@
 #ifndef _LINUX_FTRACE_H
 #define _LINUX_FTRACE_H
 
+#include 
 #include 
 #include 
 #include 
diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
new file mode 100644
index ..dbb7b6d4c94c
--- /dev/null
+++ b/include/linux/trace_recursion.h
@@ -0,0 +1,187 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_TRACE_RECURSION_H
+#define _LINUX_TRACE_RECURSION_H
+
+#include 
+#include 
+
+#ifdef CONFIG_TRACING
+
+/* Only current can touch trace_recursion */
+
+/*
+ * For function tracing recursion:
+ *  The order of these bits are important.
+ *
+ *  When function tracing occurs, the following steps are made:
+ *   If arch does not support a ftrace feature:
+ *call internal function (uses INTERNAL bits) which calls...
+ *   If callback is registered to the "global" list, the list
+ *function is called and recursion checks the GLOBAL bits.
+ *then this function calls...
+ *   The function callback, which can use the FTRACE bits to
+ *check for recursion.
+ *
+ * Now if the arch does not support a feature, and it calls
+ * the global list function which calls the ftrace callback
+ * all three of these steps will do a recursion protection.
+ * There's no reason to do one if the previous caller already
+ * did. The recursion that we are protecting against will
+ * go through the same steps again.
+ *
+ * To prevent the multiple recursion checks, if a recursion
+ * bit is set that is higher than the MAX bit of the current
+ * check, then we know that the check was made by the previous
+ * caller, and we can skip the current check.
+ */
+enum {
+   /* Function recursion bits */
+   TRACE_FTRACE_BIT,
+   TRACE_FTRACE_NMI_BIT,
+   TRACE_FTRACE_IRQ_BIT,
+   TRACE_FTRACE_SIRQ_BIT,
+
+   /* INTERNAL_BITs must be greater than FTRACE_BITs */
+   TRACE_INTERNAL_BIT,
+   TRACE_INTERNAL_NMI_BIT,
+   TRACE_INTERNAL_IRQ_BIT,
+   TRACE_INTERNAL_SIRQ_BIT,
+
+   TRACE_BRANCH_BIT,
+/*
+ * Abuse of the trace_recursion.
+ * As we need a way to maintain state if we are tracing the function
+ * graph in irq because we want to trace a particular function that
+ * was called in irq context but we have irq tracing off. Since this
+ * can only be modified by current, we can reuse trace_recursion.
+ */
+   TRACE_IRQ_BIT,
+
+   /* Set if the function is in the set_graph_function file */
+   TRACE_GRAPH_BIT,
+
+   /*
+* In the very unlikely case that an interrupt came in
+* at a start of graph tracing, and we want to trace
+* the function in that interrupt, the depth can be greater
+* than zero, because of the preempted start of a previous
+* trace. In an even more unlikely case, depth could be 2
+* if a softirq interrupted the start of graph tracing,
+* followed by an interrupt preempting a start of graph
+* tracing in the softirq, and depth can even be 3
+* if an NMI came in at the start of an interrupt function
+* that preempted a softirq start of a function that
+* preempted normal context Luckily, it can't be
+* greater than 3, so the next two bits are a mask
+* of what the depth is when we set TRACE_GRAPH_BIT
+*/
+
+   TRACE_GRAPH_DEPTH_START_BIT,
+   TRACE_GRAPH_DEPTH_END_BIT,
+
+   /*
+* To implement set_graph_notrace, if this bit is set, we ignore
+* function graph tracing of called functions, until the return
+* function is called to clear it.
+*/
+   TRACE_GRAPH_NOTRACE_BIT,
+
+   /*
+* When transitioning between context, the preempt_count() may
+* not be correct. Allow for a single recursion to cover this case.
+*/
+   TRACE_TRANSITION_BIT,
+};
+
+#define trace_recursion_set(bit)   do { (current)->trace_recursion |= 
(1<<(bit)); } while (0)
+#define trace_recursion

[PATCH 09/11 v3] perf/ftrace: Check for rcu_is_watching() in callback function

2020-11-05 Thread VMware

From: "Steven Rostedt (VMware)" 

If a ftrace callback requires "rcu_is_watching", then it adds the
FTRACE_OPS_FL_RCU flag and it will not be called if RCU is not "watching".
But this means that it will use a trampoline when called, and this slows
down the function tracing a tad. By checking rcu_is_watching() from within
the callback, it no longer needs the RCU flag set in the ftrace_ops and it
can be safely called directly.

Link: https://lkml.kernel.org/r/20201028115613.591878...@goodmis.org

Cc: Masami Hiramatsu 
Cc: Andrew Morton 
Cc: Jiri Olsa 
Signed-off-by: Steven Rostedt (VMware) 
---
 kernel/trace/trace_event_perf.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index fd58d83861d8..a2b9fddb8148 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -441,6 +441,9 @@ perf_ftrace_function_call(unsigned long ip, unsigned long 
parent_ip,
int rctx;
int bit;
 
+   if (!rcu_is_watching())
+   return;
+
if ((unsigned long)ops->private != smp_processor_id())
return;
 
@@ -484,7 +487,6 @@ static int perf_ftrace_function_register(struct perf_event 
*event)
 {
struct ftrace_ops *ops = >ftrace_ops;
 
-   ops->flags   = FTRACE_OPS_FL_RCU;
ops->func= perf_ftrace_function_call;
ops->private = (void *)(unsigned long)nr_cpu_ids;
 
-- 
2.28.0

[PATCH 08/11 v3] perf/ftrace: Add recursion protection to the ftrace callback

2020-11-05 Thread VMware

From: "Steven Rostedt (VMware)" 

If a ftrace callback does not supply its own recursion protection and
does not set the RECURSION_SAFE flag in its ftrace_ops, then ftrace will
make a helper trampoline to do so before calling the callback instead of
just calling the callback directly.

The default for ftrace_ops is going to change. It will expect that handlers
provide their own recursion protection, unless its ftrace_ops states
otherwise.

Link: https://lkml.kernel.org/r/20201028115613.77...@goodmis.org

Cc: Masami Hiramatsu 
Cc: Andrew Morton 
Cc: Jiri Olsa 
Signed-off-by: Steven Rostedt (VMware) 
---
 kernel/trace/trace_event_perf.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 643e0b19920d..fd58d83861d8 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -439,10 +439,15 @@ perf_ftrace_function_call(unsigned long ip, unsigned long 
parent_ip,
struct hlist_head head;
struct pt_regs regs;
int rctx;
+   int bit;
 
if ((unsigned long)ops->private != smp_processor_id())
return;
 
+   bit = ftrace_test_recursion_trylock();
+   if (bit < 0)
+   return;
+
event = container_of(ops, struct perf_event, ftrace_ops);
 
/*
@@ -463,13 +468,15 @@ perf_ftrace_function_call(unsigned long ip, unsigned long 
parent_ip,
 
entry = perf_trace_buf_alloc(ENTRY_SIZE, NULL, );
if (!entry)
-   return;
+   goto out;
 
entry->ip = ip;
entry->parent_ip = parent_ip;
perf_trace_buf_submit(entry, ENTRY_SIZE, rctx, TRACE_FN,
  1, , , NULL);
 
+out:
+   ftrace_test_recursion_unlock(bit);
 #undef ENTRY_SIZE
 }
 
-- 
2.28.0

[PATCH 02/11 v3] ftrace: Add ftrace_test_recursion_trylock() helper function

2020-11-05 Thread VMware

From: "Steven Rostedt (VMware)" 

To make it easier for ftrace callbacks to have recursion protection, provide
a ftrace_test_recursion_trylock() and ftrace_test_recursion_unlock() helper
that tests for recursion.

Link: https://lkml.kernel.org/r/20201028115612.634927...@goodmis.org

Signed-off-by: Steven Rostedt (VMware) 
---
 include/linux/trace_recursion.h | 25 +
 kernel/trace/trace_functions.c  | 12 +---
 2 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index dbb7b6d4c94c..f2a949dbfec7 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -183,5 +183,30 @@ static __always_inline void trace_clear_recursion(int bit)
current->trace_recursion = val;
 }
 
+/**
+ * ftrace_test_recursion_trylock - tests for recursion in same context
+ *
+ * Use this for ftrace callbacks. This will detect if the function
+ * tracing recursed in the same context (normal vs interrupt),
+ *
+ * Returns: -1 if a recursion happened.
+ *   >= 0 if no recursion
+ */
+static __always_inline int ftrace_test_recursion_trylock(void)
+{
+   return trace_test_and_set_recursion(TRACE_FTRACE_START, 
TRACE_FTRACE_MAX);
+}
+
+/**
+ * ftrace_test_recursion_unlock - called when function callback is complete
+ * @bit: The return of a successful ftrace_test_recursion_trylock()
+ *
+ * This is used at the end of a ftrace callback.
+ */
+static __always_inline void ftrace_test_recursion_unlock(int bit)
+{
+   trace_clear_recursion(bit);
+}
+
 #endif /* CONFIG_TRACING */
 #endif /* _LINUX_TRACE_RECURSION_H */
diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 2c2126e1871d..943756c01190 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -141,22 +141,20 @@ function_trace_call(unsigned long ip, unsigned long 
parent_ip,
if (unlikely(!tr->function_enabled))
return;
 
+   bit = ftrace_test_recursion_trylock();
+   if (bit < 0)
+   return;
+
pc = preempt_count();
preempt_disable_notrace();
 
-   bit = trace_test_and_set_recursion(TRACE_FTRACE_START, 
TRACE_FTRACE_MAX);
-   if (bit < 0)
-   goto out;
-
cpu = smp_processor_id();
data = per_cpu_ptr(tr->array_buffer.data, cpu);
if (!atomic_read(>disabled)) {
local_save_flags(flags);
trace_function(tr, ip, parent_ip, flags, pc);
}
-   trace_clear_recursion(bit);
-
- out:
+   ftrace_test_recursion_unlock(bit);
preempt_enable_notrace();
 }
 
-- 
2.28.0

[PATCH 10/11 v3] ftrace: Reverse what the RECURSION flag means in the ftrace_ops

2020-11-05 Thread VMware

From: "Steven Rostedt (VMware)" 

Now that all callbacks are recursion safe, reverse the meaning of the
RECURSION flag and rename it from RECURSION_SAFE to simply RECURSION.
Now only callbacks that request to have recursion protecting it will
have the added trampoline to do so.

Also remove the outdated comment about "PER_CPU" when determining to
use the ftrace_ops_assist_func.

Link: https://lkml.kernel.org/r/20201028115613.742454...@goodmis.org

Cc: Masami Hiramatsu 
Cc: Andrew Morton 
Cc: Jonathan Corbet 
Cc: Sebastian Andrzej Siewior 
Cc: Miroslav Benes 
Cc: Kamalesh Babulal 
Cc: Petr Mladek 
Cc: linux-...@vger.kernel.org
Signed-off-by: Steven Rostedt (VMware) 
---
 Documentation/trace/ftrace-uses.rst | 82 +
 include/linux/ftrace.h  | 12 ++---
 kernel/trace/fgraph.c   |  3 +-
 kernel/trace/ftrace.c   | 20 ---
 kernel/trace/trace_events.c |  1 -
 kernel/trace/trace_functions.c  |  2 +-
 kernel/trace/trace_selftest.c   |  7 +--
 kernel/trace/trace_stack.c  |  1 -
 8 files changed, 79 insertions(+), 49 deletions(-)

diff --git a/Documentation/trace/ftrace-uses.rst 
b/Documentation/trace/ftrace-uses.rst
index a4955f7e3d19..86cd14b8e126 100644
--- a/Documentation/trace/ftrace-uses.rst
+++ b/Documentation/trace/ftrace-uses.rst
@@ -30,8 +30,8 @@ The ftrace context
   This requires extra care to what can be done inside a callback. A callback
   can be called outside the protective scope of RCU.
 
-The ftrace infrastructure has some protections against recursions and RCU
-but one must still be very careful how they use the callbacks.
+There are helper functions to help against recursion, and making sure
+RCU is watching. These are explained below.
 
 
 The ftrace_ops structure
@@ -108,6 +108,50 @@ The prototype of the callback function is as follows (as 
of v4.14):
at the start of the function where ftrace was tracing. Otherwise it
either contains garbage, or NULL.
 
+Protect your callback
+=
+
+As functions can be called from anywhere, and it is possible that a function
+called by a callback may also be traced, and call that same callback,
+recursion protection must be used. There are two helper functions that
+can help in this regard. If you start your code with:
+
+   int bit;
+
+   bit = ftrace_test_recursion_trylock();
+   if (bit < 0)
+   return;
+
+and end it with:
+
+   ftrace_test_recursion_unlock(bit);
+
+The code in between will be safe to use, even if it ends up calling a
+function that the callback is tracing. Note, on success,
+ftrace_test_recursion_trylock() will disable preemption, and the
+ftrace_test_recursion_unlock() will enable it again (if it was previously
+enabled).
+
+Alternatively, if the FTRACE_OPS_FL_RECURSION flag is set on the ftrace_ops
+(as explained below), then a helper trampoline will be used to test
+for recursion for the callback and no recursion test needs to be done.
+But this is at the expense of a slightly more overhead from an extra
+function call.
+
+If your callback accesses any data or critical section that requires RCU
+protection, it is best to make sure that RCU is "watching", otherwise
+that data or critical section will not be protected as expected. In this
+case add:
+
+   if (!rcu_is_watching())
+   return;
+
+Alternatively, if the FTRACE_OPS_FL_RCU flag is set on the ftrace_ops
+(as explained below), then a helper trampoline will be used to test
+for rcu_is_watching for the callback and no other test needs to be done.
+But this is at the expense of a slightly more overhead from an extra
+function call.
+
 
 The ftrace FLAGS
 
@@ -128,26 +172,20 @@ FTRACE_OPS_FL_SAVE_REGS_IF_SUPPORTED
will not fail with this flag set. But the callback must check if
regs is NULL or not to determine if the architecture supports it.
 
-FTRACE_OPS_FL_RECURSION_SAFE
-   By default, a wrapper is added around the callback to
-   make sure that recursion of the function does not occur. That is,
-   if a function that is called as a result of the callback's execution
-   is also traced, ftrace will prevent the callback from being called
-   again. But this wrapper adds some overhead, and if the callback is
-   safe from recursion, it can set this flag to disable the ftrace
-   protection.
-
-   Note, if this flag is set, and recursion does occur, it could cause
-   the system to crash, and possibly reboot via a triple fault.
-
-   It is OK if another callback traces a function that is called by a
-   callback that is marked recursion safe. Recursion safe callbacks
-   must never trace any function that are called by the callback
-   itself or any nested functions that those functions call.
-
-   If this flag is set, it is possible that the callback will also
-   be called with preemption enabl

[PATCH 05/11 v3] kprobes/ftrace: Add recursion protection to the ftrace callback

2020-11-05 Thread VMware

From: "Steven Rostedt (VMware)" 

If a ftrace callback does not supply its own recursion protection and
does not set the RECURSION_SAFE flag in its ftrace_ops, then ftrace will
make a helper trampoline to do so before calling the callback instead of
just calling the callback directly.

The default for ftrace_ops is going to change. It will expect that handlers
provide their own recursion protection, unless its ftrace_ops states
otherwise.

Link: https://lkml.kernel.org/r/20201028115613.140212...@goodmis.org

Cc: Andrew Morton 
Cc: Masami Hiramatsu 
Cc: Guo Ren 
Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Cc: "Naveen N. Rao" 
Cc: Anil S Keshavamurthy 
Cc: "David S. Miller" 
Cc: linux-c...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Signed-off-by: Steven Rostedt (VMware) 
---

Changes since v2:

 - Move get_kprobe() into preempt disabled sections for various archs


 arch/csky/kernel/probes/ftrace.c | 12 ++--
 arch/parisc/kernel/ftrace.c  | 16 +---
 arch/powerpc/kernel/kprobes-ftrace.c | 11 ++-
 arch/s390/kernel/ftrace.c| 16 +---
 arch/x86/kernel/kprobes/ftrace.c | 12 ++--
 5 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/arch/csky/kernel/probes/ftrace.c b/arch/csky/kernel/probes/ftrace.c
index 5264763d05be..5eb2604fdf71 100644
--- a/arch/csky/kernel/probes/ftrace.c
+++ b/arch/csky/kernel/probes/ftrace.c
@@ -13,16 +13,21 @@ int arch_check_ftrace_location(struct kprobe *p)
 void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
   struct ftrace_ops *ops, struct pt_regs *regs)
 {
+   int bit;
bool lr_saver = false;
struct kprobe *p;
struct kprobe_ctlblk *kcb;
 
-   /* Preempt is disabled by ftrace */
+   bit = ftrace_test_recursion_trylock();
+   if (bit < 0)
+   return;
+
+   preempt_disable_notrace();
p = get_kprobe((kprobe_opcode_t *)ip);
if (!p) {
p = get_kprobe((kprobe_opcode_t *)(ip - MCOUNT_INSN_SIZE));
if (unlikely(!p) || kprobe_disabled(p))
-   return;
+   goto out;
lr_saver = true;
}
 
@@ -56,6 +61,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long 
parent_ip,
 */
__this_cpu_write(current_kprobe, NULL);
}
+out:
+   preempt_enable_notrace();
+   ftrace_test_recursion_unlock(bit);
 }
 NOKPROBE_SYMBOL(kprobe_ftrace_handler);
 
diff --git a/arch/parisc/kernel/ftrace.c b/arch/parisc/kernel/ftrace.c
index 63e3ecb9da81..13d85042810a 100644
--- a/arch/parisc/kernel/ftrace.c
+++ b/arch/parisc/kernel/ftrace.c
@@ -207,14 +207,21 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned 
long parent_ip,
   struct ftrace_ops *ops, struct pt_regs *regs)
 {
struct kprobe_ctlblk *kcb;
-   struct kprobe *p = get_kprobe((kprobe_opcode_t *)ip);
+   struct kprobe *p;
+   int bit;
 
-   if (unlikely(!p) || kprobe_disabled(p))
+   bit = ftrace_test_recursion_trylock();
+   if (bit < 0)
return;
 
+   preempt_disable_notrace();
+   p = get_kprobe((kprobe_opcode_t *)ip);
+   if (unlikely(!p) || kprobe_disabled(p))
+   goto out;
+
if (kprobe_running()) {
kprobes_inc_nmissed_count(p);
-   return;
+   goto out;
}
 
__this_cpu_write(current_kprobe, p);
@@ -235,6 +242,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long 
parent_ip,
}
}
__this_cpu_write(current_kprobe, NULL);
+out:
+   preempt_enable_notrace();
+   ftrace_test_recursion_unlock(bit);
 }
 NOKPROBE_SYMBOL(kprobe_ftrace_handler);
 
diff --git a/arch/powerpc/kernel/kprobes-ftrace.c 
b/arch/powerpc/kernel/kprobes-ftrace.c
index 972cb28174b2..5df8d50c65ae 100644
--- a/arch/powerpc/kernel/kprobes-ftrace.c
+++ b/arch/powerpc/kernel/kprobes-ftrace.c
@@ -18,10 +18,16 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long 
parent_nip,
 {
struct kprobe *p;
struct kprobe_ctlblk *kcb;
+   int bit;
 
+   bit = ftrace_test_recursion_trylock();
+   if (bit < 0)
+   return;
+
+   preempt_disable_notrace();
p = get_kprobe((kprobe_opcode_t *)nip);
if (unlikely(!p) || kprobe_disabled(p))
-   return;
+   goto out;
 
kcb = get_kprobe_ctlblk();
if (kprobe_running()) {
@@ -52,6 +58,9 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long 
parent_nip,

[PATCH 11/11 v3] ftrace: Add recording of functions that caused recursion

2020-11-05 Thread VMware

From: "Steven Rostedt (VMware)" 

This adds CONFIG_FTRACE_RECORD_RECURSION that will record to a file
"recursed_functions" all the functions that caused recursion while a
callback to the function tracer was running.

Cc: Jonathan Corbet 
Cc: Guo Ren 
Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Cc: Kees Cook 
Cc: Anton Vorontsov 
Cc: Colin Cross 
Cc: Tony Luck 
Cc: Josh Poimboeuf 
Cc: Jiri Kosina 
Cc: Miroslav Benes 
Cc: Petr Mladek 
Cc: Joe Lawrence 
Cc: Kamalesh Babulal 
Cc: Mauro Carvalho Chehab 
Cc: Sebastian Andrzej Siewior 
Cc: linux-...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-c...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: live-patch...@vger.kernel.org
Signed-off-by: Steven Rostedt (VMware) 
---

Changes since v2:

 - Use trace_recursion flags in current for protecting recursion of recursion 
recording
 - Make the recursion logic a little cleaner
 - Export GPL the recursion recording

 Documentation/trace/ftrace-uses.rst   |   6 +-
 arch/csky/kernel/probes/ftrace.c  |   2 +-
 arch/parisc/kernel/ftrace.c   |   2 +-
 arch/powerpc/kernel/kprobes-ftrace.c  |   2 +-
 arch/s390/kernel/ftrace.c |   2 +-
 arch/x86/kernel/kprobes/ftrace.c  |   2 +-
 fs/pstore/ftrace.c|   2 +-
 include/linux/trace_recursion.h   |  29 +++-
 kernel/livepatch/patch.c  |   2 +-
 kernel/trace/Kconfig  |  25 +++
 kernel/trace/Makefile |   1 +
 kernel/trace/ftrace.c |   4 +-
 kernel/trace/trace_event_perf.c   |   2 +-
 kernel/trace/trace_functions.c|   2 +-
 kernel/trace/trace_output.c   |   6 +-
 kernel/trace/trace_output.h   |   1 +
 kernel/trace/trace_recursion_record.c | 236 ++
 17 files changed, 306 insertions(+), 20 deletions(-)
 create mode 100644 kernel/trace/trace_recursion_record.c

diff --git a/Documentation/trace/ftrace-uses.rst 
b/Documentation/trace/ftrace-uses.rst
index 86cd14b8e126..5981d5691745 100644
--- a/Documentation/trace/ftrace-uses.rst
+++ b/Documentation/trace/ftrace-uses.rst
@@ -118,7 +118,7 @@ can help in this regard. If you start your code with:
 
int bit;
 
-   bit = ftrace_test_recursion_trylock();
+   bit = ftrace_test_recursion_trylock(ip, parent_ip);
if (bit < 0)
return;
 
@@ -130,7 +130,9 @@ The code in between will be safe to use, even if it ends up 
calling a
 function that the callback is tracing. Note, on success,
 ftrace_test_recursion_trylock() will disable preemption, and the
 ftrace_test_recursion_unlock() will enable it again (if it was previously
-enabled).
+enabled). The instruction pointer (ip) and its parent (parent_ip) is passed to
+ftrace_test_recursion_trylock() to record where the recursion happened
+(if CONFIG_FTRACE_RECORD_RECURSION is set).
 
 Alternatively, if the FTRACE_OPS_FL_RECURSION flag is set on the ftrace_ops
 (as explained below), then a helper trampoline will be used to test
diff --git a/arch/csky/kernel/probes/ftrace.c b/arch/csky/kernel/probes/ftrace.c
index 5eb2604fdf71..f30b179924ef 100644
--- a/arch/csky/kernel/probes/ftrace.c
+++ b/arch/csky/kernel/probes/ftrace.c
@@ -18,7 +18,7 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long 
parent_ip,
struct kprobe *p;
struct kprobe_ctlblk *kcb;
 
-   bit = ftrace_test_recursion_trylock();
+   bit = ftrace_test_recursion_trylock(ip, parent_ip);
if (bit < 0)
return;
 
diff --git a/arch/parisc/kernel/ftrace.c b/arch/parisc/kernel/ftrace.c
index 13d85042810a..1c5d3732bda2 100644
--- a/arch/parisc/kernel/ftrace.c
+++ b/arch/parisc/kernel/ftrace.c
@@ -210,7 +210,7 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long 
parent_ip,
struct kprobe *p;
int bit;
 
-   bit = ftrace_test_recursion_trylock();
+   bit = ftrace_test_recursion_trylock(ip, parent_ip);
if (bit < 0)
return;
 
diff --git a/arch/powerpc/kernel/kprobes-ftrace.c 
b/arch/powerpc/kernel/kprobes-ftrace.c
index 5df8d50c65ae..fdfee39938ea 100644
--- a/arch/powerpc/kernel/kprobes-ftrace.c
+++ b/arch/powerpc/kernel/kprobes-ftrace.c
@@ -20,7 +20,7 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long 
parent_nip,
struct kprobe_ctlblk *kcb;
int bit;
 
-   bit = ftrace_test_recursion_trylock();
+   bit = ftrace_test_recursion_trylock(nip, parent_nip);
if (bit < 0)
return;
 
diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index 8f31c726537a..657c1ab45408 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@

[tip: core/static_call] tracepoint: Fix out of sync data passing by static caller

2020-10-03 Thread tip-bot2 for Steven Rostedt (VMware)

The following commit has been merged into the core/static_call branch of tip:

Commit-ID: 547305a64632813286700cb6d768bfe773df7d19
Gitweb:
https://git.kernel.org/tip/547305a64632813286700cb6d768bfe773df7d19
Author:Steven Rostedt (VMware) 
AuthorDate:Thu, 01 Oct 2020 21:27:57 -04:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 02 Oct 2020 21:18:25 +02:00

tracepoint: Fix out of sync data passing by static caller

Naresh reported a bug that appears to be a side effect of the static
calls. It happens when going from more than one tracepoint callback to
a single one, and removing the first callback on the list. The list of
tracepoint callbacks holds data and a function to call with the
parameters of that tracepoint and a handler to the associated data.

 old_list:
0: func = foo; data = NULL;
1: func = bar; data = _struct;

 new_list:
0: func = bar; data = _struct;

CPU 0   CPU 1
-   -
   tp_funcs = old_list;
   tp_static_caller = tp_interator

   __DO_TRACE()

data = tp_funcs[0].data = NULL;

   tp_funcs = new_list;
   tracepoint_update_call()
  tp_static_caller = tp_funcs[0] = bar;
tp_static_caller(data)
   bar(data)
 x = data->item = NULL->item

   BOOM!

To solve this, add a tracepoint_synchronize_unregister() between
changing tp_funcs and updating the static tracepoint, that does both a
synchronize_rcu() and synchronize_srcu(). This will ensure that when
the static call is updated to the single callback that it will be
receiving the data that it registered with.

Fixes: d25e37d89dd2f ("tracepoint: Optimize using static_call()")
Reported-by: Naresh Kamboju 
Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Peter Zijlstra (Intel) 
Link: 
https://lore.kernel.org/linux-next/CA+G9fYvPXVRO0NV7yL=FxCmFEMYkCwdz7R=9w+_votpt824...@mail.gmail.com
---
 kernel/tracepoint.c | 22 --
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index e92f3fb..26efd22 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -221,7 +221,7 @@ static void *func_remove(struct tracepoint_func **funcs,
return old;
 }
 
-static void tracepoint_update_call(struct tracepoint *tp, struct 
tracepoint_func *tp_funcs)
+static void tracepoint_update_call(struct tracepoint *tp, struct 
tracepoint_func *tp_funcs, bool sync)
 {
void *func = tp->iterator;
 
@@ -229,8 +229,17 @@ static void tracepoint_update_call(struct tracepoint *tp, 
struct tracepoint_func
if (!tp->static_call_key)
return;
 
-   if (!tp_funcs[1].func)
+   if (!tp_funcs[1].func) {
func = tp_funcs[0].func;
+   /*
+* If going from the iterator back to a single caller,
+* we need to synchronize with __DO_TRACE to make sure
+* that the data passed to the callback is the one that
+* belongs to that callback.
+*/
+   if (sync)
+   tracepoint_synchronize_unregister();
+   }
 
__static_call_update(tp->static_call_key, tp->static_call_tramp, func);
 }
@@ -265,7 +274,7 @@ static int tracepoint_add_func(struct tracepoint *tp,
 * include/linux/tracepoint.h using rcu_dereference_sched().
 */
rcu_assign_pointer(tp->funcs, tp_funcs);
-   tracepoint_update_call(tp, tp_funcs);
+   tracepoint_update_call(tp, tp_funcs, false);
static_key_enable(>key);
 
release_probes(old);
@@ -297,11 +306,12 @@ static int tracepoint_remove_func(struct tracepoint *tp,
tp->unregfunc();
 
static_key_disable(>key);
+   rcu_assign_pointer(tp->funcs, tp_funcs);
} else {
-   tracepoint_update_call(tp, tp_funcs);
+   rcu_assign_pointer(tp->funcs, tp_funcs);
+   tracepoint_update_call(tp, tp_funcs,
+  tp_funcs[0].func != old[0].func);
}
-
-   rcu_assign_pointer(tp->funcs, tp_funcs);
release_probes(old);
return 0;
 }

[PATCH v2 2/2] tools lib traceevent: Man page libtraceevent debug APIs

2020-09-30 Thread Tzvetomir Stoyanov (VMware)

Add a new libtraceevent man page with documentation of these debug APIs:
 tep_print_printk
 tep_print_funcs
 tep_set_test_filters
 tep_plugin_print_options

Signed-off-by: Tzvetomir Stoyanov (VMware) 
---
v2 changes:
 - Removed an extra interval from the example's code.

 .../Documentation/libtraceevent-debug.txt | 95 +++
 1 file changed, 95 insertions(+)
 create mode 100644 tools/lib/traceevent/Documentation/libtraceevent-debug.txt

diff --git a/tools/lib/traceevent/Documentation/libtraceevent-debug.txt 
b/tools/lib/traceevent/Documentation/libtraceevent-debug.txt
new file mode 100644
index ..a0be84fe8990
--- /dev/null
+++ b/tools/lib/traceevent/Documentation/libtraceevent-debug.txt
@@ -0,0 +1,95 @@
+libtraceevent(3)
+
+
+NAME
+
+tep_print_printk, tep_print_funcs, tep_set_test_filters, 
tep_plugin_print_options -
+Print libtraceevent internal information.
+
+SYNOPSIS
+
+[verse]
+--
+*#include *
+*#include *
+
+void *tep_print_printk*(struct tep_handle pass:[*]tep);
+void *tep_print_funcs*(struct tep_handle pass:[*]tep);
+void *tep_set_test_filters*(struct tep_handle pass:[*]tep, int test_filters);
+void *tep_plugin_print_options*(struct trace_seq pass:[*]s);
+--
+
+DESCRIPTION
+---
+The _tep_print_printk()_ function prints the printk string formats that were
+stored for this tracing session. The _tep_ argument is trace event parser 
context.
+
+The _tep_print_funcs()_ function prints the stored function name to address 
mapping
+for this tracing session. The _tep_ argument is trace event parser context.
+
+The _tep_set_test_filters()_ function sets a flag to test a filter string. If 
this
+flag is set, when _tep_filter_add_filter_str()_ API as called, it will print 
the filter
+string instead of adding it. The _tep_ argument is trace event parser context.
+The _test_filters_ argument is the test flag that will be set.
+
+The _tep_plugin_print_options()_ function writes a list of the registered 
plugin options
+into _s_.
+
+EXAMPLE
+---
+[source,c]
+--
+#include 
+#include 
+...
+struct tep_handle *tep = tep_alloc();
+...
+   tep_print_printk(tep);
+...
+   tep_print_funcs(tep);
+...
+struct tep_event_filter *filter = tep_filter_alloc(tep);
+   tep_set_test_filters(tep, 1);
+   tep_filter_add_filter_str(filter, "sched/sched_wakeup:target_cpu==1");
+   tep_set_test_filters(tep, 0);
+   tep_filter_free(filter);
+...
+struct trace_seq seq;
+trace_seq_init();
+
+   tep_plugin_print_options();
+...
+--
+
+FILES
+-
+[verse]
+--
+*event-parse.h*
+   Header file to include in order to have access to the library APIs.
+*-ltraceevent*
+   Linker switch to add when building a program that uses the library.
+--
+
+SEE ALSO
+
+_libtraceevent(3)_, _trace-cmd(1)_
+
+AUTHOR
+--
+[verse]
+--
+*Steven Rostedt* , author of *libtraceevent*.
+*Tzvetomir Stoyanov* , author of this man page.
+--
+REPORTING BUGS
+--
+Report bugs to  
+
+LICENSE
+---
+libtraceevent is Free Software licensed under the GNU LGPL 2.1
+
+RESOURCES
+-
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
-- 
2.26.2

[PATCH v2 1/2] tools lib traceevent: Man page for tep_add_plugin_path() API

2020-09-30 Thread Tzvetomir Stoyanov (VMware)

Add documentation of tep_add_plugin_path() API in the libtraceevent
plugin man page.

Signed-off-by: Tzvetomir Stoyanov (VMware) 
---
v2 changes:
 - Fixed grammar mistakes, found by Steven Rostedt.

 .../Documentation/libtraceevent-plugins.txt   | 25 +--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/tools/lib/traceevent/Documentation/libtraceevent-plugins.txt 
b/tools/lib/traceevent/Documentation/libtraceevent-plugins.txt
index 4d6394397d92..4b7ac5c5217b 100644
--- a/tools/lib/traceevent/Documentation/libtraceevent-plugins.txt
+++ b/tools/lib/traceevent/Documentation/libtraceevent-plugins.txt
@@ -3,7 +3,7 @@ libtraceevent(3)
 
 NAME
 
-tep_load_plugins, tep_unload_plugins, tep_load_plugins_hook - Load / unload 
traceevent plugins.
+tep_load_plugins, tep_unload_plugins, tep_load_plugins_hook, 
tep_add_plugin_path - Load / unload traceevent plugins.
 
 SYNOPSIS
 
@@ -19,6 +19,8 @@ void *tep_load_plugins_hook*(struct tep_handle pass:[*]_tep_, 
const char pass:[*
   const char pass:[*]name,
   void pass:[*]data),
   void pass:[*]_data_);
+int *tep_add_plugin_path*(struct tep_handle pass:[*]tep, char pass:[*]path,
+ enum tep_plugin_load_priority prio);
 --
 
 DESCRIPTION
@@ -52,16 +54,33 @@ _tep_load_plugins()_. The _tep_ argument is trace event 
parser context. The
 _plugin_list_ is the list of loaded plugins, returned by
 the _tep_load_plugins()_ function.
 
-The _tep_load_plugins_hook_ function walks through all directories with plugins
+The _tep_load_plugins_hook()_ function walks through all directories with 
plugins
 and calls user specified _load_plugin()_ hook for each plugin file. Only files
 with given _suffix_ are considered to be plugins. The _data_ is a user 
specified
 context, passed to _load_plugin()_. Directories and the walk order are the same
 as in _tep_load_plugins()_ API.
 
+The _tep_add_plugin_path()_ functions adds additional directories with plugins 
in
+the _tep_->plugins_dir list. It must be called before _tep_load_plugins()_ in 
order
+for the plugins from the new directories to be loaded. The _tep_ argument is
+the trace event parser context. The _path_ is the full path to the new plugin
+directory. The _prio_ argument specifies the loading priority order for the
+new directory of plugins. The loading priority is important in case of 
different
+versions of the same plugin located in multiple plugin directories.The last 
loaded
+plugin wins. The priority can be:
+[verse]
+--
+   _TEP_PLUGIN_FIRST_  - Load plugins from this directory first
+   _TEP_PLUGIN_LAST_   - Load plugins from this directory last
+--
+Where the plugins in TEP_PLUGIN_LAST" will take precedence over the
+plugins in the other directories.
+
 RETURN VALUE
 
 The _tep_load_plugins()_ function returns a list of successfully loaded 
plugins,
 or NULL in case no plugins are loaded.
+The _tep_add_plugin_path()_ function returns -1 in case of an error, 0 
otherwise.
 
 EXAMPLE
 ---
@@ -71,6 +90,8 @@ EXAMPLE
 ...
 struct tep_handle *tep = tep_alloc();
 ...
+tep_add_plugin_path(tep, "~/dev_plugins", TEP_PLUGIN_LAST);
+...
 struct tep_plugin_list *plugins = tep_load_plugins(tep);
 if (plugins == NULL) {
/* no plugins are loaded */
-- 
2.26.2

[PATCH v4] tools lib traceevent: Hide non API functions

2020-09-30 Thread Tzvetomir Stoyanov (VMware)

There are internal library functions, which are not declared as a static.
They are used inside the library from different files. Hide them from
the library users, as they are not part of the API.
These functions are made hidden and are renamed without the prefix "tep_":
 tep_free_plugin_paths
 tep_peek_char
 tep_buffer_init
 tep_get_input_buf_ptr
 tep_get_input_buf
 tep_read_token
 tep_free_token
 tep_free_event
 tep_free_format_field
 __tep_parse_format

Link: 
https://lore.kernel.org/linux-trace-devel/e4afdd82deb5e023d53231bb13e08dca78085fb0.ca...@decadent.org.uk/
Reported-by: Ben Hutchings 
Signed-off-by: Tzvetomir Stoyanov (VMware) 
---
v1 of the patch is here: 
https://lore.kernel.org/r/20200924070609.100771-2-tz.stoya...@gmail.com
v2 changes (addressed Steven's comments):
  - Removed leading underscores from the names of newly hidden internal
functions.
v3 changes (addressed Steven's comment):
  - Moved comments from removed APIs to internal functions.
  - Fixed a typo in patch description.
v4 changes (addressed Steven's comment):
  - Coding style fixes.
  - Removed "__" prefix from hidden functions.

 tools/lib/traceevent/event-parse-api.c   |   8 +-
 tools/lib/traceevent/event-parse-local.h |  24 +++--
 tools/lib/traceevent/event-parse.c   | 125 ++-
 tools/lib/traceevent/event-parse.h   |   8 --
 tools/lib/traceevent/event-plugin.c  |   2 +-
 tools/lib/traceevent/parse-filter.c  |  23 ++---
 6 files changed, 83 insertions(+), 107 deletions(-)

diff --git a/tools/lib/traceevent/event-parse-api.c 
b/tools/lib/traceevent/event-parse-api.c
index 4faf52a65791..f8361e45d446 100644
--- a/tools/lib/traceevent/event-parse-api.c
+++ b/tools/lib/traceevent/event-parse-api.c
@@ -92,7 +92,7 @@ bool tep_test_flag(struct tep_handle *tep, enum tep_flag flag)
return false;
 }
 
-unsigned short tep_data2host2(struct tep_handle *tep, unsigned short data)
+__hidden unsigned short data2host2(struct tep_handle *tep, unsigned short data)
 {
unsigned short swap;
 
@@ -105,7 +105,7 @@ unsigned short tep_data2host2(struct tep_handle *tep, 
unsigned short data)
return swap;
 }
 
-unsigned int tep_data2host4(struct tep_handle *tep, unsigned int data)
+__hidden unsigned int data2host4(struct tep_handle *tep, unsigned int data)
 {
unsigned int swap;
 
@@ -120,8 +120,8 @@ unsigned int tep_data2host4(struct tep_handle *tep, 
unsigned int data)
return swap;
 }
 
-unsigned long long
-tep_data2host8(struct tep_handle *tep, unsigned long long data)
+__hidden  unsigned long long
+data2host8(struct tep_handle *tep, unsigned long long data)
 {
unsigned long long swap;
 
diff --git a/tools/lib/traceevent/event-parse-local.h 
b/tools/lib/traceevent/event-parse-local.h
index d805a920af6f..fd4bbcfbb849 100644
--- a/tools/lib/traceevent/event-parse-local.h
+++ b/tools/lib/traceevent/event-parse-local.h
@@ -15,6 +15,8 @@ struct event_handler;
 struct func_resolver;
 struct tep_plugins_dir;
 
+#define __hidden __attribute__((visibility ("hidden")))
+
 struct tep_handle {
int ref_count;
 
@@ -102,12 +104,20 @@ struct tep_print_parse {
struct tep_print_arg*len_as_arg;
 };
 
-void tep_free_event(struct tep_event *event);
-void tep_free_format_field(struct tep_format_field *field);
-void tep_free_plugin_paths(struct tep_handle *tep);
-
-unsigned short tep_data2host2(struct tep_handle *tep, unsigned short data);
-unsigned int tep_data2host4(struct tep_handle *tep, unsigned int data);
-unsigned long long tep_data2host8(struct tep_handle *tep, unsigned long long 
data);
+void free_tep_event(struct tep_event *event);
+void free_tep_format_field(struct tep_format_field *field);
+void free_tep_plugin_paths(struct tep_handle *tep);
+
+unsigned short data2host2(struct tep_handle *tep, unsigned short data);
+unsigned int data2host4(struct tep_handle *tep, unsigned int data);
+unsigned long long data2host8(struct tep_handle *tep, unsigned long long data);
+
+/* access to the internal parser */
+int peek_char(void);
+void init_input_buf(const char *buf, unsigned long long size);
+unsigned long long get_input_buf_ptr(void);
+const char *get_input_buf(void);
+enum tep_event_type read_token(char **tok);
+void free_token(char *tok);
 
 #endif /* _PARSE_EVENTS_INT_H */
diff --git a/tools/lib/traceevent/event-parse.c 
b/tools/lib/traceevent/event-parse.c
index 5acc18b32606..fe58843d047c 100644
--- a/tools/lib/traceevent/event-parse.c
+++ b/tools/lib/traceevent/event-parse.c
@@ -54,19 +54,26 @@ static int show_warning = 1;
warning(fmt, ##__VA_ARGS__);\
} while (0)
 
-static void init_input_buf(const char *buf, unsigned long long size)
+/**
+ * init_input_buf - init buffer for parsing
+ * @buf: buffer to parse
+ * @size: the size of the buffer
+ *
+ * Initializes the internal buffer that tep_read_token() will parse.
+ */
+__hidden void init_input_buf(const char

[PATCH 2/2] tools lib traceevent: Man page libtraceevent debug APIs

2020-09-29 Thread Tzvetomir Stoyanov (VMware)

Add a new libtraceevent man page with documentation of these debug APIs:
 tep_print_printk
 tep_print_funcs
 tep_set_test_filters
 tep_plugin_print_options

Signed-off-by: Tzvetomir Stoyanov (VMware) 
---
 .../Documentation/libtraceevent-debug.txt | 95 +++
 1 file changed, 95 insertions(+)
 create mode 100644 tools/lib/traceevent/Documentation/libtraceevent-debug.txt

diff --git a/tools/lib/traceevent/Documentation/libtraceevent-debug.txt 
b/tools/lib/traceevent/Documentation/libtraceevent-debug.txt
new file mode 100644
index ..9a2d1ffa2d72
--- /dev/null
+++ b/tools/lib/traceevent/Documentation/libtraceevent-debug.txt
@@ -0,0 +1,95 @@
+libtraceevent(3)
+
+
+NAME
+
+tep_print_printk, tep_print_funcs, tep_set_test_filters, 
tep_plugin_print_options -
+Print libtraceevent internal information.
+
+SYNOPSIS
+
+[verse]
+--
+*#include *
+*#include *
+
+void *tep_print_printk*(struct tep_handle pass:[*]tep);
+void *tep_print_funcs*(struct tep_handle pass:[*]tep);
+void *tep_set_test_filters*(struct tep_handle pass:[*]tep, int test_filters);
+void *tep_plugin_print_options*(struct trace_seq pass:[*]s);
+--
+
+DESCRIPTION
+---
+The _tep_print_printk()_ function prints the printk string formats that were
+stored for this tracing session. The _tep_ argument is trace event parser 
context.
+
+The _tep_print_funcs()_ function prints the stored function name to address 
mapping
+for this tracing session. The _tep_ argument is trace event parser context.
+
+The _tep_set_test_filters()_ function sets a flag to test a filter string. If 
this
+flag is set, when _tep_filter_add_filter_str()_ API as called, it will print 
the filter
+string instead of adding it. The _tep_ argument is trace event parser context.
+The _test_filters_ argument is the test flag that will be set.
+
+The _tep_plugin_print_options()_ function writes a list of the registered 
plugin options
+into _s_.
+
+EXAMPLE
+---
+[source,c]
+--
+#include 
+#include 
+...
+struct tep_handle *tep = tep_alloc();
+...
+   tep_print_printk(tep);
+...
+   tep_print_funcs(tep);
+...
+struct tep_event_filter *filter = tep_filter_alloc(tep);
+   tep_set_test_filters(tep, 1);
+   tep_filter_add_filter_str(filter, "sche d/sched_wakeup:target_cpu==1");
+   tep_set_test_filters(tep, 0);
+   tep_filter_free(filter);
+...
+struct trace_seq seq;
+trace_seq_init();
+
+   tep_plugin_print_options();
+...
+--
+
+FILES
+-
+[verse]
+--
+*event-parse.h*
+   Header file to include in order to have access to the library APIs.
+*-ltraceevent*
+   Linker switch to add when building a program that uses the library.
+--
+
+SEE ALSO
+
+_libtraceevent(3)_, _trace-cmd(1)_
+
+AUTHOR
+--
+[verse]
+--
+*Steven Rostedt* , author of *libtraceevent*.
+*Tzvetomir Stoyanov* , author of this man page.
+--
+REPORTING BUGS
+--
+Report bugs to  
+
+LICENSE
+---
+libtraceevent is Free Software licensed under the GNU LGPL 2.1
+
+RESOURCES
+-
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
-- 
2.26.2

[PATCH 1/2] tools lib traceevent: Man page for tep_add_plugin_path() API

2020-09-29 Thread Tzvetomir Stoyanov (VMware)

Add documentation of tep_add_plugin_path() API in the libtraceevent plugin man 
page.

Signed-off-by: Tzvetomir Stoyanov (VMware) 
---
 .../Documentation/libtraceevent-plugins.txt   | 22 +--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/tools/lib/traceevent/Documentation/libtraceevent-plugins.txt 
b/tools/lib/traceevent/Documentation/libtraceevent-plugins.txt
index 4d6394397d92..e584b8c777ad 100644
--- a/tools/lib/traceevent/Documentation/libtraceevent-plugins.txt
+++ b/tools/lib/traceevent/Documentation/libtraceevent-plugins.txt
@@ -3,7 +3,7 @@ libtraceevent(3)
 
 NAME
 
-tep_load_plugins, tep_unload_plugins, tep_load_plugins_hook - Load / unload 
traceevent plugins.
+tep_load_plugins, tep_unload_plugins, tep_load_plugins_hook, 
tep_add_plugin_path - Load / unload traceevent plugins.
 
 SYNOPSIS
 
@@ -19,6 +19,8 @@ void *tep_load_plugins_hook*(struct tep_handle pass:[*]_tep_, 
const char pass:[*
   const char pass:[*]name,
   void pass:[*]data),
   void pass:[*]_data_);
+int *tep_add_plugin_path*(struct tep_handle pass:[*]tep, char pass:[*]path,
+ enum tep_plugin_load_priority prio);
 --
 
 DESCRIPTION
@@ -52,16 +54,30 @@ _tep_load_plugins()_. The _tep_ argument is trace event 
parser context. The
 _plugin_list_ is the list of loaded plugins, returned by
 the _tep_load_plugins()_ function.
 
-The _tep_load_plugins_hook_ function walks through all directories with plugins
+The _tep_load_plugins_hook()_ function walks through all directories with 
plugins
 and calls user specified _load_plugin()_ hook for each plugin file. Only files
 with given _suffix_ are considered to be plugins. The _data_ is a user 
specified
 context, passed to _load_plugin()_. Directories and the walk order are the same
 as in _tep_load_plugins()_ API.
 
+The _tep_add_plugin_path()_ functions adds additional directories with plugins 
in
+the _tep_->plugins_dir list. It must be called before _tep_load_plugins()_ in 
order
+the plugins from the new directories to be loaded. The _tep_ argument is trace 
event
+parser context. The _path_ is the full path to the new plugin directory. The 
_prio_
+argument specifies the loading priority of plugins from the new directory. The 
loading
+priority is important in case of different versions of the same plugin located 
in
+multiple plugin directories.The last loaded plugin wins. The priority can be:
+[verse]
+--
+   _TEP_PLUGIN_FIRST_  - Load plugins from this directory first
+   _TEP_PLUGIN_LAST_   - Load plugins from this directory last
+--
+
 RETURN VALUE
 
 The _tep_load_plugins()_ function returns a list of successfully loaded 
plugins,
 or NULL in case no plugins are loaded.
+The _tep_add_plugin_path()_ function returns -1 in case of an error, 0 
otherwise.
 
 EXAMPLE
 ---
@@ -71,6 +87,8 @@ EXAMPLE
 ...
 struct tep_handle *tep = tep_alloc();
 ...
+tep_add_plugin_path(tep, "~/dev_plugins", TEP_PLUGIN_LAST);
+...
 struct tep_plugin_list *plugins = tep_load_plugins(tep);
 if (plugins == NULL) {
/* no plugins are loaded */
-- 
2.26.2

[PATCH v3] tools lib traceevent: Hide non API functions

2020-09-29 Thread Tzvetomir Stoyanov (VMware)

There are internal library functions, which are not declared as a static.
They are used inside the library from different files. Hide them from
the library users, as they are not part of the API.
These functions are made hidden and are renamed without the prefix "tep_":
 tep_free_plugin_paths
 tep_peek_char
 tep_buffer_init
 tep_get_input_buf_ptr
 tep_get_input_buf
 tep_read_token
 tep_free_token
 tep_free_event
 tep_free_format_field

Link: 
https://lore.kernel.org/linux-trace-devel/e4afdd82deb5e023d53231bb13e08dca78085fb0.ca...@decadent.org.uk/
Reported-by: Ben Hutchings 
Signed-off-by: Tzvetomir Stoyanov (VMware) 
---
v1 of the patch is here: 
https://lore.kernel.org/r/20200924070609.100771-2-tz.stoya...@gmail.com
v2 changes (addressed Steven's comments):
  - Removed leading underscores from the names of newly hidden internal
functions.
v3 changes (addressed Steven's comment):
  - Moved comments from removed APIs to internal functions.
  - Fixed a typo in patch description.

 tools/lib/traceevent/event-parse-api.c   |   8 +-
 tools/lib/traceevent/event-parse-local.h |  24 +++--
 tools/lib/traceevent/event-parse.c   | 125 ++-
 tools/lib/traceevent/event-parse.h   |   8 --
 tools/lib/traceevent/event-plugin.c  |   2 +-
 tools/lib/traceevent/parse-filter.c  |  23 ++---
 6 files changed, 83 insertions(+), 107 deletions(-)

diff --git a/tools/lib/traceevent/event-parse-api.c 
b/tools/lib/traceevent/event-parse-api.c
index 4faf52a65791..f8361e45d446 100644
--- a/tools/lib/traceevent/event-parse-api.c
+++ b/tools/lib/traceevent/event-parse-api.c
@@ -92,7 +92,7 @@ bool tep_test_flag(struct tep_handle *tep, enum tep_flag flag)
return false;
 }
 
-unsigned short tep_data2host2(struct tep_handle *tep, unsigned short data)
+__hidden unsigned short data2host2(struct tep_handle *tep, unsigned short data)
 {
unsigned short swap;
 
@@ -105,7 +105,7 @@ unsigned short tep_data2host2(struct tep_handle *tep, 
unsigned short data)
return swap;
 }
 
-unsigned int tep_data2host4(struct tep_handle *tep, unsigned int data)
+__hidden unsigned int data2host4(struct tep_handle *tep, unsigned int data)
 {
unsigned int swap;
 
@@ -120,8 +120,8 @@ unsigned int tep_data2host4(struct tep_handle *tep, 
unsigned int data)
return swap;
 }
 
-unsigned long long
-tep_data2host8(struct tep_handle *tep, unsigned long long data)
+__hidden  unsigned long long
+data2host8(struct tep_handle *tep, unsigned long long data)
 {
unsigned long long swap;
 
diff --git a/tools/lib/traceevent/event-parse-local.h 
b/tools/lib/traceevent/event-parse-local.h
index d805a920af6f..fd4bbcfbb849 100644
--- a/tools/lib/traceevent/event-parse-local.h
+++ b/tools/lib/traceevent/event-parse-local.h
@@ -15,6 +15,8 @@ struct event_handler;
 struct func_resolver;
 struct tep_plugins_dir;
 
+#define __hidden __attribute__((visibility ("hidden")))
+
 struct tep_handle {
int ref_count;
 
@@ -102,12 +104,20 @@ struct tep_print_parse {
struct tep_print_arg*len_as_arg;
 };
 
-void tep_free_event(struct tep_event *event);
-void tep_free_format_field(struct tep_format_field *field);
-void tep_free_plugin_paths(struct tep_handle *tep);
-
-unsigned short tep_data2host2(struct tep_handle *tep, unsigned short data);
-unsigned int tep_data2host4(struct tep_handle *tep, unsigned int data);
-unsigned long long tep_data2host8(struct tep_handle *tep, unsigned long long 
data);
+void free_tep_event(struct tep_event *event);
+void free_tep_format_field(struct tep_format_field *field);
+void free_tep_plugin_paths(struct tep_handle *tep);
+
+unsigned short data2host2(struct tep_handle *tep, unsigned short data);
+unsigned int data2host4(struct tep_handle *tep, unsigned int data);
+unsigned long long data2host8(struct tep_handle *tep, unsigned long long data);
+
+/* access to the internal parser */
+int peek_char(void);
+void init_input_buf(const char *buf, unsigned long long size);
+unsigned long long get_input_buf_ptr(void);
+const char *get_input_buf(void);
+enum tep_event_type read_token(char **tok);
+void free_token(char *tok);
 
 #endif /* _PARSE_EVENTS_INT_H */
diff --git a/tools/lib/traceevent/event-parse.c 
b/tools/lib/traceevent/event-parse.c
index 5acc18b32606..590640e97ecc 100644
--- a/tools/lib/traceevent/event-parse.c
+++ b/tools/lib/traceevent/event-parse.c
@@ -54,19 +54,26 @@ static int show_warning = 1;
warning(fmt, ##__VA_ARGS__);\
} while (0)
 
-static void init_input_buf(const char *buf, unsigned long long size)
+/**
+ * init_input_buf - init buffer for parsing
+ * @buf: buffer to parse
+ * @size: the size of the buffer
+ *
+ * Initializes the internal buffer that tep_read_token() will parse.
+ */
+__hidden void init_input_buf(const char *buf, unsigned long long size)
 {
input_buf = buf;
input_buf_siz = size;
input_buf_ptr = 0;
 }
 
-const char *tep_get_input_buf(void)

[PATCH v2] tools lib traceevent: Hide non API functions

2020-09-28 Thread Tzvetomir Stoyanov (VMware)

There are internal library functions, which are not decalred as a static.
They are used inside the library from different files. Hide them from
the library users, as they are not part of the API.
These functions are made hidden and are renamed without the prefix "tep_":
 tep_free_plugin_paths
 tep_peek_char
 tep_buffer_init
 tep_get_input_buf_ptr
 tep_get_input_buf
 tep_read_token
 tep_free_token
 tep_free_event
 tep_free_format_field

Reported-by: Ben Hutchings 
Signed-off-by: Tzvetomir Stoyanov (VMware) 
---
v1 of the patch is here: 
https://lore.kernel.org/r/20200924070609.100771-2-tz.stoya...@gmail.com
v2 changes:
  - Removed leading underscores from the names of newly hidden internal
functions.

 tools/lib/traceevent/event-parse-api.c   |  8 +-
 tools/lib/traceevent/event-parse-local.h | 24 --
 tools/lib/traceevent/event-parse.c   | 96 ++--
 tools/lib/traceevent/event-parse.h   |  8 --
 tools/lib/traceevent/event-plugin.c  |  2 +-
 tools/lib/traceevent/parse-filter.c  | 23 +++---
 6 files changed, 56 insertions(+), 105 deletions(-)

diff --git a/tools/lib/traceevent/event-parse-api.c 
b/tools/lib/traceevent/event-parse-api.c
index 4faf52a65791..f8361e45d446 100644
--- a/tools/lib/traceevent/event-parse-api.c
+++ b/tools/lib/traceevent/event-parse-api.c
@@ -92,7 +92,7 @@ bool tep_test_flag(struct tep_handle *tep, enum tep_flag flag)
return false;
 }
 
-unsigned short tep_data2host2(struct tep_handle *tep, unsigned short data)
+__hidden unsigned short data2host2(struct tep_handle *tep, unsigned short data)
 {
unsigned short swap;
 
@@ -105,7 +105,7 @@ unsigned short tep_data2host2(struct tep_handle *tep, 
unsigned short data)
return swap;
 }
 
-unsigned int tep_data2host4(struct tep_handle *tep, unsigned int data)
+__hidden unsigned int data2host4(struct tep_handle *tep, unsigned int data)
 {
unsigned int swap;
 
@@ -120,8 +120,8 @@ unsigned int tep_data2host4(struct tep_handle *tep, 
unsigned int data)
return swap;
 }
 
-unsigned long long
-tep_data2host8(struct tep_handle *tep, unsigned long long data)
+__hidden  unsigned long long
+data2host8(struct tep_handle *tep, unsigned long long data)
 {
unsigned long long swap;
 
diff --git a/tools/lib/traceevent/event-parse-local.h 
b/tools/lib/traceevent/event-parse-local.h
index d805a920af6f..fd4bbcfbb849 100644
--- a/tools/lib/traceevent/event-parse-local.h
+++ b/tools/lib/traceevent/event-parse-local.h
@@ -15,6 +15,8 @@ struct event_handler;
 struct func_resolver;
 struct tep_plugins_dir;
 
+#define __hidden __attribute__((visibility ("hidden")))
+
 struct tep_handle {
int ref_count;
 
@@ -102,12 +104,20 @@ struct tep_print_parse {
struct tep_print_arg*len_as_arg;
 };
 
-void tep_free_event(struct tep_event *event);
-void tep_free_format_field(struct tep_format_field *field);
-void tep_free_plugin_paths(struct tep_handle *tep);
-
-unsigned short tep_data2host2(struct tep_handle *tep, unsigned short data);
-unsigned int tep_data2host4(struct tep_handle *tep, unsigned int data);
-unsigned long long tep_data2host8(struct tep_handle *tep, unsigned long long 
data);
+void free_tep_event(struct tep_event *event);
+void free_tep_format_field(struct tep_format_field *field);
+void free_tep_plugin_paths(struct tep_handle *tep);
+
+unsigned short data2host2(struct tep_handle *tep, unsigned short data);
+unsigned int data2host4(struct tep_handle *tep, unsigned int data);
+unsigned long long data2host8(struct tep_handle *tep, unsigned long long data);
+
+/* access to the internal parser */
+int peek_char(void);
+void init_input_buf(const char *buf, unsigned long long size);
+unsigned long long get_input_buf_ptr(void);
+const char *get_input_buf(void);
+enum tep_event_type read_token(char **tok);
+void free_token(char *tok);
 
 #endif /* _PARSE_EVENTS_INT_H */
diff --git a/tools/lib/traceevent/event-parse.c 
b/tools/lib/traceevent/event-parse.c
index 5acc18b32606..032ecb22cde9 100644
--- a/tools/lib/traceevent/event-parse.c
+++ b/tools/lib/traceevent/event-parse.c
@@ -54,19 +54,19 @@ static int show_warning = 1;
warning(fmt, ##__VA_ARGS__);\
} while (0)
 
-static void init_input_buf(const char *buf, unsigned long long size)
+__hidden void init_input_buf(const char *buf, unsigned long long size)
 {
input_buf = buf;
input_buf_siz = size;
input_buf_ptr = 0;
 }
 
-const char *tep_get_input_buf(void)
+__hidden const char *get_input_buf(void)
 {
return input_buf;
 }
 
-unsigned long long tep_get_input_buf_ptr(void)
+__hidden unsigned long long get_input_buf_ptr(void)
 {
return input_buf_ptr;
 }
@@ -100,26 +100,13 @@ process_defined_func(struct trace_seq *s, void *data, int 
size,
 
 static void free_func_handle(struct tep_function_handler *func);
 
-/**
- * tep_buffer_init - init buffer for parsing
- * @buf: buffer to parse
- * @size: the size of the bu

[tip: core/static_call] tracepoint: Optimize using static_call()

2020-09-01 Thread tip-bot2 for Steven Rostedt (VMware)

The following commit has been merged into the core/static_call branch of tip:

Commit-ID: d25e37d89dd2f41d7acae0429039d2f0ae8b4a07
Gitweb:
https://git.kernel.org/tip/d25e37d89dd2f41d7acae0429039d2f0ae8b4a07
Author:Steven Rostedt (VMware) 
AuthorDate:Tue, 18 Aug 2020 15:57:52 +02:00
Committer: Ingo Molnar 
CommitterDate: Tue, 01 Sep 2020 09:58:06 +02:00

tracepoint: Optimize using static_call()

Currently the tracepoint site will iterate a vector and issue indirect
calls to however many handlers are registered (ie. the vector is
long).

Using static_call() it is possible to optimize this for the common
case of only having a single handler registered. In this case the
static_call() can directly call this handler. Otherwise, if the vector
is longer than 1, call a function that iterates the whole vector like
the current code.

[peterz: updated to new interface]

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Cc: Linus Torvalds 
Link: https://lore.kernel.org/r/20200818135805.279421...@infradead.org
---
 include/linux/tracepoint-defs.h |  5 ++-
 include/linux/tracepoint.h  | 86 ++--
 include/trace/define_trace.h| 14 ++---
 kernel/tracepoint.c | 25 +++--
 4 files changed, 94 insertions(+), 36 deletions(-)

diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index b29950a..de97450 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -11,6 +11,8 @@
 #include 
 #include 
 
+struct static_call_key;
+
 struct trace_print_flags {
unsigned long   mask;
const char  *name;
@@ -30,6 +32,9 @@ struct tracepoint_func {
 struct tracepoint {
const char *name;   /* Tracepoint name */
struct static_key key;
+   struct static_call_key *static_call_key;
+   void *static_call_tramp;
+   void *iterator;
int (*regfunc)(void);
void (*unregfunc)(void);
struct tracepoint_func __rcu *funcs;
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 598fec9..3722a10 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct module;
 struct tracepoint;
@@ -92,7 +93,9 @@ extern int syscall_regfunc(void);
 extern void syscall_unregfunc(void);
 #endif /* CONFIG_HAVE_SYSCALL_TRACEPOINTS */
 
+#ifndef PARAMS
 #define PARAMS(args...) args
+#endif
 
 #define TRACE_DEFINE_ENUM(x)
 #define TRACE_DEFINE_SIZEOF(x)
@@ -148,6 +151,12 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
 
 #ifdef TRACEPOINTS_ENABLED
 
+#ifdef CONFIG_HAVE_STATIC_CALL
+#define __DO_TRACE_CALL(name)  static_call(tp_func_##name)
+#else
+#define __DO_TRACE_CALL(name)  __tracepoint_iter_##name
+#endif /* CONFIG_HAVE_STATIC_CALL */
+
 /*
  * it_func[0] is never NULL because there is at least one element in the array
  * when the array itself is non NULL.
@@ -157,12 +166,11 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
  * has a "void" prototype, then it is invalid to declare a function
  * as "(void *, void)".
  */
-#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
+#define __DO_TRACE(name, proto, args, cond, rcuidle)   \
do {\
struct tracepoint_func *it_func_ptr;\
-   void *it_func;  \
-   void *__data;   \
int __maybe_unused __idx = 0;   \
+   void *__data;   \
\
if (!(cond))\
return; \
@@ -182,14 +190,11 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
rcu_irq_enter_irqson(); \
}   \
\
-   it_func_ptr = rcu_dereference_raw((tp)->funcs); \
-   \
+   it_func_ptr =   \
+   rcu_dereference_raw((&__tracepoint_##name)->funcs); \
if (it_func_ptr) {  \
-   do {\
-   it_func = (it_func_ptr)->func;  \
-

[tip: sched/core] sched: Remove struct sched_class::next field

2020-06-25 Thread tip-bot2 for Steven Rostedt (VMware)

The following commit has been merged into the sched/core branch of tip:

Commit-ID: a87e749e8fa1aaef9b4db32e21c2795e69ce67bf
Gitweb:
https://git.kernel.org/tip/a87e749e8fa1aaef9b4db32e21c2795e69ce67bf
Author:Steven Rostedt (VMware) 
AuthorDate:Thu, 19 Dec 2019 16:44:54 -05:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 25 Jun 2020 13:45:44 +02:00

sched: Remove struct sched_class::next field

Now that the sched_class descriptors are defined in order via the linker
script vmlinux.lds.h, there's no reason to have a "next" pointer to the
previous priroity structure. The order of the sturctures can be aligned as
an array, and used to index and find the next sched_class descriptor.

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20191219214558.845353...@goodmis.org
---
 kernel/sched/deadline.c  | 1 -
 kernel/sched/fair.c  | 1 -
 kernel/sched/idle.c  | 1 -
 kernel/sched/rt.c| 1 -
 kernel/sched/sched.h | 1 -
 kernel/sched/stop_task.c | 1 -
 6 files changed, 6 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d9e7946..c9cc1d6 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2481,7 +2481,6 @@ static void prio_changed_dl(struct rq *rq, struct 
task_struct *p,
 
 const struct sched_class dl_sched_class
__attribute__((section("__dl_sched_class"))) = {
-   .next   = _sched_class,
.enqueue_task   = enqueue_task_dl,
.dequeue_task   = dequeue_task_dl,
.yield_task = yield_task_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3365f6b..a63f400 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11124,7 +11124,6 @@ static unsigned int get_rr_interval_fair(struct rq *rq, 
struct task_struct *task
  */
 const struct sched_class fair_sched_class
__attribute__((section("__fair_sched_class"))) = {
-   .next   = _sched_class,
.enqueue_task   = enqueue_task_fair,
.dequeue_task   = dequeue_task_fair,
.yield_task = yield_task_fair,
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f580629..336d478 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -455,7 +455,6 @@ static void update_curr_idle(struct rq *rq)
  */
 const struct sched_class idle_sched_class
__attribute__((section("__idle_sched_class"))) = {
-   /* .next is NULL */
/* no enqueue/yield_task for idle tasks */
 
/* dequeue is not valid, we print a debug message there: */
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 6543d44..f215eea 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2431,7 +2431,6 @@ static unsigned int get_rr_interval_rt(struct rq *rq, 
struct task_struct *task)
 
 const struct sched_class rt_sched_class
__attribute__((section("__rt_sched_class"))) = {
-   .next   = _sched_class,
.enqueue_task   = enqueue_task_rt,
.dequeue_task   = dequeue_task_rt,
.yield_task = yield_task_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4165c06..549e7e6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1754,7 +1754,6 @@ extern const u32  sched_prio_to_wmult[40];
 #define RETRY_TASK ((void *)-1UL)
 
 struct sched_class {
-   const struct sched_class *next;
 
 #ifdef CONFIG_UCLAMP_TASK
int uclamp_enabled;
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index f4bbd54..394bc81 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -111,7 +111,6 @@ static void update_curr_stop(struct rq *rq)
  */
 const struct sched_class stop_sched_class
__attribute__((section("__stop_sched_class"))) = {
-   .next   = _sched_class,
 
.enqueue_task   = enqueue_task_stop,
.dequeue_task   = dequeue_task_stop,

[tip: sched/core] sched: Have sched_class_highest define by vmlinux.lds.h

2020-06-25 Thread tip-bot2 for Steven Rostedt (VMware)

The following commit has been merged into the sched/core branch of tip:

Commit-ID: c3a340f7e7eadac7662ab104ceb16432e5a4c6b2
Gitweb:
https://git.kernel.org/tip/c3a340f7e7eadac7662ab104ceb16432e5a4c6b2
Author:Steven Rostedt (VMware) 
AuthorDate:Thu, 19 Dec 2019 16:44:53 -05:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 25 Jun 2020 13:45:44 +02:00

sched: Have sched_class_highest define by vmlinux.lds.h

Now that the sched_class descriptors are defined by the linker script, and
this needs to be aware of the existance of stop_sched_class when SMP is
enabled or not, as it is used as the "highest" priority when defined. Move
the declaration of sched_class_highest to the same location in the linker
script that inserts stop_sched_class, and this will also make it easier to
see what should be defined as the highest class, as this linker script
location defines the priorities as well.

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20191219214558.682913...@goodmis.org
---
 include/asm-generic/vmlinux.lds.h |  5 -
 kernel/sched/core.c   |  8 
 kernel/sched/sched.h  | 17 +
 3 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 2186d7b..66fb84c 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -114,11 +114,14 @@
  * relation to each other.
  */
 #define SCHED_DATA \
+   STRUCT_ALIGN(); \
+   __begin_sched_classes = .;  \
*(__idle_sched_class)   \
*(__fair_sched_class)   \
*(__rt_sched_class) \
*(__dl_sched_class) \
-   *(__stop_sched_class)
+   *(__stop_sched_class)   \
+   __end_sched_classes = .;
 
 /*
  * Align to a 32 byte boundary equal to the
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0208b71..81640fe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6646,6 +6646,14 @@ void __init sched_init(void)
unsigned long ptr = 0;
int i;
 
+   /* Make sure the linker didn't screw up */
+   BUG_ON(_sched_class + 1 != _sched_class ||
+  _sched_class + 1 != _sched_class ||
+  _sched_class + 1   != _sched_class);
+#ifdef CONFIG_SMP
+   BUG_ON(_sched_class + 1 != _sched_class);
+#endif
+
wait_bit_init();
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3368876..4165c06 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1811,7 +1811,7 @@ struct sched_class {
 #ifdef CONFIG_FAIR_GROUP_SCHED
void (*task_change_group)(struct task_struct *p, int type);
 #endif
-};
+} __aligned(32); /* STRUCT_ALIGN(), vmlinux.lds.h */
 
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
@@ -1825,17 +1825,18 @@ static inline void set_next_task(struct rq *rq, struct 
task_struct *next)
next->sched_class->set_next_task(rq, next, false);
 }
 
-#ifdef CONFIG_SMP
-#define sched_class_highest (_sched_class)
-#else
-#define sched_class_highest (_sched_class)
-#endif
+/* Defined in include/asm-generic/vmlinux.lds.h */
+extern struct sched_class __begin_sched_classes[];
+extern struct sched_class __end_sched_classes[];
+
+#define sched_class_highest (__end_sched_classes - 1)
+#define sched_class_lowest  (__begin_sched_classes - 1)
 
 #define for_class_range(class, _from, _to) \
-   for (class = (_from); class != (_to); class = class->next)
+   for (class = (_from); class != (_to); class--)
 
 #define for_each_class(class) \
-   for_class_range(class, sched_class_highest, NULL)
+   for_class_range(class, sched_class_highest, sched_class_lowest)
 
 extern const struct sched_class stop_sched_class;
 extern const struct sched_class dl_sched_class;

[tip: sched/core] sched: Force the address order of each sched class descriptor

2020-06-25 Thread tip-bot2 for Steven Rostedt (VMware)

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 590d69796346353878b275c5512c664e3f875f24
Gitweb:
https://git.kernel.org/tip/590d69796346353878b275c5512c664e3f875f24
Author:Steven Rostedt (VMware) 
AuthorDate:Thu, 19 Dec 2019 16:44:52 -05:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 25 Jun 2020 13:45:43 +02:00

sched: Force the address order of each sched class descriptor

In order to make a micro optimization in pick_next_task(), the order of the
sched class descriptor address must be in the same order as their priority
to each other. That is:

 _sched_class < _sched_class < _sched_class <
 _sched_class < _sched_class

In order to guarantee this order of the sched class descriptors, add each
one into their own data section and force the order in the linker script.

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Peter Zijlstra (Intel) 
Link: 
https://lore.kernel.org/r/157675913272.349305.8936736338884044103.stgit@localhost.localdomain
---
 include/asm-generic/vmlinux.lds.h | 13 +
 kernel/sched/deadline.c   |  3 ++-
 kernel/sched/fair.c   |  3 ++-
 kernel/sched/idle.c   |  3 ++-
 kernel/sched/rt.c |  3 ++-
 kernel/sched/stop_task.c  |  3 ++-
 6 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index db600ef..2186d7b 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -109,6 +109,18 @@
 #endif
 
 /*
+ * The order of the sched class addresses are important, as they are
+ * used to determine the order of the priority of each sched class in
+ * relation to each other.
+ */
+#define SCHED_DATA \
+   *(__idle_sched_class)   \
+   *(__fair_sched_class)   \
+   *(__rt_sched_class) \
+   *(__dl_sched_class) \
+   *(__stop_sched_class)
+
+/*
  * Align to a 32 byte boundary equal to the
  * alignment gcc 4.5 uses for a struct
  */
@@ -388,6 +400,7 @@
.rodata   : AT(ADDR(.rodata) - LOAD_OFFSET) {   \
__start_rodata = .; \
*(.rodata) *(.rodata.*) \
+   SCHED_DATA  \
RO_AFTER_INIT_DATA  /* Read only after init */  \
. = ALIGN(8);   \
__start___tracepoints_ptrs = .; \
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d4708e2..d9e7946 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2479,7 +2479,8 @@ static void prio_changed_dl(struct rq *rq, struct 
task_struct *p,
}
 }
 
-const struct sched_class dl_sched_class = {
+const struct sched_class dl_sched_class
+   __attribute__((section("__dl_sched_class"))) = {
.next   = _sched_class,
.enqueue_task   = enqueue_task_dl,
.dequeue_task   = dequeue_task_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0424a0a..3365f6b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11122,7 +11122,8 @@ static unsigned int get_rr_interval_fair(struct rq *rq, 
struct task_struct *task
 /*
  * All the scheduling class methods:
  */
-const struct sched_class fair_sched_class = {
+const struct sched_class fair_sched_class
+   __attribute__((section("__fair_sched_class"))) = {
.next   = _sched_class,
.enqueue_task   = enqueue_task_fair,
.dequeue_task   = dequeue_task_fair,
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 8d75ca2..f580629 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -453,7 +453,8 @@ static void update_curr_idle(struct rq *rq)
 /*
  * Simple, special scheduling class for the per-CPU idle tasks:
  */
-const struct sched_class idle_sched_class = {
+const struct sched_class idle_sched_class
+   __attribute__((section("__idle_sched_class"))) = {
/* .next is NULL */
/* no enqueue/yield_task for idle tasks */
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f395ddb..6543d44 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2429,7 +2429,8 @@ static unsigned int get_rr_interval_rt(struct rq *rq, 
struct task_struct *task)
return 0;
 }
 
-const struct sched_class rt_sched_class = {
+const struct sched_class rt_sched_class
+   __attribute__((section("__rt_sched_class"))) = {
.next   = _sched_class,
.enqueue_task   = enqueue_task_rt,
.dequeue_task   = dequeue_task_rt,
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_tas

[PATCH] mm: Fix a huge pud insertion race during faulting

2019-10-22 Thread VMware

From: Thomas Hellstrom 

A huge pud page can theoretically be faulted in racing with pmd_alloc()
in __handle_mm_fault(). That will lead to pmd_alloc() returning an
invalid pmd pointer. Fix this by adding a pud_trans_unstable() function
similar to pmd_trans_unstable() and check whether the pud is really stable
before using the pmd pointer.

Race:
Thread 1: Thread 2: Comment
create_huge_pud()   Fallback - not taken.
  create_huge_pud() Taken.
pmd_alloc() Returns an invalid pointer.

Cc: Matthew Wilcox 
Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
Signed-off-by: Thomas Hellstrom 
---
 include/asm-generic/pgtable.h | 25 +
 mm/memory.c   |  6 ++
 2 files changed, 31 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 818691846c90..70c2058230ba 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -912,6 +912,31 @@ static inline int pud_trans_huge(pud_t pud)
 }
 #endif
 
+/* See pmd_none_or_trans_huge_or_clear_bad for discussion. */
+static inline int pud_none_or_trans_huge_or_dev_or_clear_bad(pud_t *pud)
+{
+   pud_t pudval = READ_ONCE(*pud);
+
+   if (pud_none(pudval) || pud_trans_huge(pudval) || pud_devmap(pudval))
+   return 1;
+   if (unlikely(pud_bad(pudval))) {
+   pud_clear_bad(pud);
+   return 1;
+   }
+   return 0;
+}
+
+/* See pmd_trans_unstable for discussion. */
+static inline int pud_trans_unstable(pud_t *pud)
+{
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) &&\
+   defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+   return pud_none_or_trans_huge_or_dev_or_clear_bad(pud);
+#else
+   return 0;
+#endif
+}
+
 #ifndef pmd_read_atomic
 static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 {
diff --git a/mm/memory.c b/mm/memory.c
index b1ca51a079f2..43ff372f4f07 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3914,6 +3914,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct 
*vma,
vmf.pud = pud_alloc(mm, p4d, address);
if (!vmf.pud)
return VM_FAULT_OOM;
+retry_pud:
if (pud_none(*vmf.pud) && __transparent_hugepage_enabled(vma)) {
ret = create_huge_pud();
if (!(ret & VM_FAULT_FALLBACK))
@@ -3940,6 +3941,11 @@ static vm_fault_t __handle_mm_fault(struct 
vm_area_struct *vma,
vmf.pmd = pmd_alloc(mm, vmf.pud, address);
if (!vmf.pmd)
return VM_FAULT_OOM;
+
+   /* Huge pud page fault raced with pmd_alloc? */
+   if (pud_trans_unstable(vmf.pud))
+   goto retry_pud;
+
if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
ret = create_huge_pmd();
if (!(ret & VM_FAULT_FALLBACK))
-- 
2.21.0

[tip: perf/core] perf tools: Remove unused trace_find_next_event()

2019-10-21 Thread tip-bot2 for Steven Rostedt (VMware)

The following commit has been merged into the perf/core branch of tip:

Commit-ID: 9bdff5b6436655d42dd30253c521e86ce07b9961
Gitweb:
https://git.kernel.org/tip/9bdff5b6436655d42dd30253c521e86ce07b9961
Author:Steven Rostedt (VMware) 
AuthorDate:Thu, 17 Oct 2019 17:05:23 -04:00
Committer: Arnaldo Carvalho de Melo 
CommitterDate: Fri, 18 Oct 2019 12:07:46 -03:00

perf tools: Remove unused trace_find_next_event()

trace_find_next_event() was buggy and pretty much a useless helper. As
there are no more users, just remove it.

Signed-off-by: Steven Rostedt (VMware) 
Cc: Andrew Morton 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Tzvetomir Stoyanov 
Cc: linux-trace-de...@vger.kernel.org
Link: http://lore.kernel.org/lkml/20191017210636.224045...@goodmis.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/util/trace-event-parse.c | 31 +
 tools/perf/util/trace-event.h   |  2 +--
 2 files changed, 33 deletions(-)

diff --git a/tools/perf/util/trace-event-parse.c 
b/tools/perf/util/trace-event-parse.c
index 5d6bfc7..9634f0a 100644
--- a/tools/perf/util/trace-event-parse.c
+++ b/tools/perf/util/trace-event-parse.c
@@ -173,37 +173,6 @@ int parse_event_file(struct tep_handle *pevent,
return tep_parse_event(pevent, buf, size, sys);
 }
 
-struct tep_event *trace_find_next_event(struct tep_handle *pevent,
-   struct tep_event *event)
-{
-   static int idx;
-   int events_count;
-   struct tep_event *all_events;
-
-   all_events = tep_get_first_event(pevent);
-   events_count = tep_get_events_count(pevent);
-   if (!pevent || !all_events || events_count < 1)
-   return NULL;
-
-   if (!event) {
-   idx = 0;
-   return all_events;
-   }
-
-   if (idx < events_count && event == (all_events + idx)) {
-   idx++;
-   if (idx == events_count)
-   return NULL;
-   return (all_events + idx);
-   }
-
-   for (idx = 1; idx < events_count; idx++) {
-   if (event == (all_events + (idx - 1)))
-   return (all_events + idx);
-   }
-   return NULL;
-}
-
 struct flag {
const char *name;
unsigned long long value;
diff --git a/tools/perf/util/trace-event.h b/tools/perf/util/trace-event.h
index 2e15838..72fdf2a 100644
--- a/tools/perf/util/trace-event.h
+++ b/tools/perf/util/trace-event.h
@@ -47,8 +47,6 @@ void parse_saved_cmdline(struct tep_handle *pevent, char 
*file, unsigned int siz
 
 ssize_t trace_report(int fd, struct trace_event *tevent, bool repipe);
 
-struct tep_event *trace_find_next_event(struct tep_handle *pevent,
-   struct tep_event *event);
 unsigned long long read_size(struct tep_event *event, void *ptr, int size);
 unsigned long long eval_flag(const char *flag);

[tip: perf/core] perf scripting engines: Iterate on tep event arrays directly

2019-10-21 Thread tip-bot2 for Steven Rostedt (VMware)

The following commit has been merged into the perf/core branch of tip:

Commit-ID: a5e05abc6b8d81148b35cd8632a4a6252383d968
Gitweb:
https://git.kernel.org/tip/a5e05abc6b8d81148b35cd8632a4a6252383d968
Author:Steven Rostedt (VMware) 
AuthorDate:Thu, 17 Oct 2019 17:05:22 -04:00
Committer: Arnaldo Carvalho de Melo 
CommitterDate: Fri, 18 Oct 2019 12:07:46 -03:00

perf scripting engines: Iterate on tep event arrays directly

Instead of calling a useless (and broken) helper function to get the
next event of a tep event array, just get the array directly and iterate
over it.

Note, the broken part was from trace_find_next_event() which after this
will no longer be used, and can be removed.

Committer notes:

This fixes a segfault when generating python scripts from perf.data
files with multiple tracepoint events, i.e. the following use case is
fixed by this patch:

  # perf record -e sched:* sleep 1
  [ perf record: Woken up 31 times to write data ]
  [ perf record: Captured and wrote 0.031 MB perf.data (9 samples) ]
  # perf script -g python
  Segmentation fault (core dumped)
  #

Reported-by: Daniel Bristot de Oliveira 
Signed-off-by: Steven Rostedt (VMware) 
Tested-by: Arnaldo Carvalho de Melo 
Cc: Andrew Morton 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Tzvetomir Stoyanov 
Cc: linux-trace-de...@vger.kernel.org
Link: http://lkml.kernel.org/r/20191017153733.630cd...@gandalf.local.home
Link: http://lore.kernel.org/lkml/20191017210636.061448...@goodmis.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/util/scripting-engines/trace-event-perl.c   |  8 ++--
 tools/perf/util/scripting-engines/trace-event-python.c |  9 +++--
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/scripting-engines/trace-event-perl.c 
b/tools/perf/util/scripting-engines/trace-event-perl.c
index 1596185..741f040 100644
--- a/tools/perf/util/scripting-engines/trace-event-perl.c
+++ b/tools/perf/util/scripting-engines/trace-event-perl.c
@@ -539,10 +539,11 @@ static int perl_stop_script(void)
 
 static int perl_generate_script(struct tep_handle *pevent, const char *outfile)
 {
+   int i, not_first, count, nr_events;
+   struct tep_event **all_events;
struct tep_event *event = NULL;
struct tep_format_field *f;
char fname[PATH_MAX];
-   int not_first, count;
FILE *ofp;
 
sprintf(fname, "%s.pl", outfile);
@@ -603,8 +604,11 @@ sub print_backtrace\n\
 }\n\n\
 ");
 
+   nr_events = tep_get_events_count(pevent);
+   all_events = tep_list_events(pevent, TEP_EVENT_SORT_ID);
 
-   while ((event = trace_find_next_event(pevent, event))) {
+   for (i = 0; all_events && i < nr_events; i++) {
+   event = all_events[i];
fprintf(ofp, "sub %s::%s\n{\n", event->system, event->name);
fprintf(ofp, "\tmy (");
 
diff --git a/tools/perf/util/scripting-engines/trace-event-python.c 
b/tools/perf/util/scripting-engines/trace-event-python.c
index 5d341ef..93c03b3 100644
--- a/tools/perf/util/scripting-engines/trace-event-python.c
+++ b/tools/perf/util/scripting-engines/trace-event-python.c
@@ -1687,10 +1687,11 @@ static int python_stop_script(void)
 
 static int python_generate_script(struct tep_handle *pevent, const char 
*outfile)
 {
+   int i, not_first, count, nr_events;
+   struct tep_event **all_events;
struct tep_event *event = NULL;
struct tep_format_field *f;
char fname[PATH_MAX];
-   int not_first, count;
FILE *ofp;
 
sprintf(fname, "%s.py", outfile);
@@ -1735,7 +1736,11 @@ static int python_generate_script(struct tep_handle 
*pevent, const char *outfile
fprintf(ofp, "def trace_end():\n");
fprintf(ofp, "\tprint(\"in trace_end\")\n\n");
 
-   while ((event = trace_find_next_event(pevent, event))) {
+   nr_events = tep_get_events_count(pevent);
+   all_events = tep_list_events(pevent, TEP_EVENT_SORT_ID);
+
+   for (i = 0; all_events && i < nr_events; i++) {
+   event = all_events[i];
fprintf(ofp, "def %s__%s(", event->system, event->name);
fprintf(ofp, "event_name, ");
fprintf(ofp, "context, ");

[PATCH v2 1/2] x86/cpu/vmware: Use the full form of INL in VMWARE_HYPERCALL

2019-10-21 Thread VMware

From: Thomas Hellstrom 

LLVM's assembler doesn't accept the short form INL instruction:

  inl (%%dx)

but instead insists on the output register to be explicitly specified.

This was previously fixed for the VMWARE_PORT macro. Fix it also for
the VMWARE_HYPERCALL macro.

Cc: clang-built-li...@googlegroups.com
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: x86-ml 
Cc: Borislav Petkov 
Fixes: b4dd4f6e3648 ("Add a header file for hypercall definitions")
Suggested-by: Sami Tolvanen 
Signed-off-by: Thomas Hellstrom 
Reviewed-by: Nick Desaulniers 
---
 arch/x86/include/asm/vmware.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmware.h b/arch/x86/include/asm/vmware.h
index e00c9e875933..3caac90f9761 100644
--- a/arch/x86/include/asm/vmware.h
+++ b/arch/x86/include/asm/vmware.h
@@ -29,7 +29,8 @@
 
 /* The low bandwidth call. The low word of edx is presumed clear. */
 #define VMWARE_HYPERCALL   \
-   ALTERNATIVE_2("movw $" VMWARE_HYPERVISOR_PORT ", %%dx; inl (%%dx)", \
+   ALTERNATIVE_2("movw $" VMWARE_HYPERVISOR_PORT ", %%dx; "\
+ "inl (%%dx), %%eax",  \
  "vmcall", X86_FEATURE_VMCALL, \
  "vmmcall", X86_FEATURE_VMW_VMMCALL)
 
-- 
2.21.0

[PATCH v2 2/2] x86/cpu/vmware: Fix platform detection VMWARE_PORT macro

2019-10-21 Thread VMware

From: Thomas Hellstrom 

The platform detection VMWARE_PORT macro uses the VMWARE_HYPERVISOR_PORT
definition, but expects it to be an integer. However, when it was moved
to the new vmware.h include file, it was changed to be a string to better
fit into the VMWARE_HYPERCALL set of macros. This obviously breaks the
platform detection VMWARE_PORT functionality.

Change the VMWARE_HYPERVISOR_PORT and VMWARE_HYPERVISOR_PORT_HB
definitions to be integers, and use __stringify() for their stringified
form when needed.

Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: x86-ml 
Cc: Borislav Petkov 
Fixes: b4dd4f6e3648 ("Add a header file for hypercall definitions")
Signed-off-by: Thomas Hellstrom 
---
 arch/x86/include/asm/vmware.h | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/vmware.h b/arch/x86/include/asm/vmware.h
index 3caac90f9761..ac9fc51e2b18 100644
--- a/arch/x86/include/asm/vmware.h
+++ b/arch/x86/include/asm/vmware.h
@@ -4,6 +4,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * The hypercall definitions differ in the low word of the %edx argument
@@ -20,8 +21,8 @@
  */
 
 /* Old port-based version */
-#define VMWARE_HYPERVISOR_PORT"0x5658"
-#define VMWARE_HYPERVISOR_PORT_HB "0x5659"
+#define VMWARE_HYPERVISOR_PORT0x5658
+#define VMWARE_HYPERVISOR_PORT_HB 0x5659
 
 /* Current vmcall / vmmcall version */
 #define VMWARE_HYPERVISOR_HB   BIT(0)
@@ -29,7 +30,7 @@
 
 /* The low bandwidth call. The low word of edx is presumed clear. */
 #define VMWARE_HYPERCALL   \
-   ALTERNATIVE_2("movw $" VMWARE_HYPERVISOR_PORT ", %%dx; "\
+   ALTERNATIVE_2("movw $" __stringify(VMWARE_HYPERVISOR_PORT) ", %%dx; " \
  "inl (%%dx), %%eax",  \
  "vmcall", X86_FEATURE_VMCALL, \
  "vmmcall", X86_FEATURE_VMW_VMMCALL)
@@ -39,7 +40,8 @@
  * HB and OUT bits set.
  */
 #define VMWARE_HYPERCALL_HB_OUT
\
-   ALTERNATIVE_2("movw $" VMWARE_HYPERVISOR_PORT_HB ", %%dx; rep outsb", \
+   ALTERNATIVE_2("movw $" __stringify(VMWARE_HYPERVISOR_PORT_HB) ", %%dx; 
" \
+ "rep outsb",  \
  "vmcall", X86_FEATURE_VMCALL, \
  "vmmcall", X86_FEATURE_VMW_VMMCALL)
 
@@ -48,7 +50,8 @@
  * HB bit set.
  */
 #define VMWARE_HYPERCALL_HB_IN \
-   ALTERNATIVE_2("movw $" VMWARE_HYPERVISOR_PORT_HB ", %%dx; rep insb", \
+   ALTERNATIVE_2("movw $" __stringify(VMWARE_HYPERVISOR_PORT_HB) ", %%dx; 
" \
+ "rep insb",   \
  "vmcall", X86_FEATURE_VMCALL, \
  "vmmcall", X86_FEATURE_VMW_VMMCALL)
 #endif
-- 
2.21.0

[PATCH v2 0/2] x86/cpu/vmware: Fixes for 5.4

2019-10-21 Thread VMware

From: Thomas Hellstrom 

Two fixes for recently introduced regressions:

Patch 1 is more or less idential to a previous patch fixing the VMW_PORT
macro on LLVM's assembler. However, that patch left out the VMW_HYPERCALL
macro (probably not configured for use), so let's fix that also.

Patch 2 fixes another VMW_PORT run-time regression at platform detection
time

v2:
- Added an R-B for patch 1 (Nick Desaulniers)
- Improved on asm formatting in patch 2 (Sean Christopherson)

Cc: clang-built-li...@googlegroups.com
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: x86-ml 
Cc: Borislav Petkov

Re: dma coherent memory user-space maps

2019-10-21 Thread VMware


On 10/8/19 2:34 PM, Thomas Hellström (VMware) wrote:

Hi, Christoph,

Following our previous discussion I wonder if something along the 
lines of the following could work / be acceptible


typedef unsigned long dma_pfn_t /* Opaque pfn type. Arch dependent. 
This could if needed be a struct with a pointer and an offset */


/* Similar to vmf_insert_mixed() */
vm_fault_t dma_vmf_insert_mixed(struct device *dev,
                struct vm_area_struct *vma,
                unsigned long addr,
                dma_pfn_t dpfn,
                unsigned long attrs);

/* Similar to vmf_insert_pfn_pmd() */
vm_fault_t dma_vmf_insert_pfn_pmd(struct device *dev,
                  struct vm_area_struct *vma,
                  unsigned long addr,
                  dma_pfn_t dpfn,
                  unsigned long attrs);

/* Like vmap, but takes struct dma_pfns. */
extern void *dma_vmap(struct device *dev,
          dma_pfn_t dpfns[],
          unsigned int count, unsigned long flags,
          unsigned long attrs);

/* Obtain struct dma_pfn pointers from a dma coherent allocation */
int dma_get_dpfns(struct device *dev, void *cpu_addr, dma_addr_t 
dma_addr,

          pgoff_t offset, pgoff_t num, dma_pfn_t dpfns[]);

I figure, for most if not all architectures we could use an ordinary 
pfn as dma_pfn_t, but the dma layer would still have control over how 
those pfns are obtained and how they are used in the kernel's mapping 
APIs.


If so, I could start looking at this, time permitting,  for the cases 
where the pfn can be obtained from the kernel address or from 
arch_dma_coherent_to_pfn(), and also the needed work to have a 
tailored vmap_pfn().


Thanks,
/Thomas



Ping?

Thanks,

Thomas

[PATCH 2/2] x86/cpu/vmware: Fix platform detection VMWARE_PORT macro

2019-10-18 Thread VMware

From: Thomas Hellstrom 

The platform detection VMWARE_PORT macro uses the VMWARE_HYPERVISOR_PORT
definition, but expects it to be an integer. However, when it was moved
to the new vmware.h include file, it was changed to be a string to better
fit into the VMWARE_HYPERCALL set of macros. This obviously breaks the
platform detection VMWARE_PORT functionality.

Change the VMWARE_HYPERVISOR_PORT and VMWARE_HYPERVISOR_PORT_HB
definitions to be integers, and use __stringify() for their stringified
form when needed.

Fixes: b4dd4f6e3648 ("Add a header file for hypercall definitions")
Signed-off-by: Thomas Hellstrom 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: x86-ml 
Cc: Borislav Petkov 
---
 arch/x86/include/asm/vmware.h | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/vmware.h b/arch/x86/include/asm/vmware.h
index f5fbe3778aef..d20eda0c6ed8 100644
--- a/arch/x86/include/asm/vmware.h
+++ b/arch/x86/include/asm/vmware.h
@@ -4,6 +4,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * The hypercall definitions differ in the low word of the %edx argument
@@ -20,8 +21,8 @@
  */
 
 /* Old port-based version */
-#define VMWARE_HYPERVISOR_PORT"0x5658"
-#define VMWARE_HYPERVISOR_PORT_HB "0x5659"
+#define VMWARE_HYPERVISOR_PORT0x5658
+#define VMWARE_HYPERVISOR_PORT_HB 0x5659
 
 /* Current vmcall / vmmcall version */
 #define VMWARE_HYPERVISOR_HB   BIT(0)
@@ -29,7 +30,7 @@
 
 /* The low bandwidth call. The low word of edx is presumed clear. */
 #define VMWARE_HYPERCALL   \
-   ALTERNATIVE_2("movw $" VMWARE_HYPERVISOR_PORT   \
+   ALTERNATIVE_2("movw $" __stringify(VMWARE_HYPERVISOR_PORT)  \
  ", %%dx; inl (%%dx), %%eax",  \
  "vmcall", X86_FEATURE_VMCALL, \
  "vmmcall", X86_FEATURE_VMW_VMMCALL)
@@ -39,7 +40,8 @@
  * HB and OUT bits set.
  */
 #define VMWARE_HYPERCALL_HB_OUT
\
-   ALTERNATIVE_2("movw $" VMWARE_HYPERVISOR_PORT_HB ", %%dx; rep outsb", \
+   ALTERNATIVE_2("movw $" __stringify(VMWARE_HYPERVISOR_PORT_HB)   \
+ ", %%dx; rep outsb",  \
  "vmcall", X86_FEATURE_VMCALL, \
  "vmmcall", X86_FEATURE_VMW_VMMCALL)
 
@@ -48,7 +50,8 @@
  * HB bit set.
  */
 #define VMWARE_HYPERCALL_HB_IN \
-   ALTERNATIVE_2("movw $" VMWARE_HYPERVISOR_PORT_HB ", %%dx; rep insb", \
+   ALTERNATIVE_2("movw $" __stringify(VMWARE_HYPERVISOR_PORT_HB)   \
+ ", %%dx; rep insb",   \
  "vmcall", X86_FEATURE_VMCALL, \
  "vmmcall", X86_FEATURE_VMW_VMMCALL)
 #endif
-- 
2.21.0

[PATCH 0/2] x86/cpu/vmware: Fixes for 5.4

2019-10-18 Thread VMware

From: Thomas Hellstrom 

Two fixes for recently introduced regressions:

Patch 1 is more or less idential to a previous patch fixing the VMW_PORT
macro on LLVM's assembler. However, that patch left out the VMW_HYPERCALL
macro (probably not configured for use), so let's fix that also.

Patch 2 fixes another VMW_PORT run-time regression at platform detection
time

[PATCH 1/2] x86/cpu/vmware: Use the full form of INL in VMWARE_HYPERCALL

2019-10-18 Thread VMware

From: Thomas Hellstrom 

LLVM's assembler doesn't accept the short form INL instruction:

  inl (%%dx)

but instead insists on the output register to be explicitly specified.

This was previously fixed for the VMWARE_PORT macro. Fix it also for
the VMWARE_HYPERCALL macro.

Fixes: b4dd4f6e3648 ("Add a header file for hypercall definitions")
Suggested-by: Sami Tolvanen 
Signed-off-by: Thomas Hellstrom 
Cc: clang-built-li...@googlegroups.com
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: x86-ml 
Cc: Borislav Petkov 
---
 arch/x86/include/asm/vmware.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmware.h b/arch/x86/include/asm/vmware.h
index e00c9e875933..f5fbe3778aef 100644
--- a/arch/x86/include/asm/vmware.h
+++ b/arch/x86/include/asm/vmware.h
@@ -29,7 +29,8 @@
 
 /* The low bandwidth call. The low word of edx is presumed clear. */
 #define VMWARE_HYPERCALL   \
-   ALTERNATIVE_2("movw $" VMWARE_HYPERVISOR_PORT ", %%dx; inl (%%dx)", \
+   ALTERNATIVE_2("movw $" VMWARE_HYPERVISOR_PORT   \
+ ", %%dx; inl (%%dx), %%eax",  \
  "vmcall", X86_FEATURE_VMCALL, \
  "vmmcall", X86_FEATURE_VMW_VMMCALL)
 
-- 
2.21.0

Re: [PATCH v5 4/8] mm: Add write-protect and clean utilities for address space ranges

2019-10-16 Thread VMware


On 10/10/19 4:17 PM, Peter Zijlstra wrote:

On Thu, Oct 10, 2019 at 03:24:47PM +0200, Thomas Hellström (VMware) wrote:

On 10/10/19 3:05 PM, Peter Zijlstra wrote:

On Thu, Oct 10, 2019 at 02:43:10PM +0200, Thomas Hellström (VMware) wrote:

+/**
+ * wp_shared_mapping_range - Write-protect all ptes in an address space range
+ * @mapping: The address_space we want to write protect
+ * @first_index: The first page offset in the range
+ * @nr: Number of incremental page offsets to cover
+ *
+ * Note: This function currently skips transhuge page-table entries, since
+ * it's intended for dirty-tracking on the PTE level. It will warn on
+ * encountering transhuge write-enabled entries, though, and can easily be
+ * extended to handle them as well.
+ *
+ * Return: The number of ptes actually write-protected. Note that
+ * already write-protected ptes are not counted.
+ */
+unsigned long wp_shared_mapping_range(struct address_space *mapping,
+ pgoff_t first_index, pgoff_t nr)
+{
+   struct wp_walk wpwalk = { .total = 0 };
+
+   i_mmap_lock_read(mapping);
+   WARN_ON(walk_page_mapping(mapping, first_index, nr, _walk_ops,
+ ));
+   i_mmap_unlock_read(mapping);
+
+   return wpwalk.total;
+}

That's a read lock, this means there's concurrency to self. What happens
if someone does two concurrent wp_shared_mapping_range() on the same
mapping?

The thing is, because of pte_wrprotect() the iteration that starts last
will see a smaller pte_write range, if it completes first and does
flush_tlb_range(), it will only flush a partial range.

This is exactly what {inc,dec}_tlb_flush_pending() is for, but you're
not using mm_tlb_flush_nested() to detect the situation and do a bigger
flush.

Or if you're not needing that, then I'm missing why.

Good catch. Thanks,

Yes the read lock is not intended to protect against concurrent users but to
protect the vmas from disappearing under us. Since it fundamentally makes no
sense having two concurrent threads picking up dirty ptes on the same
address_space range we have an external range-based lock to protect against
that.

Nothing mandates/verifies the function you expose is used exclusively.
Therefore you cannot make assumptions on that range lock your user has.


However, that external lock doesn't protect other code  from concurrently
modifying ptes and having the mm's  tlb_flush_pending increased, so I guess
we unconditionally need to test for that and do a full range flush if
necessary?

Yes, something like:

if (mm_tlb_flush_nested(mm))
flush_tlb_range(walk->vma, walk->vma->vm_start, 
walk->vma->vm_end);
else  if (wpwalk->tlbflush_end > wpwalk->tlbflush_start)
flush_tlb_range(walk->vma, wpwalk->tlbflush_start, 
wpwalk->tlbflush_end);


Hi, Peter,

I've updated the patch to incorporate something similar to the above. 
Since you've looked at the patch, any chance of an R-B?


Thanks,

Thomas

Re: [RFC PATCH] mm: Fix a huge pud insertion race during faulting

2019-10-16 Thread VMware


Hi, Dan,

On 10/16/19 3:44 AM, Dan Williams wrote:

On Tue, Oct 15, 2019 at 3:06 AM Kirill A. Shutemov  wrote:

On Tue, Oct 08, 2019 at 11:37:11AM +0200, Thomas Hellström (VMware) wrote:

From: Thomas Hellstrom 

A huge pud page can theoretically be faulted in racing with pmd_alloc()
in __handle_mm_fault(). That will lead to pmd_alloc() returning an
invalid pmd pointer. Fix this by adding a pud_trans_unstable() function
similar to pmd_trans_unstable() and check whether the pud is really stable
before using the pmd pointer.

Race:
Thread 1: Thread 2: Comment
create_huge_pud()   Fallback - not taken.
 create_huge_pud() Taken.
pmd_alloc() Returns an invalid pointer.

Cc: Matthew Wilcox 
Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
Signed-off-by: Thomas Hellstrom 
---
RFC: We include pud_devmap() as an unstable PUD flag. Is this correct?
  Do the same for pmds?

I *think* it is correct and we should do the same for PMD, but I may be
wrong.

Dan, Matthew, could you comment on this?

The _devmap() check in these paths near _trans_unstable() has always
been about avoiding assumptions that the corresponding page might be
page cache or anonymous which for dax it's neither and does not behave
like a typical page.


The concern here is that _trans_huge() returns false for _devmap() 
pages, which means that also _trans_unstable() returns false.


Still, I figure someone could zap the entry at any time using madvise(), 
so AFAICT the entry is indeed unstable, and it's a bug not to include 
_devmap() in the _trans_unstable() functions?


Thanks,

Thomas

[PATCH v6 3/8] mm: Add a walk_page_mapping() function to the pagewalk code

2019-10-14 Thread VMware

From: Thomas Hellstrom 

For users that want to travers all page table entries pointing into a
region of a struct address_space mapping, introduce a walk_page_mapping()
function.

The walk_page_mapping() function will be initially be used for dirty-
tracking in virtual graphics drivers.

Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: Michal Hocko 
Cc: Huang Ying 
Cc: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Signed-off-by: Thomas Hellstrom 
---
 include/linux/pagewalk.h |  9 
 mm/pagewalk.c| 94 +++-
 2 files changed, 102 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index bddd9759bab9..6ec82e92c87f 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -24,6 +24,9 @@ struct mm_walk;
  * "do page table walk over the current vma", returning
  * a negative value means "abort current page table walk
  * right now" and returning 1 means "skip the current vma"
+ * @pre_vma:if set, called before starting walk on a non-null vma.
+ * @post_vma:   if set, called after a walk on a non-null vma, provided
+ *  that @pre_vma and the vma walk succeeded.
  */
 struct mm_walk_ops {
int (*pud_entry)(pud_t *pud, unsigned long addr,
@@ -39,6 +42,9 @@ struct mm_walk_ops {
 struct mm_walk *walk);
int (*test_walk)(unsigned long addr, unsigned long next,
struct mm_walk *walk);
+   int (*pre_vma)(unsigned long start, unsigned long end,
+  struct mm_walk *walk);
+   void (*post_vma)(struct mm_walk *walk);
 };
 
 /**
@@ -62,5 +68,8 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
void *private);
 int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
void *private);
+int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
+ pgoff_t nr, const struct mm_walk_ops *ops,
+ void *private);
 
 #endif /* _LINUX_PAGEWALK_H */
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index c5fa42cab14f..ea0b9e606ad1 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -254,13 +254,23 @@ static int __walk_page_range(unsigned long start, 
unsigned long end,
 {
int err = 0;
struct vm_area_struct *vma = walk->vma;
+   const struct mm_walk_ops *ops = walk->ops;
+
+   if (vma && ops->pre_vma) {
+   err = ops->pre_vma(start, end, walk);
+   if (err)
+   return err;
+   }
 
if (vma && is_vm_hugetlb_page(vma)) {
-   if (walk->ops->hugetlb_entry)
+   if (ops->hugetlb_entry)
err = walk_hugetlb_range(start, end, walk);
} else
err = walk_pgd_range(start, end, walk);
 
+   if (vma && ops->post_vma)
+   ops->post_vma(walk);
+
return err;
 }
 
@@ -291,6 +301,11 @@ static int __walk_page_range(unsigned long start, unsigned 
long end,
  * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
  * purpose.
  *
+ * If operations need to be staged before and committed after a vma is walked,
+ * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
+ * since it is intended to handle commit-type operations, can't return any
+ * errors.
+ *
  * struct mm_walk keeps current values of some common data like vma and pmd,
  * which are useful for the access from callbacks. If you want to pass some
  * caller-specific data to callbacks, @private should be helpful.
@@ -377,3 +392,80 @@ int walk_page_vma(struct vm_area_struct *vma, const struct 
mm_walk_ops *ops,
return err;
return __walk_page_range(vma->vm_start, vma->vm_end, );
 }
+
+/**
+ * walk_page_mapping - walk all memory areas mapped into a struct 
address_space.
+ * @mapping: Pointer to the struct address_space
+ * @first_index: First page offset in the address_space
+ * @nr: Number of incremental page offsets to cover
+ * @ops:   operation to call during the walk
+ * @private:   private data for callbacks' usage
+ *
+ * This function walks all memory areas mapped into a struct address_space.
+ * The walk is limited to only the given page-size index range, but if
+ * the index boundaries cross a huge page-table entry, that entry will be
+ * included.
+ *
+ * Also see walk_page_range() for additional information.
+ *
+ * Locking:
+ *   This function can't require that the struct mm_struct::mmap_sem is held,
+ *   since @mapping may be mapped by multiple processes. Instead
+ *   @mapping->i_mmap_rwsem must be held. This might have implications in the
+ *   callbacks, and it's up tho the caller to ensure that the
+ *   struct mm_struct::mmap_sem is not needed.
+ *
+ *   Also this means that a caller

[PATCH v6 6/8] drm/vmwgfx: Use an RBtree instead of linked list for MOB resources

2019-10-14 Thread VMware

From: Thomas Hellstrom 

With emulated coherent memory we need to be able to quickly look up
a resource from the MOB offset. Instead of traversing a linked list with
O(n) worst case, use an RBtree with O(log n) worst case complexity.

Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: Michal Hocko 
Cc: Huang Ying 
Cc: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Signed-off-by: Thomas Hellstrom 
Reviewed-by: Deepak Rawat 
---
 drivers/gpu/drm/vmwgfx/vmwgfx_bo.c   |  5 ++--
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.h  | 10 +++
 drivers/gpu/drm/vmwgfx/vmwgfx_resource.c | 33 +---
 3 files changed, 32 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c 
b/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
index 869aeaec2f86..18e4b329e563 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
@@ -463,6 +463,7 @@ void vmw_bo_bo_free(struct ttm_buffer_object *bo)
struct vmw_buffer_object *vmw_bo = vmw_buffer_object(bo);
 
WARN_ON(vmw_bo->dirty);
+   WARN_ON(!RB_EMPTY_ROOT(_bo->res_tree));
vmw_bo_unmap(vmw_bo);
kfree(vmw_bo);
 }
@@ -479,6 +480,7 @@ static void vmw_user_bo_destroy(struct ttm_buffer_object 
*bo)
struct vmw_buffer_object *vbo = _user_bo->vbo;
 
WARN_ON(vbo->dirty);
+   WARN_ON(!RB_EMPTY_ROOT(>res_tree));
vmw_bo_unmap(vbo);
ttm_prime_object_kfree(vmw_user_bo, prime);
 }
@@ -514,8 +516,7 @@ int vmw_bo_init(struct vmw_private *dev_priv,
memset(vmw_bo, 0, sizeof(*vmw_bo));
BUILD_BUG_ON(TTM_MAX_BO_PRIORITY <= 3);
vmw_bo->base.priority = 3;
-
-   INIT_LIST_HEAD(_bo->res_list);
+   vmw_bo->res_tree = RB_ROOT;
 
ret = ttm_bo_init(bdev, _bo->base, size,
  ttm_bo_type_device, placement,
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h 
b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
index 7944dbbbdd72..53f8522ae032 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
@@ -100,7 +100,7 @@ struct vmw_fpriv {
 /**
  * struct vmw_buffer_object - TTM buffer object with vmwgfx additions
  * @base: The TTM buffer object
- * @res_list: List of resources using this buffer object as a backing MOB
+ * @res_tree: RB tree of resources using this buffer object as a backing MOB
  * @pin_count: pin depth
  * @dx_query_ctx: DX context if this buffer object is used as a DX query MOB
  * @map: Kmap object for semi-persistent mappings
@@ -109,7 +109,7 @@ struct vmw_fpriv {
  */
 struct vmw_buffer_object {
struct ttm_buffer_object base;
-   struct list_head res_list;
+   struct rb_root res_tree;
s32 pin_count;
/* Not ref-counted.  Protected by binding_mutex */
struct vmw_resource *dx_query_ctx;
@@ -157,8 +157,8 @@ struct vmw_res_func;
  * pin-count greater than zero. It is not on the resource LRU lists and its
  * backup buffer is pinned. Hence it can't be evicted.
  * @func: Method vtable for this resource. Immutable.
+ * @mob_node; Node for the MOB backup rbtree. Protected by @backup reserved.
  * @lru_head: List head for the LRU list. Protected by 
@dev_priv::resource_lock.
- * @mob_head: List head for the MOB backup list. Protected by @backup reserved.
  * @binding_head: List head for the context binding list. Protected by
  * the @dev_priv::binding_mutex
  * @res_free: The resource destructor.
@@ -179,8 +179,8 @@ struct vmw_resource {
unsigned long backup_offset;
unsigned long pin_count;
const struct vmw_res_func *func;
+   struct rb_node mob_node;
struct list_head lru_head;
-   struct list_head mob_head;
struct list_head binding_head;
struct vmw_resource_dirty *dirty;
void (*res_free) (struct vmw_resource *res);
@@ -733,7 +733,7 @@ void vmw_resource_dirty_update(struct vmw_resource *res, 
pgoff_t start,
  */
 static inline bool vmw_resource_mob_attached(const struct vmw_resource *res)
 {
-   return !list_empty(>mob_head);
+   return !RB_EMPTY_NODE(>mob_node);
 }
 
 /**
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c 
b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
index e4c97a4cf2ff..328ad46076ff 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
@@ -40,11 +40,24 @@
 void vmw_resource_mob_attach(struct vmw_resource *res)
 {
struct vmw_buffer_object *backup = res->backup;
+   struct rb_node **new = >res_tree.rb_node, *parent = NULL;
 
dma_resv_assert_held(res->backup->base.base.resv);
res->used_prio = (res->res_dirty) ? res->func->dirty_prio :
res->func->prio;
-   list_add_tail(>mob_head, >res_list);
+
+   while (*new) {
+   struct vmw_resource *this =
+   container_of(*new, struct vmw_resource, mob_node);
+
+   parent = *new;
+   new = (res->backup_offset <

[PATCH v6 0/8] Emulated coherent graphics memory take 2

2019-10-14 Thread VMware

From: Thomas Hellström 

Graphics APIs like OpenGL 4.4 and Vulkan require the graphics driver
to provide coherent graphics memory, meaning that the GPU sees any
content written to the coherent memory on the next GPU operation that
touches that memory, and the CPU sees any content written by the GPU
to that memory immediately after any fence object trailing the GPU
operation is signaled.

Paravirtual drivers that otherwise require explicit synchronization
needs to do this by hooking up dirty tracking to pagefault handlers
and buffer object validation.

Provide mm helpers needed for this and that also allow for huge pmd-
and pud entries (patch 1-3), and the associated vmwgfx code (patch 4-7).

The code has been tested and exercised by a tailored version of mesa
where we disable all explicit synchronization and assume graphics memory
is coherent. The performance loss varies of course; a typical number is
around 5%.

I would like to merge this code through the DRM tree, so an ack to include
the new mm helpers in that merge would be greatly appreciated.

Changes since RFC:
- Merge conflict changes moved to the correct patch. Fixes intra-patchset
  compile errors.
- Be more aggressive when turning ttm vm code into helpers. This makes sure
  we can use a const qualifier on the vmwgfx vm_ops.
- Reinstate a lost comment an fix an error path that was broken when turning
  the ttm vm code into helpers.
- Remove explicit type-casts of struct vm_area_struct::vm_private_data
- Clarify the locking inversion that makes us not being able to use the mm
  pagewalk code.

Changes since v1:
- Removed the vmwgfx maintainer entry for as_dirty_helpers.c, updated
  commit message accordingly
- Removed the TTM patches from the series as they are merged separately
  through DRM.
Changes since v2:
- Split out the pagewalk code from as_dirty_helpers.c and document locking.
- Add pre_vma and post_vma callbacks to the pagewalk code.
- Remove huge pmd and -pud asserts that would trip when we protect vmas with
  struct address_space::i_mmap_rwsem rather than with
  struct vm_area_struct::mmap_sem.
- Do some naming cleanup in as_dirty_helpers.c
Changes since v3:
- Extensive renaming of the dirty helpers including the filename.
- Update walk_page_mapping() doc.
- Update the pagewalk code to not unconditionally split pmds if a pte_entry()
  callback is present. Update the dirty helper pmd_entry accordingly.
- Use separate walk ops for the dirty helpers.
- Update the pagewalk code to take the pagetable lock in walk_pte_range.
Changes since v4:
- Fix pte pointer confusion in patch 2/8
- Skip the pagewalk code conditional split patch for now, and update the
  mapping_dirty_helper accordingly. That problem will be solved in a cleaner
  way in a follow-up patchset.
Changes since v5:
- Fix tlb flushing when we have other pending tlb flushes.
  
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: Michal Hocko 
Cc: Huang Ying 
Cc: Jérôme Glisse 
Cc: Kirill A. Shutemov

[PATCH v6 7/8] drm/vmwgfx: Implement an infrastructure for read-coherent resources

2019-10-14 Thread VMware

From: Thomas Hellstrom 

Similar to write-coherent resources, make sure that from the user-space
point of view, GPU rendered contents is automatically available for
reading by the CPU.

Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: Michal Hocko 
Cc: Huang Ying 
Cc: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Signed-off-by: Thomas Hellstrom 
Reviewed-by: Deepak Rawat 
---
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.h   |   7 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c|  77 -
 drivers/gpu/drm/vmwgfx/vmwgfx_resource.c  | 103 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h |   2 +
 drivers/gpu/drm/vmwgfx/vmwgfx_validation.c|   3 +-
 5 files changed, 181 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h 
b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
index 53f8522ae032..729a2e93acf1 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
@@ -680,7 +680,8 @@ extern void vmw_resource_unreference(struct vmw_resource 
**p_res);
 extern struct vmw_resource *vmw_resource_reference(struct vmw_resource *res);
 extern struct vmw_resource *
 vmw_resource_reference_unless_doomed(struct vmw_resource *res);
-extern int vmw_resource_validate(struct vmw_resource *res, bool intr);
+extern int vmw_resource_validate(struct vmw_resource *res, bool intr,
+bool dirtying);
 extern int vmw_resource_reserve(struct vmw_resource *res, bool interruptible,
bool no_backup);
 extern bool vmw_resource_needs_backup(const struct vmw_resource *res);
@@ -724,6 +725,8 @@ void vmw_resource_mob_attach(struct vmw_resource *res);
 void vmw_resource_mob_detach(struct vmw_resource *res);
 void vmw_resource_dirty_update(struct vmw_resource *res, pgoff_t start,
   pgoff_t end);
+int vmw_resources_clean(struct vmw_buffer_object *vbo, pgoff_t start,
+   pgoff_t end, pgoff_t *num_prefault);
 
 /**
  * vmw_resource_mob_attached - Whether a resource currently has a mob attached
@@ -1417,6 +1420,8 @@ int vmw_bo_dirty_add(struct vmw_buffer_object *vbo);
 void vmw_bo_dirty_transfer_to_res(struct vmw_resource *res);
 void vmw_bo_dirty_clear_res(struct vmw_resource *res);
 void vmw_bo_dirty_release(struct vmw_buffer_object *vbo);
+void vmw_bo_dirty_unmap(struct vmw_buffer_object *vbo,
+   pgoff_t start, pgoff_t end);
 vm_fault_t vmw_bo_vm_fault(struct vm_fault *vmf);
 vm_fault_t vmw_bo_vm_mkwrite(struct vm_fault *vmf);
 
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c 
b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
index 060c1e492f25..f07aa857587c 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
@@ -155,7 +155,6 @@ static void vmw_bo_dirty_scan_mkwrite(struct 
vmw_buffer_object *vbo)
}
 }
 
-
 /**
  * vmw_bo_dirty_scan - Scan for dirty pages and add them to the dirty
  * tracking structure
@@ -173,6 +172,53 @@ void vmw_bo_dirty_scan(struct vmw_buffer_object *vbo)
vmw_bo_dirty_scan_mkwrite(vbo);
 }
 
+/**
+ * vmw_bo_dirty_pre_unmap - write-protect and pick up dirty pages before
+ * an unmap_mapping_range operation.
+ * @vbo: The buffer object,
+ * @start: First page of the range within the buffer object.
+ * @end: Last page of the range within the buffer object + 1.
+ *
+ * If we're using the _PAGETABLE scan method, we may leak dirty pages
+ * when calling unmap_mapping_range(). This function makes sure we pick
+ * up all dirty pages.
+ */
+static void vmw_bo_dirty_pre_unmap(struct vmw_buffer_object *vbo,
+  pgoff_t start, pgoff_t end)
+{
+   struct vmw_bo_dirty *dirty = vbo->dirty;
+   unsigned long offset = drm_vma_node_start(>base.base.vma_node);
+   struct address_space *mapping = vbo->base.bdev->dev_mapping;
+
+   if (dirty->method != VMW_BO_DIRTY_PAGETABLE || start >= end)
+   return;
+
+   wp_shared_mapping_range(mapping, start + offset, end - start);
+   clean_record_shared_mapping_range(mapping, start + offset,
+ end - start, offset,
+ >bitmap[0], >start,
+ >end);
+}
+
+/**
+ * vmw_bo_dirty_unmap - Clear all ptes pointing to a range within a bo
+ * @vbo: The buffer object,
+ * @start: First page of the range within the buffer object.
+ * @end: Last page of the range within the buffer object + 1.
+ *
+ * This is similar to ttm_bo_unmap_virtual_locked() except it takes a subrange.
+ */
+void vmw_bo_dirty_unmap(struct vmw_buffer_object *vbo,
+   pgoff_t start, pgoff_t end)
+{
+   unsigned long offset = drm_vma_node_start(>base.base.vma_node);
+   struct address_space *mapping = vbo->base.bdev->dev_mapping;
+
+   vmw_bo_dirty_pre_unmap(vbo, start, end);
+

[PATCH v6 4/8] mm: Add write-protect and clean utilities for address space ranges

2019-10-14 Thread VMware

From: Thomas Hellstrom 

Add two utilities to 1) write-protect and 2) clean all ptes pointing into
a range of an address space.
The utilities are intended to aid in tracking dirty pages (either
driver-allocated system memory or pci device memory).
The write-protect utility should be used in conjunction with
page_mkwrite() and pfn_mkwrite() to trigger write page-faults on page
accesses. Typically one would want to use this on sparse accesses into
large memory regions. The clean utility should be used to utilize
hardware dirtying functionality and avoid the overhead of page-faults,
typically on large accesses into small memory regions.

Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: Michal Hocko 
Cc: Huang Ying 
Cc: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Signed-off-by: Thomas Hellstrom 
---
 include/linux/mm.h |  13 +-
 mm/Kconfig |   3 +
 mm/Makefile|   1 +
 mm/mapping_dirty_helpers.c | 315 +
 4 files changed, 331 insertions(+), 1 deletion(-)
 create mode 100644 mm/mapping_dirty_helpers.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cc292273e6ba..4bc93477375e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2637,7 +2637,6 @@ typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, 
void *data);
 extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
   unsigned long size, pte_fn_t fn, void *data);
 
-
 #ifdef CONFIG_PAGE_POISONING
 extern bool page_poisoning_enabled(void);
 extern void kernel_poison_pages(struct page *page, int numpages, int enable);
@@ -2878,5 +2877,17 @@ static inline int pages_identical(struct page *page1, 
struct page *page2)
return !memcmp_pages(page1, page2);
 }
 
+#ifdef CONFIG_MAPPING_DIRTY_HELPERS
+unsigned long clean_record_shared_mapping_range(struct address_space *mapping,
+   pgoff_t first_index, pgoff_t nr,
+   pgoff_t bitmap_pgoff,
+   unsigned long *bitmap,
+   pgoff_t *start,
+   pgoff_t *end);
+
+unsigned long wp_shared_mapping_range(struct address_space *mapping,
+ pgoff_t first_index, pgoff_t nr);
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index a5dae9a7eb51..550f7aceb679 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,7 @@ config ARCH_HAS_PTE_SPECIAL
 config ARCH_HAS_HUGEPD
bool
 
+config MAPPING_DIRTY_HELPERS
+bool
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index d996846697ef..1937cc251883 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -107,3 +107,4 @@ obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
new file mode 100644
index ..71070dda9643
--- /dev/null
+++ b/mm/mapping_dirty_helpers.c
@@ -0,0 +1,315 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/**
+ * struct wp_walk - Private struct for pagetable walk callbacks
+ * @range: Range for mmu notifiers
+ * @tlbflush_start: Address of first modified pte
+ * @tlbflush_end: Address of last modified pte + 1
+ * @total: Total number of modified ptes
+ */
+struct wp_walk {
+   struct mmu_notifier_range range;
+   unsigned long tlbflush_start;
+   unsigned long tlbflush_end;
+   unsigned long total;
+};
+
+/**
+ * wp_pte - Write-protect a pte
+ * @pte: Pointer to the pte
+ * @addr: The virtual page address
+ * @walk: pagetable walk callback argument
+ *
+ * The function write-protects a pte and records the range in
+ * virtual address space of touched ptes for efficient range TLB flushes.
+ */
+static int wp_pte(pte_t *pte, unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+   struct wp_walk *wpwalk = walk->private;
+   pte_t ptent = *pte;
+
+   if (pte_write(ptent)) {
+   pte_t old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
+
+   ptent = pte_wrprotect(old_pte);
+   ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
+   wpwalk->total++;
+   wpwalk->tlbflush_start = min(wpwalk->tlbflush_start, addr);
+   wpwalk->tlbflush_end = max(wpwalk->tlbflush_end,
+  addr + PAGE_SIZE);
+   }
+
+   return 0;
+}
+
+/**
+ * struct clean_walk - Private struct for the clean_record_pte function.
+ * @base: struct wp_walk we derive from
+ * @bitmap_pgoff: Address_space Page

[PATCH v6 2/8] mm: pagewalk: Take the pagetable lock in walk_pte_range()

2019-10-14 Thread VMware

From: Thomas Hellstrom 

Without the lock, anybody modifying a pte from within this function might
have it concurrently modified by someone else.

Cc: Matthew Wilcox 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: Michal Hocko 
Cc: Huang Ying 
Cc: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Suggested-by: Linus Torvalds 
Signed-off-by: Thomas Hellstrom 
Acked-by: Kirill A. Shutemov 
---
 mm/pagewalk.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index d48c2a986ea3..c5fa42cab14f 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -10,8 +10,9 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
unsigned long end,
pte_t *pte;
int err = 0;
const struct mm_walk_ops *ops = walk->ops;
+   spinlock_t *ptl;
 
-   pte = pte_offset_map(pmd, addr);
+   pte = pte_offset_map_lock(walk->mm, pmd, addr, );
for (;;) {
err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
if (err)
@@ -22,7 +23,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
unsigned long end,
pte++;
}
 
-   pte_unmap(pte);
+   pte_unmap_unlock(pte, ptl);
return err;
 }
 
-- 
2.20.1

[PATCH v6 8/8] drm/vmwgfx: Add surface dirty-tracking callbacks

2019-10-14 Thread VMware

From: Thomas Hellstrom 

Add the callbacks necessary to implement emulated coherent memory for
surfaces. Add a flag to the gb_surface_create ioctl to indicate that
surface memory should be coherent.
Also bump the drm minor version to signal the availability of coherent
surfaces.

Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: Michal Hocko 
Cc: Huang Ying 
Cc: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Signed-off-by: Thomas Hellstrom 
Reviewed-by: Deepak Rawat 
---
 .../device_include/svga3d_surfacedefs.h   | 233 ++-
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.h   |   4 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_surface.c   | 395 +-
 include/uapi/drm/vmwgfx_drm.h |   4 +-
 4 files changed, 629 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/vmwgfx/device_include/svga3d_surfacedefs.h 
b/drivers/gpu/drm/vmwgfx/device_include/svga3d_surfacedefs.h
index f2bfd3d80598..61414f105c67 100644
--- a/drivers/gpu/drm/vmwgfx/device_include/svga3d_surfacedefs.h
+++ b/drivers/gpu/drm/vmwgfx/device_include/svga3d_surfacedefs.h
@@ -1280,7 +1280,6 @@ svga3dsurface_get_pixel_offset(SVGA3dSurfaceFormat format,
return offset;
 }
 
-
 static inline u32
 svga3dsurface_get_image_offset(SVGA3dSurfaceFormat format,
   surf_size_struct baseLevelSize,
@@ -1375,4 +1374,236 @@ 
svga3dsurface_is_screen_target_format(SVGA3dSurfaceFormat format)
return svga3dsurface_is_dx_screen_target_format(format);
 }
 
+/**
+ * struct svga3dsurface_mip - Mimpmap level information
+ * @bytes: Bytes required in the backing store of this mipmap level.
+ * @img_stride: Byte stride per image.
+ * @row_stride: Byte stride per block row.
+ * @size: The size of the mipmap.
+ */
+struct svga3dsurface_mip {
+   size_t bytes;
+   size_t img_stride;
+   size_t row_stride;
+   struct drm_vmw_size size;
+
+};
+
+/**
+ * struct svga3dsurface_cache - Cached surface information
+ * @desc: Pointer to the surface descriptor
+ * @mip: Array of mipmap level information. Valid size is @num_mip_levels.
+ * @mip_chain_bytes: Bytes required in the backing store for the whole chain
+ * of mip levels.
+ * @sheet_bytes: Bytes required in the backing store for a sheet
+ * representing a single sample.
+ * @num_mip_levels: Valid size of the @mip array. Number of mipmap levels in
+ * a chain.
+ * @num_layers: Number of slices in an array texture or number of faces in
+ * a cubemap texture.
+ */
+struct svga3dsurface_cache {
+   const struct svga3d_surface_desc *desc;
+   struct svga3dsurface_mip mip[DRM_VMW_MAX_MIP_LEVELS];
+   size_t mip_chain_bytes;
+   size_t sheet_bytes;
+   u32 num_mip_levels;
+   u32 num_layers;
+};
+
+/**
+ * struct svga3dsurface_loc - Surface location
+ * @sub_resource: Surface subresource. Defined as layer * num_mip_levels +
+ * mip_level.
+ * @x: X coordinate.
+ * @y: Y coordinate.
+ * @z: Z coordinate.
+ */
+struct svga3dsurface_loc {
+   u32 sub_resource;
+   u32 x, y, z;
+};
+
+/**
+ * svga3dsurface_subres - Compute the subresource from layer and mipmap.
+ * @cache: Surface layout data.
+ * @mip_level: The mipmap level.
+ * @layer: The surface layer (face or array slice).
+ *
+ * Return: The subresource.
+ */
+static inline u32 svga3dsurface_subres(const struct svga3dsurface_cache *cache,
+  u32 mip_level, u32 layer)
+{
+   return cache->num_mip_levels * layer + mip_level;
+}
+
+/**
+ * svga3dsurface_setup_cache - Build a surface cache entry
+ * @size: The surface base level dimensions.
+ * @format: The surface format.
+ * @num_mip_levels: Number of mipmap levels.
+ * @num_layers: Number of layers.
+ * @cache: Pointer to a struct svga3dsurface_cach object to be filled in.
+ *
+ * Return: Zero on success, -EINVAL on invalid surface layout.
+ */
+static inline int svga3dsurface_setup_cache(const struct drm_vmw_size *size,
+   SVGA3dSurfaceFormat format,
+   u32 num_mip_levels,
+   u32 num_layers,
+   u32 num_samples,
+   struct svga3dsurface_cache *cache)
+{
+   const struct svga3d_surface_desc *desc;
+   u32 i;
+
+   memset(cache, 0, sizeof(*cache));
+   cache->desc = desc = svga3dsurface_get_desc(format);
+   cache->num_mip_levels = num_mip_levels;
+   cache->num_layers = num_layers;
+   for (i = 0; i < cache->num_mip_levels; i++) {
+   struct svga3dsurface_mip *mip = >mip[i];
+
+   mip->size = svga3dsurface_get_mip_size(*size, i);
+   mip->bytes = svga3dsurface_get_image_buffer_size
+   (desc, >size, 0);
+   mip->row_stride =
+   __KERNEL_DIV_ROUND_UP(mip->size.width,
+

[PATCH v6 1/8] mm: Remove BUG_ON mmap_sem not held from xxx_trans_huge_lock()

2019-10-14 Thread VMware

From: Thomas Hellstrom 

The caller needs to make sure that the vma is not torn down during the
lock operation and can also use the i_mmap_rwsem for file-backed vmas.
Remove the BUG_ON. We could, as an alternative, add a test that either
vma->vm_mm->mmap_sem or vma->vm_file->f_mapping->i_mmap_rwsem are held.

Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: Michal Hocko 
Cc: Huang Ying 
Cc: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Signed-off-by: Thomas Hellstrom 
Acked-by: Kirill A. Shutemov 
---
 include/linux/huge_mm.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 93d5cf0bc716..0b84e13e88e2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -216,7 +216,6 @@ static inline int is_swap_pmd(pmd_t pmd)
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma)
 {
-   VM_BUG_ON_VMA(!rwsem_is_locked(>vm_mm->mmap_sem), vma);
if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
return __pmd_trans_huge_lock(pmd, vma);
else
@@ -225,7 +224,6 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
struct vm_area_struct *vma)
 {
-   VM_BUG_ON_VMA(!rwsem_is_locked(>vm_mm->mmap_sem), vma);
if (pud_trans_huge(*pud) || pud_devmap(*pud))
return __pud_trans_huge_lock(pud, vma);
else
-- 
2.20.1

[PATCH v6 5/8] drm/vmwgfx: Implement an infrastructure for write-coherent resources

2019-10-14 Thread VMware

backup_dirty;
+   u32 res_dirty : 1;
+   u32 backup_dirty : 1;
+   u32 coherent : 1;
struct vmw_buffer_object *backup;
unsigned long backup_offset;
unsigned long pin_count;
@@ -177,6 +182,7 @@ struct vmw_resource {
struct list_head lru_head;
struct list_head mob_head;
struct list_head binding_head;
+   struct vmw_resource_dirty *dirty;
void (*res_free) (struct vmw_resource *res);
void (*hw_destroy) (struct vmw_resource *res);
 };
@@ -716,6 +722,8 @@ extern void vmw_resource_evict_all(struct vmw_private 
*dev_priv);
 extern void vmw_resource_unbind_list(struct vmw_buffer_object *vbo);
 void vmw_resource_mob_attach(struct vmw_resource *res);
 void vmw_resource_mob_detach(struct vmw_resource *res);
+void vmw_resource_dirty_update(struct vmw_resource *res, pgoff_t start,
+  pgoff_t end);
 
 /**
  * vmw_resource_mob_attached - Whether a resource currently has a mob attached
@@ -1403,6 +1411,15 @@ int vmw_host_log(const char *log);
 #define VMW_DEBUG_USER(fmt, ...)  \
DRM_DEBUG_DRIVER(fmt, ##__VA_ARGS__)
 
+/* Resource dirtying - vmwgfx_page_dirty.c */
+void vmw_bo_dirty_scan(struct vmw_buffer_object *vbo);
+int vmw_bo_dirty_add(struct vmw_buffer_object *vbo);
+void vmw_bo_dirty_transfer_to_res(struct vmw_resource *res);
+void vmw_bo_dirty_clear_res(struct vmw_resource *res);
+void vmw_bo_dirty_release(struct vmw_buffer_object *vbo);
+vm_fault_t vmw_bo_vm_fault(struct vm_fault *vmf);
+vm_fault_t vmw_bo_vm_mkwrite(struct vm_fault *vmf);
+
 /**
  * VMW_DEBUG_KMS - Debug output for kernel mode-setting
  *
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c 
b/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
index ff86d49dc5e8..934ad7c0c342 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
@@ -2560,7 +2560,6 @@ static int vmw_cmd_dx_check_subresource(struct 
vmw_private *dev_priv,
 offsetof(typeof(*cmd), sid));
 
cmd = container_of(header, typeof(*cmd), header);
-
return vmw_cmd_res_check(dev_priv, sw_context, vmw_res_surface,
 VMW_RES_DIRTY_NONE, user_surface_converter,
 >sid, NULL);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c 
b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
new file mode 100644
index ..060c1e492f25
--- /dev/null
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
@@ -0,0 +1,421 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/**
+ *
+ * Copyright 2019 VMware, Inc., Palo Alto, CA., USA
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sub license, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the
+ * next paragraph) shall be included in all copies or substantial portions
+ * of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDERS, AUTHORS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM,
+ * DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
+ * OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
+ * USE OR OTHER DEALINGS IN THE SOFTWARE.
+ *
+ **/
+#include "vmwgfx_drv.h"
+
+/*
+ * Different methods for tracking dirty:
+ * VMW_BO_DIRTY_PAGETABLE - Scan the pagetable for hardware dirty bits
+ * VMW_BO_DIRTY_MKWRITE - Write-protect page table entries and record write-
+ * accesses in the VM mkwrite() callback
+ */
+enum vmw_bo_dirty_method {
+   VMW_BO_DIRTY_PAGETABLE,
+   VMW_BO_DIRTY_MKWRITE,
+};
+
+/*
+ * No dirtied pages at scan trigger a transition to the _MKWRITE method,
+ * similarly a certain percentage of dirty pages trigger a transition to
+ * the _PAGETABLE method. How many triggers should we wait for before
+ * changing method?
+ */
+#define VMW_DIRTY_NUM_CHANGE_TRIGGERS 2
+
+/* Percentage to trigger a transition to the _PAGETABLE method */
+#define VMW_DIRTY_PERCENTAGE 10
+
+/**
+ * struct vmw_bo_dirty - Dirty information for buffer objects
+ * @start: First currently dirty bit
+ * @end: Last currently dirty bit + 1
+ * @method: The currently used dirty method
+ * @change_count: Number of consecutive method ch

[RFC PATCH 0/4] mm: pagewalk: Rework callback return values and optionally skip the pte level

2019-10-10 Thread VMware

This series converts all users of pagewalk positive callback return values
to use negative values instead, so that the positive values are free for
pagewalk control. Then the return value PAGE_WALK_CONTINUE is introduced.
That value is intended for callbacks to indicate that they've handled the
entry and it's not necessary to split and go to a lower level.
Initially this is implemented only for pmd_entry(), but could (should?)
at some point be used also for pud_entry().

Finally the mapping_dirty_helpers pagewalk is modified to use the new value
indicating that it has processed a read-only huge pmd entry and don't want
it to be split and handled at the pte level.

Note: This series still needs some significant testing and another re-audit
to verify that there are no more pagewalk users relying on positive return
values.

[RFC PATCH 1/4] mm: Have the mempolicy pagewalk to avoid positive callback return codes

2019-10-10 Thread VMware

From: Linus Torvalds 

The pagewalk code is being reworked to have positive callback return codes
do walk control. Avoid using positive return codes: "1" is replaced by
"-EBUSY".

Co-developed-by: Thomas Hellstrom 
Signed-off-by: Thomas Hellstrom 
---
 mm/mempolicy.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ae967bcf954..df34c7498c27 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -482,8 +482,8 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, 
unsigned long addr,
  *
  * queue_pages_pte_range() has three possible return values:
  * 0 - pages are placed on the right node or queued successfully.
- * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
+ * -EBUSY - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
+ *  specified.
  * -EIO - only MPOL_MF_STRICT was specified and an existing page was already
  *on a node that does not follow the policy.
  */
@@ -503,7 +503,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long 
addr,
if (ptl) {
ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
if (ret != 2)
-   return ret;
+   return (ret == 1) ? -EBUSY : ret;
}
/* THP was split, fall through to pte walk */
 
@@ -546,7 +546,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long 
addr,
cond_resched();
 
if (has_unmovable)
-   return 1;
+   return -EBUSY;
 
return addr != end ? -EIO : 0;
 }
@@ -669,9 +669,9 @@ static const struct mm_walk_ops queue_pages_walk_ops = {
  * passed via @private.
  *
  * queue_pages_range() has three possible return values:
- * 1 - there is unmovable page, but MPOL_MF_MOVE* & MPOL_MF_STRICT were
- * specified.
  * 0 - queue pages successfully or no misplaced page.
+ * -EBUSY - there is unmovable page, but MPOL_MF_MOVE* & MPOL_MF_STRICT were
+ * specified.
  * -EIO - there is misplaced page and only MPOL_MF_STRICT was specified.
  */
 static int
@@ -1285,7 +1285,7 @@ static long do_mbind(unsigned long start, unsigned long 
len,
ret = queue_pages_range(mm, start, end, nmask,
  flags | MPOL_MF_INVERT, );
 
-   if (ret < 0) {
+   if (ret < 0 && ret != -EBUSY) {
err = -EIO;
goto up_out;
}
@@ -1303,7 +1303,7 @@ static long do_mbind(unsigned long start, unsigned long 
len,
putback_movable_pages();
}
 
-   if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
+   if ((ret < 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
err = -EIO;
} else
putback_movable_pages();
-- 
2.21.0

[RFC PATCH 2/4] fs: task_mmu: Have the pagewalk avoid positive callback return codes

2019-10-10 Thread VMware

From: Thomas Hellstrom 

The pagewalk code is being reworked to have positive callback return codes
mean "walk control". Avoid using positive return codes: "1" is replaced by
"-ENOBUFS".

Signed-off-by: Thomas Hellstrom 
---
 fs/proc/task_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 9442631fd4af..ef11969d9ba1 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1265,7 +1265,7 @@ struct pagemapread {
 #define PM_SWAPBIT_ULL(62)
 #define PM_PRESENT BIT_ULL(63)
 
-#define PM_END_OF_BUFFER1
+#define PM_END_OF_BUFFER(-ENOBUFS)
 
 static inline pagemap_entry_t make_pme(u64 frame, u64 flags)
 {
-- 
2.21.0

[RFC PATCH 3/4] mm: pagewalk: Disallow user positive callback return values and use them for walk control

2019-10-10 Thread VMware

From: Linus Torvalds 

When we have both a pmd_entry() and a pte_entry() callback, in some
siutaions it is desirable not to traverse the pte level.
Reserve positive callback return values for walk control and define a
return value PAGE_WALK_CONTINUE that means skip lower level traversal
and continue the walk.

Co-developed-by: Thomas Hellstrom 
Signed-off-by: Thomas Hellstrom 
---
 include/linux/pagewalk.h | 6 ++
 mm/pagewalk.c| 9 ++---
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 6ec82e92c87f..d9e5d1927315 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -4,6 +4,12 @@
 
 #include 
 
+/*
+ * pmd_entry() Return code meaning skip to next entry.
+ * Don't look for lower levels
+ */
+#define PAGE_WALK_CONTINUE 1
+
 struct mm_walk;
 
 /**
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index ea0b9e606ad1..d2483d432fda 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -52,8 +52,12 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, 
unsigned long end,
 */
if (ops->pmd_entry)
err = ops->pmd_entry(pmd, addr, next, walk);
-   if (err)
+   if (err < 0)
break;
+   if (err == PAGE_WALK_CONTINUE) {
+   err = 0;
+   continue;
+   }
 
/*
 * Check this here so we only break down trans_huge
@@ -291,8 +295,7 @@ static int __walk_page_range(unsigned long start, unsigned 
long end,
  *
  *  - 0  : succeeded to handle the current entry, and if you don't reach the
  * end address yet, continue to walk.
- *  - >0 : succeeded to handle the current entry, and return to the caller
- * with caller specific value.
+ *  - >0 : Reserved for walk control. Use only PAGE_WALK_XX values.
  *  - <0 : failed to handle the current entry, and return to the caller
  * with error code.
  *
-- 
2.21.0

[RFC PATCH 4/4] mm: mapping_dirty_helpers: Handle huge pmds correctly

2019-10-10 Thread VMware

From: Thomas Hellstrom 

We always do dirty tracking on the PTE level. This means that any huge
pmds we encounter should be read-only and not dirty: We can just skip
those. Write-enabled huge pmds should not exist. They should have been
split when made write-enabled. Warn and attempt to split them.

Signed-off-by: Thomas Hellstrom 
---
 mm/mapping_dirty_helpers.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
index 799b9154b48f..f61bb9de1530 100644
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -115,11 +115,18 @@ static int clean_record_pte(pte_t *pte, unsigned long 
addr,
 static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long 
end,
  struct mm_walk *walk)
 {
-   /* Dirty-tracking should be handled on the pte level */
pmd_t pmdval = pmd_read_atomic(pmd);
 
-   if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval))
-   WARN_ON(pmd_write(pmdval) || pmd_dirty(pmdval));
+   /*
+* Dirty-tracking should be handled on the pte level, and write-
+* enabled huge PMDS should never have been created. Warn on those.
+* Read-only huge PMDS can't be dirty so we just skip them.
+*/
+   if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval)) {
+   if (WARN_ON(pmd_write(pmdval) || pmd_dirty(pmdval)))
+   return 0;
+   return PAGE_WALK_CONTINUE;
+   }
 
return 0;
 }
-- 
2.21.0

Re: [PATCH v5 4/8] mm: Add write-protect and clean utilities for address space ranges

2019-10-10 Thread VMware


On 10/10/19 3:05 PM, Peter Zijlstra wrote:

On Thu, Oct 10, 2019 at 02:43:10PM +0200, Thomas Hellström (VMware) wrote:


+/**
+ * struct wp_walk - Private struct for pagetable walk callbacks
+ * @range: Range for mmu notifiers
+ * @tlbflush_start: Address of first modified pte
+ * @tlbflush_end: Address of last modified pte + 1
+ * @total: Total number of modified ptes
+ */
+struct wp_walk {
+   struct mmu_notifier_range range;
+   unsigned long tlbflush_start;
+   unsigned long tlbflush_end;
+   unsigned long total;
+};
+
+/**
+ * wp_pte - Write-protect a pte
+ * @pte: Pointer to the pte
+ * @addr: The virtual page address
+ * @walk: pagetable walk callback argument
+ *
+ * The function write-protects a pte and records the range in
+ * virtual address space of touched ptes for efficient range TLB flushes.
+ */
+static int wp_pte(pte_t *pte, unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+   struct wp_walk *wpwalk = walk->private;
+   pte_t ptent = *pte;
+
+   if (pte_write(ptent)) {
+   pte_t old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
+
+   ptent = pte_wrprotect(old_pte);
+   ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
+   wpwalk->total++;
+   wpwalk->tlbflush_start = min(wpwalk->tlbflush_start, addr);
+   wpwalk->tlbflush_end = max(wpwalk->tlbflush_end,
+  addr + PAGE_SIZE);
+   }
+
+   return 0;
+}
+/*
+ * wp_clean_pre_vma - The pagewalk pre_vma callback.
+ *
+ * The pre_vma callback performs the cache flush, stages the tlb flush
+ * and calls the necessary mmu notifiers.
+ */
+static int wp_clean_pre_vma(unsigned long start, unsigned long end,
+   struct mm_walk *walk)
+{
+   struct wp_walk *wpwalk = walk->private;
+
+   wpwalk->tlbflush_start = end;
+   wpwalk->tlbflush_end = start;
+
+   mmu_notifier_range_init(>range, MMU_NOTIFY_PROTECTION_PAGE, 0,
+   walk->vma, walk->mm, start, end);
+   mmu_notifier_invalidate_range_start(>range);
+   flush_cache_range(walk->vma, start, end);
+
+   /*
+* We're not using tlb_gather_mmu() since typically
+* only a small subrange of PTEs are affected, whereas
+* tlb_gather_mmu() records the full range.
+*/
+   inc_tlb_flush_pending(walk->mm);
+
+   return 0;
+}
+
+/*
+ * wp_clean_post_vma - The pagewalk post_vma callback.
+ *
+ * The post_vma callback performs the tlb flush and calls necessary mmu
+ * notifiers.
+ */
+static void wp_clean_post_vma(struct mm_walk *walk)
+{
+   struct wp_walk *wpwalk = walk->private;
+
+   if (wpwalk->tlbflush_end > wpwalk->tlbflush_start)
+   flush_tlb_range(walk->vma, wpwalk->tlbflush_start,
+   wpwalk->tlbflush_end);
+
+   mmu_notifier_invalidate_range_end(>range);
+   dec_tlb_flush_pending(walk->mm);
+}
+/**
+ * wp_shared_mapping_range - Write-protect all ptes in an address space range
+ * @mapping: The address_space we want to write protect
+ * @first_index: The first page offset in the range
+ * @nr: Number of incremental page offsets to cover
+ *
+ * Note: This function currently skips transhuge page-table entries, since
+ * it's intended for dirty-tracking on the PTE level. It will warn on
+ * encountering transhuge write-enabled entries, though, and can easily be
+ * extended to handle them as well.
+ *
+ * Return: The number of ptes actually write-protected. Note that
+ * already write-protected ptes are not counted.
+ */
+unsigned long wp_shared_mapping_range(struct address_space *mapping,
+ pgoff_t first_index, pgoff_t nr)
+{
+   struct wp_walk wpwalk = { .total = 0 };
+
+   i_mmap_lock_read(mapping);
+   WARN_ON(walk_page_mapping(mapping, first_index, nr, _walk_ops,
+ ));
+   i_mmap_unlock_read(mapping);
+
+   return wpwalk.total;
+}

That's a read lock, this means there's concurrency to self. What happens
if someone does two concurrent wp_shared_mapping_range() on the same
mapping?

The thing is, because of pte_wrprotect() the iteration that starts last
will see a smaller pte_write range, if it completes first and does
flush_tlb_range(), it will only flush a partial range.

This is exactly what {inc,dec}_tlb_flush_pending() is for, but you're
not using mm_tlb_flush_nested() to detect the situation and do a bigger
flush.

Or if you're not needing that, then I'm missing why.


Good catch. Thanks,

Yes the read lock is not intended to protect against concurrent users 
but to protect the vmas from disappearing under us. Since it 
fundamentally makes no sense having two concurrent threads picking up 
dirty ptes on the same address_space range we have an external 
range-bas

[PATCH v5 7/8] drm/vmwgfx: Implement an infrastructure for read-coherent resources

2019-10-10 Thread VMware

From: Thomas Hellstrom 

Similar to write-coherent resources, make sure that from the user-space
point of view, GPU rendered contents is automatically available for
reading by the CPU.

Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: Michal Hocko 
Cc: Huang Ying 
Cc: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Signed-off-by: Thomas Hellstrom 
Reviewed-by: Deepak Rawat 
---
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.h   |   7 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c|  77 -
 drivers/gpu/drm/vmwgfx/vmwgfx_resource.c  | 103 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_resource_priv.h |   2 +
 drivers/gpu/drm/vmwgfx/vmwgfx_validation.c|   3 +-
 5 files changed, 181 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h 
b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
index 53f8522ae032..729a2e93acf1 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
@@ -680,7 +680,8 @@ extern void vmw_resource_unreference(struct vmw_resource 
**p_res);
 extern struct vmw_resource *vmw_resource_reference(struct vmw_resource *res);
 extern struct vmw_resource *
 vmw_resource_reference_unless_doomed(struct vmw_resource *res);
-extern int vmw_resource_validate(struct vmw_resource *res, bool intr);
+extern int vmw_resource_validate(struct vmw_resource *res, bool intr,
+bool dirtying);
 extern int vmw_resource_reserve(struct vmw_resource *res, bool interruptible,
bool no_backup);
 extern bool vmw_resource_needs_backup(const struct vmw_resource *res);
@@ -724,6 +725,8 @@ void vmw_resource_mob_attach(struct vmw_resource *res);
 void vmw_resource_mob_detach(struct vmw_resource *res);
 void vmw_resource_dirty_update(struct vmw_resource *res, pgoff_t start,
   pgoff_t end);
+int vmw_resources_clean(struct vmw_buffer_object *vbo, pgoff_t start,
+   pgoff_t end, pgoff_t *num_prefault);
 
 /**
  * vmw_resource_mob_attached - Whether a resource currently has a mob attached
@@ -1417,6 +1420,8 @@ int vmw_bo_dirty_add(struct vmw_buffer_object *vbo);
 void vmw_bo_dirty_transfer_to_res(struct vmw_resource *res);
 void vmw_bo_dirty_clear_res(struct vmw_resource *res);
 void vmw_bo_dirty_release(struct vmw_buffer_object *vbo);
+void vmw_bo_dirty_unmap(struct vmw_buffer_object *vbo,
+   pgoff_t start, pgoff_t end);
 vm_fault_t vmw_bo_vm_fault(struct vm_fault *vmf);
 vm_fault_t vmw_bo_vm_mkwrite(struct vm_fault *vmf);
 
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c 
b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
index 060c1e492f25..f07aa857587c 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
@@ -155,7 +155,6 @@ static void vmw_bo_dirty_scan_mkwrite(struct 
vmw_buffer_object *vbo)
}
 }
 
-
 /**
  * vmw_bo_dirty_scan - Scan for dirty pages and add them to the dirty
  * tracking structure
@@ -173,6 +172,53 @@ void vmw_bo_dirty_scan(struct vmw_buffer_object *vbo)
vmw_bo_dirty_scan_mkwrite(vbo);
 }
 
+/**
+ * vmw_bo_dirty_pre_unmap - write-protect and pick up dirty pages before
+ * an unmap_mapping_range operation.
+ * @vbo: The buffer object,
+ * @start: First page of the range within the buffer object.
+ * @end: Last page of the range within the buffer object + 1.
+ *
+ * If we're using the _PAGETABLE scan method, we may leak dirty pages
+ * when calling unmap_mapping_range(). This function makes sure we pick
+ * up all dirty pages.
+ */
+static void vmw_bo_dirty_pre_unmap(struct vmw_buffer_object *vbo,
+  pgoff_t start, pgoff_t end)
+{
+   struct vmw_bo_dirty *dirty = vbo->dirty;
+   unsigned long offset = drm_vma_node_start(>base.base.vma_node);
+   struct address_space *mapping = vbo->base.bdev->dev_mapping;
+
+   if (dirty->method != VMW_BO_DIRTY_PAGETABLE || start >= end)
+   return;
+
+   wp_shared_mapping_range(mapping, start + offset, end - start);
+   clean_record_shared_mapping_range(mapping, start + offset,
+ end - start, offset,
+ >bitmap[0], >start,
+ >end);
+}
+
+/**
+ * vmw_bo_dirty_unmap - Clear all ptes pointing to a range within a bo
+ * @vbo: The buffer object,
+ * @start: First page of the range within the buffer object.
+ * @end: Last page of the range within the buffer object + 1.
+ *
+ * This is similar to ttm_bo_unmap_virtual_locked() except it takes a subrange.
+ */
+void vmw_bo_dirty_unmap(struct vmw_buffer_object *vbo,
+   pgoff_t start, pgoff_t end)
+{
+   unsigned long offset = drm_vma_node_start(>base.base.vma_node);
+   struct address_space *mapping = vbo->base.bdev->dev_mapping;
+
+   vmw_bo_dirty_pre_unmap(vbo, start, end);
+

[PATCH v5 1/8] mm: Remove BUG_ON mmap_sem not held from xxx_trans_huge_lock()

2019-10-10 Thread VMware

From: Thomas Hellstrom 

The caller needs to make sure that the vma is not torn down during the
lock operation and can also use the i_mmap_rwsem for file-backed vmas.
Remove the BUG_ON. We could, as an alternative, add a test that either
vma->vm_mm->mmap_sem or vma->vm_file->f_mapping->i_mmap_rwsem are held.

Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: Michal Hocko 
Cc: Huang Ying 
Cc: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Signed-off-by: Thomas Hellstrom 
Acked-by: Kirill A. Shutemov 
---
 include/linux/huge_mm.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 93d5cf0bc716..0b84e13e88e2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -216,7 +216,6 @@ static inline int is_swap_pmd(pmd_t pmd)
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma)
 {
-   VM_BUG_ON_VMA(!rwsem_is_locked(>vm_mm->mmap_sem), vma);
if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
return __pmd_trans_huge_lock(pmd, vma);
else
@@ -225,7 +224,6 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
struct vm_area_struct *vma)
 {
-   VM_BUG_ON_VMA(!rwsem_is_locked(>vm_mm->mmap_sem), vma);
if (pud_trans_huge(*pud) || pud_devmap(*pud))
return __pud_trans_huge_lock(pud, vma);
else
-- 
2.21.0

[PATCH v5 3/8] mm: Add a walk_page_mapping() function to the pagewalk code

2019-10-10 Thread VMware

From: Thomas Hellstrom 

For users that want to travers all page table entries pointing into a
region of a struct address_space mapping, introduce a walk_page_mapping()
function.

The walk_page_mapping() function will be initially be used for dirty-
tracking in virtual graphics drivers.

Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: Michal Hocko 
Cc: Huang Ying 
Cc: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Signed-off-by: Thomas Hellstrom 
---
 include/linux/pagewalk.h |  9 
 mm/pagewalk.c| 94 +++-
 2 files changed, 102 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index bddd9759bab9..6ec82e92c87f 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -24,6 +24,9 @@ struct mm_walk;
  * "do page table walk over the current vma", returning
  * a negative value means "abort current page table walk
  * right now" and returning 1 means "skip the current vma"
+ * @pre_vma:if set, called before starting walk on a non-null vma.
+ * @post_vma:   if set, called after a walk on a non-null vma, provided
+ *  that @pre_vma and the vma walk succeeded.
  */
 struct mm_walk_ops {
int (*pud_entry)(pud_t *pud, unsigned long addr,
@@ -39,6 +42,9 @@ struct mm_walk_ops {
 struct mm_walk *walk);
int (*test_walk)(unsigned long addr, unsigned long next,
struct mm_walk *walk);
+   int (*pre_vma)(unsigned long start, unsigned long end,
+  struct mm_walk *walk);
+   void (*post_vma)(struct mm_walk *walk);
 };
 
 /**
@@ -62,5 +68,8 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
void *private);
 int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
void *private);
+int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
+ pgoff_t nr, const struct mm_walk_ops *ops,
+ void *private);
 
 #endif /* _LINUX_PAGEWALK_H */
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index c5fa42cab14f..ea0b9e606ad1 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -254,13 +254,23 @@ static int __walk_page_range(unsigned long start, 
unsigned long end,
 {
int err = 0;
struct vm_area_struct *vma = walk->vma;
+   const struct mm_walk_ops *ops = walk->ops;
+
+   if (vma && ops->pre_vma) {
+   err = ops->pre_vma(start, end, walk);
+   if (err)
+   return err;
+   }
 
if (vma && is_vm_hugetlb_page(vma)) {
-   if (walk->ops->hugetlb_entry)
+   if (ops->hugetlb_entry)
err = walk_hugetlb_range(start, end, walk);
} else
err = walk_pgd_range(start, end, walk);
 
+   if (vma && ops->post_vma)
+   ops->post_vma(walk);
+
return err;
 }
 
@@ -291,6 +301,11 @@ static int __walk_page_range(unsigned long start, unsigned 
long end,
  * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
  * purpose.
  *
+ * If operations need to be staged before and committed after a vma is walked,
+ * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
+ * since it is intended to handle commit-type operations, can't return any
+ * errors.
+ *
  * struct mm_walk keeps current values of some common data like vma and pmd,
  * which are useful for the access from callbacks. If you want to pass some
  * caller-specific data to callbacks, @private should be helpful.
@@ -377,3 +392,80 @@ int walk_page_vma(struct vm_area_struct *vma, const struct 
mm_walk_ops *ops,
return err;
return __walk_page_range(vma->vm_start, vma->vm_end, );
 }
+
+/**
+ * walk_page_mapping - walk all memory areas mapped into a struct 
address_space.
+ * @mapping: Pointer to the struct address_space
+ * @first_index: First page offset in the address_space
+ * @nr: Number of incremental page offsets to cover
+ * @ops:   operation to call during the walk
+ * @private:   private data for callbacks' usage
+ *
+ * This function walks all memory areas mapped into a struct address_space.
+ * The walk is limited to only the given page-size index range, but if
+ * the index boundaries cross a huge page-table entry, that entry will be
+ * included.
+ *
+ * Also see walk_page_range() for additional information.
+ *
+ * Locking:
+ *   This function can't require that the struct mm_struct::mmap_sem is held,
+ *   since @mapping may be mapped by multiple processes. Instead
+ *   @mapping->i_mmap_rwsem must be held. This might have implications in the
+ *   callbacks, and it's up tho the caller to ensure that the
+ *   struct mm_struct::mmap_sem is not needed.
+ *
+ *   Also this means that a caller

[PATCH v5 2/8] mm: pagewalk: Take the pagetable lock in walk_pte_range()

2019-10-10 Thread VMware

From: Thomas Hellstrom 

Without the lock, anybody modifying a pte from within this function might
have it concurrently modified by someone else.

Cc: Matthew Wilcox 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: Michal Hocko 
Cc: Huang Ying 
Cc: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Suggested-by: Linus Torvalds 
Signed-off-by: Thomas Hellstrom 
---
 mm/pagewalk.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index d48c2a986ea3..c5fa42cab14f 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -10,8 +10,9 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
unsigned long end,
pte_t *pte;
int err = 0;
const struct mm_walk_ops *ops = walk->ops;
+   spinlock_t *ptl;
 
-   pte = pte_offset_map(pmd, addr);
+   pte = pte_offset_map_lock(walk->mm, pmd, addr, );
for (;;) {
err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
if (err)
@@ -22,7 +23,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
unsigned long end,
pte++;
}
 
-   pte_unmap(pte);
+   pte_unmap_unlock(pte, ptl);
return err;
 }
 
-- 
2.21.0

[PATCH v5 4/8] mm: Add write-protect and clean utilities for address space ranges

2019-10-10 Thread VMware

From: Thomas Hellstrom 

Add two utilities to 1) write-protect and 2) clean all ptes pointing into
a range of an address space.
The utilities are intended to aid in tracking dirty pages (either
driver-allocated system memory or pci device memory).
The write-protect utility should be used in conjunction with
page_mkwrite() and pfn_mkwrite() to trigger write page-faults on page
accesses. Typically one would want to use this on sparse accesses into
large memory regions. The clean utility should be used to utilize
hardware dirtying functionality and avoid the overhead of page-faults,
typically on large accesses into small memory regions.

Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: Michal Hocko 
Cc: Huang Ying 
Cc: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Signed-off-by: Thomas Hellstrom 
---
 include/linux/mm.h |  13 +-
 mm/Kconfig |   3 +
 mm/Makefile|   1 +
 mm/mapping_dirty_helpers.c | 312 +
 4 files changed, 328 insertions(+), 1 deletion(-)
 create mode 100644 mm/mapping_dirty_helpers.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cc292273e6ba..4bc93477375e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2637,7 +2637,6 @@ typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, 
void *data);
 extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
   unsigned long size, pte_fn_t fn, void *data);
 
-
 #ifdef CONFIG_PAGE_POISONING
 extern bool page_poisoning_enabled(void);
 extern void kernel_poison_pages(struct page *page, int numpages, int enable);
@@ -2878,5 +2877,17 @@ static inline int pages_identical(struct page *page1, 
struct page *page2)
return !memcmp_pages(page1, page2);
 }
 
+#ifdef CONFIG_MAPPING_DIRTY_HELPERS
+unsigned long clean_record_shared_mapping_range(struct address_space *mapping,
+   pgoff_t first_index, pgoff_t nr,
+   pgoff_t bitmap_pgoff,
+   unsigned long *bitmap,
+   pgoff_t *start,
+   pgoff_t *end);
+
+unsigned long wp_shared_mapping_range(struct address_space *mapping,
+ pgoff_t first_index, pgoff_t nr);
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index a5dae9a7eb51..550f7aceb679 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,7 @@ config ARCH_HAS_PTE_SPECIAL
 config ARCH_HAS_HUGEPD
bool
 
+config MAPPING_DIRTY_HELPERS
+bool
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index d996846697ef..1937cc251883 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -107,3 +107,4 @@ obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
new file mode 100644
index ..799b9154b48f
--- /dev/null
+++ b/mm/mapping_dirty_helpers.c
@@ -0,0 +1,312 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/**
+ * struct wp_walk - Private struct for pagetable walk callbacks
+ * @range: Range for mmu notifiers
+ * @tlbflush_start: Address of first modified pte
+ * @tlbflush_end: Address of last modified pte + 1
+ * @total: Total number of modified ptes
+ */
+struct wp_walk {
+   struct mmu_notifier_range range;
+   unsigned long tlbflush_start;
+   unsigned long tlbflush_end;
+   unsigned long total;
+};
+
+/**
+ * wp_pte - Write-protect a pte
+ * @pte: Pointer to the pte
+ * @addr: The virtual page address
+ * @walk: pagetable walk callback argument
+ *
+ * The function write-protects a pte and records the range in
+ * virtual address space of touched ptes for efficient range TLB flushes.
+ */
+static int wp_pte(pte_t *pte, unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+   struct wp_walk *wpwalk = walk->private;
+   pte_t ptent = *pte;
+
+   if (pte_write(ptent)) {
+   pte_t old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
+
+   ptent = pte_wrprotect(old_pte);
+   ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
+   wpwalk->total++;
+   wpwalk->tlbflush_start = min(wpwalk->tlbflush_start, addr);
+   wpwalk->tlbflush_end = max(wpwalk->tlbflush_end,
+  addr + PAGE_SIZE);
+   }
+
+   return 0;
+}
+
+/**
+ * struct clean_walk - Private struct for the clean_record_pte function.
+ * @base: struct wp_walk we derive from
+ * @bitmap_pgoff: Address_space Page

[PATCH v5 8/8] drm/vmwgfx: Add surface dirty-tracking callbacks

2019-10-10 Thread VMware

From: Thomas Hellstrom 

Add the callbacks necessary to implement emulated coherent memory for
surfaces. Add a flag to the gb_surface_create ioctl to indicate that
surface memory should be coherent.
Also bump the drm minor version to signal the availability of coherent
surfaces.

Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: Michal Hocko 
Cc: Huang Ying 
Cc: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Signed-off-by: Thomas Hellstrom 
Reviewed-by: Deepak Rawat 
---
 .../device_include/svga3d_surfacedefs.h   | 233 ++-
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.h   |   4 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_surface.c   | 395 +-
 include/uapi/drm/vmwgfx_drm.h |   4 +-
 4 files changed, 629 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/vmwgfx/device_include/svga3d_surfacedefs.h 
b/drivers/gpu/drm/vmwgfx/device_include/svga3d_surfacedefs.h
index f2bfd3d80598..61414f105c67 100644
--- a/drivers/gpu/drm/vmwgfx/device_include/svga3d_surfacedefs.h
+++ b/drivers/gpu/drm/vmwgfx/device_include/svga3d_surfacedefs.h
@@ -1280,7 +1280,6 @@ svga3dsurface_get_pixel_offset(SVGA3dSurfaceFormat format,
return offset;
 }
 
-
 static inline u32
 svga3dsurface_get_image_offset(SVGA3dSurfaceFormat format,
   surf_size_struct baseLevelSize,
@@ -1375,4 +1374,236 @@ 
svga3dsurface_is_screen_target_format(SVGA3dSurfaceFormat format)
return svga3dsurface_is_dx_screen_target_format(format);
 }
 
+/**
+ * struct svga3dsurface_mip - Mimpmap level information
+ * @bytes: Bytes required in the backing store of this mipmap level.
+ * @img_stride: Byte stride per image.
+ * @row_stride: Byte stride per block row.
+ * @size: The size of the mipmap.
+ */
+struct svga3dsurface_mip {
+   size_t bytes;
+   size_t img_stride;
+   size_t row_stride;
+   struct drm_vmw_size size;
+
+};
+
+/**
+ * struct svga3dsurface_cache - Cached surface information
+ * @desc: Pointer to the surface descriptor
+ * @mip: Array of mipmap level information. Valid size is @num_mip_levels.
+ * @mip_chain_bytes: Bytes required in the backing store for the whole chain
+ * of mip levels.
+ * @sheet_bytes: Bytes required in the backing store for a sheet
+ * representing a single sample.
+ * @num_mip_levels: Valid size of the @mip array. Number of mipmap levels in
+ * a chain.
+ * @num_layers: Number of slices in an array texture or number of faces in
+ * a cubemap texture.
+ */
+struct svga3dsurface_cache {
+   const struct svga3d_surface_desc *desc;
+   struct svga3dsurface_mip mip[DRM_VMW_MAX_MIP_LEVELS];
+   size_t mip_chain_bytes;
+   size_t sheet_bytes;
+   u32 num_mip_levels;
+   u32 num_layers;
+};
+
+/**
+ * struct svga3dsurface_loc - Surface location
+ * @sub_resource: Surface subresource. Defined as layer * num_mip_levels +
+ * mip_level.
+ * @x: X coordinate.
+ * @y: Y coordinate.
+ * @z: Z coordinate.
+ */
+struct svga3dsurface_loc {
+   u32 sub_resource;
+   u32 x, y, z;
+};
+
+/**
+ * svga3dsurface_subres - Compute the subresource from layer and mipmap.
+ * @cache: Surface layout data.
+ * @mip_level: The mipmap level.
+ * @layer: The surface layer (face or array slice).
+ *
+ * Return: The subresource.
+ */
+static inline u32 svga3dsurface_subres(const struct svga3dsurface_cache *cache,
+  u32 mip_level, u32 layer)
+{
+   return cache->num_mip_levels * layer + mip_level;
+}
+
+/**
+ * svga3dsurface_setup_cache - Build a surface cache entry
+ * @size: The surface base level dimensions.
+ * @format: The surface format.
+ * @num_mip_levels: Number of mipmap levels.
+ * @num_layers: Number of layers.
+ * @cache: Pointer to a struct svga3dsurface_cach object to be filled in.
+ *
+ * Return: Zero on success, -EINVAL on invalid surface layout.
+ */
+static inline int svga3dsurface_setup_cache(const struct drm_vmw_size *size,
+   SVGA3dSurfaceFormat format,
+   u32 num_mip_levels,
+   u32 num_layers,
+   u32 num_samples,
+   struct svga3dsurface_cache *cache)
+{
+   const struct svga3d_surface_desc *desc;
+   u32 i;
+
+   memset(cache, 0, sizeof(*cache));
+   cache->desc = desc = svga3dsurface_get_desc(format);
+   cache->num_mip_levels = num_mip_levels;
+   cache->num_layers = num_layers;
+   for (i = 0; i < cache->num_mip_levels; i++) {
+   struct svga3dsurface_mip *mip = >mip[i];
+
+   mip->size = svga3dsurface_get_mip_size(*size, i);
+   mip->bytes = svga3dsurface_get_image_buffer_size
+   (desc, >size, 0);
+   mip->row_stride =
+   __KERNEL_DIV_ROUND_UP(mip->size.width,
+

[PATCH v5 0/8] Emulated coherent graphics memory take 2

2019-10-10 Thread VMware

From: Thomas Hellström 

Graphics APIs like OpenGL 4.4 and Vulkan require the graphics driver
to provide coherent graphics memory, meaning that the GPU sees any
content written to the coherent memory on the next GPU operation that
touches that memory, and the CPU sees any content written by the GPU
to that memory immediately after any fence object trailing the GPU
operation is signaled.

Paravirtual drivers that otherwise require explicit synchronization
needs to do this by hooking up dirty tracking to pagefault handlers
and buffer object validation.

Provide mm helpers needed for this and that also allow for huge pmd-
and pud entries (patch 1-3), and the associated vmwgfx code (patch 4-7).

The code has been tested and exercised by a tailored version of mesa
where we disable all explicit synchronization and assume graphics memory
is coherent. The performance loss varies of course; a typical number is
around 5%.

I would like to merge this code through the DRM tree, so an ack to include
the new mm helpers in that merge would be greatly appreciated.

Changes since RFC:
- Merge conflict changes moved to the correct patch. Fixes intra-patchset
  compile errors.
- Be more aggressive when turning ttm vm code into helpers. This makes sure
  we can use a const qualifier on the vmwgfx vm_ops.
- Reinstate a lost comment an fix an error path that was broken when turning
  the ttm vm code into helpers.
- Remove explicit type-casts of struct vm_area_struct::vm_private_data
- Clarify the locking inversion that makes us not being able to use the mm
  pagewalk code.

Changes since v1:
- Removed the vmwgfx maintainer entry for as_dirty_helpers.c, updated
  commit message accordingly
- Removed the TTM patches from the series as they are merged separately
  through DRM.
Changes since v2:
- Split out the pagewalk code from as_dirty_helpers.c and document locking.
- Add pre_vma and post_vma callbacks to the pagewalk code.
- Remove huge pmd and -pud asserts that would trip when we protect vmas with
  struct address_space::i_mmap_rwsem rather than with
  struct vm_area_struct::mmap_sem.
- Do some naming cleanup in as_dirty_helpers.c
Changes since v3:
- Extensive renaming of the dirty helpers including the filename.
- Update walk_page_mapping() doc.
- Update the pagewalk code to not unconditionally split pmds if a pte_entry()
  callback is present. Update the dirty helper pmd_entry accordingly.
- Use separate walk ops for the dirty helpers.
- Update the pagewalk code to take the pagetable lock in walk_pte_range.
Changes since v4:
- Fix pte pointer confusion in patch 2/8
- Skip the pagewalk code conditional split patch for now, and update the
  mapping_dirty_helper accordingly. That problem will be solved in a cleaner
  way in a follow-up patchset.
  
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: Michal Hocko 
Cc: Huang Ying 
Cc: Jérôme Glisse 
Cc: Kirill A. Shutemov

[PATCH v5 6/8] drm/vmwgfx: Use an RBtree instead of linked list for MOB resources

2019-10-10 Thread VMware

From: Thomas Hellstrom 

With emulated coherent memory we need to be able to quickly look up
a resource from the MOB offset. Instead of traversing a linked list with
O(n) worst case, use an RBtree with O(log n) worst case complexity.

Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: Michal Hocko 
Cc: Huang Ying 
Cc: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Signed-off-by: Thomas Hellstrom 
Reviewed-by: Deepak Rawat 
---
 drivers/gpu/drm/vmwgfx/vmwgfx_bo.c   |  5 ++--
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.h  | 10 +++
 drivers/gpu/drm/vmwgfx/vmwgfx_resource.c | 33 +---
 3 files changed, 32 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c 
b/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
index 869aeaec2f86..18e4b329e563 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
@@ -463,6 +463,7 @@ void vmw_bo_bo_free(struct ttm_buffer_object *bo)
struct vmw_buffer_object *vmw_bo = vmw_buffer_object(bo);
 
WARN_ON(vmw_bo->dirty);
+   WARN_ON(!RB_EMPTY_ROOT(_bo->res_tree));
vmw_bo_unmap(vmw_bo);
kfree(vmw_bo);
 }
@@ -479,6 +480,7 @@ static void vmw_user_bo_destroy(struct ttm_buffer_object 
*bo)
struct vmw_buffer_object *vbo = _user_bo->vbo;
 
WARN_ON(vbo->dirty);
+   WARN_ON(!RB_EMPTY_ROOT(>res_tree));
vmw_bo_unmap(vbo);
ttm_prime_object_kfree(vmw_user_bo, prime);
 }
@@ -514,8 +516,7 @@ int vmw_bo_init(struct vmw_private *dev_priv,
memset(vmw_bo, 0, sizeof(*vmw_bo));
BUILD_BUG_ON(TTM_MAX_BO_PRIORITY <= 3);
vmw_bo->base.priority = 3;
-
-   INIT_LIST_HEAD(_bo->res_list);
+   vmw_bo->res_tree = RB_ROOT;
 
ret = ttm_bo_init(bdev, _bo->base, size,
  ttm_bo_type_device, placement,
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h 
b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
index 7944dbbbdd72..53f8522ae032 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
@@ -100,7 +100,7 @@ struct vmw_fpriv {
 /**
  * struct vmw_buffer_object - TTM buffer object with vmwgfx additions
  * @base: The TTM buffer object
- * @res_list: List of resources using this buffer object as a backing MOB
+ * @res_tree: RB tree of resources using this buffer object as a backing MOB
  * @pin_count: pin depth
  * @dx_query_ctx: DX context if this buffer object is used as a DX query MOB
  * @map: Kmap object for semi-persistent mappings
@@ -109,7 +109,7 @@ struct vmw_fpriv {
  */
 struct vmw_buffer_object {
struct ttm_buffer_object base;
-   struct list_head res_list;
+   struct rb_root res_tree;
s32 pin_count;
/* Not ref-counted.  Protected by binding_mutex */
struct vmw_resource *dx_query_ctx;
@@ -157,8 +157,8 @@ struct vmw_res_func;
  * pin-count greater than zero. It is not on the resource LRU lists and its
  * backup buffer is pinned. Hence it can't be evicted.
  * @func: Method vtable for this resource. Immutable.
+ * @mob_node; Node for the MOB backup rbtree. Protected by @backup reserved.
  * @lru_head: List head for the LRU list. Protected by 
@dev_priv::resource_lock.
- * @mob_head: List head for the MOB backup list. Protected by @backup reserved.
  * @binding_head: List head for the context binding list. Protected by
  * the @dev_priv::binding_mutex
  * @res_free: The resource destructor.
@@ -179,8 +179,8 @@ struct vmw_resource {
unsigned long backup_offset;
unsigned long pin_count;
const struct vmw_res_func *func;
+   struct rb_node mob_node;
struct list_head lru_head;
-   struct list_head mob_head;
struct list_head binding_head;
struct vmw_resource_dirty *dirty;
void (*res_free) (struct vmw_resource *res);
@@ -733,7 +733,7 @@ void vmw_resource_dirty_update(struct vmw_resource *res, 
pgoff_t start,
  */
 static inline bool vmw_resource_mob_attached(const struct vmw_resource *res)
 {
-   return !list_empty(>mob_head);
+   return !RB_EMPTY_NODE(>mob_node);
 }
 
 /**
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c 
b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
index e4c97a4cf2ff..328ad46076ff 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c
@@ -40,11 +40,24 @@
 void vmw_resource_mob_attach(struct vmw_resource *res)
 {
struct vmw_buffer_object *backup = res->backup;
+   struct rb_node **new = >res_tree.rb_node, *parent = NULL;
 
dma_resv_assert_held(res->backup->base.base.resv);
res->used_prio = (res->res_dirty) ? res->func->dirty_prio :
res->func->prio;
-   list_add_tail(>mob_head, >res_list);
+
+   while (*new) {
+   struct vmw_resource *this =
+   container_of(*new, struct vmw_resource, mob_node);
+
+   parent = *new;
+   new = (res->backup_offset <

[PATCH v5 5/8] drm/vmwgfx: Implement an infrastructure for write-coherent resources

2019-10-10 Thread VMware

backup_dirty;
+   u32 res_dirty : 1;
+   u32 backup_dirty : 1;
+   u32 coherent : 1;
struct vmw_buffer_object *backup;
unsigned long backup_offset;
unsigned long pin_count;
@@ -177,6 +182,7 @@ struct vmw_resource {
struct list_head lru_head;
struct list_head mob_head;
struct list_head binding_head;
+   struct vmw_resource_dirty *dirty;
void (*res_free) (struct vmw_resource *res);
void (*hw_destroy) (struct vmw_resource *res);
 };
@@ -716,6 +722,8 @@ extern void vmw_resource_evict_all(struct vmw_private 
*dev_priv);
 extern void vmw_resource_unbind_list(struct vmw_buffer_object *vbo);
 void vmw_resource_mob_attach(struct vmw_resource *res);
 void vmw_resource_mob_detach(struct vmw_resource *res);
+void vmw_resource_dirty_update(struct vmw_resource *res, pgoff_t start,
+  pgoff_t end);
 
 /**
  * vmw_resource_mob_attached - Whether a resource currently has a mob attached
@@ -1403,6 +1411,15 @@ int vmw_host_log(const char *log);
 #define VMW_DEBUG_USER(fmt, ...)  \
DRM_DEBUG_DRIVER(fmt, ##__VA_ARGS__)
 
+/* Resource dirtying - vmwgfx_page_dirty.c */
+void vmw_bo_dirty_scan(struct vmw_buffer_object *vbo);
+int vmw_bo_dirty_add(struct vmw_buffer_object *vbo);
+void vmw_bo_dirty_transfer_to_res(struct vmw_resource *res);
+void vmw_bo_dirty_clear_res(struct vmw_resource *res);
+void vmw_bo_dirty_release(struct vmw_buffer_object *vbo);
+vm_fault_t vmw_bo_vm_fault(struct vm_fault *vmf);
+vm_fault_t vmw_bo_vm_mkwrite(struct vm_fault *vmf);
+
 /**
  * VMW_DEBUG_KMS - Debug output for kernel mode-setting
  *
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c 
b/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
index ff86d49dc5e8..934ad7c0c342 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
@@ -2560,7 +2560,6 @@ static int vmw_cmd_dx_check_subresource(struct 
vmw_private *dev_priv,
 offsetof(typeof(*cmd), sid));
 
cmd = container_of(header, typeof(*cmd), header);
-
return vmw_cmd_res_check(dev_priv, sw_context, vmw_res_surface,
 VMW_RES_DIRTY_NONE, user_surface_converter,
 >sid, NULL);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c 
b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
new file mode 100644
index ..060c1e492f25
--- /dev/null
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
@@ -0,0 +1,421 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/**
+ *
+ * Copyright 2019 VMware, Inc., Palo Alto, CA., USA
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sub license, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the
+ * next paragraph) shall be included in all copies or substantial portions
+ * of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDERS, AUTHORS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM,
+ * DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
+ * OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
+ * USE OR OTHER DEALINGS IN THE SOFTWARE.
+ *
+ **/
+#include "vmwgfx_drv.h"
+
+/*
+ * Different methods for tracking dirty:
+ * VMW_BO_DIRTY_PAGETABLE - Scan the pagetable for hardware dirty bits
+ * VMW_BO_DIRTY_MKWRITE - Write-protect page table entries and record write-
+ * accesses in the VM mkwrite() callback
+ */
+enum vmw_bo_dirty_method {
+   VMW_BO_DIRTY_PAGETABLE,
+   VMW_BO_DIRTY_MKWRITE,
+};
+
+/*
+ * No dirtied pages at scan trigger a transition to the _MKWRITE method,
+ * similarly a certain percentage of dirty pages trigger a transition to
+ * the _PAGETABLE method. How many triggers should we wait for before
+ * changing method?
+ */
+#define VMW_DIRTY_NUM_CHANGE_TRIGGERS 2
+
+/* Percentage to trigger a transition to the _PAGETABLE method */
+#define VMW_DIRTY_PERCENTAGE 10
+
+/**
+ * struct vmw_bo_dirty - Dirty information for buffer objects
+ * @start: First currently dirty bit
+ * @end: Last currently dirty bit + 1
+ * @method: The currently used dirty method
+ * @change_count: Number of consecutive method ch

Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present

2019-10-10 Thread VMware


On 10/10/19 4:07 AM, Linus Torvalds wrote:

On Wed, Oct 9, 2019 at 6:10 PM Thomas Hellström (VMware)
 wrote:

Your original patch does exactly the same!

Oh, no. You misread my original patch.

Look again.

The logic in my original patch was very different. It said that

  - *if* we have a pmd_entry function, then we obviously call that one.

 And if - after calling the pmd_entry function - we are still a
hugepage, then we skip the pte_entry case entirely.

And part of skipping is obviously "don't split" - but it never had
that "don't split and then call the pte walker" case.

  - and if we *don't* have a pmd_entry function, but we do have a
pte_entry function, then we always split before calling it.

Notice the difference?


From what I can tell, my patch is doing the same. At least that always 
was the intention. To determine whether to skip pte and skip split, your 
patch uses


/* No pte level at all? */
if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd)
|| pmd_devmap(*pmd))
continue;

whereas my patch does

            if (pmd_trans_unstable(pmd))
goto again;
/* Fall through */

which is the same (pmd_trans_unstable() is the same test as you do, but 
not racy). Yes, it's missing the test for pmd_devmap() but I think 
that's an mm bug been discussed elsewhere, and we also rerun because a 
huge / none pmd at this (FALLBACK) point is probably a race and unintended.




But I think the "change pmd_entry to have a sane return code" is a
simpler and more flexible model, and then the pmd_entry code can just
let the pte walker split the pmd if needed.


OK, let's aim for that then.

Thanks,

Thomas




So I liked that part of your patch.

Linus

Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present

2019-10-09 Thread VMware


On 10/10/19 1:51 AM, Linus Torvalds wrote:

On Wed, Oct 9, 2019 at 3:31 PM Thomas Hellström (VMware)
 wrote:

On 10/9/19 10:20 PM, Linus Torvalds wrote:

You *have* to call split_huge_pmd() if you're doing to call the
pte_entry() function.

End of story.

So is it that you want pte_entry() to be strictly called for *each*
virtual address, even if we have a pmd_entry()?

Thomas, you're not reading the emails.

You are conflating two issues:

  (a) the conditional split_huge_pmd()

  (b) the what to do about the return value of pmd_entry().

and I'm now FOR THE LAST TIME telling you that (a) is completely
wrong, buggy crap. It will not happen. I missed that part of the patch
in my original read-through, because it was so senseless.

Get it through you head. The "conditional split_huge_pmd()" thing is
wrong, wrong, wrong.

And it is COMPLETELY WRONG regardless of any "each virtual address"
thing that you keep bringing up. The "each virtual address" argument
is irrelevant, pointless, and does not matter.

So stop bringing that thing up. Really. The conditional
split_huge_pmd() is wrong.

It's wrong,  because the whole notion of "don't split pmd and then
call the pte walker" is completely idiotic and utterly senseless,
because without the splitting the pte's DO NOT EXIST.

What part of that is not getting through?


In that case I completely follow your arguments, meaning we skip this
patch completely?

Well, yes and no.

The part of the patch that is complete and utter garbage, and that you
really need to *understand* why it's complete and utter garbage is
this part:

 if (!ops->pte_entry)
 continue;
-
-   split_huge_pmd(walk->vma, pmd, addr);
+   if (!ops->pmd_entry)
+   split_huge_pmd(walk->vma, pmd, addr);
 if (pmd_trans_unstable(pmd))
 goto again;
 err = walk_pte_range(pmd, addr, next, walk);

Look, you cannot call "walk_pte_range()" without calling
split_huge_pmd(), because walk_pte_range() cannot deal with a huge
pmd.

walk_pte_range() does that pte_offset_map() on the pmd (well, with
your other patch in place it does the locking version), and then it
walks every pte entry of it. But that cannot possibly work if you
didn't split it.


Thank you for your patience!

Yes, I very well *do* understand that we need to split a huge pmd to 
walk the pte range, and I've never been against removing that 
conditional. What I've said is that it is pointless anyway, because 
we've already verified that the only path coming from the pmd_entry 
(with the patch applied) has the pmd *already split* and stable.


Your original patch does exactly the same!

So please let's move on from the splitting issue. We don't disagree 
here. The conditional is gone to never be resurrected.




Now, that gets us back to the (b) part above - what to do about the
return value of "pmd_entry()".

What *would* make sense, for example, is saying "ok, pmd_entry()
already handled this, so we'll just continue, and not bother with the
pte_range() at all".

Note how that is VERY VERY different from "let's not split".

Yes, for that case we also don't split, but we don't split because
we're not even walking it. See the difference?

Not splitting and then walking: wrong, insane, and not going to happen.

Nor splitting because we're not going to walk it: sane, and we already
have one such case.

But the "don't split and then don't skip the level" makes zero sense
what-so-ever. Ever. Under no circumstances can that be valid as per
above.


Agreed.



There's also the argument that right now, there are no users that
actually want to skip the level.

Even your use case doesn't really want that, because in your use-case,
the only case that would do it is the error case that isn't supposed
to happen.

And if it *is* supposed to happen, in many ways it would be more
sensible to just return a positive value saying "I took care of it, go
on to the next entry", wouldn't you agree?


Indeed.



Now, I actually tried to look through every single existing pmd_entry
case, because I wanted to see who is returning positive values right
now, and which of them might *want* to say "ok, I handled it" or "now
do the pte walking".

Quite a lot of them seem to really want to do that "ok, now do the pte
walking", because they currently do it inside the pmd function because
of how the original interface was badly done. So while we _currently_
do not have a "ok, I did everything for this pmd, skip it" vs a "ok,
continue with pte" case, it clearly does make sense. No question about
it.

I did find one single user of a positive return value:
queue_pages_pte_range() returns

  * 0 - pages are placed on the right node or queued successfully.
  * 1 - there is unmovable p

Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present

2019-10-09 Thread VMware


On 10/10/19 12:30 AM, Thomas Hellström (VMware) wrote:

On 10/9/19 10:20 PM, Linus Torvalds wrote:

On Wed, Oct 9, 2019 at 1:06 PM Thomas Hellström (VMware)
 wrote:

On 10/9/19 9:20 PM, Linus Torvalds wrote:

Don't you get it? There *is* no PTE level if you didn't split.
Hmm, This paragraph makes me think we have very different 
perceptions about what I'm trying to achieve.

It's not about what you're trying to achieve.

It's about the actual code.

You cannot do that


- split_huge_pmd(walk->vma, pmd, addr);
+   if (!ops->pmd_entry)
+   split_huge_pmd(walk->vma, pmd, addr);

it's insane.

You *have* to call split_huge_pmd() if you're doing to call the
pte_entry() function.

I don't understand why you are arguing. This is not about "feelings"
and "intentions" or about "trying to achieve".

This is about cold hard "you can't do that", and this is now the third
time I tell you _why_ you can't do that: you can't walk the last level
if you don't _have_ a last level. You have to split the pmd to do so.
It's not so much arguing but rather trying to understand your concerns 
and your perception of what the final code should look like.


End of story.


So is it that you want pte_entry() to be strictly called for *each* 
virtual address, even if we have a pmd_entry()?
In that case I completely follow your arguments, meaning we skip this 
patch completely?


Or if you're still OK with your original patch

https://lore.kernel.org/lkml/CAHk-=wj5nifpouyd6zugy4k7vovoaxqt-xhdrjd6j5hifbw...@mail.gmail.com/

I'd happily use that instead.

Thanks,

Thomas

Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present

2019-10-09 Thread VMware


On 10/9/19 10:20 PM, Linus Torvalds wrote:

On Wed, Oct 9, 2019 at 1:06 PM Thomas Hellström (VMware)
 wrote:

On 10/9/19 9:20 PM, Linus Torvalds wrote:

Don't you get it? There *is* no PTE level if you didn't split.

Hmm, This paragraph makes me think we have very different perceptions about 
what I'm trying to achieve.

It's not about what you're trying to achieve.

It's about the actual code.

You cannot do that


-   split_huge_pmd(walk->vma, pmd, addr);
+   if (!ops->pmd_entry)
+   split_huge_pmd(walk->vma, pmd, addr);

it's insane.

You *have* to call split_huge_pmd() if you're doing to call the
pte_entry() function.

I don't understand why you are arguing. This is not about "feelings"
and "intentions" or about "trying to achieve".

This is about cold hard "you can't do that", and this is now the third
time I tell you _why_ you can't do that: you can't walk the last level
if you don't _have_ a last level. You have to split the pmd to do so.
It's not so much arguing but rather trying to understand your concerns 
and your perception of what the final code should look like.


End of story.


So is it that you want pte_entry() to be strictly called for *each* 
virtual address, even if we have a pmd_entry()?
In that case I completely follow your arguments, meaning we skip this 
patch completely?


My take on the change was that pmd_entry() returning 0 would mean we 
could actually skip the pte level completely and nothing would otherwise 
pass down to the next level for which split_huge_pmd() wasn't a NOP, 
similar to how pud_entry does things. FWIW, see


https://lore.kernel.org/lkml/20191004123732.xpr3vroee5mhg...@box.shutemov.name/

and we could in the long run transform the pte walk in many pmd_entry 
callbacks into pte_entry callbacks.






I wanted the pte level to *only* get called for *pre-existing* pte entries.

Again, I told you what the solution was.

But the fact is, it's not what your other code even wants or does.

Seriously. You have two cases you care about in your callbacks

  - an actual hugepmd. This is an ERROR for you and you do a huge
WARN_ON() for it to let people know.
No, it's typically a NOP, since the hugepmd should be read-only. 
Write-enabled huge pages are split in fault().


  - regular pages. This is what your other code actually handles.

So for the hugepomd case, you have two choices:

  - handle it by splitting and deal with the regular pages: "return 0;"
Well, this is not what we want to do, and the reason we have the patch 
in the first place.


  - actually error out: "return -EINVAL".


/Thomas

Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present

2019-10-09 Thread VMware


On 10/9/19 9:20 PM, Linus Torvalds wrote:


No. Your logic is garbage. The above code is completely broken.

YOU CAN NOT AVOID TRHE SPLIT AND THEN GO ON AT THE PTE LEVEL.

Don't you get it? There *is* no PTE level if you didn't split.


Hmm, This paragraph makes me think we have very different perceptions about 
what I'm trying to achieve.

I wanted the pte level to *only* get called for *pre-existing* pte entries.
Surely those must be able to exist even if we don't split occasional huge pmds 
in the pagewalk code?



So what you should do is to just always return 0 in your pmd_entry().
Boom, done. The only reason for the pmd_entry existing at all is to
get the warning. Then, if you don't want to split it, you make that
warning just return an error (or a positive value) instead and say
"ok, that was bad, we don't handle it at all".

And in some _future_ life, if anybody wants to actually say "yeah,
let's not split it", make it have some "yeah I handled it" case.


Well yes, this is exactly what I want. Because any huge pmd we encounter 
should be read-only.


/Thomas

Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present

2019-10-09 Thread VMware


On 10/9/19 6:21 PM, Linus Torvalds wrote:

On Wed, Oct 9, 2019 at 8:27 AM Kirill A. Shutemov  wrote:

Do we have any current user that expect split_huge_pmd() in this scenario.

No. There are no current users of the pmd callback and the pte
callback at all, that I could find.

But it looks like the new drm use does want a "I can't handle the
hugepage, please split it and I'll fo the ptes instead".

Nope, it handles the hugepages by ignoring them, since they should be 
read-only, but if pmd_entry() was called with something else than a 
hugepage, then it requests the fallback, but never a split.


/Thomas

1 2 3 4 5 6 >

1 - 100 of 527 matches

Mail list logo