date:20160914

[RFC v3 22/22] samples/landlock: Add sandbox example

2016-09-14 Thread Mickaël Salaün

Add a basic sandbox tool to create a process isolated from some part of
the system. This can depend of the current cgroup.

Example with the current process hierarchy (seccomp):

  $ ls /home
  user1
  $ LANDLOCK_ALLOWED='/bin:/lib:/usr:/tmp:/proc/self/fd/0' \
  ./samples/landlock/sandbox /bin/sh -i
  Launching a new sandboxed process.
  $ ls /home
  ls: cannot open directory '/home': Permission denied

Example with a cgroup:

  $ mkdir /sys/fs/cgroup/sandboxed
  $ ls /home
  user1
  $ LANDLOCK_CGROUPS='/sys/fs/cgroup/sandboxed' \
  LANDLOCK_ALLOWED='/bin:/lib:/usr:/tmp:/proc/self/fd/0' \
  ./samples/landlock/sandbox
  Ready to sandbox with cgroups.
  $ ls /home
  user1
  $ echo $$ > /sys/fs/cgroup/sandboxed/cgroup.procs
  $ ls /home
  ls: cannot open directory '/home': Permission denied

Changes since v2:
* use BPF_PROG_ATTACH for cgroup handling

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andy Lutomirski 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: James Morris 
Cc: Kees Cook 
Cc: Serge E. Hallyn 
---
 samples/Makefile|   2 +-
 samples/landlock/.gitignore |   1 +
 samples/landlock/Makefile   |  16 +++
 samples/landlock/sandbox.c  | 307 
 4 files changed, 325 insertions(+), 1 deletion(-)
 create mode 100644 samples/landlock/.gitignore
 create mode 100644 samples/landlock/Makefile
 create mode 100644 samples/landlock/sandbox.c

diff --git a/samples/Makefile b/samples/Makefile
index 1a20169d85ac..a2dcd57ca7ac 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -2,4 +2,4 @@
 
 obj-$(CONFIG_SAMPLES)  += kobject/ kprobes/ trace_events/ livepatch/ \
   hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/ \
-  configfs/ connector/ v4l/ trace_printk/
+  configfs/ connector/ v4l/ trace_printk/ landlock/
diff --git a/samples/landlock/.gitignore b/samples/landlock/.gitignore
new file mode 100644
index ..f6c6da930a30
--- /dev/null
+++ b/samples/landlock/.gitignore
@@ -0,0 +1 @@
+/sandbox
diff --git a/samples/landlock/Makefile b/samples/landlock/Makefile
new file mode 100644
index ..d1044b2afd27
--- /dev/null
+++ b/samples/landlock/Makefile
@@ -0,0 +1,16 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+hostprogs-$(CONFIG_SECURITY_LANDLOCK) := sandbox
+sandbox-objs := sandbox.o
+
+always := $(hostprogs-y)
+
+HOSTCFLAGS += -I$(objtree)/usr/include
+
+# Trick to allow make to be run from this directory
+all:
+   $(MAKE) -C ../../ $$PWD/
+
+clean:
+   $(MAKE) -C ../../ M=$$PWD clean
diff --git a/samples/landlock/sandbox.c b/samples/landlock/sandbox.c
new file mode 100644
index ..9d6ac00cdd23
--- /dev/null
+++ b/samples/landlock/sandbox.c
@@ -0,0 +1,307 @@
+/*
+ * Landlock LSM - Sandbox example
+ *
+ * Copyright (C) 2016  Mickaël Salaün 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 3, as
+ * published by the Free Software Foundation.
+ */
+
+#define _GNU_SOURCE
+#include 
+#include  /* open() */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../../tools/include/linux/filter.h"
+
+#include "../bpf/libbpf.c"
+
+#ifndef seccomp
+static int seccomp(unsigned int op, unsigned int flags, void *args)
+{
+   errno = 0;
+   return syscall(__NR_seccomp, op, flags, args);
+}
+#endif
+
+static int landlock_prog_load(const struct bpf_insn *insns, int prog_len,
+   enum landlock_hook_id hook_id, __u64 access)
+{
+   union bpf_attr attr = {
+   .prog_type = BPF_PROG_TYPE_LANDLOCK,
+   .insns = ptr_to_u64((void *) insns),
+   .insn_cnt = prog_len / sizeof(struct bpf_insn),
+   .license = ptr_to_u64((void *) "GPL"),
+   .log_buf = ptr_to_u64(bpf_log_buf),
+   .log_size = LOG_BUF_SIZE,
+   .log_level = 1,
+   .prog_subtype.landlock_hook = {
+   .id = hook_id,
+   .origin = LANDLOCK_FLAG_ORIGIN_SECCOMP |
+   LANDLOCK_FLAG_ORIGIN_SYSCALL |
+   LANDLOCK_FLAG_ORIGIN_INTERRUPT,
+   .access = access,
+   },
+   };
+
+   /* assign one field outside of struct init to make sure any
+* padding is zero initialized
+*/
+   attr.kern_version = 0;
+
+   bpf_log_buf[0] = 0;
+
+   return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
+}
+
+#define ARRAY_SIZE(a)  (sizeof(a) / sizeof(a[0]))
+
+static int apply_sandbox(const char **allowed_paths, int path_nb, const char
+   **cgroup_paths, int cgroup_nb)
+{
+   __u32 key;
+   int i, ret = 0, map_fs = -1, offset;
+
+   /* set up the test sandbox */
+   if (prctl(PR_SE

[RFC v3 20/22] landlock: Add update and debug access flags

2016-09-14 Thread Mickaël Salaün

For now, the update and debug accesses are only accessible to a process
with CAP_SYS_ADMIN. This could change in the future.

The capability check is statically done when loading an eBPF program,
according to the current process. If the process has enough rights and
set the appropriate access flags, then the dedicated functions or data
will be accessible.

With the update access, the following functions are available:
* bpf_map_lookup_elem
* bpf_map_update_elem
* bpf_map_delete_elem
* bpf_tail_call

With the debug access, the following functions are available:
* bpf_trace_printk
* bpf_get_prandom_u32
* bpf_get_current_pid_tgid
* bpf_get_current_uid_gid
* bpf_get_current_comm

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andy Lutomirski 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: Kees Cook 
Cc: Sargun Dhillon 
---
 include/uapi/linux/bpf.h |  4 +++-
 security/landlock/lsm.c  | 54 
 2 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3cc52e51357f..8cfc2de2ab76 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -584,7 +584,9 @@ enum landlock_hook_id {
 #define _LANDLOCK_FLAG_ORIGIN_MASK ((1 << 3) - 1)
 
 /* context of function access flags */
-#define _LANDLOCK_FLAG_ACCESS_MASK ((1ULL << 0) - 1)
+#define LANDLOCK_FLAG_ACCESS_UPDATE(1 << 0)
+#define LANDLOCK_FLAG_ACCESS_DEBUG (1 << 1)
+#define _LANDLOCK_FLAG_ACCESS_MASK ((1ULL << 2) - 1)
 
 /* Handle check flags */
 #define LANDLOCK_FLAG_FS_DENTRY(1 << 0)
diff --git a/security/landlock/lsm.c b/security/landlock/lsm.c
index 2a15839a08c8..56c45abe979c 100644
--- a/security/landlock/lsm.c
+++ b/security/landlock/lsm.c
@@ -202,11 +202,57 @@ static int landlock_run_prog(enum landlock_hook_id 
hook_id, __u64 args[6])
 static const struct bpf_func_proto *bpf_landlock_func_proto(
enum bpf_func_id func_id, union bpf_prog_subtype *prog_subtype)
 {
+   bool access_update = !!(prog_subtype->landlock_hook.access &
+   LANDLOCK_FLAG_ACCESS_UPDATE);
+   bool access_debug = !!(prog_subtype->landlock_hook.access &
+   LANDLOCK_FLAG_ACCESS_DEBUG);
+
switch (func_id) {
case BPF_FUNC_landlock_cmp_fs_prop_with_struct_file:
return &bpf_landlock_cmp_fs_prop_with_struct_file_proto;
case BPF_FUNC_landlock_cmp_fs_beneath_with_struct_file:
return &bpf_landlock_cmp_fs_beneath_with_struct_file_proto;
+
+   /* access_update */
+   case BPF_FUNC_map_lookup_elem:
+   if (access_update)
+   return &bpf_map_lookup_elem_proto;
+   return NULL;
+   case BPF_FUNC_map_update_elem:
+   if (access_update)
+   return &bpf_map_update_elem_proto;
+   return NULL;
+   case BPF_FUNC_map_delete_elem:
+   if (access_update)
+   return &bpf_map_delete_elem_proto;
+   return NULL;
+   case BPF_FUNC_tail_call:
+   if (access_update)
+   return &bpf_tail_call_proto;
+   return NULL;
+
+   /* access_debug */
+   case BPF_FUNC_trace_printk:
+   if (access_debug)
+   return bpf_get_trace_printk_proto();
+   return NULL;
+   case BPF_FUNC_get_prandom_u32:
+   if (access_debug)
+   return &bpf_get_prandom_u32_proto;
+   return NULL;
+   case BPF_FUNC_get_current_pid_tgid:
+   if (access_debug)
+   return &bpf_get_current_pid_tgid_proto;
+   return NULL;
+   case BPF_FUNC_get_current_uid_gid:
+   if (access_debug)
+   return &bpf_get_current_uid_gid_proto;
+   return NULL;
+   case BPF_FUNC_get_current_comm:
+   if (access_debug)
+   return &bpf_get_current_comm_proto;
+   return NULL;
+
default:
return NULL;
}
@@ -348,6 +394,14 @@ static inline bool bpf_landlock_is_valid_subtype(
if (prog_subtype->landlock_hook.access & ~_LANDLOCK_FLAG_ACCESS_MASK)
return false;
 
+   /* check access flags */
+   if (prog_subtype->landlock_hook.access & LANDLOCK_FLAG_ACCESS_UPDATE &&
+   !capable(CAP_SYS_ADMIN))
+   return false;
+   if (prog_subtype->landlock_hook.access & LANDLOCK_FLAG_ACCESS_DEBUG &&
+   !capable(CAP_SYS_ADMIN))
+   return false;
+
return true;
 }
 
-- 
2.9.3

[RFC v3 16/22] bpf/cgroup,landlock: Handle Landlock hooks per cgroup

2016-09-14 Thread Mickaël Salaün

This allows to add new eBPF programs to Landlock hooks dedicated to a
cgroup thanks to the BPF_PROG_ATTACH command. Like for socket eBPF
programs, the Landlock hooks attached to a cgroup are propagated to the
nested cgroups. However, when a new Landlock program is attached to one
of this nested cgroup, this cgroup hierarchy fork the Landlock hooks.
This design is simple and match the current CONFIG_BPF_CGROUP
inheritance. The difference lie in the fact that Landlock programs can
only be stacked but not removed. This match the append-only seccomp
behavior. Userland is free to handle Landlock hooks attached to a cgroup
in more complicated ways (e.g. continuous inheritance), but care should
be taken to properly handle error cases (e.g. memory allocation errors).

Changes since v2:
* new design based on BPF_PROG_ATTACH (suggested by Alexei Starovoitov)

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andy Lutomirski 
Cc: Daniel Borkmann 
Cc: Daniel Mack 
Cc: David S. Miller 
Cc: Kees Cook 
Cc: Tejun Heo 
Link: https://lkml.kernel.org/r/20160826021432.ga8...@ast-mbp.thefacebook.com
Link: https://lkml.kernel.org/r/20160827204307.ga43...@ast-mbp.thefacebook.com
---
 include/linux/bpf-cgroup.h  |  7 +++
 include/linux/cgroup-defs.h |  2 ++
 include/linux/landlock.h|  9 +
 include/uapi/linux/bpf.h|  1 +
 kernel/bpf/cgroup.c | 33 ++---
 kernel/bpf/syscall.c| 11 +++
 security/landlock/lsm.c | 40 +++-
 security/landlock/manager.c | 32 
 8 files changed, 131 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 6cca7924ee17..439c681159e2 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -14,8 +14,15 @@ struct sk_buff;
 extern struct static_key_false cgroup_bpf_enabled_key;
 #define cgroup_bpf_enabled static_branch_unlikely(&cgroup_bpf_enabled_key)
 
+#ifdef CONFIG_SECURITY_LANDLOCK
+struct landlock_hooks;
+#endif /* CONFIG_SECURITY_LANDLOCK */
+
 union bpf_object {
struct bpf_prog *prog;
+#ifdef CONFIG_SECURITY_LANDLOCK
+   struct landlock_hooks *hooks;
+#endif /* CONFIG_SECURITY_LANDLOCK */
 };
 
 struct cgroup_bpf {
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 861b4677fc5b..fe1023bf7b9d 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -301,8 +301,10 @@ struct cgroup {
/* used to schedule release agent */
struct work_struct release_agent_work;
 
+#ifdef CONFIG_CGROUP_BPF
/* used to store eBPF programs */
struct cgroup_bpf bpf;
+#endif /* CONFIG_CGROUP_BPF */
 
/* ids of the ancestors at each level including self */
int ancestor_ids[];
diff --git a/include/linux/landlock.h b/include/linux/landlock.h
index 932ae57fa70e..179a848110f3 100644
--- a/include/linux/landlock.h
+++ b/include/linux/landlock.h
@@ -19,6 +19,9 @@
 #include  /* struct seccomp_filter */
 #endif /* CONFIG_SECCOMP_FILTER */
 
+#ifdef CONFIG_CGROUP_BPF
+#include  /* struct cgroup */
+#endif /* CONFIG_CGROUP_BPF */
 
 #ifdef CONFIG_SECCOMP_FILTER
 struct landlock_seccomp_ret {
@@ -65,6 +68,7 @@ struct landlock_hooks {
 
 
 struct landlock_hooks *new_landlock_hooks(void);
+void get_landlock_hooks(struct landlock_hooks *hooks);
 void put_landlock_hooks(struct landlock_hooks *hooks);
 
 #ifdef CONFIG_SECCOMP_FILTER
@@ -73,5 +77,10 @@ int landlock_seccomp_set_hook(unsigned int flags,
const char __user *user_bpf_fd);
 #endif /* CONFIG_SECCOMP_FILTER */
 
+#ifdef CONFIG_CGROUP_BPF
+struct landlock_hooks *landlock_cgroup_set_hook(struct cgroup *cgrp,
+   struct bpf_prog *prog);
+#endif /* CONFIG_CGROUP_BPF */
+
 #endif /* CONFIG_SECURITY_LANDLOCK */
 #endif /* _LINUX_LANDLOCK_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 905dcace7255..12e61508f879 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -124,6 +124,7 @@ enum bpf_prog_type {
 enum bpf_attach_type {
BPF_CGROUP_INET_INGRESS,
BPF_CGROUP_INET_EGRESS,
+   BPF_CGROUP_LANDLOCK,
__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 7b75fa692617..1c18fe46958a 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
 EXPORT_SYMBOL(cgroup_bpf_enabled_key);
@@ -31,7 +32,15 @@ void cgroup_bpf_put(struct cgroup *cgrp)
union bpf_object pinned = cgrp->bpf.pinned[type];
 
if (pinned.prog) {
-   bpf_prog_put(pinned.prog);
+   switch (type) {
+   case BPF_CGROUP_LANDLOCK:
+#ifdef CONFIG_SECURITY_LANDLOCK
+   put_landlock_hooks(pinned.hooks);
+   break;
+#endif /* CONFIG_SECU

[RFC v3 07/22] landlock: Handle file comparisons

2016-09-14 Thread Mickaël Salaün

Add eBPF functions to compare file system access with a Landlock file
system handle:
* bpf_landlock_cmp_fs_prop_with_struct_file(prop, map, map_op, file)
  This function allows to compare the dentry, inode, device or mount
  point of the currently accessed file, with a reference handle.
* bpf_landlock_cmp_fs_beneath_with_struct_file(opt, map, map_op, file)
  This function allows an eBPF program to check if the current accessed
  file is the same or in the hierarchy of a reference handle.

The goal of file system handle is to abstract kernel objects such as a
struct file or a struct inode. Userland can create this kind of handle
thanks to the BPF_MAP_UPDATE_ELEM command. The element is a struct
landlock_handle containing the handle type (e.g.
BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD) and a file descriptor. This could
also be any descriptions able to match a struct file or a struct inode
(e.g. path or glob string).

Changes since v2:
* add MNT_INTERNAL check to only add file handle from user-visible FS
  (e.g. no anonymous inode)
* replace struct file* with struct path* in map_landlock_handle
* add BPF protos
* fix bpf_landlock_cmp_fs_prop_with_struct_file()

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andy Lutomirski 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: James Morris 
Cc: Kees Cook 
Cc: Serge E. Hallyn 
Link: 
https://lkml.kernel.org/r/calcetrwwtiz3kztkegow24-dvhqq6lftwexh77fd2g5o71y...@mail.gmail.com
---
 include/linux/bpf.h|  10 +++
 include/uapi/linux/bpf.h   |  49 +++
 kernel/bpf/arraymap.c  |  21 +
 kernel/bpf/verifier.c  |   8 ++
 security/landlock/Makefile |   2 +-
 security/landlock/checker_fs.c | 179 +
 security/landlock/checker_fs.h |  20 +
 security/landlock/lsm.c|   6 ++
 8 files changed, 294 insertions(+), 1 deletion(-)
 create mode 100644 security/landlock/checker_fs.c
 create mode 100644 security/landlock/checker_fs.h

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 36c3e482239c..f7325c17f720 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -87,6 +87,7 @@ enum bpf_arg_type {
ARG_ANYTHING,   /* any (initialized) argument is ok */
 
ARG_PTR_TO_STRUCT_FILE, /* pointer to struct file */
+   ARG_CONST_PTR_TO_LANDLOCK_HANDLE_FS,/* pointer to Landlock FS 
handle */
 };
 
 /* type of values returned from helper functions */
@@ -148,6 +149,7 @@ enum bpf_reg_type {
 
/* Landlock */
PTR_TO_STRUCT_FILE,
+   CONST_PTR_TO_LANDLOCK_HANDLE_FS,
 };
 
 struct bpf_prog;
@@ -214,6 +216,9 @@ struct bpf_array {
 #ifdef CONFIG_SECURITY_LANDLOCK
 struct map_landlock_handle {
u32 type; /* enum bpf_map_handle_type */
+   union {
+   struct path path;
+   };
 };
 #endif /* CONFIG_SECURITY_LANDLOCK */
 
@@ -348,6 +353,11 @@ extern const struct bpf_func_proto bpf_skb_vlan_push_proto;
 extern const struct bpf_func_proto bpf_skb_vlan_pop_proto;
 extern const struct bpf_func_proto bpf_get_stackid_proto;
 
+#ifdef CONFIG_SECURITY_LANDLOCK
+extern const struct bpf_func_proto 
bpf_landlock_cmp_fs_prop_with_struct_file_proto;
+extern const struct bpf_func_proto 
bpf_landlock_cmp_fs_beneath_with_struct_file_proto;
+#endif /* CONFIG_SECURITY_LANDLOCK */
+
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ad87003fe892..905dcace7255 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -92,10 +92,20 @@ enum bpf_map_type {
 
 enum bpf_map_array_type {
BPF_MAP_ARRAY_TYPE_UNSPEC,
+   BPF_MAP_ARRAY_TYPE_LANDLOCK_FS,
 };
 
 enum bpf_map_handle_type {
BPF_MAP_HANDLE_TYPE_UNSPEC,
+   BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD,
+   /* BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_GLOB, */
+};
+
+enum bpf_map_array_op {
+   BPF_MAP_ARRAY_OP_UNSPEC,
+   BPF_MAP_ARRAY_OP_OR,
+   BPF_MAP_ARRAY_OP_AND,
+   BPF_MAP_ARRAY_OP_XOR,
 };
 
 enum bpf_prog_type {
@@ -434,6 +444,34 @@ enum bpf_func_id {
 */
BPF_FUNC_skb_change_tail,
 
+   /**
+* bpf_landlock_cmp_fs_prop_with_struct_file(prop, map, map_op, file)
+* Compare file system handles with a struct file
+*
+* @prop: properties to check against (e.g. LANDLOCK_FLAG_FS_DENTRY)
+* @map: handles to compare against
+* @map_op: which elements of the map to use (e.g. BPF_MAP_ARRAY_OP_OR)
+* @file: struct file address to compare with (taken from the context)
+*
+* Return: 0 if the file match the handles, 1 otherwise, or a negative
+* value if an error occurred.
+*/
+   BPF_FUNC_landlock_cmp_fs_prop_with_struct_file,
+
+   /**
+* bpf_landlock_cmp_fs_beneath_with_struct_file(opt, map, map_op, file)
+* Check if a struct file is a leaf of file system ha

[RFC v3 08/22] seccomp: Fix documentation for struct seccomp_filter

2016-09-14 Thread Mickaël Salaün

Signed-off-by: Mickaël Salaün 
Cc: Kees Cook 
Cc: Andy Lutomirski 
Cc: Will Drewry 
---
 kernel/seccomp.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 0db7c8a2afe2..dccfc05cb3ec 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -41,8 +41,7 @@
  * outside of a lifetime-guarded section.  In general, this
  * is only needed for handling filters shared across tasks.
  * @prev: points to a previously installed, or inherited, filter
- * @len: the number of instructions in the program
- * @insnsi: the BPF program instructions to evaluate
+ * @prog: the BPF program to evaluate
  *
  * seccomp_filter objects are organized in a tree linked via the @prev
  * pointer.  For any task, it appears to be a singly-linked list starting
-- 
2.9.3

[RFC v3 10/22] seccomp: Split put_seccomp_filter() with put_seccomp()

2016-09-14 Thread Mickaël Salaün

The semantic is unchanged. This will be useful for the Landlock
integration with seccomp (next commit).

Signed-off-by: Mickaël Salaün 
Cc: Kees Cook 
Cc: Andy Lutomirski 
Cc: Will Drewry 
---
 include/linux/seccomp.h |  5 +++--
 kernel/fork.c   |  2 +-
 kernel/seccomp.c| 18 +-
 3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index a0459a7315ce..ffdab7cdd162 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -102,13 +102,14 @@ static inline int seccomp_mode(struct seccomp *s)
 #endif /* CONFIG_SECCOMP */
 
 #ifdef CONFIG_SECCOMP_FILTER
-extern void put_seccomp_filter(struct task_struct *tsk);
+extern void put_seccomp(struct task_struct *tsk);
 extern void get_seccomp_filter(struct task_struct *tsk);
 #else  /* CONFIG_SECCOMP_FILTER */
-static inline void put_seccomp_filter(struct task_struct *tsk)
+static inline void put_seccomp(struct task_struct *tsk)
 {
return;
 }
+
 static inline void get_seccomp_filter(struct task_struct *tsk)
 {
return;
diff --git a/kernel/fork.c b/kernel/fork.c
index 3584f521e3a6..99df46f157cf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -276,7 +276,7 @@ void free_task(struct task_struct *tsk)
free_thread_stack(tsk);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
-   put_seccomp_filter(tsk);
+   put_seccomp(tsk);
arch_release_task_struct(tsk);
free_task_struct(tsk);
 }
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 1867bbfa7c6c..92b15083b1b2 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -36,6 +36,8 @@
 /* Limit any path through the tree to 256KB worth of instructions. */
 #define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))
 
+static void put_seccomp_filter(struct seccomp_filter *filter);
+
 /*
  * Endianness is explicitly ignored and left for BPF program authors to manage
  * as per the specific architecture.
@@ -286,7 +288,7 @@ static inline void seccomp_sync_threads(void)
 * current's path will hold a reference.  (This also
 * allows a put before the assignment.)
 */
-   put_seccomp_filter(thread);
+   put_seccomp_filter(thread->seccomp.filter);
smp_store_release(&thread->seccomp.filter,
  caller->seccomp.filter);
 
@@ -448,10 +450,11 @@ static inline void seccomp_filter_free(struct 
seccomp_filter *filter)
}
 }
 
-/* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */
-void put_seccomp_filter(struct task_struct *tsk)
+/* put_seccomp_filter - decrements the ref count of a filter */
+static void put_seccomp_filter(struct seccomp_filter *filter)
 {
-   struct seccomp_filter *orig = tsk->seccomp.filter;
+   struct seccomp_filter *orig = filter;
+
/* Clean up single-reference branches iteratively. */
while (orig && atomic_dec_and_test(&orig->usage)) {
struct seccomp_filter *freeme = orig;
@@ -460,6 +463,11 @@ void put_seccomp_filter(struct task_struct *tsk)
}
 }
 
+void put_seccomp(struct task_struct *tsk)
+{
+   put_seccomp_filter(tsk->seccomp.filter);
+}
+
 /**
  * seccomp_send_sigsys - signals the task to allow in-process syscall emulation
  * @syscall: syscall number to send to userland
@@ -871,7 +879,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned 
long filter_off,
if (copy_to_user(data, fprog->filter, bpf_classic_proglen(fprog)))
ret = -EFAULT;
 
-   put_seccomp_filter(task);
+   put_seccomp_filter(task->seccomp.filter);
return ret;
 
 out:
-- 
2.9.3

[RFC v3 11/22] seccomp,landlock: Handle Landlock hooks per process hierarchy

2016-09-14 Thread Mickaël Salaün

A Landlock program will be triggered according to its subtype/origin
bitfield. The LANDLOCK_FLAG_ORIGIN_SECCOMP value will trigger the
Landlock program when a seccomp filter will return RET_LANDLOCK.
Moreover, it is possible to return a 16-bit cookie which will be
readable by the Landlock programs in its context.

Only seccomp filters loaded from the same thread and before a Landlock
program can trigger it through LANDLOCK_FLAG_ORIGIN_SECCOMP. Multiple
Landlock programs can be triggered by one or more seccomp filters. This
way, each RET_LANDLOCK (with specific cookie) will trigger all the
allowed Landlock programs once.

Changes since v2:
* Landlock programs can now be run without seccomp filter but for any
  syscall (from the process) or interruption
* move Landlock related functions and structs into security/landlock/*
  (to manage cgroups as well)
* fix seccomp filter handling: run Landlock programs for each of their
  legitimate seccomp filter
* properly clean up all seccomp results
* cosmetic changes to ease the understanding
* fix some ifdef

Signed-off-by: Mickaël Salaün 
Cc: Kees Cook 
Cc: Andy Lutomirski 
Cc: Will Drewry 
Cc: Andrew Morton 
---
 include/linux/landlock.h |  77 ++
 include/linux/seccomp.h  |  26 +
 include/uapi/linux/seccomp.h |   2 +
 kernel/fork.c|  23 +++-
 kernel/seccomp.c |  68 +++-
 security/landlock/Makefile   |   2 +-
 security/landlock/common.h   |  27 +
 security/landlock/lsm.c  |  96 -
 security/landlock/manager.c  | 242 +++
 9 files changed, 552 insertions(+), 11 deletions(-)
 create mode 100644 include/linux/landlock.h
 create mode 100644 security/landlock/common.h
 create mode 100644 security/landlock/manager.c

diff --git a/include/linux/landlock.h b/include/linux/landlock.h
new file mode 100644
index ..932ae57fa70e
--- /dev/null
+++ b/include/linux/landlock.h
@@ -0,0 +1,77 @@
+/*
+ * Landlock LSM - Public headers
+ *
+ * Copyright (C) 2016  Mickaël Salaün 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef _LINUX_LANDLOCK_H
+#define _LINUX_LANDLOCK_H
+#ifdef CONFIG_SECURITY_LANDLOCK
+
+#include  /* _LANDLOCK_HOOK_LAST */
+#include  /* atomic_t */
+
+#ifdef CONFIG_SECCOMP_FILTER
+#include  /* struct seccomp_filter */
+#endif /* CONFIG_SECCOMP_FILTER */
+
+
+#ifdef CONFIG_SECCOMP_FILTER
+struct landlock_seccomp_ret {
+   struct landlock_seccomp_ret *prev;
+   struct seccomp_filter *filter;
+   u16 cookie;
+   bool triggered;
+};
+#endif /* CONFIG_SECCOMP_FILTER */
+
+struct landlock_rule {
+   atomic_t usage;
+   struct landlock_rule *prev;
+   /*
+* List of filters (through filter->thread_prev) allowed to trigger
+* this Landlock program.
+*/
+   struct bpf_prog *prog;
+#ifdef CONFIG_SECCOMP_FILTER
+   struct seccomp_filter *thread_filter;
+#endif /* CONFIG_SECCOMP_FILTER */
+};
+
+/**
+ * struct landlock_hooks - Landlock hook programs enforced on a thread
+ *
+ * This is used for low performance impact when forking a process. Instead of
+ * copying the full array and incrementing the usage field of each entries,
+ * only create a pointer to struct landlock_hooks and increment the usage
+ * field.
+ *
+ * A new struct landlock_hooks must be created thanks to a call to
+ * new_landlock_hooks().
+ *
+ * @usage: reference count to manage the object lifetime. When a thread need to
+ * add Landlock programs and if @usage is greater than 1, then the
+ * thread must duplicate struct landlock_hooks to not change the
+ * children' rules as well.
+ */
+struct landlock_hooks {
+   atomic_t usage;
+   struct landlock_rule *rules[_LANDLOCK_HOOK_LAST];
+};
+
+
+struct landlock_hooks *new_landlock_hooks(void);
+void put_landlock_hooks(struct landlock_hooks *hooks);
+
+#ifdef CONFIG_SECCOMP_FILTER
+void put_landlock_ret(struct landlock_seccomp_ret *landlock_ret);
+int landlock_seccomp_set_hook(unsigned int flags,
+   const char __user *user_bpf_fd);
+#endif /* CONFIG_SECCOMP_FILTER */
+
+#endif /* CONFIG_SECURITY_LANDLOCK */
+#endif /* _LINUX_LANDLOCK_H */
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index ffdab7cdd162..3cb90bf43a24 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -10,6 +10,10 @@
 #include 
 #include 
 
+#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
+#include 
+#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
+
 /**
  * struct seccomp_filter - container for seccomp BPF programs
  *
@@ -19,6 +23,7 @@
  * is only needed for handling filters shared across tasks.
  * @prev: points to a previously installed, or inherited, filter
  * @prog: the BPF program to evaluate
+ * @thread_prev: points to

[RFC v3 00/22] Landlock LSM: Unprivileged sandboxing

2016-09-14 Thread Mickaël Salaün

Hi,

This series, improvement of the previous RFC [1], is a proof of concept to fill
some missing part of seccomp as the ability to check syscall argument pointers
or creating more dynamic security policies. The goal of this new stackable
Linux Security Module (LSM) called Landlock is to allow any process, including
unprivileged ones, to create powerful security sandboxes comparable to the
Seatbelt/XNU Sandbox or the OpenBSD Pledge. This kind of sandbox help to
mitigate the security impact of bugs or unexpected/malicious behaviors in
userland applications.


# Landlock LSM

This series is mainly focused on cgroups while keeping the possibility to
enforce access rules through seccomp. eBPF programs are used to create a
security rule. They are very limited (i.e. can only call a whitelist of
functions) and can not do a denial of service (i.e. no loop). A new dedicated
eBPF map allows to collect and compare Landlock handles with system resources
(e.g. files or network connections).

The approach taken is to add the minimum amount of code while still allowing
the userland to create quite complex access rules. A dedicated security policy
language such as used by SELinux, AppArmor and other major LSMs is a lot of
code and dedicated to a trusted process (i.e. root/administrator).


# eBPF

To get an expressive language while still being safe and small, Landlock is
based on eBPF. Landlock should be usable by untrusted processes and must then
expose a minimal attack surface. The eBPF bytecode is minimal while powerful,
widely used and thought to be used by not so trusted application. Reusing this
code allows to not reproduce the same mistakes and minimize new code  while
still taking a generic approach. There is only some new features like a new
kind of arraymap and few dedicated eBPF functions.

An eBPF program have access to an eBPF context which contains the LSM hook
arguments (as does seccomp-bpf with syscall arguments). They can be used
directly or passed to helper functions according to their types. It is then
possible to do complex access checks without race conditions nor inconsistent
evaluation (i.e. incorrect mirroring of the OS code and state [2]).

There is one eBPF program subtype per LSM hook. This allow to statically check
which context access is performed by an eBPF program. This is needed to deny
kernel address leak and ensure the right use of LSM hook arguments with eBPF
functions. Moreover, this safe pointer handling remove the need for runtime
check or abstract data, which improve performances. Any user can add multiple
Landlock eBPF programs per LSM hook. They are stacked and evaluated one after
the other (cf. seccomp-bpf).


# LSM hooks

Contrary to syscalls, LSM hooks are security checkpoints and are not
architecture dependant. They are designed to match a security need reflected by
a security policy (e.g. access to a file). Exposing parts of some LSM hooks
instead of using the syscall API for sandboxing should help to avoid bugs and
hacks as encountered by the first RFC. Instead of redoing the work of the LSM
hooks through syscalls, we should use and expose them as does policies of
access control LSM.

Only a subset of the hooks are meaningful for an unprivileged sandbox mechanism
(e.g. file system or network access control). Landlock use an abstraction of
raw LSM hooks, which allow to deal with possible future API changes of the LSM
hook API. Moreover, thanks to the ePBF program typing (per LSM hook) used by
Landlock, it should not be hard to make such evolutions backward compatible.


# Use case scenario

First, a process need to create a new dedicated eBPF map containing handles.
This handles are references to system resources (e.g. file or directory) and
grouped in one or multiple maps to be efficiently managed and checked in
batches. This kind of map can be passed to Landlock eBPF functions to compare,
for example, with a file access request. The handles are only accessible from
the eBPF programs created by the same thread.

The loaded Landlock eBPF programs can be triggered by a seccomp filter
returning RET_LANDLOCK. In addition, a cookie (16-bit value) can be passed from
a seccomp filter to eBPF programs. This allow flexible security policies
between seccomp and Landlock.

Another way to enforce a Landlock security policy is to attach Landlock
programs to a dedicated cgroup. All the processes in this cgroup will then be
subject to this policy. For unprivileged processes, this can be done thanks to
cgroup delegation.

A triggered Landlock eBPF program can allows or deny an access, according to
its subtype (i.e. LSM hook), thanks to errno return values.


# Sandbox example with process hierarchy sandboxing (seccomp)

  $ ls /home
  user1
  $ LANDLOCK_ALLOWED='/bin:/lib:/usr:/tmp:/proc/self/fd/0' \
  ./samples/landlock/sandbox /bin/sh -i
  Launching a new sandboxed process.
  $ ls /home
  ls: cannot open directory '/home': Permission denied


# Sandbox example with conditional access

[RFC v3 04/22] bpf: Set register type according to is_valid_access()

2016-09-14 Thread Mickaël Salaün

This fix a pointer leak when an unprivileged eBPF program read a pointer
value from the context. Even if is_valid_access() returns a pointer
type, the eBPF verifier replace it with UNKNOWN_VALUE. The register
value containing an address is then allowed to leak. Moreover, this
prevented unprivileged eBPF programs to use functions with (legitimate)
pointer arguments.

This bug was not a problem until now because the only unprivileged eBPF
program allowed is of type BPF_PROG_TYPE_SOCKET_FILTER and all the types
from its context are UNKNOWN_VALUE.

Signed-off-by: Mickaël Salaün 
Fixes: 969bf05eb3ce ("bpf: direct packet access")
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
---
 kernel/bpf/verifier.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c0c4a92dae8c..608cbffb0e86 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -794,10 +794,8 @@ static int check_mem_access(struct verifier_env *env, u32 
regno, int off,
}
err = check_ctx_access(env, off, size, t, ®_type);
if (!err && t == BPF_READ && value_regno >= 0) {
-   mark_reg_unknown_value(state->regs, value_regno);
-   if (env->allow_ptr_leaks)
-   /* note that reg.[id|off|range] == 0 */
-   state->regs[value_regno].type = reg_type;
+   /* note that reg.[id|off|range] == 0 */
+   state->regs[value_regno].type = reg_type;
}
 
} else if (reg->type == FRAME_PTR || reg->type == PTR_TO_STACK) {
-- 
2.9.3

[RFC v3 19/22] landlock: Add interrupted origin

2016-09-14 Thread Mickaël Salaün

This third origin of hook call should cover all possible trigger paths
(e.g. page fault). Landlock eBPF programs can then take decisions
accordingly.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andy Lutomirski 
Cc: Daniel Borkmann 
Cc: Kees Cook 
---
 include/uapi/linux/bpf.h |  3 ++-
 security/landlock/lsm.c  | 17 +++--
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 12e61508f879..3cc52e51357f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -580,7 +580,8 @@ enum landlock_hook_id {
 /* Trigger type */
 #define LANDLOCK_FLAG_ORIGIN_SYSCALL   (1 << 0)
 #define LANDLOCK_FLAG_ORIGIN_SECCOMP   (1 << 1)
-#define _LANDLOCK_FLAG_ORIGIN_MASK ((1 << 2) - 1)
+#define LANDLOCK_FLAG_ORIGIN_INTERRUPT (1 << 2)
+#define _LANDLOCK_FLAG_ORIGIN_MASK ((1 << 3) - 1)
 
 /* context of function access flags */
 #define _LANDLOCK_FLAG_ACCESS_MASK ((1ULL << 0) - 1)
diff --git a/security/landlock/lsm.c b/security/landlock/lsm.c
index 000dd0c7ec3d..2a15839a08c8 100644
--- a/security/landlock/lsm.c
+++ b/security/landlock/lsm.c
@@ -17,6 +17,7 @@
 #include  /* FIELD_SIZEOF() */
 #include 
 #include 
+#include  /* in_interrupt() */
 #include  /* struct seccomp_* */
 #include  /* uintptr_t */
 
@@ -109,6 +110,7 @@ static int landlock_run_prog(enum landlock_hook_id hook_id, 
__u64 args[6])
 #endif /* CONFIG_CGROUP_BPF */
struct landlock_rule *rule;
u32 hook_idx = get_index(hook_id);
+   u16 current_call;
 
struct landlock_data ctx = {
.hook = hook_id,
@@ -128,6 +130,16 @@ static int landlock_run_prog(enum landlock_hook_id 
hook_id, __u64 args[6])
 * prioritize fine-grained policies (i.e. per thread), and return early.
 */
 
+   if (unlikely(in_interrupt())) {
+   current_call = LANDLOCK_FLAG_ORIGIN_INTERRUPT;
+#ifdef CONFIG_SECCOMP_FILTER
+   /* bypass landlock_ret evaluation */
+   goto seccomp_int;
+#endif /* CONFIG_SECCOMP_FILTER */
+   } else {
+   current_call = LANDLOCK_FLAG_ORIGIN_SYSCALL;
+   }
+
 #ifdef CONFIG_SECCOMP_FILTER
/* seccomp triggers and landlock_ret cleanup */
ctx.origin = LANDLOCK_FLAG_ORIGIN_SECCOMP;
@@ -164,8 +176,9 @@ static int landlock_run_prog(enum landlock_hook_id hook_id, 
__u64 args[6])
return -ret;
ctx.cookie = 0;
 
+seccomp_int:
/* syscall trigger */
-   ctx.origin = LANDLOCK_FLAG_ORIGIN_SYSCALL;
+   ctx.origin = current_call;
ret = landlock_run_prog_for_syscall(hook_idx, &ctx,
current->seccomp.landlock_hooks);
if (ret)
@@ -175,7 +188,7 @@ static int landlock_run_prog(enum landlock_hook_id hook_id, 
__u64 args[6])
 #ifdef CONFIG_CGROUP_BPF
/* syscall trigger */
if (cgroup_bpf_enabled) {
-   ctx.origin = LANDLOCK_FLAG_ORIGIN_SYSCALL;
+   ctx.origin = current_call;
/* get the default cgroup associated with the current thread */
cgrp = task_css_set(current)->dfl_cgrp;
ret = landlock_run_prog_for_syscall(hook_idx, &ctx,
-- 
2.9.3

[RFC v3 14/22] bpf/cgroup: Make cgroup_bpf_update() return an error code

2016-09-14 Thread Mickaël Salaün

This will be useful to support Landlock for the next commits.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Daniel Mack 
Cc: David S. Miller 
Cc: Tejun Heo 
---
 include/linux/bpf-cgroup.h |  4 ++--
 kernel/bpf/cgroup.c|  3 ++-
 kernel/bpf/syscall.c   | 10 ++
 kernel/cgroup.c|  6 --
 4 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 2234042d7f61..6cca7924ee17 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -31,13 +31,13 @@ struct cgroup_bpf {
 void cgroup_bpf_put(struct cgroup *cgrp);
 void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
 
-void __cgroup_bpf_update(struct cgroup *cgrp,
+int __cgroup_bpf_update(struct cgroup *cgrp,
 struct cgroup *parent,
 struct bpf_prog *prog,
 enum bpf_attach_type type);
 
 /* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
-void cgroup_bpf_update(struct cgroup *cgrp,
+int cgroup_bpf_update(struct cgroup *cgrp,
   struct bpf_prog *prog,
   enum bpf_attach_type type);
 
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 782878ec4f2d..7b75fa692617 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -83,7 +83,7 @@ void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup 
*parent)
  *
  * Must be called with cgroup_mutex held.
  */
-void __cgroup_bpf_update(struct cgroup *cgrp,
+int __cgroup_bpf_update(struct cgroup *cgrp,
 struct cgroup *parent,
 struct bpf_prog *prog,
 enum bpf_attach_type type)
@@ -117,6 +117,7 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
bpf_prog_put(old_pinned.prog);
static_branch_dec(&cgroup_bpf_enabled_key);
}
+   return 0;
 }
 
 /**
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 45a91d59..c978f2d9a1b3 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -831,6 +831,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 {
struct bpf_prog *prog;
struct cgroup *cgrp;
+   int result;
 
if (!capable(CAP_NET_ADMIN))
return -EPERM;
@@ -858,10 +859,10 @@ static int bpf_prog_attach(const union bpf_attr *attr)
return PTR_ERR(cgrp);
}
 
-   cgroup_bpf_update(cgrp, prog, attr->attach_type);
+   result = cgroup_bpf_update(cgrp, prog, attr->attach_type);
cgroup_put(cgrp);
 
-   return 0;
+   return result;
 }
 
 #define BPF_PROG_DETACH_LAST_FIELD attach_type
@@ -869,6 +870,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 static int bpf_prog_detach(const union bpf_attr *attr)
 {
struct cgroup *cgrp;
+   int result = 0;
 
if (!capable(CAP_NET_ADMIN))
return -EPERM;
@@ -883,7 +885,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
if (IS_ERR(cgrp))
return PTR_ERR(cgrp);
 
-   cgroup_bpf_update(cgrp, NULL, attr->attach_type);
+   result = cgroup_bpf_update(cgrp, NULL, attr->attach_type);
cgroup_put(cgrp);
break;
 
@@ -891,7 +893,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
return -EINVAL;
}
 
-   return 0;
+   return result;
 }
 #endif /* CONFIG_CGROUP_BPF */
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 87324ce481b1..48b650a640a9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -6450,15 +6450,17 @@ static __init int cgroup_namespaces_init(void)
 subsys_initcall(cgroup_namespaces_init);
 
 #ifdef CONFIG_CGROUP_BPF
-void cgroup_bpf_update(struct cgroup *cgrp,
+int cgroup_bpf_update(struct cgroup *cgrp,
   struct bpf_prog *prog,
   enum bpf_attach_type type)
 {
struct cgroup *parent = cgroup_parent(cgrp);
+   int result;
 
mutex_lock(&cgroup_mutex);
-   __cgroup_bpf_update(cgrp, parent, prog, type);
+   result = __cgroup_bpf_update(cgrp, parent, prog, type);
mutex_unlock(&cgroup_mutex);
+   return result;
 }
 #endif /* CONFIG_CGROUP_BPF */
 
-- 
2.9.3

[RFC v3 13/22] bpf/cgroup: Replace struct bpf_prog with union bpf_object

2016-09-14 Thread Mickaël Salaün

This allows CONFIG_CGROUP_BPF to manage different type of pointers
instead of only eBPF programs. This will be useful for the next commits
to support Landlock with cgroups.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Daniel Mack 
Cc: David S. Miller 
Cc: Tejun Heo 
---
 include/linux/bpf-cgroup.h |  8 ++--
 kernel/bpf/cgroup.c| 44 +++-
 2 files changed, 29 insertions(+), 23 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index fc076de74ab9..2234042d7f61 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -14,14 +14,18 @@ struct sk_buff;
 extern struct static_key_false cgroup_bpf_enabled_key;
 #define cgroup_bpf_enabled static_branch_unlikely(&cgroup_bpf_enabled_key)
 
+union bpf_object {
+   struct bpf_prog *prog;
+};
+
 struct cgroup_bpf {
/*
 * Store two sets of bpf_prog pointers, one for programs that are
 * pinned directly to this cgroup, and one for those that are effective
 * when this cgroup is accessed.
 */
-   struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
-   struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
+   union bpf_object pinned[MAX_BPF_ATTACH_TYPE];
+   union bpf_object effective[MAX_BPF_ATTACH_TYPE];
 };
 
 void cgroup_bpf_put(struct cgroup *cgrp);
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 21d168c3ad35..782878ec4f2d 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -20,18 +20,18 @@ DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
 EXPORT_SYMBOL(cgroup_bpf_enabled_key);
 
 /**
- * cgroup_bpf_put() - put references of all bpf programs
+ * cgroup_bpf_put() - put references of all bpf objects
  * @cgrp: the cgroup to modify
  */
 void cgroup_bpf_put(struct cgroup *cgrp)
 {
unsigned int type;
 
-   for (type = 0; type < ARRAY_SIZE(cgrp->bpf.prog); type++) {
-   struct bpf_prog *prog = cgrp->bpf.prog[type];
+   for (type = 0; type < ARRAY_SIZE(cgrp->bpf.pinned); type++) {
+   union bpf_object pinned = cgrp->bpf.pinned[type];
 
-   if (prog) {
-   bpf_prog_put(prog);
+   if (pinned.prog) {
+   bpf_prog_put(pinned.prog);
static_branch_dec(&cgroup_bpf_enabled_key);
}
}
@@ -47,11 +47,12 @@ void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup 
*parent)
unsigned int type;
 
for (type = 0; type < ARRAY_SIZE(cgrp->bpf.effective); type++) {
-   struct bpf_prog *e;
+   union bpf_object e;
 
-   e = rcu_dereference_protected(parent->bpf.effective[type],
- lockdep_is_held(&cgroup_mutex));
-   rcu_assign_pointer(cgrp->bpf.effective[type], e);
+   e.prog = rcu_dereference_protected(
+   parent->bpf.effective[type].prog,
+   lockdep_is_held(&cgroup_mutex));
+   rcu_assign_pointer(cgrp->bpf.effective[type].prog, e.prog);
}
 }
 
@@ -87,32 +88,33 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
 struct bpf_prog *prog,
 enum bpf_attach_type type)
 {
-   struct bpf_prog *old_prog, *effective;
+   union bpf_object obj, old_pinned, effective;
struct cgroup_subsys_state *pos;
 
-   old_prog = xchg(cgrp->bpf.prog + type, prog);
+   obj.prog = prog;
+   old_pinned = xchg(cgrp->bpf.pinned + type, obj);
 
-   effective = (!prog && parent) ?
-   rcu_dereference_protected(parent->bpf.effective[type],
+   effective.prog = (!obj.prog && parent) ?
+   rcu_dereference_protected(parent->bpf.effective[type].prog,
  lockdep_is_held(&cgroup_mutex)) :
-   prog;
+   obj.prog;
 
css_for_each_descendant_pre(pos, &cgrp->self) {
struct cgroup *desc = container_of(pos, struct cgroup, self);
 
/* skip the subtree if the descendant has its own program */
-   if (desc->bpf.prog[type] && desc != cgrp)
+   if (desc->bpf.pinned[type].prog && desc != cgrp)
pos = css_rightmost_descendant(pos);
else
-   rcu_assign_pointer(desc->bpf.effective[type],
-  effective);
+   rcu_assign_pointer(desc->bpf.effective[type].prog,
+  effective.prog);
}
 
-   if (prog)
+   if (obj.prog)
static_branch_inc(&cgroup_bpf_enabled_key);
 
-   if (old_prog) {
-   bpf_prog_put(old_prog);
+   if (old_pinned.prog) {
+   bpf_prog_put(old_pinned.prog);
static_branch_dec(&cgroup_bpf_enabled_key);
}
 }
@@ -151,7 +153,7 @@

[RFC v3 02/22] bpf: Move u64_to_ptr() to BPF headers and inline it

2016-09-14 Thread Mickaël Salaün

This helper will be useful for arraymap (next commit).

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: David S. Miller 
Cc: Daniel Borkmann 
---
 include/linux/bpf.h  | 6 ++
 kernel/bpf/syscall.c | 6 --
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9a904f63f8c1..fa9a988400d9 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -274,6 +274,12 @@ static inline void bpf_long_memcpy(void *dst, const void 
*src, u32 size)
 
 /* verify correctness of eBPF program */
 int bpf_check(struct bpf_prog **fp, union bpf_attr *attr);
+
+/* helper to convert user pointers passed inside __aligned_u64 fields */
+static inline void __user *u64_to_ptr(__u64 val)
+{
+   return (void __user *) (unsigned long) val;
+}
 #else
 static inline void bpf_register_prog_type(struct bpf_prog_type_list *tl)
 {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 1a8592a082ce..776c752604b0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -252,12 +252,6 @@ struct bpf_map *bpf_map_get_with_uref(u32 ufd)
return map;
 }
 
-/* helper to convert user pointers passed inside __aligned_u64 fields */
-static void __user *u64_to_ptr(__u64 val)
-{
-   return (void __user *) (unsigned long) val;
-}
-
 int __weak bpf_stackmap_copy(struct bpf_map *map, void *key, void *value)
 {
return -ENOTSUPP;
-- 
2.9.3

[RFC v3 06/22] landlock: Add LSM hooks

2016-09-14 Thread Mickaël Salaün

Add LSM hooks which can be used by userland through Landlock (eBPF)
programs. This programs are limited to a whitelist of functions (cf.
next commit). The eBPF program context is depicted by the struct
landlock_data (cf. include/uapi/linux/bpf.h):
* hook: LSM hook ID
* origin: what triggered this Landlock program (syscall, dedicated
  seccomp return or interruption)
* cookie: the 16-bit value from the seccomp filter that triggered this
  Landlock program
* args[6]: array of some LSM hook arguments

The LSM hook arguments can contain raw values as integers or
(unleakable) pointers. The only way to use the pointers are to pass them
to an eBPF function according to their types (e.g. the
bpf_landlock_cmp_fs_beneath_with_struct_file function can use a struct
file pointer).

For each Landlock program, the subtype allows to specify for which LSM
hook the program is dedicated thanks to the "id" field. The "origin"
field must contains each triggers for which the Landlock program will
be called (e.g. every syscall or/and seccomp filters returning
RET_LANDLOCK). The "access" bitfield can be used to allow a program to
access a specific feature from a Landlock hook (i.e. context value or
function). The flag guarding this feature may only be enabled according
to the capabilities of the process loading the program.

For now, there is three hooks for file system access control:
* file_open
* file_permission
* mmap_file

Changes since v2:
* use subtypes instead of dedicated eBPF program types for each hook
  (suggested by Alexei Starovoitov)
* replace convert_ctx_access() with subtype check
* use an array of Landlock program list instead of a single list
* handle running Landlock programs without needing a seccomp filter
* use, check and expose "origin" to Landlock programs
* mask the unused struct cred * (suggested by Andy Lutomirski)

Changes since v1:
* revamp access control from a syscall-based to a LSM hooks-based
* do not use audit cache
* no race conditions by design
* architecture agnostic
* switch from cBPF to eBPF (suggested by Daniel Borkmann)
* new BPF context

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andy Lutomirski 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: James Morris 
Cc: Kees Cook 
Cc: Serge E. Hallyn 
Cc: Will Drewry 
Link: https://lkml.kernel.org/r/20160827205559.ga43...@ast-mbp.thefacebook.com
Link: https://lkml.kernel.org/r/20160827180642.ga38...@ast-mbp.thefacebook.com
Link: 
https://lkml.kernel.org/r/CALCETrUK1umtXMEXXKzMAccNQCVTPA8_XNDf01B5=gazujw...@mail.gmail.com
Link: https://lkml.kernel.org/r/20160827204307.ga43...@ast-mbp.thefacebook.com
---
 include/linux/bpf.h|   5 +
 include/linux/lsm_hooks.h  |   5 +
 include/uapi/linux/bpf.h   |  37 
 kernel/bpf/syscall.c   |  10 +-
 kernel/bpf/verifier.c  |   6 ++
 security/Makefile  |   2 +
 security/landlock/Makefile |   3 +
 security/landlock/lsm.c| 222 +
 security/security.c|   1 +
 9 files changed, 289 insertions(+), 2 deletions(-)
 create mode 100644 security/landlock/Makefile
 create mode 100644 security/landlock/lsm.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9aa01d9d3d80..36c3e482239c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -85,6 +85,8 @@ enum bpf_arg_type {
 
ARG_PTR_TO_CTX, /* pointer to context */
ARG_ANYTHING,   /* any (initialized) argument is ok */
+
+   ARG_PTR_TO_STRUCT_FILE, /* pointer to struct file */
 };
 
 /* type of values returned from helper functions */
@@ -143,6 +145,9 @@ enum bpf_reg_type {
 */
PTR_TO_PACKET,
PTR_TO_PACKET_END,   /* skb->data + headlen */
+
+   /* Landlock */
+   PTR_TO_STRUCT_FILE,
 };
 
 struct bpf_prog;
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 558adfa5c8a8..069af34301d4 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1933,5 +1933,10 @@ void __init loadpin_add_hooks(void);
 #else
 static inline void loadpin_add_hooks(void) { };
 #endif
+#ifdef CONFIG_SECURITY_LANDLOCK
+extern void __init landlock_add_hooks(void);
+#else
+static inline void __init landlock_add_hooks(void) { }
+#endif /* CONFIG_SECURITY_LANDLOCK */
 
 #endif /* ! __LINUX_LSM_HOOKS_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 667b6ef3ff1e..ad87003fe892 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -108,6 +108,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
BPF_PROG_TYPE_CGROUP_SOCKET,
+   BPF_PROG_TYPE_LANDLOCK,
 };
 
 enum bpf_attach_type {
@@ -528,6 +529,23 @@ struct xdp_md {
__u32 data_end;
 };
 
+/* LSM hooks */
+enum landlock_hook_id {
+   LANDLOCK_HOOK_UNSPEC,
+   LANDLOCK_HOOK_FILE_OPEN,
+   LANDLOCK_HOOK_FILE_PERMISSION,
+   LANDLOCK_HOOK_MMAP_FILE,
+};
+#define _LANDLOCK_HOOK_LAST LANDLOCK_HOOK_MMAP_FILE
+
+/* Trigger ty

[RFC v3 21/22] bpf,landlock: Add optional skb pointer in the Landlock context

2016-09-14 Thread Mickaël Salaün

This is a proof of concept to expose optional values that could depend
of the process access rights.

There is two dedicated flags: LANDLOCK_FLAG_ACCESS_SKB_READ and
LANDLOCK_FLAG_ACCESS_SKB_WRITE. Each of them can be activated to access
eBPF functions manipulating a skb in a read or write way.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andy Lutomirski 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: Kees Cook 
Cc: Sargun Dhillon 
---
 include/linux/bpf.h  |  2 ++
 include/uapi/linux/bpf.h |  7 ++-
 kernel/bpf/verifier.c|  6 ++
 security/landlock/lsm.c  | 26 ++
 4 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f7325c17f720..218973777612 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -88,6 +88,7 @@ enum bpf_arg_type {
 
ARG_PTR_TO_STRUCT_FILE, /* pointer to struct file */
ARG_CONST_PTR_TO_LANDLOCK_HANDLE_FS,/* pointer to Landlock FS 
handle */
+   ARG_PTR_TO_STRUCT_SKB,  /* pointer to struct skb */
 };
 
 /* type of values returned from helper functions */
@@ -150,6 +151,7 @@ enum bpf_reg_type {
/* Landlock */
PTR_TO_STRUCT_FILE,
CONST_PTR_TO_LANDLOCK_HANDLE_FS,
+   PTR_TO_STRUCT_SKB,
 };
 
 struct bpf_prog;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8cfc2de2ab76..7d9e56952ed9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -586,7 +586,9 @@ enum landlock_hook_id {
 /* context of function access flags */
 #define LANDLOCK_FLAG_ACCESS_UPDATE(1 << 0)
 #define LANDLOCK_FLAG_ACCESS_DEBUG (1 << 1)
-#define _LANDLOCK_FLAG_ACCESS_MASK ((1ULL << 2) - 1)
+#define LANDLOCK_FLAG_ACCESS_SKB_READ  (1 << 2)
+#define LANDLOCK_FLAG_ACCESS_SKB_WRITE (1 << 3)
+#define _LANDLOCK_FLAG_ACCESS_MASK ((1ULL << 4) - 1)
 
 /* Handle check flags */
 #define LANDLOCK_FLAG_FS_DENTRY(1 << 0)
@@ -619,12 +621,15 @@ struct landlock_handle {
  * @args: LSM hook arguments, see include/linux/lsm_hooks.h for there
  *description and the LANDLOCK_HOOK* definitions from
  *security/landlock/lsm.c for their types.
+ * @opt_skb: optional skb pointer, accessible with the
+ *   LANDLOCK_FLAG_ACCESS_SKB_* flags for network-related hooks.
  */
 struct landlock_data {
__u32 hook; /* enum landlock_hook_id */
__u16 origin; /* LANDLOCK_FLAG_ORIGIN_* */
__u16 cookie; /* seccomp RET_LANDLOCK */
__u64 args[6];
+   __u64 opt_skb;
 };
 
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8d7b18574f5a..a95154c1a60f 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -247,6 +247,7 @@ static const char * const reg_type_str[] = {
[PTR_TO_PACKET_END] = "pkt_end",
[PTR_TO_STRUCT_FILE]= "struct_file",
[CONST_PTR_TO_LANDLOCK_HANDLE_FS] = "landlock_handle_fs",
+   [PTR_TO_STRUCT_SKB] = "struct_skb",
 };
 
 static void print_verifier_state(struct verifier_state *state)
@@ -559,6 +560,7 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
case CONST_PTR_TO_MAP:
case PTR_TO_STRUCT_FILE:
case CONST_PTR_TO_LANDLOCK_HANDLE_FS:
+   case PTR_TO_STRUCT_SKB:
return true;
default:
return false;
@@ -984,6 +986,10 @@ static int check_func_arg(struct verifier_env *env, u32 
regno,
expected_type = CONST_PTR_TO_LANDLOCK_HANDLE_FS;
if (type != expected_type)
goto err_type;
+   } else if (arg_type == ARG_PTR_TO_STRUCT_SKB) {
+   expected_type = PTR_TO_STRUCT_SKB;
+   if (type != expected_type)
+   goto err_type;
} else if (arg_type == ARG_PTR_TO_STACK ||
   arg_type == ARG_PTR_TO_RAW_STACK) {
expected_type = PTR_TO_STACK;
diff --git a/security/landlock/lsm.c b/security/landlock/lsm.c
index 56c45abe979c..8b0e6f0eb6b7 100644
--- a/security/landlock/lsm.c
+++ b/security/landlock/lsm.c
@@ -281,6 +281,7 @@ static bool __is_valid_access(int off, int size, enum 
bpf_access_type type,
break;
case offsetof(struct landlock_data, args[0]) ...
offsetof(struct landlock_data, args[5]):
+   case offsetof(struct landlock_data, opt_skb):
expected_size = sizeof(__u64);
break;
default:
@@ -299,6 +300,13 @@ static bool __is_valid_access(int off, int size, enum 
bpf_access_type type,
if (*reg_type == NOT_INIT)
return false;
break;
+   case offsetof(struct landlock_data, opt_skb):
+   if (!(prog_subtype->landlock_hook.access &
+   (LANDLOCK_FLAG_ACCESS_SKB_READ |
+LANDLOCK_FLAG_ACCESS_SKB_WRITE)))
+   return false;
+

[RFC v3 12/22] bpf: Cosmetic change for bpf_prog_attach()

2016-09-14 Thread Mickaël Salaün

Move code outside a switch/case to ease code factoring (cf. next
commit).

This apply on Daniel Mack's "Add eBPF hooks for cgroups":
https://lkml.kernel.org/r/1473696735-11269-1-git-send-email-dan...@zonque.org

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Daniel Mack 
---
 kernel/bpf/syscall.c | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f22e3b63d253..45a91d59 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -843,23 +843,24 @@ static int bpf_prog_attach(const union bpf_attr *attr)
case BPF_CGROUP_INET_EGRESS:
prog = bpf_prog_get_type(attr->attach_bpf_fd,
 BPF_PROG_TYPE_CGROUP_SOCKET);
-   if (IS_ERR(prog))
-   return PTR_ERR(prog);
-
-   cgrp = cgroup_get_from_fd(attr->target_fd);
-   if (IS_ERR(cgrp)) {
-   bpf_prog_put(prog);
-   return PTR_ERR(cgrp);
-   }
-
-   cgroup_bpf_update(cgrp, prog, attr->attach_type);
-   cgroup_put(cgrp);
break;
 
default:
return -EINVAL;
}
 
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp)) {
+   bpf_prog_put(prog);
+   return PTR_ERR(cgrp);
+   }
+
+   cgroup_bpf_update(cgrp, prog, attr->attach_type);
+   cgroup_put(cgrp);
+
return 0;
 }
 
-- 
2.9.3

[RFC v3 03/22] bpf,landlock: Add a new arraymap type to deal with (Landlock) handles

2016-09-14 Thread Mickaël Salaün

This new arraymap looks like a set and brings new properties:
* strong typing of entries: the eBPF functions get the array type of
  elements instead of CONST_PTR_TO_MAP (e.g.
  CONST_PTR_TO_LANDLOCK_HANDLE_FS);
* force sequential filling (i.e. replace or append-only update), which
  allow quick browsing of all entries.

This strong typing is useful to statically check if the content of a map
can be passed to an eBPF function. For example, Landlock use it to store
and manage kernel objects (e.g. struct file) instead of dealing with
userland raw data. This improve efficiency and ensure that an eBPF
program can only call functions with the right high-level arguments.

The enum bpf_map_handle_type list low-level types (e.g.
BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD) which are identified when
updating a map entry (handle). This handle types are used to infer a
high-level arraymap type which are listed in enum bpf_map_array_type
(e.g. BPF_MAP_ARRAY_TYPE_LANDLOCK_FS).

For now, this new arraymap is only used by Landlock LSM (cf. next
commits) but it could be useful for other needs.

Changes since v2:
* add a RLIMIT_NOFILE-based limit to the maximum number of arraymap
  handle entries (suggested by Andy Lutomirski)
* remove useless checks

Changes since v1:
* arraymap of handles replace custom checker groups
* simpler userland API

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andy Lutomirski 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: Kees Cook 
Link: 
https://lkml.kernel.org/r/calcetrwwtiz3kztkegow24-dvhqq6lftwexh77fd2g5o71y...@mail.gmail.com
---
 include/linux/bpf.h  |  14 
 include/uapi/linux/bpf.h |  18 +
 kernel/bpf/arraymap.c| 203 +++
 kernel/bpf/verifier.c|  12 ++-
 4 files changed, 246 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fa9a988400d9..eae4ce4542c1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -13,6 +13,10 @@
 #include 
 #include 
 
+#ifdef CONFIG_SECURITY_LANDLOCK
+#include  /* struct file */
+#endif /* CONFIG_SECURITY_LANDLOCK */
+
 struct perf_event;
 struct bpf_map;
 
@@ -38,6 +42,7 @@ struct bpf_map_ops {
 struct bpf_map {
atomic_t refcnt;
enum bpf_map_type map_type;
+   enum bpf_map_array_type map_array_type;
u32 key_size;
u32 value_size;
u32 max_entries;
@@ -187,6 +192,9 @@ struct bpf_array {
 */
enum bpf_prog_type owner_prog_type;
bool owner_jited;
+#ifdef CONFIG_SECURITY_LANDLOCK
+   u32 n_entries;  /* number of entries in a handle array */
+#endif /* CONFIG_SECURITY_LANDLOCK */
union {
char value[0] __aligned(8);
void *ptrs[0] __aligned(8);
@@ -194,6 +202,12 @@ struct bpf_array {
};
 };
 
+#ifdef CONFIG_SECURITY_LANDLOCK
+struct map_landlock_handle {
+   u32 type; /* enum bpf_map_handle_type */
+};
+#endif /* CONFIG_SECURITY_LANDLOCK */
+
 #define MAX_TAIL_CALL_CNT 32
 
 struct bpf_event_entry {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7cd36166f9b7..b68de57f7ab8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -87,6 +87,15 @@ enum bpf_map_type {
BPF_MAP_TYPE_PERCPU_ARRAY,
BPF_MAP_TYPE_STACK_TRACE,
BPF_MAP_TYPE_CGROUP_ARRAY,
+   BPF_MAP_TYPE_LANDLOCK_ARRAY,
+};
+
+enum bpf_map_array_type {
+   BPF_MAP_ARRAY_TYPE_UNSPEC,
+};
+
+enum bpf_map_handle_type {
+   BPF_MAP_HANDLE_TYPE_UNSPEC,
 };
 
 enum bpf_prog_type {
@@ -510,4 +519,13 @@ struct xdp_md {
__u32 data_end;
 };
 
+/* Map handle entry */
+struct landlock_handle {
+   __u32 type; /* enum bpf_map_handle_type */
+   union {
+   __u32 fd;
+   __aligned_u64 glob;
+   };
+} __attribute__((aligned(8)));
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index a2ac051c342f..94256597eacd 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -16,6 +16,13 @@
 #include 
 #include 
 #include 
+#include  /* fput() */
+#include  /* struct file */
+
+#ifdef CONFIG_SECURITY_LANDLOCK
+#include  /* RLIMIT_NOFILE */
+#include  /* rlimit() */
+#endif /* CONFIG_SECURITY_LANDLOCK */
 
 static void bpf_array_free_percpu(struct bpf_array *array)
 {
@@ -580,3 +587,199 @@ static int __init register_cgroup_array_map(void)
 }
 late_initcall(register_cgroup_array_map);
 #endif
+
+#ifdef CONFIG_SECURITY_LANDLOCK
+static struct bpf_map *landlock_array_map_alloc(union bpf_attr *attr)
+{
+   if (attr->value_size != sizeof(struct landlock_handle))
+   return ERR_PTR(-EINVAL);
+   attr->value_size = sizeof(struct map_landlock_handle);
+
+   return array_map_alloc(attr);
+}
+
+static void landlock_put_handle(struct map_landlock_handle *handle)
+{
+   enum bpf_map_handle_type handle_type = handle->type;
+
+   switch (handle_type) {
+   case BPF_MAP_HANDLE_TYPE_UNSPEC:
+   default:
+

[RFC v3 17/22] cgroup: Add access check for cgroup_get_from_fd()

2016-09-14 Thread Mickaël Salaün

Add security access check for cgroup backed FD. The "cgroup.procs" file
of the corresponding cgroup must be readable to identify the cgroup, and
writable to prove that the current process can manage this cgroup (e.g.
through delegation). This is similar to the check done by
cgroup_procs_write_permission().

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andy Lutomirski 
Cc: Daniel Borkmann 
Cc: Daniel Mack 
Cc: David S. Miller 
Cc: Kees Cook 
Cc: Tejun Heo 
---
 include/linux/cgroup.h |  2 +-
 kernel/bpf/arraymap.c  |  2 +-
 kernel/bpf/syscall.c   |  6 +++---
 kernel/cgroup.c| 16 +++-
 4 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index c4688742ddc4..5767d471e292 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -87,7 +87,7 @@ struct cgroup_subsys_state *css_tryget_online_from_dir(struct 
dentry *dentry,
   struct cgroup_subsys 
*ss);
 
 struct cgroup *cgroup_get_from_path(const char *path);
-struct cgroup *cgroup_get_from_fd(int fd);
+struct cgroup *cgroup_get_from_fd(int fd, int access_mask);
 
 int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
 int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index edaab4c87292..1d4de8e0ab13 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -552,7 +552,7 @@ static void *cgroup_fd_array_get_ptr(struct bpf_map *map,
 struct file *map_file /* not used */,
 int fd)
 {
-   return cgroup_get_from_fd(fd);
+   return cgroup_get_from_fd(fd, MAY_READ);
 }
 
 static void cgroup_fd_array_put_ptr(void *ptr)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e9c5add327e6..f90225dbbb59 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 DEFINE_PER_CPU(int, bpf_prog_active);
 
@@ -863,7 +864,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
if (IS_ERR(prog))
return PTR_ERR(prog);
 
-   cgrp = cgroup_get_from_fd(attr->target_fd);
+   cgrp = cgroup_get_from_fd(attr->target_fd, MAY_WRITE);
if (IS_ERR(cgrp)) {
bpf_prog_put(prog);
return PTR_ERR(cgrp);
@@ -891,10 +892,9 @@ static int bpf_prog_detach(const union bpf_attr *attr)
if (!capable(CAP_NET_ADMIN))
return -EPERM;
 
-   cgrp = cgroup_get_from_fd(attr->target_fd);
+   cgrp = cgroup_get_from_fd(attr->target_fd, MAY_WRITE);
if (IS_ERR(cgrp))
return PTR_ERR(cgrp);
-
result = cgroup_bpf_update(cgrp, NULL, attr->attach_type);
cgroup_put(cgrp);
break;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 48b650a640a9..3bbaf3f02ed2 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -6241,17 +6241,20 @@ EXPORT_SYMBOL_GPL(cgroup_get_from_path);
 /**
  * cgroup_get_from_fd - get a cgroup pointer from a fd
  * @fd: fd obtained by open(cgroup2_dir)
+ * @access_mask: contains the permission mask
  *
  * Find the cgroup from a fd which should be obtained
  * by opening a cgroup directory.  Returns a pointer to the
  * cgroup on success. ERR_PTR is returned if the cgroup
  * cannot be found.
  */
-struct cgroup *cgroup_get_from_fd(int fd)
+struct cgroup *cgroup_get_from_fd(int fd, int access_mask)
 {
struct cgroup_subsys_state *css;
struct cgroup *cgrp;
struct file *f;
+   struct inode *inode;
+   int ret;
 
f = fget_raw(fd);
if (!f)
@@ -6268,6 +6271,17 @@ struct cgroup *cgroup_get_from_fd(int fd)
return ERR_PTR(-EBADF);
}
 
+   ret = -ENOMEM;
+   inode = kernfs_get_inode(f->f_path.dentry->d_sb, cgrp->procs_file.kn);
+   if (inode) {
+   ret = inode_permission(inode, access_mask);
+   iput(inode);
+   }
+   if (ret) {
+   cgroup_put(cgrp);
+   return ERR_PTR(ret);
+   }
+
return cgrp;
 }
 EXPORT_SYMBOL_GPL(cgroup_get_from_fd);
-- 
2.9.3

[RFC v3 15/22] bpf/cgroup: Move capability check

2016-09-14 Thread Mickaël Salaün

This will be useful to be able to add more BPF attach type with
different capability checks.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Daniel Mack 
---
 kernel/bpf/syscall.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c978f2d9a1b3..8599596fd6cf 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -833,15 +833,15 @@ static int bpf_prog_attach(const union bpf_attr *attr)
struct cgroup *cgrp;
int result;
 
-   if (!capable(CAP_NET_ADMIN))
-   return -EPERM;
-
if (CHECK_ATTR(BPF_PROG_ATTACH))
return -EINVAL;
 
switch (attr->attach_type) {
case BPF_CGROUP_INET_INGRESS:
case BPF_CGROUP_INET_EGRESS:
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
prog = bpf_prog_get_type(attr->attach_bpf_fd,
 BPF_PROG_TYPE_CGROUP_SOCKET);
break;
@@ -872,15 +872,15 @@ static int bpf_prog_detach(const union bpf_attr *attr)
struct cgroup *cgrp;
int result = 0;
 
-   if (!capable(CAP_NET_ADMIN))
-   return -EPERM;
-
if (CHECK_ATTR(BPF_PROG_DETACH))
return -EINVAL;
 
switch (attr->attach_type) {
case BPF_CGROUP_INET_INGRESS:
case BPF_CGROUP_INET_EGRESS:
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
cgrp = cgroup_get_from_fd(attr->target_fd);
if (IS_ERR(cgrp))
return PTR_ERR(cgrp);
-- 
2.9.3

[RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Mickaël Salaün

Add a new flag CGRP_NO_NEW_PRIVS for each cgroup. This flag is initially
set for all cgroup except the root. The flag is clear when a new process
without the no_new_privs flags is attached to the cgroup.

If a cgroup is landlocked, then any new attempt, from an unprivileged
process, to attach a process without no_new_privs to this cgroup will
be denied.

This allows to safely manage Landlock rules with cgroup delegation as
with seccomp.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andy Lutomirski 
Cc: Daniel Borkmann 
Cc: Daniel Mack 
Cc: David S. Miller 
Cc: Kees Cook 
Cc: Tejun Heo 
---
 include/linux/cgroup-defs.h |  7 +++
 kernel/bpf/syscall.c|  7 ---
 kernel/cgroup.c | 44 ++--
 security/landlock/manager.c |  7 +++
 4 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index fe1023bf7b9d..ce0e4c90ae7d 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -59,6 +59,13 @@ enum {
 * specified at mount time and thus is implemented here.
 */
CGRP_CPUSET_CLONE_CHILDREN,
+   /*
+* Keep track of the no_new_privs property of processes in the cgroup.
+* This is useful to quickly check if all processes in the cgroup have
+* their no_new_privs bit on. This flag is initially set to true but
+* ANDed with every processes coming in the cgroup.
+*/
+   CGRP_NO_NEW_PRIVS,
 };
 
 /* cgroup_root->flags */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f90225dbbb59..ff8b53a8a2a0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -849,9 +849,10 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 
case BPF_CGROUP_LANDLOCK:
 #ifdef CONFIG_SECURITY_LANDLOCK
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
-
+   /*
+* security/capability check done in landlock_cgroup_set_hook()
+* called by cgroup_bpf_update()
+*/
prog = bpf_prog_get_type(attr->attach_bpf_fd,
BPF_PROG_TYPE_LANDLOCK);
break;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3bbaf3f02ed2..913e2d3b6d55 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -62,6 +62,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define CREATE_TRACE_POINTS
@@ -1985,6 +1986,7 @@ static void init_cgroup_root(struct cgroup_root *root,
strcpy(root->name, opts->name);
if (opts->cpuset_clone_children)
set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
+   /* no CGRP_NO_NEW_PRIVS flag for the root */
 }
 
 static int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
@@ -2812,14 +2814,35 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
LIST_HEAD(preloaded_csets);
struct task_struct *task;
int ret;
+#if defined(CONFIG_CGROUP_BPF) && defined(CONFIG_SECURITY_LANDLOCK)
+   bool no_new_privs;
+#endif /* CONFIG_CGROUP_BPF && CONFIG_SECURITY_LANDLOCK */
 
if (!cgroup_may_migrate_to(dst_cgrp))
return -EBUSY;
 
+   task = leader;
+#if defined(CONFIG_CGROUP_BPF) && defined(CONFIG_SECURITY_LANDLOCK)
+   no_new_privs = !!(dst_cgrp->flags & BIT_ULL(CGRP_NO_NEW_PRIVS));
+   do {
+   no_new_privs = no_new_privs && task_no_new_privs(task);
+   if (!no_new_privs) {
+   if (dst_cgrp->bpf.pinned[BPF_CGROUP_LANDLOCK].hooks &&
+   security_capable_noaudit(current_cred(),
+   current_user_ns(),
+   CAP_SYS_ADMIN) != 0)
+   return -EPERM;
+   clear_bit(CGRP_NO_NEW_PRIVS, &dst_cgrp->flags);
+   break;
+   }
+   if (!threadgroup)
+   break;
+   } while_each_thread(leader, task);
+#endif /* CONFIG_CGROUP_BPF && CONFIG_SECURITY_LANDLOCK */
+
/* look up all src csets */
spin_lock_irq(&css_set_lock);
rcu_read_lock();
-   task = leader;
do {
cgroup_migrate_add_src(task_css_set(task), dst_cgrp,
   &preloaded_csets);
@@ -4345,9 +4368,22 @@ int cgroup_transfer_tasks(struct cgroup *to, struct 
cgroup *from)
return -EBUSY;
 
mutex_lock(&cgroup_mutex);
-
percpu_down_write(&cgroup_threadgroup_rwsem);
 
+#if defined(CONFIG_CGROUP_BPF) && defined(CONFIG_SECURITY_LANDLOCK)
+   if (!(from->flags & BIT_ULL(CGRP_NO_NEW_PRIVS))) {
+   if (to->bpf.pinned[BPF_CGROUP_LANDLOCK].hooks &&
+   security_capable_noaudit(current_cred(),
+   current_user_ns(), CAP_SYS_ADMIN) !=

[RFC v3 09/22] seccomp: Move struct seccomp_filter in seccomp.h

2016-09-14 Thread Mickaël Salaün

Set struct seccomp_filter public because of the next use of
the new field thread_prev added for Landlock LSM.

Signed-off-by: Mickaël Salaün 
Cc: Kees Cook 
Cc: Andy Lutomirski 
Cc: Will Drewry 
---
 include/linux/seccomp.h | 27 ++-
 kernel/seccomp.c| 26 --
 2 files changed, 26 insertions(+), 27 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index ecc296c137cd..a0459a7315ce 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -10,7 +10,32 @@
 #include 
 #include 
 
-struct seccomp_filter;
+/**
+ * struct seccomp_filter - container for seccomp BPF programs
+ *
+ * @usage: reference count to manage the object lifetime.
+ * get/put helpers should be used when accessing an instance
+ * outside of a lifetime-guarded section.  In general, this
+ * is only needed for handling filters shared across tasks.
+ * @prev: points to a previously installed, or inherited, filter
+ * @prog: the BPF program to evaluate
+ *
+ * seccomp_filter objects are organized in a tree linked via the @prev
+ * pointer.  For any task, it appears to be a singly-linked list starting
+ * with current->seccomp.filter, the most recently attached or inherited 
filter.
+ * However, multiple filters may share a @prev node, by way of fork(), which
+ * results in a unidirectional tree existing in memory.  This is similar to
+ * how namespaces work.
+ *
+ * seccomp_filter objects should never be modified after being attached
+ * to a task_struct (other than @usage).
+ */
+struct seccomp_filter {
+   atomic_t usage;
+   struct seccomp_filter *prev;
+   struct bpf_prog *prog;
+};
+
 /**
  * struct seccomp - the state of a seccomp'ed process
  *
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index dccfc05cb3ec..1867bbfa7c6c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -33,32 +33,6 @@
 #include 
 #include 
 
-/**
- * struct seccomp_filter - container for seccomp BPF programs
- *
- * @usage: reference count to manage the object lifetime.
- * get/put helpers should be used when accessing an instance
- * outside of a lifetime-guarded section.  In general, this
- * is only needed for handling filters shared across tasks.
- * @prev: points to a previously installed, or inherited, filter
- * @prog: the BPF program to evaluate
- *
- * seccomp_filter objects are organized in a tree linked via the @prev
- * pointer.  For any task, it appears to be a singly-linked list starting
- * with current->seccomp.filter, the most recently attached or inherited 
filter.
- * However, multiple filters may share a @prev node, by way of fork(), which
- * results in a unidirectional tree existing in memory.  This is similar to
- * how namespaces work.
- *
- * seccomp_filter objects should never be modified after being attached
- * to a task_struct (other than @usage).
- */
-struct seccomp_filter {
-   atomic_t usage;
-   struct seccomp_filter *prev;
-   struct bpf_prog *prog;
-};
-
 /* Limit any path through the tree to 256KB worth of instructions. */
 #define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))
 
-- 
2.9.3

[RFC v3 01/22] landlock: Add Kconfig

2016-09-14 Thread Mickaël Salaün

Initial Landlock Kconfig needed to split the Landlock eBPF and seccomp
parts to ease the review.

Changes from v2:
* add seccomp filter or cgroups (with eBPF programs attached support)
  dependencies

Signed-off-by: Mickaël Salaün 
Cc: James Morris 
Cc: Kees Cook 
Cc: Serge E. Hallyn 
---
 security/Kconfig  |  1 +
 security/landlock/Kconfig | 23 +++
 2 files changed, 24 insertions(+)
 create mode 100644 security/landlock/Kconfig

diff --git a/security/Kconfig b/security/Kconfig
index 118f4549404e..c63194c561c5 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -164,6 +164,7 @@ source security/tomoyo/Kconfig
 source security/apparmor/Kconfig
 source security/loadpin/Kconfig
 source security/yama/Kconfig
+source security/landlock/Kconfig
 
 source security/integrity/Kconfig
 
diff --git a/security/landlock/Kconfig b/security/landlock/Kconfig
new file mode 100644
index ..dec64270b06d
--- /dev/null
+++ b/security/landlock/Kconfig
@@ -0,0 +1,23 @@
+config SECURITY_LANDLOCK
+   bool "Landlock sandbox support"
+   depends on SECURITY
+   depends on BPF_SYSCALL
+   depends on SECCOMP_FILTER || CGROUP_BPF
+   default y
+   help
+ Landlock is a stacked LSM which allows any user to load a security
+ policy to restrict their processes (i.e. create a sandbox). The
+ policy is a list of stacked eBPF programs for some LSM hooks. Each
+ program can do some access comparison to check if an access request
+ is legitimate.
+
+ You need to enable seccomp filter and/or cgroups (with eBPF programs
+ attached support) to apply a security policy to either a process
+ hierarchy (e.g. application with built-in sandboxing) or a group of
+ processes (e.g. container sandboxing). It is recommended to enable
+ both seccomp filter and cgroups.
+
+ Further information about eBPF can be found in
+ Documentation/networking/filter.txt
+
+ If you are unsure how to answer this question, answer Y.
-- 
2.9.3

Re: [PATCH net-next 0/4] rxrpc: Support IPv6

2016-09-14 Thread David Howells

David Howells  wrote:

> Here is a set of patches that add IPv6 support.  They need to be applied on
> top of the just-posted miscellaneous fix patches.  They are:

This subset needs to be made to depend on CONFIG_IPV6.

David

[RFC v3 05/22] bpf,landlock: Add eBPF program subtype and is_valid_subtype() verifier

2016-09-14 Thread Mickaël Salaün

The program subtype goal is to be able to have different static
fine-grained verifications for a unique program type.

The struct bpf_verifier_ops gets a new optional function:
is_valid_subtype(). This new verifier is called at the begening of the
eBPF program verification to check if the (optional) program subtype is
valid.

For now, only Landlock eBPF programs are using a program subtype but
this could be used by other program types in the future.

Cf. the next commit to see how the subtype is used by Landlock LSM.

Signed-off-by: Mickaël Salaün 
Link: https://lkml.kernel.org/r/20160827205559.ga43...@ast-mbp.thefacebook.com
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: David S. Miller 
---
 include/linux/bpf.h  |  8 ++--
 include/linux/filter.h   |  1 +
 include/uapi/linux/bpf.h |  9 +
 kernel/bpf/syscall.c |  5 +++--
 kernel/bpf/verifier.c|  9 +++--
 kernel/trace/bpf_trace.c | 12 
 net/core/filter.c| 21 +
 7 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index eae4ce4542c1..9aa01d9d3d80 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -149,17 +149,21 @@ struct bpf_prog;
 
 struct bpf_verifier_ops {
/* return eBPF function prototype for verification */
-   const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id 
func_id);
+   const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id,
+   union bpf_prog_subtype *prog_subtype);
 
/* return true if 'size' wide access at offset 'off' within bpf_context
 * with 'type' (read or write) is allowed
 */
bool (*is_valid_access)(int off, int size, enum bpf_access_type type,
-   enum bpf_reg_type *reg_type);
+   enum bpf_reg_type *reg_type,
+   union bpf_prog_subtype *prog_subtype);
 
u32 (*convert_ctx_access)(enum bpf_access_type type, int dst_reg,
  int src_reg, int ctx_off,
  struct bpf_insn *insn, struct bpf_prog *prog);
+
+   bool (*is_valid_subtype)(union bpf_prog_subtype *prog_subtype);
 };
 
 struct bpf_prog_type_list {
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1f09c521adfe..88470cdd3ee1 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -406,6 +406,7 @@ struct bpf_prog {
kmemcheck_bitfield_end(meta);
u32 len;/* Number of filter blocks */
enum bpf_prog_type  type;   /* Type of BPF program */
+   union bpf_prog_subtype  subtype;/* For fine-grained 
verifications */
struct bpf_prog_aux *aux;   /* Auxiliary fields */
struct sock_fprog_kern  *orig_prog; /* Original BPF program */
unsigned int(*bpf_func)(const struct sk_buff *skb,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b68de57f7ab8..667b6ef3ff1e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -127,6 +127,14 @@ enum bpf_attach_type {
 
 #define BPF_F_NO_PREALLOC  (1U << 0)
 
+union bpf_prog_subtype {
+   struct {
+   __u32   id; /* enum landlock_hook_id */
+   __u16   origin; /* LANDLOCK_FLAG_ORIGIN_* */
+   __aligned_u64   access; /* LANDLOCK_FLAG_ACCESS_* */
+   } landlock_hook;
+} __attribute__((aligned(8)));
+
 union bpf_attr {
struct { /* anonymous struct used by BPF_MAP_CREATE command */
__u32   map_type;   /* one of enum bpf_map_type */
@@ -155,6 +163,7 @@ union bpf_attr {
__u32   log_size;   /* size of user buffer */
__aligned_u64   log_buf;/* user supplied buffer */
__u32   kern_version;   /* checked when 
prog_type=kprobe */
+   union bpf_prog_subtype prog_subtype;/* checked when 
prog_type=landlock */
};
 
struct { /* anonymous struct used by BPF_OBJ_* commands */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 776c752604b0..8b3f4d2b4802 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -572,7 +572,7 @@ static void fixup_bpf_calls(struct bpf_prog *prog)
continue;
}
 
-   fn = prog->aux->ops->get_func_proto(insn->imm);
+   fn = prog->aux->ops->get_func_proto(insn->imm, 
&prog->subtype);
/* all functions that have prototype and verifier 
allowed
 * programs to call them, must be real in-kernel 
functions
 */
@@ -710,7 +710,7 @@ struct bpf_prog *bpf_prog_get_type(u32 ufd, enum 
bpf_prog_type type)
 EXPORT_SYMBOL_GPL(bpf_prog_get_type);
 
 /* last field in 'union bpf_attr' used by this command */
-#define

RE: [RFC 07/11] Add support for memory registeration verbs

2016-09-14 Thread Amrani, Ram

> > +static inline struct qedr_ah *get_qedr_ah(struct ib_ah *ibah) {
> > +   return container_of(ibah, struct qedr_ah, ibah); }
> 
> Little surprising to find that here... how is the ah related to this patch?

Thanks, Sagi. Will move into a proper location.

Re: [RFC 00/11] QLogic RDMA Driver (qedr) RFC

2016-09-14 Thread Sagi Grimberg


> SRQ is not part of the RFC (but we do have it and NVMF was tested

with it).






Nice, I have plans to make SRQs better usable for our ULPs so it'd be



good to have it.


That’s good to know. Are there plans on implementing XRC?

Right now it looks like none of the ULPS make use of it.


The problem with XRC is that it needs to be reflected on the
wire protocol for it to actually be useful for our ULPs. Not
sure if its worth the effort...

Re: [net-next PATCH 00/11] iw_cxgb4,cxgbit: remove duplicate code

2016-09-14 Thread Or Gerlitz

On Tue, Sep 13, 2016 at 6:53 PM, Varun Prakash  wrote:
> This patch series removes duplicate code from
> iw_cxgb4 and cxgbit by adding common function definitions in libcxgb.

Is that bunch of misc functionalities or you can provide a more high
level description what
you are cleaning out. Also, what other areas are you planning to
refactor following the review
comments we had on the target driver?

Or.

Re: [PATCH] MAINTAINERS: Remove myself from PA Semi entries

2016-09-14 Thread Wolfram Sang


> > I was hoping to have Michael merge this since the bulk of the platform is 
> > under him,
> > cc:ing you mostly to be aware that I am orphaning a driver in your 
> > subsystems.
> 
> Let me answer for Jean since I took over I2C in November 2012 ;) I'd
> think the entry can go completely. The last 'F:' tag for the platform
> catches the I2C driver anyhow. But in general:

To make it crystal clear: I meant the I2C entry for PASEMI could go.



signature.asc
Description: PGP signature

Re: [PATCH v3 5/9] ARM: dts: sun8i-h3: add sun8i-emac ethernet driver

2016-09-14 Thread LABBE Corentin

On Mon, Sep 12, 2016 at 09:29:33AM +0200, Maxime Ripard wrote:
> On Fri, Sep 09, 2016 at 02:45:13PM +0200, Corentin Labbe wrote:
> > The sun8i-emac is an ethernet MAC hardware that support 10/100/1000
> > speed.
> > 
> > This patch enable the sun8i-emac on the Allwinner H3 SoC Device-tree.
> > The SoC H3 have an internal PHY, so optionals syscon and ephy are set.
> > 
> > Signed-off-by: Corentin Labbe 
> > ---
> >  arch/arm/boot/dts/sun8i-h3.dtsi | 19 +++
> >  1 file changed, 19 insertions(+)
> > 
> > diff --git a/arch/arm/boot/dts/sun8i-h3.dtsi 
> > b/arch/arm/boot/dts/sun8i-h3.dtsi
> > index a39da6f..a3ac476 100644
> > --- a/arch/arm/boot/dts/sun8i-h3.dtsi
> > +++ b/arch/arm/boot/dts/sun8i-h3.dtsi
> > @@ -50,6 +50,10 @@
> >  / {
> > interrupt-parent = <&gic>;
> >  
> > +   aliases {
> > +   ethernet0 = &emac;
> > +   };
> > +
> 
> This needs to be done at the board level.
> 

ok

> > cpus {
> > #address-cells = <1>;
> > #size-cells = <0>;
> > @@ -446,6 +450,21 @@
> > status = "disabled";
> > };
> >  
> > +   emac: ethernet@1c3 {
> > +   compatible = "allwinner,sun8i-h3-emac";
> > +   syscon = <&syscon>;
> > +   reg = <0x01c3 0x104>;
> > +   reg-names = "emac";
> 
> You don't need reg-names anymore.
> 

ok

> > +   interrupts = ;
> > +   resets = <&ccu RST_BUS_EMAC>, <&ccu RST_BUS_EPHY>;
> > +   reset-names = "ahb", "ephy";
> > +   clocks = <&ccu CLK_BUS_EMAC>, <&ccu CLK_BUS_EPHY>;
> > +   clock-names = "ahb", "ephy";
> 
> I still believe that having the same node for both the PHY and the MAC
> is wrong.
> 

Ok I have moved clock/reset of ephy in its node.

Thanks

Regards

Corentin Labbe

Re: [PATCH] MAINTAINERS: Remove myself from PA Semi entries

2016-09-14 Thread Michael Ellerman

Olof Johansson  writes:

> The platform is old, very few users and I lack bandwidth to keep after
> it these days.
>
> Mark the base platform as well as the drivers as orphans, patches have
> been flowing through the fallback maintainers for a while already.

Sorry to see you go, but thanks for keeping an eye on it as long as you
did!

> Jean, Dave,
>
> I was hoping to have Michael merge this since the bulk of the platform is 
> under him,
> cc:ing you mostly to be aware that I am orphaning a driver in your subsystems.

I'll merge it unless I hear otherwise from Dave.

Should we go the whole hog and just do as below? I think most folks use
get_maintainers.pl these days, so this should have basically the same
effect. Happy to go with your original version though if you prefer.

cheers


diff --git a/MAINTAINERS b/MAINTAINERS
index 0bbe4b105c34..8ca1c25d870d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7049,6 +7049,7 @@ N:powermac
 N: powernv
 N: [^a-z0-9]ps3
 N: pseries
+N: pasemi
 
 LINUX FOR POWER MACINTOSH
 M: Benjamin Herrenschmidt 
@@ -7098,14 +7099,6 @@ S:   Maintained
 F: arch/powerpc/platforms/83xx/
 F: arch/powerpc/platforms/85xx/
 
-LINUX FOR POWERPC PA SEMI PWRFICIENT
-M: Olof Johansson 
-L: linuxppc-...@lists.ozlabs.org
-S: Maintained
-F: arch/powerpc/platforms/pasemi/
-F: drivers/*/*pasemi*
-F: drivers/*/*/*pasemi*
-
 LINUX SECURITY MODULE (LSM) FRAMEWORK
 M: Chris Wright 
 L: linux-security-mod...@vger.kernel.org
@@ -8773,18 +8766,6 @@ W:   http://wireless.kernel.org/en/users/Drivers/p54
 S: Maintained
 F: drivers/net/wireless/intersil/p54/
 
-PA SEMI ETHERNET DRIVER
-M: Olof Johansson 
-L: netdev@vger.kernel.org
-S: Maintained
-F: drivers/net/ethernet/pasemi/*
-
-PA SEMI SMBUS DRIVER
-M: Olof Johansson 
-L: linux-...@vger.kernel.org
-S: Maintained
-F: drivers/i2c/busses/i2c-pasemi.c
-
 PADATA PARALLEL EXECUTION MECHANISM
 M: Steffen Klassert 
 L: linux-cry...@vger.kernel.org

Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-14 Thread Thomas Graf

[Sorry for the repost, gmail decided to start sending HTML crap along
 overnight for some reason]

On 09/13/16 at 09:42pm, Alexei Starovoitov wrote:
> On Tue, Sep 13, 2016 at 07:24:08PM +0200, Pablo Neira Ayuso wrote:
> > Then you have to explain me how can anyone else than systemd use this
> > infrastructure?
> 
> Jokes aside. I'm puzzled why systemd is even being mentioned here.
> Here we use tupperware (our internal container management system) that
> is heavily using cgroups and has nothing to do with systemd.

Just confirming that we are planning to use this decoupled from
systemd as well.  I fail to see how this is at all systemd specific.

> For us this cgroup+bpf is _not_ for filterting and _not_ for security.
> We run a ton of tasks in cgroups that launch all sorts of
> things on their own. We need to monitor what they do from networking
> point of view. Therefore bpf programs need to monitor the traffic in
> particular part of cgroup hierarchy. Not globally and no pass/drop decisions.

+10, although filtering/drop is a valid use case, the really strong
use case is definitely introspection at networking level. Statistics,
monitoring, verification of application correctness, etc. 

I don't see why this is at all an either or discussion. If nft wants
cgroups integration similar to this effort, I see no reason why that
should stop this effort.

RE: [RFC 02/11] Add RoCE driver framework

2016-09-14 Thread Amrani, Ram

> > +   if ((event != NETDEV_CHANGENAME) && (event !=
> > NETDEV_CHANGEADDR))
> 
> nit: You don't really need the extra parens here.
> 
Sure, thanks. Will remove.

Re: [PATCH RFC 2/6] rhashtable: Call library function alloc_bucket_locks

2016-09-14 Thread Thomas Graf

On 09/09/16 at 04:19pm, Tom Herbert wrote:
> To allocate the array of bucket locks for the hash table we now
> call library function alloc_bucket_spinlocks. This function is
> based on the old alloc_bucket_locks in rhashtable and should
> produce the same effect.
> 
> Signed-off-by: Tom Herbert 

Acked-by: Thomas Graf

Re: [PATCH] [RFC] proc connector: add namespace events

2016-09-14 Thread Jiri Benc

On Tue, 13 Sep 2016 16:42:43 +0200, Alban Crequy wrote:
> Note that I will probably not have the chance to spend more time on
> this patch soon because Iago will explore other methods with
> eBPF+kprobes instead. eBPF+kprobes would not have the same API
> stability though. I was curious to see if anyone would find the
> namespace addition to proc connector interesting for other projects.

Yes, this is a sorely missing feature. I don't care how this is done
(proc connector or something else) but the feature itself is quite
important for system management daemons. In particular, we need an
application that monitors network configuration changes on the machine,
displays the current configuration and records history of the changes.
This is currently impossible to do reliably if net name spaces are in
use - which they are with OpenStack and Docker and similar things in
place on those machines. The current tools try to do things like
monitoring /var/run/netns which is obviously unreliable and broken.

There are actually two (orthogonal) problems here: apart of the one
described above, it's also startup of such daemon. There's currently no
way to find all current name spaces from the user space. We'll need an
API for this, too.

And no, eBPF is not the answer. This should just work like any other
system daemon. I can't imagine that we would need llvm compiler and
kernel sources/debuginfo/whatever on every machine that runs such
daemon.

Thanks,

 Jiri

Re: [PATCH RFC 4/6] rhashtable: abstract out function to get hash

2016-09-14 Thread Thomas Graf

On 09/09/16 at 04:19pm, Tom Herbert wrote:
> Split out most of rht_key_hashfn which is calculating the hash into
> its own function. This way the hash function can be called separately to
> get the hash value.
> 
> Signed-off-by: Tom Herbert 

Acked-by: Thomas Graf

Re: [RFC 07/11] Add support for memory registeration verbs

2016-09-14 Thread Sagi Grimberg




+struct qedr_mr *__qedr_alloc_mr(struct ib_pd *ibpd, int
+max_page_list_len) {
+   struct qedr_pd *pd = get_qedr_pd(ibpd);
+   struct qedr_dev *dev = get_qedr_dev(ibpd->device);
+   struct qedr_mr *mr;
+   int rc = -ENOMEM;
+
+   DP_VERBOSE(dev, QEDR_MSG_MR,
+  "qedr_alloc_frmr pd = %d max_page_list_len= %d\n", pd-
pd_id,
+  max_page_list_len);
+
+   mr = kzalloc(sizeof(*mr), GFP_KERNEL);
+   if (!mr)
+   return ERR_PTR(rc);
+
+   mr->dev = dev;
+   mr->type = QEDR_MR_FRMR;
+
+   rc = init_mr_info(dev, &mr->info, max_page_list_len, 1);
+   if (rc)
+   goto err0;
+
+   rc = dev->ops->rdma_alloc_tid(dev->rdma_ctx, &mr->hw_mr.itid);
+   if (rc) {
+   DP_ERR(dev, "roce alloc tid returned an error %d\n", rc);
+   goto err0;
+   }
+
+   /* Index only, 18 bit long, lkey = itid << 8 | key */
+   mr->hw_mr.tid_type = QED_RDMA_TID_FMR;
+   mr->hw_mr.key = 0;
+   mr->hw_mr.pd = pd->pd_id;


Do you have a real MR<->PD association in HW? If so, can you point me to the
code that binds it? If not, any reason not to expose the local_dma_lkey?


Yes, we send the pd id to the FW in function qed_rdma_register_tid.


Right, thanks.


In any case, if we didn't have the association in HW
Wouldn't the local_dma_lkey be relevant only to dma_mr ? ( code snippet above 
refers to FMR)


I was just sticking to a location in the code where you associate
MR<->PD...

The local_dma_lkey is a special key that spans the entire memory space
and unlike the notorious dma_mr, its not associated with a PD.

See the code in ib_alloc_pd(), if the device does not support a single
device local dma lkey, the core allocates a dma mr associated with
the pd. If your device has such a key, you can save a dma mr allocation
for each pd in the system.

Re: [PATCH RFC 1/6] spinlock: Add library function to allocate spinlock buckets array

2016-09-14 Thread Thomas Graf

On 09/09/16 at 04:19pm, Tom Herbert wrote:
> Add two new library functions alloc_bucket_spinlocks and
> free_bucket_spinlocks. These are use to allocate and free an array
> of spinlocks that are useful as locks for hash buckets. The interface
> specifies the maximum number of spinlocks in the array as well
> as a CPU multiplier to derive the number of spinlocks to allocate.
> The number to allocated is rounded up to a power of two to make
> the array amenable to hash lookup.
> 
> Signed-off-by: Tom Herbert 

Acked-by: Thomas Graf

[patch net-next v9 2/3] net: core: Add offload stats to if_stats_msg

2016-09-14 Thread Jiri Pirko

From: Nogah Frankel 

Add a nested attribute of offload stats to if_stats_msg
named IFLA_STATS_LINK_OFFLOAD_XSTATS.
Under it, add SW stats, meaning stats only per packets that went via
slowpath to the cpu, named IFLA_OFFLOAD_XSTATS_CPU_HIT.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 include/uapi/linux/if_link.h |   9 
 net/core/rtnetlink.c | 111 +--
 2 files changed, 116 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 9bf3aec..2351776 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -826,6 +826,7 @@ enum {
IFLA_STATS_LINK_64,
IFLA_STATS_LINK_XSTATS,
IFLA_STATS_LINK_XSTATS_SLAVE,
+   IFLA_STATS_LINK_OFFLOAD_XSTATS,
__IFLA_STATS_MAX,
 };
 
@@ -845,6 +846,14 @@ enum {
 };
 #define LINK_XSTATS_TYPE_MAX (__LINK_XSTATS_TYPE_MAX - 1)
 
+/* These are stats embedded into IFLA_STATS_LINK_OFFLOAD_XSTATS */
+enum {
+   IFLA_OFFLOAD_XSTATS_UNSPEC,
+   IFLA_OFFLOAD_XSTATS_CPU_HIT, /* struct rtnl_link_stats64 */
+   __IFLA_OFFLOAD_XSTATS_MAX
+};
+#define IFLA_OFFLOAD_XSTATS_MAX (__IFLA_OFFLOAD_XSTATS_MAX - 1)
+
 /* XDP section */
 
 enum {
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 937e459..ae2048a 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3577,6 +3577,91 @@ static bool stats_attr_valid(unsigned int mask, int 
attrid, int idxattr)
   (!idxattr || idxattr == attrid);
 }
 
+#define IFLA_OFFLOAD_XSTATS_FIRST (IFLA_OFFLOAD_XSTATS_UNSPEC + 1)
+static int rtnl_get_offload_stats_attr_size(int attr_id)
+{
+   switch (attr_id) {
+   case IFLA_OFFLOAD_XSTATS_CPU_HIT:
+   return sizeof(struct rtnl_link_stats64);
+   }
+
+   return 0;
+}
+
+static int rtnl_get_offload_stats(struct sk_buff *skb, struct net_device *dev,
+ int *prividx)
+{
+   struct nlattr *attr = NULL;
+   int attr_id, size;
+   void *attr_data;
+   int err;
+
+   if (!(dev->netdev_ops && dev->netdev_ops->ndo_has_offload_stats &&
+ dev->netdev_ops->ndo_get_offload_stats))
+   return -ENODATA;
+
+   for (attr_id = IFLA_OFFLOAD_XSTATS_FIRST;
+attr_id <= IFLA_OFFLOAD_XSTATS_MAX; attr_id++) {
+   if (attr_id < *prividx)
+   continue;
+
+   size = rtnl_get_offload_stats_attr_size(attr_id);
+   if (!size)
+   continue;
+
+   if (!dev->netdev_ops->ndo_has_offload_stats(attr_id))
+   continue;
+
+   attr = nla_reserve_64bit(skb, attr_id, size,
+IFLA_OFFLOAD_XSTATS_UNSPEC);
+   if (!attr)
+   goto nla_put_failure;
+
+   attr_data = nla_data(attr);
+   memset(attr_data, 0, size);
+   err = dev->netdev_ops->ndo_get_offload_stats(attr_id, dev,
+attr_data);
+   if (err)
+   goto get_offload_stats_failure;
+   }
+
+   if (!attr)
+   return -ENODATA;
+
+   *prividx = 0;
+   return 0;
+
+nla_put_failure:
+   err = -EMSGSIZE;
+get_offload_stats_failure:
+   *prividx = attr_id;
+   return err;
+}
+
+static int rtnl_get_offload_stats_size(const struct net_device *dev)
+{
+   int nla_size = 0;
+   int attr_id;
+   int size;
+
+   if (!(dev->netdev_ops && dev->netdev_ops->ndo_has_offload_stats &&
+ dev->netdev_ops->ndo_get_offload_stats))
+   return 0;
+
+   for (attr_id = IFLA_OFFLOAD_XSTATS_FIRST;
+attr_id <= IFLA_OFFLOAD_XSTATS_MAX; attr_id++) {
+   if (!dev->netdev_ops->ndo_has_offload_stats(attr_id))
+   continue;
+   size = rtnl_get_offload_stats_attr_size(attr_id);
+   nla_size += nla_total_size_64bit(size);
+   }
+
+   if (nla_size != 0)
+   nla_size += nla_total_size(0);
+
+   return nla_size;
+}
+
 static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev,
   int type, u32 pid, u32 seq, u32 change,
   unsigned int flags, unsigned int filter_mask,
@@ -3586,6 +3671,7 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, 
struct net_device *dev,
struct nlmsghdr *nlh;
struct nlattr *attr;
int s_prividx = *prividx;
+   int err;
 
ASSERT_RTNL();
 
@@ -3614,8 +3700,6 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, 
struct net_device *dev,
const struct rtnl_link_ops *ops = dev->rtnl_link_ops;
 
if (ops && ops->fill_linkxstats) {
-   int err;
-
*idxattr = IFLA_STATS_LINK_XSTATS;
attr = nla_nest_start(skb,

[patch net-next v9 0/3] return offloaded stats as default and expose original sw stats

2016-09-14 Thread Jiri Pirko

From: Jiri Pirko 

The problem we try to handle is about offloaded forwarded packets
which are not seen by kernel. Let me try to draw it:

port1   port2 (HW stats are counted here)
  \  /
   \/
\  /
 --(A) ASIC --(B)--
|
   (C)
|
   CPU (SW stats are counted here)


Now we have couple of flows for TX and RX (direction does not matter here):

1) port1->A->ASIC->C->CPU

   For this flow, HW and SW stats are equal.

2) port1->A->ASIC->C->CPU->C->ASIC->B->port2

   For this flow, HW and SW stats are equal.

3) port1->A->ASIC->B->port2

   For this flow, SW stats are 0.

The purpose of this patchset is to provide facility for user to
find out the difference between flows 1+2 and 3. In other words, user
will be able to see the statistics for the slow-path (through kernel).

Also note that HW stats are what someone calls "accumulated" stats.
Every packet counted by SW is also counted by HW. Not the other way around.

As a default the accumulated stats (HW) will be exposed to user
so the userspace apps can react properly.

This patchset add the SW stats (flows 1+2) under offload related stats, so
in the future we can expose other offload related stat in a similar way.

---
v8->v9:
- patch 2/3
 - add using of idxattr and prividx
v7->v8:
- patch 2/3
 - move helping const from uapi to rtnetlink
 - cancel driver xstat nesting if it is empty
v6->v7:
- patch 1/3:
 - ndo interface changed to get the wanted stats type as an input.
 - change commit message.
- patch 2/3:
 - create a nesting for offloaded stat and put SW stats under it.
 - change the ndo call to indicate which offload stats we wants.
 - change commit message.
- patch 3/3:
 - change ndo implementation to match the changes in the previous patches.
 - change commit message.
v5->v6:
- patch 2/4 was dropped as requested by Roopa
- patch 1/3:
 - comment changed to indicate that default stats are combined stats
 - commit massage changed
- patch 2/3: (previously 3/4)
 - SW stats return nothing if there is no SW stats ndo
v4->v5:
- updated cover letter
- patch3/4:
  - using memcpy directly to copy stats as requested by DaveM
v3->v4:
- patch1/4:
  - fixed "return ()" pointed out by EricD
- patch2/4:
  - fixed if_nlmsg_size as pointed out by EricD
v2->v3:
- patch1/4:
  - added dev_have_sw_stats helper
- patch2/4:
  - avoided memcpy as requested by DaveM
- patch3/4:
  - use new dev_have_sw_stats helper
v1->v2:
- patch3/4:
  - fixed NULL initialization

Nogah Frankel (3):
  netdevice: Add offload statistics ndo
  net: core: Add offload stats to if_stats_msg
  mlxsw: spectrum: Implement offload stats ndo and expose HW stats by
default

 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 129 +++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |   5 +
 include/linux/netdevice.h  |  12 +++
 include/uapi/linux/if_link.h   |   9 ++
 net/core/rtnetlink.c   | 111 -
 5 files changed, 255 insertions(+), 11 deletions(-)

-- 
2.5.5

[patch net-next v9 1/3] netdevice: Add offload statistics ndo

2016-09-14 Thread Jiri Pirko

From: Nogah Frankel 

Add a new ndo to return statistics for offloaded operation.
Since there can be many different offloaded operation with many
stats types, the ndo gets an attribute id by which it knows which
stats are wanted. The ndo also gets a void pointer to be cast according
to the attribute id.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 include/linux/netdevice.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2095b6a..a10d8d1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -924,6 +924,14 @@ struct netdev_xdp {
  * 3. Update dev->stats asynchronously and atomically, and define
  *neither operation.
  *
+ * bool (*ndo_has_offload_stats)(int attr_id)
+ * Return true if this device supports offload stats of this attr_id.
+ *
+ * int (*ndo_get_offload_stats)(int attr_id, const struct net_device *dev,
+ * void *attr_data)
+ * Get statistics for offload operations by attr_id. Write it into the
+ * attr_data pointer.
+ *
  * int (*ndo_vlan_rx_add_vid)(struct net_device *dev, __be16 proto, u16 vid);
  * If device supports VLAN filtering this function is called when a
  * VLAN id is registered.
@@ -1155,6 +1163,10 @@ struct net_device_ops {
 
struct rtnl_link_stats64* (*ndo_get_stats64)(struct net_device *dev,
 struct rtnl_link_stats64 
*storage);
+   bool(*ndo_has_offload_stats)(int attr_id);
+   int (*ndo_get_offload_stats)(int attr_id,
+const struct 
net_device *dev,
+void *attr_data);
struct net_device_stats* (*ndo_get_stats)(struct net_device *dev);
 
int (*ndo_vlan_rx_add_vid)(struct net_device *dev,
-- 
2.5.5

[patch net-next v9 3/3] mlxsw: spectrum: Implement offload stats ndo and expose HW stats by default

2016-09-14 Thread Jiri Pirko

From: Nogah Frankel 

Change the default statistics ndo to return HW statistics
(like the one returned by ethtool_ops).
The HW stats are collected to a cache by delayed work every 1 sec.
Implement the offload stat ndo.
Add a function to get SW statistics, to be called from this function.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 129 +++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |   5 +
 2 files changed, 127 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 27bbcaf..171f8dd 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -819,9 +819,9 @@ err_span_port_mtu_update:
return err;
 }
 
-static struct rtnl_link_stats64 *
-mlxsw_sp_port_get_stats64(struct net_device *dev,
- struct rtnl_link_stats64 *stats)
+int
+mlxsw_sp_port_get_sw_stats64(const struct net_device *dev,
+struct rtnl_link_stats64 *stats)
 {
struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev);
struct mlxsw_sp_port_pcpu_stats *p;
@@ -848,6 +848,107 @@ mlxsw_sp_port_get_stats64(struct net_device *dev,
tx_dropped  += p->tx_dropped;
}
stats->tx_dropped   = tx_dropped;
+   return 0;
+}
+
+bool mlxsw_sp_port_has_offload_stats(int attr_id)
+{
+   switch (attr_id) {
+   case IFLA_OFFLOAD_XSTATS_CPU_HIT:
+   return true;
+   }
+
+   return false;
+}
+
+int mlxsw_sp_port_get_offload_stats(int attr_id, const struct net_device *dev,
+   void *sp)
+{
+   switch (attr_id) {
+   case IFLA_OFFLOAD_XSTATS_CPU_HIT:
+   return mlxsw_sp_port_get_sw_stats64(dev, sp);
+   }
+
+   return -EINVAL;
+}
+
+static int mlxsw_sp_port_get_stats_raw(struct net_device *dev, int grp,
+  int prio, char *ppcnt_pl)
+{
+   struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev);
+   struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+
+   mlxsw_reg_ppcnt_pack(ppcnt_pl, mlxsw_sp_port->local_port, grp, prio);
+   return mlxsw_reg_query(mlxsw_sp->core, MLXSW_REG(ppcnt), ppcnt_pl);
+}
+
+static int mlxsw_sp_port_get_hw_stats(struct net_device *dev,
+ struct rtnl_link_stats64 *stats)
+{
+   char ppcnt_pl[MLXSW_REG_PPCNT_LEN];
+   int err;
+
+   err = mlxsw_sp_port_get_stats_raw(dev, MLXSW_REG_PPCNT_IEEE_8023_CNT,
+ 0, ppcnt_pl);
+   if (err)
+   goto out;
+
+   stats->tx_packets =
+   mlxsw_reg_ppcnt_a_frames_transmitted_ok_get(ppcnt_pl);
+   stats->rx_packets =
+   mlxsw_reg_ppcnt_a_frames_received_ok_get(ppcnt_pl);
+   stats->tx_bytes =
+   mlxsw_reg_ppcnt_a_octets_transmitted_ok_get(ppcnt_pl);
+   stats->rx_bytes =
+   mlxsw_reg_ppcnt_a_octets_received_ok_get(ppcnt_pl);
+   stats->multicast =
+   mlxsw_reg_ppcnt_a_multicast_frames_received_ok_get(ppcnt_pl);
+
+   stats->rx_crc_errors =
+   mlxsw_reg_ppcnt_a_frame_check_sequence_errors_get(ppcnt_pl);
+   stats->rx_frame_errors =
+   mlxsw_reg_ppcnt_a_alignment_errors_get(ppcnt_pl);
+
+   stats->rx_length_errors = (
+   mlxsw_reg_ppcnt_a_in_range_length_errors_get(ppcnt_pl) +
+   mlxsw_reg_ppcnt_a_out_of_range_length_field_get(ppcnt_pl) +
+   mlxsw_reg_ppcnt_a_frame_too_long_errors_get(ppcnt_pl));
+
+   stats->rx_errors = (stats->rx_crc_errors +
+   stats->rx_frame_errors + stats->rx_length_errors);
+
+out:
+   return err;
+}
+
+static void update_stats_cache(struct work_struct *work)
+{
+   struct mlxsw_sp_port *mlxsw_sp_port =
+   container_of(work, struct mlxsw_sp_port,
+hw_stats.update_dw.work);
+
+   if (!netif_carrier_ok(mlxsw_sp_port->dev))
+   goto out;
+
+   mlxsw_sp_port_get_hw_stats(mlxsw_sp_port->dev,
+  mlxsw_sp_port->hw_stats.cache);
+
+out:
+   mlxsw_core_schedule_dw(&mlxsw_sp_port->hw_stats.update_dw,
+  MLXSW_HW_STATS_UPDATE_TIME);
+}
+
+/* Return the stats from a cache that is updated periodically,
+ * as this function might get called in an atomic context.
+ */
+static struct rtnl_link_stats64 *
+mlxsw_sp_port_get_stats64(struct net_device *dev,
+ struct rtnl_link_stats64 *stats)
+{
+   struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev);
+
+   memcpy(stats, mlxsw_sp_port->hw_stats.cache, sizeof(*stats));
+
return stats;
 }
 
@@ -1209,6 +1310,8 @@ static const struct net_device_ops 
mlxsw_sp_port_netdev_ops = {
.ndo_set_mac_address= mlxsw_sp_port_set_mac_a

Re: [RFC 02/11] Add RoCE driver framework

2016-09-14 Thread Mintz, Yuval

> > >> >> +uint debug;
> > >> >> +module_param(debug, uint, 0);
> > > >>> +MODULE_PARM_DESC(debug, "Default debug msglevel");
> > >>
> > >> >Why are you adding this as a module parameter?
> > >>
> > >>  I believe this is mostly to follow same line as qede which also defines
> > > > 'debug' module parameter for allowing easy user control of debug
> > > > prints [& specifically for probe prints, which can't be controlled
> > > > otherwise].
> >
> > > Can you give us an example where dynamic debug and tracing infrastructures
> > > are not enough?
> >
> > > AFAIK, most of these debug module parameters are legacy copy/paste
> > > code which is useless in real life scenarios.
> >
> > Define 'enough'; Using dynamic debug you can provide all the necessary
> > information and at an even better granularity that's achieved by suggested
> > infrastructure,  but is harder for an end-user to use. Same goes for 
> > tracing.
> >
> > The 'debug' option provides an easy grouping for prints related to a 
> > specific
> > area in the driver.
> 
> It is hard to agree with you that user which knows how-to load modules
> with parameters won't success to enable debug prints.

I think you're giving too much credit to the end-user. :-D

> In addition, global increase in debug level for whole driver will create
> printk storm in dmesg and give nothing to debuggability.

So basically, what you're claiming is that ethtool 'msglvl' setting for devices
is completely obselete. While this *might* be true, we use it extensively
in our qede and qed drivers; The debug module parameter merely provides
a manner of setting the debug value prior to initial probe for all interfaces.
qedr follows the same practice.

Re: [PATCH RFC 5/6] net: Generic resolver backend

2016-09-14 Thread Thomas Graf

On 09/09/16 at 04:19pm, Tom Herbert wrote:
> diff --git a/net/core/resolver.c b/net/core/resolver.c
> new file mode 100644
> index 000..61b36c5
> --- /dev/null
> +++ b/net/core/resolver.c
> @@ -0,0 +1,267 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 

This include list could be stripped down a bit. ila, lwt, fib, ...

> +
> +static struct net_rslv_ent *net_rslv_new_ent(struct net_rslv *nrslv,
> +  void *key)

Comment above that net_rslv_get_lock() must be held?

> +{
> + struct net_rslv_ent *nrent;
> + int err;
> +
> + nrent = kzalloc(sizeof(*nrent) + nrslv->obj_size, GFP_KERNEL);

GFP_ATOMIC since you typically hold net_rslv_get_lock() spinlock?

> + if (!nrent)
> + return ERR_PTR(-EAGAIN);
> +
> + /* Key is always at beginning of object data */
> + memcpy(nrent->object, key, nrslv->params.key_len);
> +
> + /* Initialize user data */
> + if (nrslv->rslv_init)
> + nrslv->rslv_init(nrslv, nrent);
> +
> + /* Put in hash table */
> + err = rhashtable_lookup_insert_fast(&nrslv->rhash_table,
> + &nrent->node, nrslv->params);
> + if (err)
> + return ERR_PTR(err);
> +
> + if (nrslv->timeout) {
> + /* Schedule timeout for resolver */
> + INIT_DELAYED_WORK(&nrent->timeout_work, net_rslv_delayed_work);

Should this be done before inserting into rhashtable?

> + schedule_delayed_work(&nrent->timeout_work, nrslv->timeout);
> + }
> +
> + nrent->nrslv = nrslv;

Same here.  net_rslv_cancel_all_delayed_work() walking the rhashtable could
see ->nrslv as NULL.

Re: [RFC PATCH] xen-netback: fix error handling on netback_probe()

2016-09-14 Thread Wei Liu

CC xen-devel as well.

On Tue, Sep 13, 2016 at 02:11:27PM +0200, Filipe Manco wrote:
> In case of error during netback_probe() (e.g. an entry missing on the
> xenstore) netback_remove() is called on the new device, which will set
> the device backend state to XenbusStateClosed by calling
> set_backend_state(). However, the backend state wasn't initialized by
> netback_probe() at this point, which will cause and invalid transaction
> and set_backend_state() to BUG().
> 
> Initialize the backend state at the beginning of netback_probe() to
> XenbusStateInitialising, and create a new valid state transaction on
> set_backend_state(), from XenbusStateInitialising to XenbusStateClosed.
> 
> Signed-off-by: Filipe Manco 

There is a state machine right before set_backend_state. You would also
need to update that.

According to the definition of XenbusStateInitialising, this patch looks
plausible to me.

Wei.

> ---
>  drivers/net/xen-netback/xenbus.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/net/xen-netback/xenbus.c 
> b/drivers/net/xen-netback/xenbus.c
> index 6a31f2610c23..c0e5f6994d01 100644
> --- a/drivers/net/xen-netback/xenbus.c
> +++ b/drivers/net/xen-netback/xenbus.c
> @@ -270,6 +270,7 @@ static int netback_probe(struct xenbus_device *dev,
>  
>   be->dev = dev;
>   dev_set_drvdata(&dev->dev, be);
> + be->state = XenbusStateInitialising;
>  
>   sg = 1;
>  
> @@ -515,6 +516,15 @@ static void set_backend_state(struct backend_info *be,
>  {
>   while (be->state != state) {
>   switch (be->state) {
> + case XenbusStateInitialising:
> + switch (state) {
> + case XenbusStateClosed:
> + backend_switch_state(be, XenbusStateClosed);
> + break;
> + default:
> + BUG();
> + }
> + break;
>   case XenbusStateClosed:
>   switch (state) {
>   case XenbusStateInitWait:
> -- 
> 2.7.4
>

RE: [RFC 03/11] Add support for RoCE HW init

2016-09-14 Thread Amrani, Ram

> > +   dev->max_sge = min_t(u32, RDMA_MAX_SGE_PER_SQ_WQE,
> > +RDMA_MAX_SGE_PER_RQ_WQE);
> 
> Our kernel target mode consumers sort of rely on max_sge_rd, you need to
> make sure to set it too.
Good catch. Thanks!

RE: [RFC 08/11] Add support for data path

2016-09-14 Thread Amrani, Ram

> > +   pbe = (struct regpair *)pbl_table->va;
> > +   num_pbes = 0;
> > +
> > +   for (i = 0; i < mr->npages &&
> > +(total_num_pbes != mr->info.pbl_info.num_pbes); i++) {
> > +   u64 buf_addr = mr->pages[i];
> > +
> > +   pbe->lo = cpu_to_le32((u32)buf_addr);
> > +   pbe->hi = cpu_to_le32((u32)upper_32_bits(buf_addr));
> 
> Thats a shame... you could have easily set the buf_addr correctly in
> qedr_set_page...
> 
> I think you could have also set the pbe directly from set_page if you have 
> access
> to pbl_table from your mr context (and if I understand correctly I think you 
> do,
> mr->info.pbl_table)...
I see what you mean, we can surely improve here and will. Thanks.

RE: [RFC 07/11] Add support for memory registeration verbs

2016-09-14 Thread Kalderon, Michal

> >>> +struct qedr_mr *__qedr_alloc_mr(struct ib_pd *ibpd, int
> >>> +max_page_list_len) {
> >>> + struct qedr_pd *pd = get_qedr_pd(ibpd);
> >>> + struct qedr_dev *dev = get_qedr_dev(ibpd->device);
> >>> + struct qedr_mr *mr;
> >>> + int rc = -ENOMEM;
> >>> +
> >>> + DP_VERBOSE(dev, QEDR_MSG_MR,
> >>> +"qedr_alloc_frmr pd = %d max_page_list_len= %d\n", pd-
> >>> pd_id,
> >>> +max_page_list_len);
> >>> +
> >>> + mr = kzalloc(sizeof(*mr), GFP_KERNEL);
> >>> + if (!mr)
> >>> + return ERR_PTR(rc);
> >>> +
> >>> + mr->dev = dev;
> >>> + mr->type = QEDR_MR_FRMR;
> >>> +
> >>> + rc = init_mr_info(dev, &mr->info, max_page_list_len, 1);
> >>> + if (rc)
> >>> + goto err0;
> >>> +
> >>> + rc = dev->ops->rdma_alloc_tid(dev->rdma_ctx, &mr->hw_mr.itid);
> >>> + if (rc) {
> >>> + DP_ERR(dev, "roce alloc tid returned an error %d\n", rc);
> >>> + goto err0;
> >>> + }
> >>> +
> >>> + /* Index only, 18 bit long, lkey = itid << 8 | key */
> >>> + mr->hw_mr.tid_type = QED_RDMA_TID_FMR;
> >>> + mr->hw_mr.key = 0;
> >>> + mr->hw_mr.pd = pd->pd_id;
> >>
> >> Do you have a real MR<->PD association in HW? If so, can you point me
> >> to the code that binds it? If not, any reason not to expose the
> local_dma_lkey?
> >>
> > Yes, we send the pd id to the FW in function qed_rdma_register_tid.
> 
> Right, thanks.
> 
> > In any case, if we didn't have the association in HW Wouldn't the
> > local_dma_lkey be relevant only to dma_mr ? ( code snippet above
> > refers to FMR)
> 
> I was just sticking to a location in the code where you associate MR<->PD...
> 
> The local_dma_lkey is a special key that spans the entire memory space and
> unlike the notorious dma_mr, its not associated with a PD.
> 
> See the code in ib_alloc_pd(), if the device does not support a single device
> local dma lkey, the core allocates a dma mr associated with the pd. If your
> device has such a key, you can save a dma mr allocation for each pd in the
> system.
Managed to miss this. Our device supports such a key, we'll add support.

Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-14 Thread Pablo Neira Ayuso

On Tue, Sep 13, 2016 at 09:42:19PM -0700, Alexei Starovoitov wrote:
[...]
> For us this cgroup+bpf is _not_ for filterting and _not_ for security.

If your goal is monitoring, then convert these hooks not to allow to
issue a verdict on the packet, so this becomes inoquous in the same
fashion as the tracing infrastructure.

[...]
> I'd really love to have an alternative to bpf for such tasks,
> but you seem to spend all the energy arguing against bpf whereas
> nft still has a lot to be desired.

Please Alexei, stop that FUD. Anyone that has spent just one day using
the bpf tooling and infrastructure knows you have problems to
resolve...

[PATCH V2 0/3] net-next: dsa: add QCA8K support

2016-09-14 Thread John Crispin

This series is based on the AR8xxx series posted by Matthieu Olivari in may
2015. The following changes were made since then

* fixed the nitpicks from the previous review
* updated to latest API
* turned it into an mdio device
* added callbacks for fdb, bridge offloading, stp, eee, port status
* fixed several minor issues to the port setup and arp learning
* changed the namespacing as this driver to qca8k

The driver has so far only been tested on qca8337/N. It should work on other QCA
switches such as the qca8327 with minor changes.

John Crispin (3):
  Documentation: devicetree: add qca8k binding
  net-next: dsa: add Qualcomm tag RX/TX handler
  net-next: dsa: add new driver for qca8xxx family

 .../devicetree/bindings/net/dsa/qca8k.txt  |   88 ++
 drivers/net/dsa/Kconfig|9 +
 drivers/net/dsa/Makefile   |1 +
 drivers/net/dsa/qca8k.c|  968 
 drivers/net/dsa/qca8k.h|  180 
 include/net/dsa.h  |1 +
 net/dsa/Kconfig|3 +
 net/dsa/Makefile   |1 +
 net/dsa/dsa.c  |3 +
 net/dsa/dsa_priv.h |2 +
 net/dsa/tag_qca.c  |  138 +++
 11 files changed, 1394 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/dsa/qca8k.txt
 create mode 100644 drivers/net/dsa/qca8k.c
 create mode 100644 drivers/net/dsa/qca8k.h
 create mode 100644 net/dsa/tag_qca.c

-- 
1.7.10.4

[PATCH V2 3/3] net-next: dsa: add new driver for qca8xxx family

2016-09-14 Thread John Crispin

This patch contains initial support for the QCA8337 switch. It
will detect a QCA8337 switch, if present and declared in the DT.

Each port will be represented through a standalone net_device interface,
as for other DSA switches. CPU can communicate with any of the ports by
setting an IP@ on ethN interface. Most of the extra callbacks of the DSA
subsystem are already supported, such as bridge offloading, stp, fdb.

Signed-off-by: John Crispin 
---
Changes in V2
* add proper locking for the FDB table
* remove udelay when changing the page. neither datasheet nore SDK code
  requires a sleep
* add a cond_resched to the busy wait loop
* use nested locking when accessing the mdio bus
* remove the phy_to_port() wrappers
* remove mmd access function and use existing phy helpers
* fix a copy/paste bug breaking the eee callbacks
* use default vid 1 when fdb entries are added fro vid 0
* remove the phy id check and add a switch id check instead
* add error handling to the mdio read/write functions
* remove inline usage

 drivers/net/dsa/Kconfig  |9 +
 drivers/net/dsa/Makefile |1 +
 drivers/net/dsa/qca8k.c  |  968 ++
 drivers/net/dsa/qca8k.h  |  180 +
 4 files changed, 1158 insertions(+)
 create mode 100644 drivers/net/dsa/qca8k.c
 create mode 100644 drivers/net/dsa/qca8k.h

diff --git a/drivers/net/dsa/Kconfig b/drivers/net/dsa/Kconfig
index de6d044..0659846 100644
--- a/drivers/net/dsa/Kconfig
+++ b/drivers/net/dsa/Kconfig
@@ -25,4 +25,13 @@ source "drivers/net/dsa/b53/Kconfig"
 
 source "drivers/net/dsa/mv88e6xxx/Kconfig"
 
+config NET_DSA_QCA8K
+   tristate "Qualcomm Atheros QCA8K Ethernet switch family support"
+   depends on NET_DSA
+   select NET_DSA_TAG_QCA
+   select REGMAP
+   ---help---
+ This enables support for the Qualcomm Atheros QCA8K Ethernet
+ switch chips.
+
 endmenu
diff --git a/drivers/net/dsa/Makefile b/drivers/net/dsa/Makefile
index ca1e71b..8346e4f 100644
--- a/drivers/net/dsa/Makefile
+++ b/drivers/net/dsa/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_NET_DSA_MV88E6060) += mv88e6060.o
 obj-$(CONFIG_NET_DSA_BCM_SF2)  += bcm_sf2.o
+obj-$(CONFIG_NET_DSA_QCA8K)+= qca8k.o
 
 obj-y  += b53/
 obj-y  += mv88e6xxx/
diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
new file mode 100644
index 000..76a550f
--- /dev/null
+++ b/drivers/net/dsa/qca8k.c
@@ -0,0 +1,968 @@
+/*
+ * Copyright (C) 2009 Felix Fietkau 
+ * Copyright (C) 2011-2012 Gabor Juhos 
+ * Copyright (c) 2015, The Linux Foundation. All rights reserved.
+ * Copyright (c) 2016 John Crispin 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 and
+ * only version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "qca8k.h"
+
+#define MIB_DESC(_s, _o, _n)   \
+   {   \
+   .size = (_s),   \
+   .offset = (_o), \
+   .name = (_n),   \
+   }
+
+static const struct qca8k_mib_desc ar8327_mib[] = {
+   MIB_DESC(1, 0x00, "RxBroad"),
+   MIB_DESC(1, 0x04, "RxPause"),
+   MIB_DESC(1, 0x08, "RxMulti"),
+   MIB_DESC(1, 0x0c, "RxFcsErr"),
+   MIB_DESC(1, 0x10, "RxAlignErr"),
+   MIB_DESC(1, 0x14, "RxRunt"),
+   MIB_DESC(1, 0x18, "RxFragment"),
+   MIB_DESC(1, 0x1c, "Rx64Byte"),
+   MIB_DESC(1, 0x20, "Rx128Byte"),
+   MIB_DESC(1, 0x24, "Rx256Byte"),
+   MIB_DESC(1, 0x28, "Rx512Byte"),
+   MIB_DESC(1, 0x2c, "Rx1024Byte"),
+   MIB_DESC(1, 0x30, "Rx1518Byte"),
+   MIB_DESC(1, 0x34, "RxMaxByte"),
+   MIB_DESC(1, 0x38, "RxTooLong"),
+   MIB_DESC(2, 0x3c, "RxGoodByte"),
+   MIB_DESC(2, 0x44, "RxBadByte"),
+   MIB_DESC(1, 0x4c, "RxOverFlow"),
+   MIB_DESC(1, 0x50, "Filtered"),
+   MIB_DESC(1, 0x54, "TxBroad"),
+   MIB_DESC(1, 0x58, "TxPause"),
+   MIB_DESC(1, 0x5c, "TxMulti"),
+   MIB_DESC(1, 0x60, "TxUnderRun"),
+   MIB_DESC(1, 0x64, "Tx64Byte"),
+   MIB_DESC(1, 0x68, "Tx128Byte"),
+   MIB_DESC(1, 0x6c, "Tx256Byte"),
+   MIB_DESC(1, 0x70, "Tx512Byte"),
+   MIB_DESC(1, 0x74, "Tx1024Byte"),
+   MIB_DESC(1, 0x78, "Tx1518Byte"),
+   MIB_DESC(1, 0x7c, "TxMaxByte"),
+   MIB_DESC(1, 0x80, "TxOverSize"),
+   MIB_DESC(2, 0x84, "TxByte"),
+   MIB_DESC(1, 0x8c, "TxCollision"),
+   MIB_DESC(1, 0x90, "TxAbortCol"),
+   MIB_DESC(1, 0x94, "TxMultiCol"),
+   MIB_DESC(1, 0x98, "TxSingleCol"),
+   MIB_DESC(1, 0x9c, "TxExcDefer"),
+   MIB_DESC(1, 0xa0, "

[PATCH V2 1/3] Documentation: devicetree: add qca8k binding

2016-09-14 Thread John Crispin

Add device-tree binding for ar8xxx switch families.

Cc: devicet...@vger.kernel.org
Signed-off-by: John Crispin 
---
Changes in V2
* fixup ecample to include phy nodes and corresponding phandles
* add a note explaining why we need to phy nodes

 .../devicetree/bindings/net/dsa/qca8k.txt  |   88 
 1 file changed, 88 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/dsa/qca8k.txt

diff --git a/Documentation/devicetree/bindings/net/dsa/qca8k.txt 
b/Documentation/devicetree/bindings/net/dsa/qca8k.txt
new file mode 100644
index 000..2c1582a
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/dsa/qca8k.txt
@@ -0,0 +1,88 @@
+* Qualcomm Atheros QCA8xxx switch family
+
+Required properties:
+
+- compatible: should be "qca,qca8337"
+- #size-cells: must be 0
+- #address-cells: must be 1
+
+Subnodes:
+
+The integrated switch subnode should be specified according to the binding
+described in dsa/dsa.txt. As the QCA8K switches do not have a N:N mapping of
+port and PHY id, each subnode describing a port needs to have a valid phandle
+referencing the internal PHY connected to it.
+
+Example:
+
+
+   &mdio0 {
+   phy_port1: phy@0 {
+   reg = <0>;
+   };
+
+   phy_port2: phy@1 {
+   reg = <1>;
+   };
+
+   phy_port3: phy@2 {
+   reg = <2>;
+   };
+
+   phy_port4: phy@3 {
+   reg = <3>;
+   };
+
+   phy_port5: phy@4 {
+   reg = <4>;
+   };
+
+   switch0@0 {
+   compatible = "qca,qca8337";
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   reg = <0>;
+
+   ports {
+   #address-cells = <1>;
+   #size-cells = <0>;
+   port@0 {
+   reg = <0>;
+   label = "cpu";
+   ethernet = <&gmac1>;
+   phy-mode = "rgmii";
+   };
+
+   port@1 {
+   reg = <1>;
+   label = "lan1";
+   phy-handle = <&phy_port1>;
+   };
+
+   port@2 {
+   reg = <2>;
+   label = "lan2";
+   phy-handle = <&phy_port2>;
+   };
+
+   port@3 {
+   reg = <3>;
+   label = "lan3";
+   phy-handle = <&phy_port3>;
+   };
+
+   port@4 {
+   reg = <4>;
+   label = "lan4";
+   phy-handle = <&phy_port4>;
+   };
+
+   port@5 {
+   reg = <5>;
+   label = "wan";
+   phy-handle = <&phy_port5>;
+   };
+   };
+   };
+   };
-- 
1.7.10.4

[PATCH V2 2/3] net-next: dsa: add Qualcomm tag RX/TX handler

2016-09-14 Thread John Crispin

Add support for the 2-bytes Qualcomm tag that gigabit switches such as
the QCA8337/N might insert when receiving packets, or that we need
to insert while targeting specific switch ports. The tag is inserted
directly behind the ethernet header.

Signed-off-by: John Crispin 
---
* fix some comments
* remove dead code
* rename variable from phy->reg

 include/net/dsa.h  |1 +
 net/dsa/Kconfig|3 ++
 net/dsa/Makefile   |1 +
 net/dsa/dsa.c  |3 ++
 net/dsa/dsa_priv.h |2 +
 net/dsa/tag_qca.c  |  138 
 6 files changed, 148 insertions(+)
 create mode 100644 net/dsa/tag_qca.c

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 24ee961..7fdd63e 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -26,6 +26,7 @@ enum dsa_tag_protocol {
DSA_TAG_PROTO_TRAILER,
DSA_TAG_PROTO_EDSA,
DSA_TAG_PROTO_BRCM,
+   DSA_TAG_PROTO_QCA,
DSA_TAG_LAST,   /* MUST BE LAST */
 };
 
diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig
index ff7736f..96e47c5 100644
--- a/net/dsa/Kconfig
+++ b/net/dsa/Kconfig
@@ -38,4 +38,7 @@ config NET_DSA_TAG_EDSA
 config NET_DSA_TAG_TRAILER
bool
 
+config NET_DSA_TAG_QCA
+   bool
+
 endif
diff --git a/net/dsa/Makefile b/net/dsa/Makefile
index 8af4ded..a3380ed 100644
--- a/net/dsa/Makefile
+++ b/net/dsa/Makefile
@@ -7,3 +7,4 @@ dsa_core-$(CONFIG_NET_DSA_TAG_BRCM) += tag_brcm.o
 dsa_core-$(CONFIG_NET_DSA_TAG_DSA) += tag_dsa.o
 dsa_core-$(CONFIG_NET_DSA_TAG_EDSA) += tag_edsa.o
 dsa_core-$(CONFIG_NET_DSA_TAG_TRAILER) += tag_trailer.o
+dsa_core-$(CONFIG_NET_DSA_TAG_QCA) += tag_qca.o
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index d8d267e..66e31ac 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -54,6 +54,9 @@ const struct dsa_device_ops *dsa_device_ops[DSA_TAG_LAST] = {
 #ifdef CONFIG_NET_DSA_TAG_BRCM
[DSA_TAG_PROTO_BRCM] = &brcm_netdev_ops,
 #endif
+#ifdef CONFIG_NET_DSA_TAG_QCA
+   [DSA_TAG_PROTO_QCA] = &qca_netdev_ops,
+#endif
[DSA_TAG_PROTO_NONE] = &none_ops,
 };
 
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 00077a9..6cfd738 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -81,5 +81,7 @@ extern const struct dsa_device_ops trailer_netdev_ops;
 /* tag_brcm.c */
 extern const struct dsa_device_ops brcm_netdev_ops;
 
+/* tag_qca.c */
+extern const struct dsa_device_ops qca_netdev_ops;
 
 #endif
diff --git a/net/dsa/tag_qca.c b/net/dsa/tag_qca.c
new file mode 100644
index 000..0c90cac
--- /dev/null
+++ b/net/dsa/tag_qca.c
@@ -0,0 +1,138 @@
+/*
+ * Copyright (c) 2015, The Linux Foundation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 and
+ * only version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+#include "dsa_priv.h"
+
+#define QCA_HDR_LEN2
+#define QCA_HDR_VERSION0x2
+
+#define QCA_HDR_RECV_VERSION_MASK  GENMASK(15, 14)
+#define QCA_HDR_RECV_VERSION_S 14
+#define QCA_HDR_RECV_PRIORITY_MASK GENMASK(13, 11)
+#define QCA_HDR_RECV_PRIORITY_S11
+#define QCA_HDR_RECV_TYPE_MASK GENMASK(10, 6)
+#define QCA_HDR_RECV_TYPE_S6
+#define QCA_HDR_RECV_FRAME_IS_TAGGED   BIT(3)
+#define QCA_HDR_RECV_SOURCE_PORT_MASK  GENMASK(2, 0)
+
+#define QCA_HDR_XMIT_VERSION_MASK  GENMASK(15, 14)
+#define QCA_HDR_XMIT_VERSION_S 14
+#define QCA_HDR_XMIT_PRIORITY_MASK GENMASK(13, 11)
+#define QCA_HDR_XMIT_PRIORITY_S11
+#define QCA_HDR_XMIT_CONTROL_MASK  GENMASK(10, 8)
+#define QCA_HDR_XMIT_CONTROL_S 8
+#define QCA_HDR_XMIT_FROM_CPU  BIT(7)
+#define QCA_HDR_XMIT_DP_BIT_MASK   GENMASK(6, 0)
+
+static struct sk_buff *qca_tag_xmit(struct sk_buff *skb, struct net_device 
*dev)
+{
+   struct dsa_slave_priv *p = netdev_priv(dev);
+   u16 *phdr, hdr;
+
+   dev->stats.tx_packets++;
+   dev->stats.tx_bytes += skb->len;
+
+   if (skb_cow_head(skb, 0) < 0)
+   goto out_free;
+
+   skb_push(skb, QCA_HDR_LEN);
+
+   memmove(skb->data, skb->data + QCA_HDR_LEN, 2 * ETH_ALEN);
+   phdr = (u16 *)(skb->data + 2 * ETH_ALEN);
+
+   /* Set the version field, and set destination port information */
+   hdr = QCA_HDR_VERSION << QCA_HDR_XMIT_VERSION_S |
+   QCA_HDR_XMIT_FROM_CPU |
+   BIT(p->port);
+
+   *phdr = htons(hdr);
+
+   return skb;
+
+out_free:
+   kfree_skb(skb);
+   return NULL;
+}
+
+static int qca_tag_rcv(struct sk_buff *skb, struct net_device *dev,
+  struct packet_type *pt, struct net_device *orig_dev)
+{
+   struct dsa_switch_tree *dst = de

Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-14 Thread Thomas Graf

On 09/14/16 at 12:30pm, Pablo Neira Ayuso wrote:
> On Tue, Sep 13, 2016 at 09:42:19PM -0700, Alexei Starovoitov wrote:
> [...]
> > For us this cgroup+bpf is _not_ for filterting and _not_ for security.
> 
> If your goal is monitoring, then convert these hooks not to allow to
> issue a verdict on the packet, so this becomes inoquous in the same
> fashion as the tracing infrastructure.

Why? How is this at all offensive? We have three parties voicing
interest in this work for both monitoring and security. At least
two specific use cases have been described.  It builds on top of
existing infrastructure and nicely complements other ongoing work.
Why not both?

Re: [PATCH v4 01/16] vmxnet3: Move PCI Id to pci_ids.h

2016-09-14 Thread Yuval Shaia

Please update vmxnet3_drv.c accordingly.

Yuval

On Sun, Sep 11, 2016 at 09:49:11PM -0700, Adit Ranadive wrote:
> The VMXNet3 PCI Id will be shared with our paravirtual RDMA driver.
> Moved it to the shared location in pci_ids.h.
> 
> Suggested-by: Leon Romanovsky 
> Signed-off-by: Adit Ranadive 
> ---
> ---
>  drivers/net/vmxnet3/vmxnet3_int.h | 3 +--
>  include/linux/pci_ids.h   | 1 +
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/vmxnet3/vmxnet3_int.h 
> b/drivers/net/vmxnet3/vmxnet3_int.h
> index 74fc030..2bd6bf8 100644
> --- a/drivers/net/vmxnet3/vmxnet3_int.h
> +++ b/drivers/net/vmxnet3/vmxnet3_int.h
> @@ -119,9 +119,8 @@ enum {
>  };
>  
>  /*
> - * PCI vendor and device IDs.
> + * Maximum devices supported.
>   */
> -#define PCI_DEVICE_ID_VMWARE_VMXNET30x07B0
>  #define MAX_ETHERNET_CARDS   10
>  #define MAX_PCI_PASSTHRU_DEVICE  6
>  
> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> index c58752f..98bb455 100644
> --- a/include/linux/pci_ids.h
> +++ b/include/linux/pci_ids.h
> @@ -2251,6 +2251,7 @@
>  #define PCI_DEVICE_ID_RASTEL_2PORT   0x2000
>  
>  #define PCI_VENDOR_ID_VMWARE 0x15ad
> +#define PCI_DEVICE_ID_VMWARE_VMXNET3 0x07b0
>  
>  #define PCI_VENDOR_ID_ZOLTRIX0x15b0
>  #define PCI_DEVICE_ID_ZOLTRIX_2BD0   0x2bd0
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] net/mlx4_en: fix off by one in error handling

2016-09-14 Thread Sebastian Ott

If an error occurs in mlx4_init_eq_table the index used in the
err_out_unmap label is one too big which results in a panic in
mlx4_free_eq. This patch fixes the index in the error path.

Signed-off-by: Sebastian Ott 
---
 drivers/net/ethernet/mellanox/mlx4/eq.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c 
b/drivers/net/ethernet/mellanox/mlx4/eq.c
index f613977..cf8f8a7 100644
--- a/drivers/net/ethernet/mellanox/mlx4/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
@@ -1305,8 +1305,8 @@ int mlx4_init_eq_table(struct mlx4_dev *dev)
return 0;
 
 err_out_unmap:
-   while (i >= 0)
-   mlx4_free_eq(dev, &priv->eq_table.eq[i--]);
+   while (i > 0)
+   mlx4_free_eq(dev, &priv->eq_table.eq[--i]);
 #ifdef CONFIG_RFS_ACCEL
for (i = 1; i <= dev->caps.num_ports; i++) {
if (mlx4_priv(dev)->port[i].rmap) {
-- 
2.5.5

Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-14 Thread Daniel Mack

Hi Pablo,

On 09/13/2016 07:24 PM, Pablo Neira Ayuso wrote:
> On Tue, Sep 13, 2016 at 03:31:20PM +0200, Daniel Mack wrote:
>> On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
>>> On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
 This is v5 of the patch set to allow eBPF programs for network
 filtering and accounting to be attached to cgroups, so that they apply
 to all sockets of all tasks placed in that cgroup. The logic also
 allows to be extendeded for other cgroup based eBPF logic.
>>>
>>> 1) This infrastructure can only be useful to systemd, or any similar
>>>orchestration daemon. Look, you can only apply filtering policies
>>>to processes that are launched by systemd, so this only works
>>>for server processes.
>>
>> Sorry, but both statements aren't true. The eBPF policies apply to every
>> process that is placed in a cgroup, and my example program in 6/6 shows
>> how that can be done from the command line.
> 
> Then you have to explain me how can anyone else than systemd use this
> infrastructure?

I have no idea what makes you think this is limited to systemd. As I
said, I provided an example for userspace that works from the command
line. The same limitation apply as for all other users of cgroups.

> My main point is that those processes *need* to be launched by the
> orchestrator, which is was refering as 'server processes'.

Yes, that's right. But as I said, this rule applies to many other kernel
concepts, so I don't see any real issue.

>> That's a limitation that applies to many more control mechanisms in the
>> kernel, and it's something that can easily be solved with fork+exec.
> 
> As long as you have control to launch the processes yes, but this
> will not work in other scenarios. Just like cgroup net_cls and friends
> are broken for filtering for things that you have no control to
> fork+exec.

Probably, but that's only solvable with rules that store the full cgroup
path then, and do a string comparison (!) for each packet flying by.

>> That's just as transparent as SO_ATTACH_FILTER. What kind of
>> introspection mechanism do you have in mind?
> 
> SO_ATTACH_FILTER is called from the process itself, so this is a local
> filtering policy that you apply to your own process.

Not necessarily. You can as well do it the inetd way, and pass the
socket to a process that is launched on demand, but do SO_ATTACH_FILTER
+ SO_LOCK_FILTER  in the middle. What happens with payload on the socket
is not transparent to the launched binary at all. The proposed cgroup
eBPF solution implements a very similar behavior in that regard.

>> It's about filtering outgoing network packets of applications, and
>> providing them with L2 information for filtering purposes. I don't think
>> that's a very specific use-case.
>>
>> When the feature is not used at all, the added costs on the output path
>> are close to zero, due to the use of static branches.
> 
> *You're proposing a socket filtering facility that hooks layer 2
> output path*!

As I said, I'm open to discussing that. In order to make it work for L3,
the LL_OFF issues need to be solved, as Daniel explained. Daniel,
Alexei, any idea how much work that would be?

> That is only a rough ~30 lines kernel patchset to support this in
> netfilter and only one extra input hook, with potential access to
> conntrack and better integration with other existing subsystems.

Care to share the patches for that? I'd really like to have a look.

And FWIW, I agree with Thomas - there is nothing wrong with having
multiple options to use for such use-cases.


Thanks,
Daniel

Re: [PATCH v4 06/16] IB/pvrdma: Add paravirtual rdma device

2016-09-14 Thread Yuval Shaia

No more comments.
Reviewed-by: Yuval Shaia 

On Sun, Sep 11, 2016 at 09:49:16PM -0700, Adit Ranadive wrote:
> This patch adds the main device-level structures and functions to be used
> to provide RDMA functionality. Also, we define conversion functions from
> the IB core stack structures to the device-specific ones.
> 
> Reviewed-by: Jorgen Hansen 
> Reviewed-by: George Zhang 
> Reviewed-by: Aditya Sarwade 
> Reviewed-by: Bryan Tan 
> Signed-off-by: Adit Ranadive 
> ---
> Changes v3->v4:
>  - Renamed pvrdma_flush_cqe to _pvrdma_flush_cqe since we hold a lock
>  to call it.
>  - Added wrapper functions for writing to UARs for CQ/QP.
>  - The conversion functions are updated as func_name(dst, src) format.
>  - Renamed max_gs to max_sg.
>  - Added work struct for net device events.
>  - priviledged -> privileged.
> 
> Changes v2->v3:
>  - Removed VMware vendor id redefinition.
>  - Removed the boolean in pvrdma_cmd_post.
> ---
>  drivers/infiniband/hw/pvrdma/pvrdma.h | 473 
> ++
>  1 file changed, 473 insertions(+)
>  create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma.h
> 
> diff --git a/drivers/infiniband/hw/pvrdma/pvrdma.h 
> b/drivers/infiniband/hw/pvrdma/pvrdma.h
> new file mode 100644
> index 000..fedd7cb
> --- /dev/null
> +++ b/drivers/infiniband/hw/pvrdma/pvrdma.h
> @@ -0,0 +1,473 @@
> +/*
> + * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of EITHER the GNU General Public License
> + * version 2 as published by the Free Software Foundation or the BSD
> + * 2-Clause License. This program is distributed in the hope that it
> + * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
> + * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
> + * See the GNU General Public License version 2 for more details at
> + * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program available in the file COPYING in the main
> + * directory of this source tree.
> + *
> + * The BSD 2-Clause License
> + *
> + * Redistribution and use in source and binary forms, with or
> + * without modification, are permitted provided that the following
> + * conditions are met:
> + *
> + *  - Redistributions of source code must retain the above
> + *copyright notice, this list of conditions and the following
> + *disclaimer.
> + *
> + *  - Redistributions in binary form must reproduce the above
> + *copyright notice, this list of conditions and the following
> + *disclaimer in the documentation and/or other materials
> + *provided with the distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
> + * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> + * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
> + * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
> + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
> + * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> + * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
> + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
> + * OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef __PVRDMA_H__
> +#define __PVRDMA_H__
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "pvrdma_defs.h"
> +#include "pvrdma_dev_api.h"
> +#include "pvrdma_verbs.h"
> +
> +/* NOT the same as BIT_MASK(). */
> +#define PVRDMA_MASK(n) ((n << 1) - 1)
> +
> +/*
> + * VMware PVRDMA PCI device id.
> + */
> +#define PCI_DEVICE_ID_VMWARE_PVRDMA  0x0820
> +
> +struct pvrdma_dev;
> +
> +struct pvrdma_page_dir {
> + dma_addr_t dir_dma;
> + u64 *dir;
> + int ntables;
> + u64 **tables;
> + u64 npages;
> + void **pages;
> +};
> +
> +struct pvrdma_cq {
> + struct ib_cq ibcq;
> + int offset;
> + spinlock_t cq_lock; /* Poll lock. */
> + struct pvrdma_uar_map *uar;
> + struct ib_umem *umem;
> + struct pvrdma_ring_state *ring_state;
> + struct pvrdma_page_dir pdir;
> + u32 cq_handle;
> + bool is_kernel;
> + atomic_t refcnt;
> + wait_queue_head_t wait;
> +};
> +
> +struct pvrdma_id_table {
> + u32 last;
> + u32 top;
> + u32 max;
> + u32 mask;
> + spinlock_t lock; /* Table lock. */
> + unsigned long *table;
> +};
> +
> +struct pvrdma_uar_map {
> + unsigned long pfn;
> + void __iomem *map;
> + int index;
> +};
> +
> +struct pvrdma_uar

Re: [PATCH v4 07/16] IB/pvrdma: Add helper functions

2016-09-14 Thread Yuval Shaia

No more comments.
Reviewed-by: Yuval Shaia 

On Sun, Sep 11, 2016 at 09:49:17PM -0700, Adit Ranadive wrote:
> This patch adds helper functions to store guest page addresses in a page
> directory structure. The page directory pointer is passed down to the
> backend which then maps the entire memory for the RDMA object by
> traversing the directory. We add some more helper functions for converting
> to/from RDMA stack address handles from/to PVRDMA ones.
> 
> Reviewed-by: Jorgen Hansen 
> Reviewed-by: George Zhang 
> Reviewed-by: Aditya Sarwade 
> Reviewed-by: Bryan Tan 
> Signed-off-by: Adit Ranadive 
> ---
> Changes v3->v4:
>  - Updated conversion functions to func_name(dst, src) format.
>  - Removed unneeded local variables.
> ---
>  drivers/infiniband/hw/pvrdma/pvrdma_misc.c | 303 
> +
>  1 file changed, 303 insertions(+)
>  create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_misc.c
> 
> diff --git a/drivers/infiniband/hw/pvrdma/pvrdma_misc.c 
> b/drivers/infiniband/hw/pvrdma/pvrdma_misc.c
> new file mode 100644
> index 000..1f12cd6
> --- /dev/null
> +++ b/drivers/infiniband/hw/pvrdma/pvrdma_misc.c
> @@ -0,0 +1,303 @@
> +/*
> + * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of EITHER the GNU General Public License
> + * version 2 as published by the Free Software Foundation or the BSD
> + * 2-Clause License. This program is distributed in the hope that it
> + * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
> + * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
> + * See the GNU General Public License version 2 for more details at
> + * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program available in the file COPYING in the main
> + * directory of this source tree.
> + *
> + * The BSD 2-Clause License
> + *
> + * Redistribution and use in source and binary forms, with or
> + * without modification, are permitted provided that the following
> + * conditions are met:
> + *
> + *  - Redistributions of source code must retain the above
> + *copyright notice, this list of conditions and the following
> + *disclaimer.
> + *
> + *  - Redistributions in binary form must reproduce the above
> + *copyright notice, this list of conditions and the following
> + *disclaimer in the documentation and/or other materials
> + *provided with the distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
> + * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> + * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
> + * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
> + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
> + * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> + * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
> + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
> + * OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +#include "pvrdma.h"
> +
> +int pvrdma_page_dir_init(struct pvrdma_dev *dev, struct pvrdma_page_dir 
> *pdir,
> +  u64 npages, bool alloc_pages)
> +{
> + u64 i;
> +
> + if (npages > PVRDMA_PAGE_DIR_MAX_PAGES)
> + return -EINVAL;
> +
> + memset(pdir, 0, sizeof(*pdir));
> +
> + pdir->dir = dma_alloc_coherent(&dev->pdev->dev, PAGE_SIZE,
> +&pdir->dir_dma, GFP_KERNEL);
> + if (!pdir->dir)
> + goto err;
> +
> + pdir->ntables = PVRDMA_PAGE_DIR_TABLE(npages - 1) + 1;
> + pdir->tables = kcalloc(pdir->ntables, sizeof(*pdir->tables),
> +GFP_KERNEL);
> + if (!pdir->tables)
> + goto err;
> +
> + for (i = 0; i < pdir->ntables; i++) {
> + pdir->tables[i] = dma_alloc_coherent(&dev->pdev->dev, PAGE_SIZE,
> +  &pdir->dir[i], GFP_KERNEL);
> + if (!pdir->tables[i])
> + goto err;
> + }
> +
> + pdir->npages = npages;
> +
> + if (alloc_pages) {
> + pdir->pages = kcalloc(npages, sizeof(*pdir->pages),
> +   GFP_KERNEL);
> + if (!pdir->pages)
> + goto err;
> +
> + for (i = 0; i < pdir->npages; i++) {
> + dma_addr_t page_dma;
> +
> + pdir->pages[i] = dma_alloc_coherent(&dev->pdev->d

Re: [PATCH v3 net 1/1] net sched actions: fix GETing actions

2016-09-14 Thread Jamal Hadi Salim


On 16-09-13 03:47 PM, Jamal Hadi Salim wrote:

On 16-09-13 12:20 PM, Cong Wang wrote:

On Mon, Sep 12, 2016 at 4:07 PM, Jamal Hadi Salim 
wrote:


[..]

I am still trying to understand this piece, so here you hold the refcnt
for the same action used by the later iteration? Otherwise there is
almost none user inbetween hold and release...

The comment you add is not clear to me, we use RTNL/RCU to
sync destroy and replace, so how could that happen?



I was worried about the destroy() hitting an error in that function.
If an action already existed and all we asked for was to
replace some attribute it would be deleted. It was the way the code was
before your changes so i just restored it to its original form.



And I have verified this is needed. I went and made gact return
a failure if you replace something. I added a gact action; then
when i replaced it failed. And when it failed it replace the existing
action.
I then tried another experiment and batch replaced several actions
including the one i know would fail. I placed the failing action in
the middle and hallelujah, all the actions before the middle one got
deleted.

So please ACK this so we can move forward.

cheers,
jamal

Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-14 Thread Daniel Borkmann


On 09/14/2016 12:30 PM, Pablo Neira Ayuso wrote:

On Tue, Sep 13, 2016 at 09:42:19PM -0700, Alexei Starovoitov wrote:
[...]

For us this cgroup+bpf is _not_ for filterting and _not_ for security.


If your goal is monitoring, then convert these hooks not to allow to
issue a verdict on the packet, so this becomes inoquous in the same
fashion as the tracing infrastructure.

[...]

I'd really love to have an alternative to bpf for such tasks,
but you seem to spend all the energy arguing against bpf whereas
nft still has a lot to be desired.


Please Alexei, stop that FUD. Anyone that has spent just one day using
the bpf tooling and infrastructure knows you have problems to
resolve...


Not quite sure on the spreading of FUD, but sounds like we should all
get back to technical things to resolve. ;)

Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-14 Thread Daniel Borkmann


On 09/14/2016 01:13 PM, Daniel Mack wrote:

On 09/13/2016 07:24 PM, Pablo Neira Ayuso wrote:

On Tue, Sep 13, 2016 at 03:31:20PM +0200, Daniel Mack wrote:

On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:

On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:

This is v5 of the patch set to allow eBPF programs for network
filtering and accounting to be attached to cgroups, so that they apply
to all sockets of all tasks placed in that cgroup. The logic also
allows to be extendeded for other cgroup based eBPF logic.


1) This infrastructure can only be useful to systemd, or any similar
orchestration daemon. Look, you can only apply filtering policies
to processes that are launched by systemd, so this only works
for server processes.


Sorry, but both statements aren't true. The eBPF policies apply to every
process that is placed in a cgroup, and my example program in 6/6 shows
how that can be done from the command line.


Then you have to explain me how can anyone else than systemd use this
infrastructure?


I have no idea what makes you think this is limited to systemd. As I
said, I provided an example for userspace that works from the command
line. The same limitation apply as for all other users of cgroups.


My main point is that those processes *need* to be launched by the
orchestrator, which is was refering as 'server processes'.


Yes, that's right. But as I said, this rule applies to many other kernel
concepts, so I don't see any real issue.


That's a limitation that applies to many more control mechanisms in the
kernel, and it's something that can easily be solved with fork+exec.


As long as you have control to launch the processes yes, but this
will not work in other scenarios. Just like cgroup net_cls and friends
are broken for filtering for things that you have no control to
fork+exec.


Probably, but that's only solvable with rules that store the full cgroup
path then, and do a string comparison (!) for each packet flying by.


That's just as transparent as SO_ATTACH_FILTER. What kind of
introspection mechanism do you have in mind?


SO_ATTACH_FILTER is called from the process itself, so this is a local
filtering policy that you apply to your own process.


Not necessarily. You can as well do it the inetd way, and pass the
socket to a process that is launched on demand, but do SO_ATTACH_FILTER
+ SO_LOCK_FILTER  in the middle. What happens with payload on the socket
is not transparent to the launched binary at all. The proposed cgroup
eBPF solution implements a very similar behavior in that regard.


It's about filtering outgoing network packets of applications, and
providing them with L2 information for filtering purposes. I don't think
that's a very specific use-case.

When the feature is not used at all, the added costs on the output path
are close to zero, due to the use of static branches.


*You're proposing a socket filtering facility that hooks layer 2
output path*!


As I said, I'm open to discussing that. In order to make it work for L3,
the LL_OFF issues need to be solved, as Daniel explained. Daniel,
Alexei, any idea how much work that would be?


Not much. You simply need to declare your own struct bpf_verifier_ops
with a get_func_proto() handler that handles BPF_FUNC_skb_load_bytes,
and verifier in do_check() loop would need to handle that these ld_abs/
ld_ind are rejected for BPF_PROG_TYPE_CGROUP_SOCKET.


That is only a rough ~30 lines kernel patchset to support this in
netfilter and only one extra input hook, with potential access to
conntrack and better integration with other existing subsystems.


Care to share the patches for that? I'd really like to have a look.

And FWIW, I agree with Thomas - there is nothing wrong with having
multiple options to use for such use-cases.


Thanks,
Daniel

Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-14 Thread Marcelo

Hi Jia,

On Wed, Sep 14, 2016 at 01:58:42PM +0800, hejianet wrote:
> Hi Marcelo
> 
> 
> On 9/13/16 2:57 AM, Marcelo wrote:
> > On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:
> > > This is to use the generic interface snmp_get_cpu_field{,64}_batch to
> > > aggregate the data by going through all the items of each cpu 
> > > sequentially.
> > > Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid 
> > > build
> > > warning "the frame size" larger than 1024 on s390.
> > Yeah about that, did you test it with stack overflow detection?
> > These arrays can be quite large.
> > 
> > One more below..
> Do you think it is acceptable if the stack usage is a little larger than 1024?
> e.g. 1120
> I can't find any other way to reduce the stack usage except use "static" 
> before
> unsigned long buff[TCP_MIB_MAX]
> 
> PS. sizeof buff is about TCP_MIB_MAX(116)*8=928
> B.R.

That's pretty much the question. Linux has the option on some archs to
run with 4Kb (4KSTACKS option), so this function alone would be using
25% of it in this last case. While on x86_64, it uses 16Kb (6538b8ea886e
("x86_64: expand kernel stack to 16K")).

Adding static to it is not an option as it actually makes the variable
shared amongst the CPUs (and then you have concurrency issues), plus the
fact that it's always allocated, even while not in use.

Others here certainly know better than me if it's okay to make such
usage of the stach.

> > > +static int netstat_seq_show_ipext(struct seq_file *seq, void *v)
> > > +{
> > > + int i;
> > > + u64 buff64[IPSTATS_MIB_MAX];
> > > + struct net *net = seq->private;
> > >   seq_puts(seq, "\nIpExt:");
> > >   for (i = 0; snmp4_ipextstats_list[i].name != NULL; i++)
> > >   seq_printf(seq, " %s", snmp4_ipextstats_list[i].name);
> > >   seq_puts(seq, "\nIpExt:");
> > You're missing a memset() call here.

Not sure if you missed this one or not..

Thanks,
Marcelo

Re: [PATCH v4 05/16] IB/pvrdma: Add functions for Verbs support

2016-09-14 Thread Yuval Shaia

On Sun, Sep 11, 2016 at 09:49:15PM -0700, Adit Ranadive wrote:
> +
> +/**
> + * pvrdma_alloc_pd - allocate protection domain
> + * @ibdev: the IB device
> + * @context: user context
> + * @udata: user data
> + *
> + * @return: the ib_pd protection domain pointer on success, otherwise errno.
> + */
> +struct ib_pd *pvrdma_alloc_pd(struct ib_device *ibdev,
> +   struct ib_ucontext *context,
> +   struct ib_udata *udata)
> +{
> + struct pvrdma_pd *pd;
> + struct pvrdma_dev *dev = to_vdev(ibdev);
> + union pvrdma_cmd_req req;
> + union pvrdma_cmd_resp rsp;
> + struct pvrdma_cmd_create_pd *cmd = &req.create_pd;
> + struct pvrdma_cmd_create_pd_resp *resp = &rsp.create_pd_resp;
> + int ret;
> + void *ptr;
> +
> + /* Check allowed max pds */
> + if (!atomic_add_unless(&dev->num_pds, 1, dev->dsr->caps.max_pd))
> + return ERR_PTR(-EINVAL);
> +
> + memset(cmd, 0, sizeof(*cmd));
> + cmd->hdr.cmd = PVRDMA_CMD_CREATE_PD;
> + cmd->ctx_handle = (context) ? to_vucontext(context)->ctx_handle : 0;
> + ret = pvrdma_cmd_post(dev, &req, &rsp);
> + if (ret < 0) {
> + dev_warn(&dev->pdev->dev,
> +  "failed to allocate protection domain, error: %d\n",
> +  ret);
> + ptr = ERR_PTR(ret);
> + goto err;
> + } else if (resp->hdr.ack != PVRDMA_CMD_CREATE_PD_RESP) {
> + dev_warn(&dev->pdev->dev,
> +  "unknown response for allocate protection domain\n");
> + ptr = ERR_PTR(-EFAULT);
> + goto err;
> + }
> +
> + pd = kmalloc(sizeof(*pd), GFP_KERNEL);
> + if (!pd) {
> + ptr = ERR_PTR(-ENOMEM);
> + goto err;
> + }

I know that this was my suggestion but also remember that you raised a
(correct) argument that it is preferred to first do other allocation and
free them if command fails then the other way around where failure of
memory allocation (like here) will force us to do the opposite command
(pvrdma_dealloc_pd in this case).

So either accept your way (better) or call pvrdma_dealloc_pd when kmalloc
fails.

> +
> + pd->privileged = !context;
> + pd->pd_handle = resp->pd_handle;
> + pd->pdn = resp->pd_handle;
> +
> + if (context) {
> + if (ib_copy_to_udata(udata, &pd->pdn, sizeof(__u32))) {
> + dev_warn(&dev->pdev->dev,
> +  "failed to copy back protection domain\n");
> + pvrdma_dealloc_pd(&pd->ibpd);
> + return ERR_PTR(-EFAULT);
> + }
> + }
> +
> + /* u32 pd handle */
> + return  &pd->ibpd;
> +
> +err:
> + atomic_dec(&dev->num_pds);
> + return ptr;
> +}

[PATCH 1/2] batman-adv: Add missing refcnt for last_candidate

2016-09-14 Thread Simon Wunderlich

From: Sven Eckelmann 

batadv_find_router dereferences last_bonding_candidate from
orig_node without making sure that it has a valid reference. This reference
has to be retrieved by increasing the reference counter while holding
neigh_list_lock. The lock is required to avoid that
batadv_last_bonding_replace removes the current last_bonding_candidate,
reduces the reference counter and maybe destroys the object in this
process.

Fixes: f3b3d9018975 ("batman-adv: add bonding again")
Signed-off-by: Sven Eckelmann 
Signed-off-by: Marek Lindner 
Signed-off-by: Simon Wunderlich 
---
 net/batman-adv/routing.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/net/batman-adv/routing.c b/net/batman-adv/routing.c
index 7602c00..3d19947 100644
--- a/net/batman-adv/routing.c
+++ b/net/batman-adv/routing.c
@@ -470,6 +470,29 @@ static int batadv_check_unicast_packet(struct batadv_priv 
*bat_priv,
 }
 
 /**
+ * batadv_last_bonding_get - Get last_bonding_candidate of orig_node
+ * @orig_node: originator node whose last bonding candidate should be retrieved
+ *
+ * Return: last bonding candidate of router or NULL if not found
+ *
+ * The object is returned with refcounter increased by 1.
+ */
+static struct batadv_orig_ifinfo *
+batadv_last_bonding_get(struct batadv_orig_node *orig_node)
+{
+   struct batadv_orig_ifinfo *last_bonding_candidate;
+
+   spin_lock_bh(&orig_node->neigh_list_lock);
+   last_bonding_candidate = orig_node->last_bonding_candidate;
+
+   if (last_bonding_candidate)
+   kref_get(&last_bonding_candidate->refcount);
+   spin_unlock_bh(&orig_node->neigh_list_lock);
+
+   return last_bonding_candidate;
+}
+
+/**
  * batadv_last_bonding_replace - Replace last_bonding_candidate of orig_node
  * @orig_node: originator node whose bonding candidates should be replaced
  * @new_candidate: new bonding candidate or NULL
@@ -539,7 +562,7 @@ batadv_find_router(struct batadv_priv *bat_priv,
 * router - obviously there are no other candidates.
 */
rcu_read_lock();
-   last_candidate = orig_node->last_bonding_candidate;
+   last_candidate = batadv_last_bonding_get(orig_node);
if (last_candidate)
last_cand_router = rcu_dereference(last_candidate->router);
 
@@ -631,6 +654,9 @@ next:
batadv_orig_ifinfo_put(next_candidate);
}
 
+   if (last_candidate)
+   batadv_orig_ifinfo_put(last_candidate);
+
return router;
 }
 
-- 
2.9.3

[PATCH 2/2] batman-adv: fix elp packet data reservation

2016-09-14 Thread Simon Wunderlich

From: Linus Lüssing 

The skb_reserve() call only reserved headroom for the mac header, but
not the elp packet header itself.

Fixing this by using skb_put()'ing towards the skb tail instead of
skb_push()'ing towards the skb head.

Fixes: d6f94d91f766 ("batman-adv: ELP - adding basic infrastructure")
Signed-off-by: Linus Lüssing 
Signed-off-by: Marek Lindner 
Signed-off-by: Sven Eckelmann 
Signed-off-by: Simon Wunderlich 
---
 net/batman-adv/bat_v_elp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/batman-adv/bat_v_elp.c b/net/batman-adv/bat_v_elp.c
index 7d17001..ee08540 100644
--- a/net/batman-adv/bat_v_elp.c
+++ b/net/batman-adv/bat_v_elp.c
@@ -335,7 +335,7 @@ int batadv_v_elp_iface_enable(struct batadv_hard_iface 
*hard_iface)
goto out;
 
skb_reserve(hard_iface->bat_v.elp_skb, ETH_HLEN + NET_IP_ALIGN);
-   elp_buff = skb_push(hard_iface->bat_v.elp_skb, BATADV_ELP_HLEN);
+   elp_buff = skb_put(hard_iface->bat_v.elp_skb, BATADV_ELP_HLEN);
elp_packet = (struct batadv_elp_packet *)elp_buff;
memset(elp_packet, 0, BATADV_ELP_HLEN);
 
-- 
2.9.3

RE: [RFC 07/11] Add support for memory registeration verbs

2016-09-14 Thread Kalderon, Michal

> > +struct qedr_mr *__qedr_alloc_mr(struct ib_pd *ibpd, int
> > +max_page_list_len) {
> > +   struct qedr_pd *pd = get_qedr_pd(ibpd);
> > +   struct qedr_dev *dev = get_qedr_dev(ibpd->device);
> > +   struct qedr_mr *mr;
> > +   int rc = -ENOMEM;
> > +
> > +   DP_VERBOSE(dev, QEDR_MSG_MR,
> > +  "qedr_alloc_frmr pd = %d max_page_list_len= %d\n", pd-
> >pd_id,
> > +  max_page_list_len);
> > +
> > +   mr = kzalloc(sizeof(*mr), GFP_KERNEL);
> > +   if (!mr)
> > +   return ERR_PTR(rc);
> > +
> > +   mr->dev = dev;
> > +   mr->type = QEDR_MR_FRMR;
> > +
> > +   rc = init_mr_info(dev, &mr->info, max_page_list_len, 1);
> > +   if (rc)
> > +   goto err0;
> > +
> > +   rc = dev->ops->rdma_alloc_tid(dev->rdma_ctx, &mr->hw_mr.itid);
> > +   if (rc) {
> > +   DP_ERR(dev, "roce alloc tid returned an error %d\n", rc);
> > +   goto err0;
> > +   }
> > +
> > +   /* Index only, 18 bit long, lkey = itid << 8 | key */
> > +   mr->hw_mr.tid_type = QED_RDMA_TID_FMR;
> > +   mr->hw_mr.key = 0;
> > +   mr->hw_mr.pd = pd->pd_id;
> 
> Do you have a real MR<->PD association in HW? If so, can you point me to the
> code that binds it? If not, any reason not to expose the local_dma_lkey?
> 
Yes, we send the pd id to the FW in function qed_rdma_register_tid. In any 
case, if we didn't have the association in HW
Wouldn't the local_dma_lkey be relevant only to dma_mr ? ( code snippet above 
refers to FMR) 

> > +struct ib_mr *qedr_get_dma_mr(struct ib_pd *ibpd, int acc) {
> > +   struct qedr_dev *dev = get_qedr_dev(ibpd->device);
> > +   struct qedr_pd *pd = get_qedr_pd(ibpd);
> > +   struct qedr_mr *mr;
> > +   int rc;
> > +
> > +   if (acc & IB_ACCESS_MW_BIND) {
> > +   DP_ERR(dev, "Unsupported access flags received for dma
> mr\n");
> > +   return ERR_PTR(-EINVAL);
> > +   }
> 
> This check looks like it really belongs in the core, it would help everyone 
> if you
> move it...
> 
> Although I know Christoph is trying to get rid of this API altogether...
Sure, will do.
 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the 
> body
> of a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

[PATCH 0/2] pull request for net: batman-adv 2016-09-14

2016-09-14 Thread Simon Wunderlich

Hi David,

here are some more bugfix patches which we would like to have integrated
into net.

Please pull or let me know of any problem!

Thank you,
  Simon

The following changes since commit 1fe323aa1b2390a0c57fb0b06a782f128d49094c:

  sctp: use event->chunk when it's valid (2016-08-08 14:31:23 -0700)

are available in the git repository at:

  git://git.open-mesh.org/linux-merge.git tags/batadv-net-for-davem-20160914

for you to fetch changes up to 1e5d343b8f23770e8ac5d31f5c439826bdb35148:

  batman-adv: fix elp packet data reservation (2016-08-26 15:22:31 +0200)


Here are two batman-adv bugfix patches:

 - Fix reference counting for last_bonding_candidate, by Sven Eckelmann

 - Fix head room reservation for ELP packets, by Linus Luessing


Linus Lüssing (1):
  batman-adv: fix elp packet data reservation

Sven Eckelmann (1):
  batman-adv: Add missing refcnt for last_candidate

 net/batman-adv/bat_v_elp.c |  2 +-
 net/batman-adv/routing.c   | 28 +++-
 2 files changed, 28 insertions(+), 2 deletions(-)

Re: [PATCH v4 09/16] IB/pvrdma: Add support for Completion Queues

2016-09-14 Thread Yuval Shaia

On Sun, Sep 11, 2016 at 09:49:19PM -0700, Adit Ranadive wrote:
> +
> +static int pvrdma_poll_one(struct pvrdma_cq *cq, struct pvrdma_qp **cur_qp,
> +struct ib_wc *wc)
> +{
> + struct pvrdma_dev *dev = to_vdev(cq->ibcq.device);
> + int has_data;
> + unsigned int head;
> + bool tried = false;
> + struct pvrdma_cqe *cqe;
> +
> +retry:
> + has_data = pvrdma_idx_ring_has_data(&cq->ring_state->rx,
> + cq->ibcq.cqe, &head);
> + if (has_data == 0) {
> + if (tried)
> + return -EAGAIN;
> +
> + /* Pass down POLL to give physical HCA a chance to poll. */
> + pvrdma_write_uar_cq(dev, cq->cq_handle | PVRDMA_UAR_CQ_POLL);
> +
> + tried = true;
> + goto retry;
> + } else if (has_data == PVRDMA_INVALID_IDX) {

I didn't went throw the entire life cycle of RX-ring's head and tail but
you need to make sure that PVRDMA_INVALID_IDX error is recoverable one, i.e
there is probability that in the next call to pvrdma_poll_one it will be
fine. Otherwise it is an endless loop.

> + dev_err(&dev->pdev->dev, "CQ ring state invalid\n");
> + return -EAGAIN;
> + }
> +
> + cqe = get_cqe(cq, head);
> +
> + /* Ensure cqe is valid. */
> + rmb();
> + if (dev->qp_tbl[cqe->qp & 0x])
> + *cur_qp = (struct pvrdma_qp *)dev->qp_tbl[cqe->qp & 0x];
> + else
> + return -EAGAIN;
> +
> + wc->opcode = pvrdma_wc_opcode_to_ib(cqe->opcode);
> + wc->status = pvrdma_wc_status_to_ib(cqe->status);
> + wc->wr_id = cqe->wr_id;
> + wc->qp = &(*cur_qp)->ibqp;
> + wc->byte_len = cqe->byte_len;
> + wc->ex.imm_data = cqe->imm_data;
> + wc->src_qp = cqe->src_qp;
> + wc->wc_flags = pvrdma_wc_flags_to_ib(cqe->wc_flags);
> + wc->pkey_index = cqe->pkey_index;
> + wc->slid = cqe->slid;
> + wc->sl = cqe->sl;
> + wc->dlid_path_bits = cqe->dlid_path_bits;
> + wc->port_num = cqe->port_num;
> + wc->vendor_err = 0;
> +
> + /* Update shared ring state */
> + pvrdma_idx_ring_inc(&cq->ring_state->rx.cons_head, cq->ibcq.cqe);
> +
> + return 0;
> +}
> +
> +/**
> + * pvrdma_poll_cq - poll for work completion queue entries
> + * @ibcq: completion queue
> + * @num_entries: the maximum number of entries
> + * @entry: pointer to work completion array
> + *
> + * @return: number of polled completion entries
> + */
> +int pvrdma_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc)
> +{
> + struct pvrdma_cq *cq = to_vcq(ibcq);
> + struct pvrdma_qp *cur_qp = NULL;
> + unsigned long flags;
> + int npolled;
> +
> + if (num_entries < 1 || wc == NULL)
> + return 0;
> +
> + spin_lock_irqsave(&cq->cq_lock, flags);
> + for (npolled = 0; npolled < num_entries; ++npolled) {
> + if (pvrdma_poll_one(cq, &cur_qp, wc + npolled))
> + break;
> + }
> +
> + spin_unlock_irqrestore(&cq->cq_lock, flags);
> +
> + /* Ensure we do not return errors from poll_cq */
> + return npolled;
> +}
> +
> +/**
> + * pvrdma_resize_cq - resize CQ
> + * @ibcq: the completion queue
> + * @entries: CQ entries
> + * @udata: user data
> + *
> + * @return: -EOPNOTSUPP as CQ resize is not supported.
> + */
> +int pvrdma_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata)
> +{
> + return -EOPNOTSUPP;
> +}
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net v1] fib_rules: interface group matching

2016-09-14 Thread Vincent Bernat

When a user wants to assign a routing table to a group of incoming
interfaces, the current solutions are:

 - one IP rule for each interface (scalability problems)
 - use of fwmark and devgroup matcher (don't work with internal route
   lookups, used for example by RPF)
 - use of VRF devices (more complex)

Each interface can be assigned to a numeric group using IFLA_GROUP. This
commit enables a user to reference such a group into an IP rule. Here is
an example of output of iproute2:

$ ip rule show
0:  from all lookup local
32764:  from all iifgroup 2 lookup 2
32765:  from all iifgroup 1 lookup 1
32766:  from all lookup main
32767:  from all lookup default

Signed-off-by: Vincent Bernat 
---
 include/net/fib_rules.h|  6 -
 include/uapi/linux/fib_rules.h |  2 ++
 net/core/fib_rules.c   | 57 +-
 3 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index 456e4a6006ab..a96b186ccd02 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -28,6 +28,8 @@ struct fib_rule {
u32 pref;
int suppress_ifgroup;
int suppress_prefixlen;
+   int iifgroup;
+   int oifgroup;
chariifname[IFNAMSIZ];
charoifname[IFNAMSIZ];
struct rcu_head rcu;
@@ -92,7 +94,9 @@ struct fib_rules_ops {
[FRA_SUPPRESS_PREFIXLEN] = { .type = NLA_U32 }, \
[FRA_SUPPRESS_IFGROUP] = { .type = NLA_U32 }, \
[FRA_GOTO]  = { .type = NLA_U32 }, \
-   [FRA_L3MDEV]= { .type = NLA_U8 }
+   [FRA_L3MDEV]= { .type = NLA_U8 }, \
+   [FRA_IIFGROUP] = { .type = NLA_U32 }, \
+   [FRA_OIFGROUP] = { .type = NLA_U32 }
 
 static inline void fib_rule_get(struct fib_rule *rule)
 {
diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h
index 14404b3ebb89..0bf5a5e94d9a 100644
--- a/include/uapi/linux/fib_rules.h
+++ b/include/uapi/linux/fib_rules.h
@@ -51,6 +51,8 @@ enum {
FRA_OIFNAME,
FRA_PAD,
FRA_L3MDEV, /* iif or oif is l3mdev goto its table */
+   FRA_IIFGROUP,   /* interface group */
+   FRA_OIFGROUP,
__FRA_MAX
 };
 
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index be4629c344a6..f8ed6ba85c72 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -37,6 +37,9 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
r->suppress_prefixlen = -1;
r->suppress_ifgroup = -1;
 
+   r->iifgroup = -1;
+   r->oifgroup = -1;
+
/* The lock is not required here, the list in unreacheable
 * at the moment this function is called */
list_add_tail(&r->list, &ops->rules_list);
@@ -193,6 +196,30 @@ static int fib_rule_match(struct fib_rule *rule, struct 
fib_rules_ops *ops,
if (rule->l3mdev && !l3mdev_fib_rule_match(rule->fr_net, fl, arg))
goto out;
 
+   if (rule->iifgroup != -1) {
+   struct net_device *dev;
+
+   rcu_read_lock();
+   dev = dev_get_by_index_rcu(rule->fr_net, fl->flowi_iif);
+   if (!dev || dev->group != rule->iifgroup) {
+   rcu_read_unlock();
+   goto out;
+   }
+   rcu_read_unlock();
+   }
+
+   if (rule->oifgroup != -1) {
+   struct net_device *dev;
+
+   rcu_read_lock();
+   dev = dev_get_by_index_rcu(rule->fr_net, fl->flowi_oif);
+   if (!dev || dev->group != rule->oifgroup) {
+   rcu_read_unlock();
+   goto out;
+   }
+   rcu_read_unlock();
+   }
+
ret = ops->match(rule, fl, flags);
 out:
return (rule->flags & FIB_RULE_INVERT) ? !ret : ret;
@@ -305,6 +332,12 @@ static int rule_exists(struct fib_rules_ops *ops, struct 
fib_rule_hdr *frh,
if (r->l3mdev != rule->l3mdev)
continue;
 
+   if (r->iifgroup != rule->iifgroup)
+   continue;
+
+   if (r->oifgroup != rule->oifgroup)
+   continue;
+
if (!ops->compare(r, frh, tb))
continue;
return 1;
@@ -391,6 +424,16 @@ int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr 
*nlh)
goto errout_free;
}
 
+   if (tb[FRA_IIFGROUP])
+   rule->iifgroup = nla_get_u32(tb[FRA_IIFGROUP]);
+   else
+   rule->iifgroup = -1;
+
+   if (tb[FRA_OIFGROUP])
+   rule->oifgroup = nla_get_u32(tb[FRA_OIFGROUP]);
+   else
+   rule->oifgroup = -1;
+
rule->action = frh->action;
rule->flags = frh->flags;
rule->table = frh_get_table(frh, tb);
@@ -552,6 +595,14 @@

Re: [PATCH v4 11/16] IB/pvrdma: Add support for memory regions

2016-09-14 Thread Yuval Shaia

No more comments.
Reviewed-by: Yuval Shaia 

On Sun, Sep 11, 2016 at 09:49:21PM -0700, Adit Ranadive wrote:
> This patch adds support for creating and destroying memory regions. The
> PVRDMA device supports User MRs, DMA MRs (no Remote Read/Write support),
> Fast Register MRs.
> 
> Reviewed-by: Jorgen Hansen 
> Reviewed-by: George Zhang 
> Reviewed-by: Aditya Sarwade 
> Reviewed-by: Bryan Tan 
> Signed-off-by: Adit Ranadive 
> ---
> Changes v3->v4:
>  - Changed access flag check for DMA MR to using bit operation.
>  - Removed some local variables.
> 
> Changes v2->v3:
>  - Removed boolean in pvrdma_cmd_post.
> ---
>  drivers/infiniband/hw/pvrdma/pvrdma_mr.c | 332 
> +++
>  1 file changed, 332 insertions(+)
>  create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_mr.c
> 
> diff --git a/drivers/infiniband/hw/pvrdma/pvrdma_mr.c 
> b/drivers/infiniband/hw/pvrdma/pvrdma_mr.c
> new file mode 100644
> index 000..6163f17
> --- /dev/null
> +++ b/drivers/infiniband/hw/pvrdma/pvrdma_mr.c
> @@ -0,0 +1,332 @@
> +/*
> + * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of EITHER the GNU General Public License
> + * version 2 as published by the Free Software Foundation or the BSD
> + * 2-Clause License. This program is distributed in the hope that it
> + * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
> + * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
> + * See the GNU General Public License version 2 for more details at
> + * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program available in the file COPYING in the main
> + * directory of this source tree.
> + *
> + * The BSD 2-Clause License
> + *
> + * Redistribution and use in source and binary forms, with or
> + * without modification, are permitted provided that the following
> + * conditions are met:
> + *
> + *  - Redistributions of source code must retain the above
> + *copyright notice, this list of conditions and the following
> + *disclaimer.
> + *
> + *  - Redistributions in binary form must reproduce the above
> + *copyright notice, this list of conditions and the following
> + *disclaimer in the documentation and/or other materials
> + *provided with the distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
> + * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> + * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
> + * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
> + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
> + * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> + * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
> + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
> + * OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include 
> +#include 
> +
> +#include "pvrdma.h"
> +
> +/**
> + * pvrdma_get_dma_mr - get a DMA memory region
> + * @pd: protection domain
> + * @acc: access flags
> + *
> + * @return: ib_mr pointer on success, otherwise returns an errno.
> + */
> +struct ib_mr *pvrdma_get_dma_mr(struct ib_pd *pd, int acc)
> +{
> + struct pvrdma_dev *dev = to_vdev(pd->device);
> + struct pvrdma_user_mr *mr;
> + union pvrdma_cmd_req req;
> + union pvrdma_cmd_resp rsp;
> + struct pvrdma_cmd_create_mr *cmd = &req.create_mr;
> + struct pvrdma_cmd_create_mr_resp *resp = &rsp.create_mr_resp;
> + int ret;
> +
> + if (!(acc & IB_ACCESS_LOCAL_WRITE)) {
> + dev_warn(&dev->pdev->dev,
> +  "unsupported dma mr access flags %#x\n", acc);
> + return ERR_PTR(-EOPNOTSUPP);
> + }
> +
> + mr = kzalloc(sizeof(*mr), GFP_KERNEL);
> + if (!mr)
> + return ERR_PTR(-ENOMEM);
> +
> + memset(cmd, 0, sizeof(*cmd));
> + cmd->hdr.cmd = PVRDMA_CMD_CREATE_MR;
> + cmd->pd_handle = to_vpd(pd)->pd_handle;
> + cmd->access_flags = acc;
> + cmd->flags = PVRDMA_MR_FLAG_DMA;
> +
> + ret = pvrdma_cmd_post(dev, &req, &rsp);
> + if (ret < 0) {
> + dev_warn(&dev->pdev->dev, "could not get DMA mem region\n");
> + kfree(mr);
> + return ERR_PTR(ret);
> + }
> +
> + mr->mmr.mr_handle = resp->mr_handle;
> + mr->ibmr.lkey = resp->lkey;
> + mr->ibmr.rkey = resp->rkey;
> +
> + return &mr->ibmr;
> +}
> +
> +/**
> + * pvrdma_reg_user_mr - register a userspace memory region
> + * @pd: protection

Re: [PATCH v4 05/16] IB/pvrdma: Add functions for Verbs support

2016-09-14 Thread Christoph Hellwig

> + props->max_fmr = dev->dsr->caps.max_fmr;
> + props->max_map_per_fmr = dev->dsr->caps.max_map_per_fmr;

Please don't add FMR support to any new drivers.

Re: [net v1] fib_rules: interface group matching

2016-09-14 Thread Vincent Bernat

 ❦ 14 septembre 2016 14:40 CEST, Vincent Bernat  :

> Each interface can be assigned to a numeric group using IFLA_GROUP. This
> commit enables a user to reference such a group into an IP rule. Here is
> an example of output of iproute2:
>
> $ ip rule show
> 0:  from all lookup local
> 32764:  from all iifgroup 2 lookup 2
> 32765:  from all iifgroup 1 lookup 1
> 32766:  from all lookup main
> 32767:  from all lookup default

The patch for iproute2 is available here (didn't post it inline to avoid
confuse patchwork):
 http://paste.debian.net/821247/
-- 
Say what you mean, simply and directly.
- The Elements of Programming Style (Kernighan & Plauger)

Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support

2016-09-14 Thread Tariq Toukan




On 08/09/2016 12:31 PM, Or Gerlitz wrote:

On Thu, Sep 8, 2016 at 10:38 AM, Jesper Dangaard Brouer
 wrote:

On Wed, 7 Sep 2016 23:55:42 +0300
Or Gerlitz  wrote:


On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed  wrote:

From: Rana Shahout 

Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.

When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".

On XDP set, we fail if HW LRO is set and request from user to turn it
off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.

Full channels reset (close/open) is required only when setting XDP
on/off.

When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
 - rq.state = disabled
 - napi_synnchronize
 - xchg(rq->xdp_prg)
 - rq.state = enabled
 - napi_schedule // Just in case we've missed an IRQ

Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Comparison is done between:
 1. Baseline, Before this patch with TC drop action
 2. This patch with TC drop action
 3. This patch with XDP RX fast drop

StreamsBaseline(TC drop)TC dropXDP fast Drop
--
1   5.51Mpps5.14Mpps 13.5Mpps

This (13.5 M PPS) is less than 50% of the result we presented @ the
XDP summit which was obtained by Rana. Please see if/how much does
this grows if you use more sender threads, but all of them to xmit the
same stream/flows, so we're on one ring. That (XDP with single RX ring
getting packets from N remote TX rings) would be your canonical
base-line for any further numbers.

Well, my experiments with this hardware (mlx5/CX4 at 50Gbit/s) show
that you should be able to reach 23Mpps on a single CPU.  This is
a XDP-drop-simulation with order-0 pages being recycled through my
page_pool code, plus avoiding the cache-misses (notice you are using a
CPU E5-2680 with DDIO, thus you should only see a L3 cache miss).

so this takes up from 13M to 23M, good.

Could you explain why the move from order-3 to order-0 is hurting the
performance so much (drop from 32M to 23M), any way we can overcome that?

The issue is not moving from high-order to order-0.
It's moving from Striding RQ to non-Striding RQ without using a 
page-reuse mechanism (not cache).
In current memory-scheme, each 64B packet consumes a 4K page, including 
allocate/release (from cache in this case, but still...).
I believe that once we implement page-reuse for non Striding RQ we'll 
hit 32M PPS again.

The 23Mpps number looks like some HW limitation, as the increase was

not HW, I think. As I said, Rana got 32M with striding RQ when she was
using order-3
(or did we use order-5?)

order-5.

is not proportional to page-allocator overhead I removed (and CPU freq
starts to decrease).  I also did scaling tests to more CPUs, which
showed it scaled up to 40Mpps (you reported 45M).  And at the Phy RX
level I see 60Mpps (50G max is 74Mpps).

Re: [RFC 02/11] Add RoCE driver framework

2016-09-14 Thread Leon Romanovsky

On Wed, Sep 14, 2016 at 08:15:23AM +, Mintz, Yuval wrote:
> > > >> >> +uint debug;
> > > >> >> +module_param(debug, uint, 0);
> > > > >>> +MODULE_PARM_DESC(debug, "Default debug msglevel");
> > > >>
> > > >> >Why are you adding this as a module parameter?
> > > >>
> > > >>  I believe this is mostly to follow same line as qede which also 
> > > >>defines
> > > > > 'debug' module parameter for allowing easy user control of debug
> > > > > prints [& specifically for probe prints, which can't be controlled
> > > > > otherwise].
> > >
> > > > Can you give us an example where dynamic debug and tracing 
> > > > infrastructures
> > > > are not enough?
> > >
> > > > AFAIK, most of these debug module parameters are legacy copy/paste
> > > > code which is useless in real life scenarios.
> > >
> > > Define 'enough'; Using dynamic debug you can provide all the necessary
> > > information and at an even better granularity that's achieved by suggested
> > > infrastructure,  but is harder for an end-user to use. Same goes for 
> > > tracing.
> > >
> > > The 'debug' option provides an easy grouping for prints related to a 
> > > specific
> > > area in the driver.
> >
> > It is hard to agree with you that user which knows how-to load modules
> > with parameters won't success to enable debug prints.
>
> I think you're giving too much credit to the end-user. :-D
>
> > In addition, global increase in debug level for whole driver will create
> > printk storm in dmesg and give nothing to debuggability.
>
> So basically, what you're claiming is that ethtool 'msglvl' setting for 
> devices
> is completely obselete. While this *might* be true, we use it extensively
> in our qede and qed drivers; The debug module parameter merely provides
> a manner of setting the debug value prior to initial probe for all interfaces.
> qedr follows the same practice.

Thanks for this excellent example. Ethtool 'msglvl' adds this
dynamically, while your DEBUG argument works for loading module
only.

If you want dynamic prints, you have two options:
1. Add support of ethtool to whole RDMA stack.
2. Use dynamic tracing infrastructure.

Which option do you prefer?

>


signature.asc
Description: PGP signature

[PATCH 9/9] net: ethernet: ti: cpts: switch to readl/writel_relaxed()

2016-09-14 Thread Grygorii Strashko

Switch to readl/writel_relaxed() APIs, because The CPTS IP is reused
on Keystone 2 SoCs where LE/BE modes are supported.

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/cpts.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c
index cbe0974..0226582 100644
--- a/drivers/net/ethernet/ti/cpts.c
+++ b/drivers/net/ethernet/ti/cpts.c
@@ -31,8 +31,8 @@
 
 #include "cpts.h"
 
-#define cpts_read32(c, r)  __raw_readl(&c->reg->r)
-#define cpts_write32(c, v, r)  __raw_writel(v, &c->reg->r)
+#define cpts_read32(c, r)  readl_relaxed(&c->reg->r)
+#define cpts_write32(c, v, r)  writel_relaxed(v, &c->reg->r)
 
 static int event_expired(struct cpts_event *event)
 {
-- 
2.9.3

[PATCH 8/9] net: ethernet: ti: cpts: fix overflow check period

2016-09-14 Thread Grygorii Strashko

The CPTS drivers uses 8sec period for overflow checking with
assumption that CPTS rftclk will not exceed 500MHz. But that's not
true on some TI's platforms (Kesytone 2). As result, it is possible that
CPTS counter will overflow more than once between two readings.

Hence, fix it by selecting overflow check period dynamically as
max_sec_before_overflow/2, where
 max_sec_before_overflow = max_counter_val / rftclk_freq.

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/cpts.c | 16 +++-
 drivers/net/ethernet/ti/cpts.h |  4 +---
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c
index 8046a21..cbe0974 100644
--- a/drivers/net/ethernet/ti/cpts.c
+++ b/drivers/net/ethernet/ti/cpts.c
@@ -251,7 +251,7 @@ static void cpts_overflow_check(struct work_struct *work)
cpts_write32(cpts, TS_PEND_EN, int_enable);
cpts_ptp_gettime(&cpts->info, &ts);
pr_debug("cpts overflow check at %lld.%09lu\n", ts.tv_sec, ts.tv_nsec);
-   schedule_delayed_work(&cpts->overflow_work, CPTS_OVERFLOW_PERIOD);
+   schedule_delayed_work(&cpts->overflow_work, cpts->ov_check_period);
 }
 
 static int cpts_match(struct sk_buff *skb, unsigned int ptp_class,
@@ -391,7 +391,7 @@ int cpts_register(struct cpts *cpts)
}
cpts->phc_index = ptp_clock_index(cpts->clock);
 
-   schedule_delayed_work(&cpts->overflow_work, CPTS_OVERFLOW_PERIOD);
+   schedule_delayed_work(&cpts->overflow_work, cpts->ov_check_period);
return 0;
 
 err_ptp:
@@ -427,9 +427,6 @@ static void cpts_calc_mult_shift(struct cpts *cpts)
u64 ns;
u64 frac;
 
-   if (cpts->cc_mult || cpts->cc.shift)
-   return;
-
freq = clk_get_rate(cpts->refclk);
 
/* Calc the maximum number of seconds which we can run before
@@ -442,11 +439,20 @@ static void cpts_calc_mult_shift(struct cpts *cpts)
else if (maxsec > 600 && cpts->cc.mask > UINT_MAX)
maxsec = 600;
 
+   /* Calc overflow check period (maxsec / 2) */
+   cpts->ov_check_period = (HZ * maxsec) / 2;
+   dev_info(cpts->dev, "cpts: overflow check period %lu\n",
+cpts->ov_check_period);
+
+   if (cpts->cc_mult || cpts->cc.shift)
+   return;
+
clocks_calc_mult_shift(&mult, &shift, freq, NSEC_PER_SEC, maxsec);
 
cpts->cc_mult = mult;
cpts->cc.mult = mult;
cpts->cc.shift = shift;
+
/* Check calculations and inform if not precise */
frac = 0;
ns = cyclecounter_cyc2ns(&cpts->cc, freq, cpts->cc.mask, &frac);
diff --git a/drivers/net/ethernet/ti/cpts.h b/drivers/net/ethernet/ti/cpts.h
index 47026ec..e0e4a62b 100644
--- a/drivers/net/ethernet/ti/cpts.h
+++ b/drivers/net/ethernet/ti/cpts.h
@@ -97,9 +97,6 @@ enum {
CPTS_EV_TX,   /* Ethernet Transmit Event */
 };
 
-/* This covers any input clock up to about 500 MHz. */
-#define CPTS_OVERFLOW_PERIOD (HZ * 8)
-
 #define CPTS_FIFO_DEPTH 16
 #define CPTS_MAX_EVENTS 32
 
@@ -127,6 +124,7 @@ struct cpts {
struct list_head events;
struct list_head pool;
struct cpts_event pool_data[CPTS_MAX_EVENTS];
+   unsigned long ov_check_period;
 };
 
 int cpts_rx_timestamp(struct cpts *cpts, struct sk_buff *skb);
-- 
2.9.3

[PATCH 2/9] net: ethernet: ti: cpsw: minimize direct access to struct cpts

2016-09-14 Thread Grygorii Strashko

This will provide more flexibility in changing CPTS internals and also
required for further changes.

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/cpsw.c | 28 +++-
 drivers/net/ethernet/ti/cpts.h | 41 +
 2 files changed, 56 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index b743bb1d..9b900f0 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -1481,7 +1481,7 @@ static netdev_tx_t cpsw_ndo_start_xmit(struct sk_buff 
*skb,
}
 
if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP &&
-   cpsw->cpts->tx_enable)
+   cpts_is_tx_enabled(cpsw->cpts))
skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
 
skb_tx_timestamp(skb);
@@ -1519,7 +1519,8 @@ static void cpsw_hwtstamp_v1(struct cpsw_common *cpsw)
struct cpsw_slave *slave = &cpsw->slaves[cpsw->data.active_slave];
u32 ts_en, seq_id;
 
-   if (!cpsw->cpts->tx_enable && !cpsw->cpts->rx_enable) {
+   if (!cpts_is_tx_enabled(cpsw->cpts) &&
+   !cpts_is_rx_enabled(cpsw->cpts)) {
slave_write(slave, 0, CPSW1_TS_CTL);
return;
}
@@ -1527,10 +1528,10 @@ static void cpsw_hwtstamp_v1(struct cpsw_common *cpsw)
seq_id = (30 << CPSW_V1_SEQ_ID_OFS_SHIFT) | ETH_P_1588;
ts_en = EVENT_MSG_BITS << CPSW_V1_MSG_TYPE_OFS;
 
-   if (cpsw->cpts->tx_enable)
+   if (cpts_is_tx_enabled(cpsw->cpts))
ts_en |= CPSW_V1_TS_TX_EN;
 
-   if (cpsw->cpts->rx_enable)
+   if (cpts_is_rx_enabled(cpsw->cpts))
ts_en |= CPSW_V1_TS_RX_EN;
 
slave_write(slave, ts_en, CPSW1_TS_CTL);
@@ -1553,20 +1554,20 @@ static void cpsw_hwtstamp_v2(struct cpsw_priv *priv)
case CPSW_VERSION_2:
ctrl &= ~CTRL_V2_ALL_TS_MASK;
 
-   if (cpsw->cpts->tx_enable)
+   if (cpts_is_tx_enabled(cpsw->cpts))
ctrl |= CTRL_V2_TX_TS_BITS;
 
-   if (cpsw->cpts->rx_enable)
+   if (cpts_is_rx_enabled(cpsw->cpts))
ctrl |= CTRL_V2_RX_TS_BITS;
break;
case CPSW_VERSION_3:
default:
ctrl &= ~CTRL_V3_ALL_TS_MASK;
 
-   if (cpsw->cpts->tx_enable)
+   if (cpts_is_tx_enabled(cpsw->cpts))
ctrl |= CTRL_V3_TX_TS_BITS;
 
-   if (cpsw->cpts->rx_enable)
+   if (cpts_is_rx_enabled(cpsw->cpts))
ctrl |= CTRL_V3_RX_TS_BITS;
break;
}
@@ -1602,7 +1603,7 @@ static int cpsw_hwtstamp_set(struct net_device *dev, 
struct ifreq *ifr)
 
switch (cfg.rx_filter) {
case HWTSTAMP_FILTER_NONE:
-   cpts->rx_enable = 0;
+   cpts_rx_enable(cpts, 0);
break;
case HWTSTAMP_FILTER_ALL:
case HWTSTAMP_FILTER_PTP_V1_L4_EVENT:
@@ -1618,14 +1619,14 @@ static int cpsw_hwtstamp_set(struct net_device *dev, 
struct ifreq *ifr)
case HWTSTAMP_FILTER_PTP_V2_EVENT:
case HWTSTAMP_FILTER_PTP_V2_SYNC:
case HWTSTAMP_FILTER_PTP_V2_DELAY_REQ:
-   cpts->rx_enable = 1;
+   cpts_rx_enable(cpts, 1);
cfg.rx_filter = HWTSTAMP_FILTER_PTP_V2_EVENT;
break;
default:
return -ERANGE;
}
 
-   cpts->tx_enable = cfg.tx_type == HWTSTAMP_TX_ON;
+   cpts_tx_enable(cpts, cfg.tx_type == HWTSTAMP_TX_ON);
 
switch (cpsw->version) {
case CPSW_VERSION_1:
@@ -1654,8 +1655,9 @@ static int cpsw_hwtstamp_get(struct net_device *dev, 
struct ifreq *ifr)
return -EOPNOTSUPP;
 
cfg.flags = 0;
-   cfg.tx_type = cpts->tx_enable ? HWTSTAMP_TX_ON : HWTSTAMP_TX_OFF;
-   cfg.rx_filter = (cpts->rx_enable ?
+   cfg.tx_type = cpts_is_tx_enabled(cpts) ?
+ HWTSTAMP_TX_ON : HWTSTAMP_TX_OFF;
+   cfg.rx_filter = (cpts_is_rx_enabled(cpts) ?
 HWTSTAMP_FILTER_PTP_V2_EVENT : HWTSTAMP_FILTER_NONE);
 
return copy_to_user(ifr->ifr_data, &cfg, sizeof(cfg)) ? -EFAULT : 0;
diff --git a/drivers/net/ethernet/ti/cpts.h b/drivers/net/ethernet/ti/cpts.h
index a68780d..fec753c 100644
--- a/drivers/net/ethernet/ti/cpts.h
+++ b/drivers/net/ethernet/ti/cpts.h
@@ -132,6 +132,29 @@ void cpts_rx_timestamp(struct cpts *cpts, struct sk_buff 
*skb);
 void cpts_tx_timestamp(struct cpts *cpts, struct sk_buff *skb);
 int cpts_register(struct device *dev, struct cpts *cpts, u32 mult, u32 shift);
 void cpts_unregister(struct cpts *cpts);
+
+static inline void cpts_rx_enable(struct cpts *cpts, int enable)
+{
+   if (cpts)
+   cpts->rx_enable = enable;
+}
+
+static inline bool cpts_is_rx_enabled(struct cpts *cpts)
+{
+   return cpts && !!cpts->rx_enable;
+}
+
+static inline void cpts_tx_enable(struct

[PATCH 4/9] net: ethernet: ti: cpts: move dt props parsing to cpts driver

2016-09-14 Thread Grygorii Strashko

Move DT properties parsing into CPTS driver to simplify consumer's
code and CPTS driver porting on other SoC in the future
(like Keystone 2).

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/cpsw.c | 16 +---
 drivers/net/ethernet/ti/cpsw.h |  2 --
 drivers/net/ethernet/ti/cpts.c | 29 ++---
 drivers/net/ethernet/ti/cpts.h |  5 +++--
 4 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index dfd5707..3db8fec 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -2311,18 +2311,6 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
}
data->active_slave = prop;
 
-   if (of_property_read_u32(node, "cpts_clock_mult", &prop)) {
-   dev_err(&pdev->dev, "Missing cpts_clock_mult property in the 
DT.\n");
-   return -EINVAL;
-   }
-   data->cpts_clock_mult = prop;
-
-   if (of_property_read_u32(node, "cpts_clock_shift", &prop)) {
-   dev_err(&pdev->dev, "Missing cpts_clock_shift property in the 
DT.\n");
-   return -EINVAL;
-   }
-   data->cpts_clock_shift = prop;
-
data->slave_data = devm_kzalloc(&pdev->dev, data->slaves
* sizeof(struct cpsw_slave_data),
GFP_KERNEL);
@@ -2742,9 +2730,7 @@ static int cpsw_probe(struct platform_device *pdev)
goto clean_dma_ret;
}
 
-   cpsw->cpts = cpts_create(cpsw->dev, cpts_regs,
-cpsw->data.cpts_clock_mult,
-cpsw->data.cpts_clock_shift);
+   cpsw->cpts = cpts_create(cpsw->dev, cpts_regs, cpsw->dev->of_node);
if (IS_ERR(cpsw->cpts)) {
ret = PTR_ERR(cpsw->cpts);
goto clean_ale_ret;
diff --git a/drivers/net/ethernet/ti/cpsw.h b/drivers/net/ethernet/ti/cpsw.h
index 16b54c6..6c3037a 100644
--- a/drivers/net/ethernet/ti/cpsw.h
+++ b/drivers/net/ethernet/ti/cpsw.h
@@ -31,8 +31,6 @@ struct cpsw_platform_data {
u32 channels;   /* number of cpdma channels (symmetric) */
u32 slaves; /* number of slave cpgmac ports */
u32 active_slave; /* time stamping, ethtool and SIOCGMIIPHY slave */
-   u32 cpts_clock_mult;  /* convert input clock ticks to nanoseconds */
-   u32 cpts_clock_shift; /* convert input clock ticks to nanoseconds */
u32 ale_entries;/* ale table size */
u32 bd_ram_size;  /*buffer descriptor ram size */
u32 mac_control;/* Mac control register */
diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c
index a46478e..1ee64c6 100644
--- a/drivers/net/ethernet/ti/cpts.c
+++ b/drivers/net/ethernet/ti/cpts.c
@@ -388,10 +388,31 @@ void cpts_unregister(struct cpts *cpts)
clk_disable(cpts->refclk);
 }
 
+static int cpts_of_parse(struct cpts *cpts, struct device_node *node)
+{
+   int ret = -EINVAL;
+   u32 prop;
+
+   if (of_property_read_u32(node, "cpts_clock_mult", &prop))
+   goto  of_error;
+   cpts->cc_mult = prop;
+
+   if (of_property_read_u32(node, "cpts_clock_shift", &prop))
+   goto  of_error;
+   cpts->cc.shift = prop;
+
+   return 0;
+
+of_error:
+   dev_err(cpts->dev, "CPTS: Missing property in the DT.\n");
+   return ret;
+}
+
 struct cpts *cpts_create(struct device *dev, void __iomem *regs,
-u32 mult, u32 shift)
+struct device_node *node)
 {
struct cpts *cpts;
+   int ret;
 
if (!regs || !dev)
return ERR_PTR(-EINVAL);
@@ -405,6 +426,10 @@ struct cpts *cpts_create(struct device *dev, void __iomem 
*regs,
spin_lock_init(&cpts->lock);
INIT_DELAYED_WORK(&cpts->overflow_work, cpts_overflow_check);
 
+   ret = cpts_of_parse(cpts, node);
+   if (ret)
+   return ERR_PTR(ret);
+
cpts->refclk = devm_clk_get(dev, "cpts");
if (IS_ERR(cpts->refclk)) {
dev_err(dev, "Failed to get cpts refclk\n");
@@ -415,8 +440,6 @@ struct cpts *cpts_create(struct device *dev, void __iomem 
*regs,
 
cpts->cc.read = cpts_systim_read;
cpts->cc.mask = CLOCKSOURCE_MASK(32);
-   cpts->cc.shift = shift;
-   cpts->cc_mult = mult;
 
return cpts;
 }
diff --git a/drivers/net/ethernet/ti/cpts.h b/drivers/net/ethernet/ti/cpts.h
index 0c02f48..a865193 100644
--- a/drivers/net/ethernet/ti/cpts.h
+++ b/drivers/net/ethernet/ti/cpts.h
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -133,7 +134,7 @@ void cpts_tx_timestamp(struct cpts *cpts, struct sk_buff 
*skb);
 int cpts_register(struct cpts *cpts);
 void cpts_unregister(struct cpts *cpts);
 struct cpts *cpts_create(struct device *dev, void __iomem *regs,
-u

[PATCH 0/9] net: ethernet: ti: cpts: update and fixes

2016-09-14 Thread Grygorii Strashko

Hi,

It is preparation serie intended to clean up and optimize TI CPTS driver to
facilitate further integration with other TI's SoCs like Keystone 2.
It also include some non critical fixes:
 net: ethernet: ti: exclude cpts from build when disabled
 net: ethernet: ti: cpts: fix overflow check period
 net: ethernet: ti: cpts: clean up event list if event pool is empty


Grygorii Strashko (7):
  net: ethernet: ti: exclude cpts from build when disabled
  net: ethernet: ti: cpsw: minimize direct access to struct cpts
  net: ethernet: ti: cpts: rework initialization/deinitialization
  net: ethernet: ti: cpts: move dt props parsing to cpts driver
  net: ethernet: ti: cpts: calc mult and shift from refclk freq
  net: ethernet: ti: cpts: fix overflow check period
  net: ethernet: ti: cpts: switch to readl/writel_relaxed()

WingMan Kwok (2):
  net: ethernet: ti: cpts: add return value to tx and rx timestamp
funcitons
  net: ethernet: ti: cpts: clean up event list if event pool is empty

 Documentation/devicetree/bindings/net/cpsw.txt |   4 +-
 drivers/net/ethernet/ti/Makefile   |   3 +-
 drivers/net/ethernet/ti/cpsw.c |  83 
 drivers/net/ethernet/ti/cpsw.h |   2 -
 drivers/net/ethernet/ti/cpts.c | 256 ++---
 drivers/net/ethernet/ti/cpts.h |  93 +++--
 6 files changed, 319 insertions(+), 122 deletions(-)

-- 
2.9.3

[PATCH 1/9] net: ethernet: ti: exclude cpts from build when disabled

2016-09-14 Thread Grygorii Strashko

TI CPTS feature declared as optional, but cpts.c module is
included in build always.
Exclude  cpts.c from build when CPTS is disabled in Kconfig and
optimize usage of CONFIG_TI_CPTS.

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/Makefile |  3 ++-
 drivers/net/ethernet/ti/cpsw.c   | 21 -
 drivers/net/ethernet/ti/cpts.c   |  8 
 drivers/net/ethernet/ti/cpts.h   | 14 --
 4 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/ti/Makefile b/drivers/net/ethernet/ti/Makefile
index d420d94..1e7c10b 100644
--- a/drivers/net/ethernet/ti/Makefile
+++ b/drivers/net/ethernet/ti/Makefile
@@ -12,8 +12,9 @@ obj-$(CONFIG_TI_DAVINCI_MDIO) += davinci_mdio.o
 obj-$(CONFIG_TI_DAVINCI_CPDMA) += davinci_cpdma.o
 obj-$(CONFIG_TI_CPSW_PHY_SEL) += cpsw-phy-sel.o
 obj-$(CONFIG_TI_CPSW_ALE) += cpsw_ale.o
+obj-$(CONFIG_TI_CPTS) += cpts.o
 obj-$(CONFIG_TI_CPSW) += ti_cpsw.o
-ti_cpsw-y := cpsw.o cpts.o
+ti_cpsw-y := cpsw.o
 
 obj-$(CONFIG_TI_KEYSTONE_NETCP) += keystone_netcp.o
 keystone_netcp-y := netcp_core.o
diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index c6cff3d..b743bb1d 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -1514,7 +1514,6 @@ fail:
 }
 
 #ifdef CONFIG_TI_CPTS
-
 static void cpsw_hwtstamp_v1(struct cpsw_common *cpsw)
 {
struct cpsw_slave *slave = &cpsw->slaves[cpsw->data.active_slave];
@@ -1661,7 +1660,16 @@ static int cpsw_hwtstamp_get(struct net_device *dev, 
struct ifreq *ifr)
 
return copy_to_user(ifr->ifr_data, &cfg, sizeof(cfg)) ? -EFAULT : 0;
 }
+#else
+static int cpsw_hwtstamp_get(struct net_device *dev, struct ifreq *ifr)
+{
+   return -EOPNOTSUPP;
+}
 
+static int cpsw_hwtstamp_set(struct net_device *dev, struct ifreq *ifr)
+{
+   return -EOPNOTSUPP;
+}
 #endif /*CONFIG_TI_CPTS*/
 
 static int cpsw_ndo_ioctl(struct net_device *dev, struct ifreq *req, int cmd)
@@ -1674,12 +1682,10 @@ static int cpsw_ndo_ioctl(struct net_device *dev, 
struct ifreq *req, int cmd)
return -EINVAL;
 
switch (cmd) {
-#ifdef CONFIG_TI_CPTS
case SIOCSHWTSTAMP:
return cpsw_hwtstamp_set(dev, req);
case SIOCGHWTSTAMP:
return cpsw_hwtstamp_get(dev, req);
-#endif
}
 
if (!cpsw->slaves[slave_no].phy)
@@ -1935,10 +1941,10 @@ static void cpsw_set_msglevel(struct net_device *ndev, 
u32 value)
priv->msg_enable = value;
 }
 
+#ifdef CONFIG_TI_CPTS
 static int cpsw_get_ts_info(struct net_device *ndev,
struct ethtool_ts_info *info)
 {
-#ifdef CONFIG_TI_CPTS
struct cpsw_common *cpsw = ndev_to_cpsw(ndev);
 
info->so_timestamping =
@@ -1955,7 +1961,12 @@ static int cpsw_get_ts_info(struct net_device *ndev,
info->rx_filters =
(1 << HWTSTAMP_FILTER_NONE) |
(1 << HWTSTAMP_FILTER_PTP_V2_EVENT);
+   return 0;
+}
 #else
+static int cpsw_get_ts_info(struct net_device *ndev,
+   struct ethtool_ts_info *info)
+{
info->so_timestamping =
SOF_TIMESTAMPING_TX_SOFTWARE |
SOF_TIMESTAMPING_RX_SOFTWARE |
@@ -1963,9 +1974,9 @@ static int cpsw_get_ts_info(struct net_device *ndev,
info->phc_index = -1;
info->tx_types = 0;
info->rx_filters = 0;
-#endif
return 0;
 }
+#endif
 
 static int cpsw_get_settings(struct net_device *ndev,
 struct ethtool_cmd *ecmd)
diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c
index 85a55b4..aaab08e 100644
--- a/drivers/net/ethernet/ti/cpts.c
+++ b/drivers/net/ethernet/ti/cpts.c
@@ -31,8 +31,6 @@
 
 #include "cpts.h"
 
-#ifdef CONFIG_TI_CPTS
-
 #define cpts_read32(c, r)  __raw_readl(&c->reg->r)
 #define cpts_write32(c, v, r)  __raw_writel(v, &c->reg->r)
 
@@ -350,12 +348,9 @@ void cpts_tx_timestamp(struct cpts *cpts, struct sk_buff 
*skb)
skb_tstamp_tx(skb, &ssh);
 }
 
-#endif /*CONFIG_TI_CPTS*/
-
 int cpts_register(struct device *dev, struct cpts *cpts,
  u32 mult, u32 shift)
 {
-#ifdef CONFIG_TI_CPTS
int err, i;
unsigned long flags;
 
@@ -391,18 +386,15 @@ int cpts_register(struct device *dev, struct cpts *cpts,
schedule_delayed_work(&cpts->overflow_work, CPTS_OVERFLOW_PERIOD);
 
cpts->phc_index = ptp_clock_index(cpts->clock);
-#endif
return 0;
 }
 
 void cpts_unregister(struct cpts *cpts)
 {
-#ifdef CONFIG_TI_CPTS
if (cpts->clock) {
ptp_clock_unregister(cpts->clock);
cancel_delayed_work_sync(&cpts->overflow_work);
}
if (cpts->refclk)
cpts_clk_release(cpts);
-#endif
 }
diff --git a/drivers/net/ethernet/ti/cpts.h b/drivers/net/ethernet/ti/cpts.h
index 69a46b9..a68780d 100644
--- a/drivers/net/ethernet/ti/cpts.h
+++ b/drivers/net/ethernet/ti/cpts.h
@@ -130,6 +130,8 @@ struct cpts {
 #ifdef CONFIG_TI_C

[PATCH 6/9] net: ethernet: ti: cpts: clean up event list if event pool is empty

2016-09-14 Thread Grygorii Strashko

From: WingMan Kwok 

When a CPTS user does not exit gracefully by disabling cpts
timestamping and leaving a joined multicast group, the system
continues to receive and timestamps the ptp packets which eventually
occupy all the event list entries.  When this happns, the added code
tries to remove some list entries which are expired.

Signed-off-by: WingMan Kwok 
Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/cpts.c | 30 --
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c
index 970d4e2..ff8bb85 100644
--- a/drivers/net/ethernet/ti/cpts.c
+++ b/drivers/net/ethernet/ti/cpts.c
@@ -57,22 +57,48 @@ static int cpts_fifo_pop(struct cpts *cpts, u32 *high, u32 
*low)
return -1;
 }
 
+static int cpts_event_list_clean_up(struct cpts *cpts)
+{
+   struct list_head *this, *next;
+   struct cpts_event *event;
+   int removed = 0;
+
+   list_for_each_safe(this, next, &cpts->events) {
+   event = list_entry(this, struct cpts_event, list);
+   if (event_expired(event)) {
+   list_del_init(&event->list);
+   list_add(&event->list, &cpts->pool);
+   ++removed;
+   }
+   }
+   return removed;
+}
+
 /*
  * Returns zero if matching event type was found.
  */
 static int cpts_fifo_read(struct cpts *cpts, int match)
 {
int i, type = -1;
+   int removed;
u32 hi, lo;
struct cpts_event *event;
 
for (i = 0; i < CPTS_FIFO_DEPTH; i++) {
if (cpts_fifo_pop(cpts, &hi, &lo))
break;
+
if (list_empty(&cpts->pool)) {
-   pr_err("cpts: event pool is empty\n");
-   return -1;
+   removed = cpts_event_list_clean_up(cpts);
+   if (!removed) {
+   dev_err(cpts->dev,
+   "cpts: event pool is empty\n");
+   return -1;
+   }
+   dev_dbg(cpts->dev,
+   "cpts: event pool cleaned up %d\n", removed);
}
+
event = list_first_entry(&cpts->pool, struct cpts_event, list);
event->tmo = jiffies + 2;
event->high = hi;
-- 
2.9.3

[PATCH 3/9] net: ethernet: ti: cpts: rework initialization/deinitialization

2016-09-14 Thread Grygorii Strashko

The current implementation CPTS initialization and deinitialization
(represented by cpts_register/unregister()) is pretty entangled and
has some issues, like:
- ptp clock registered before spinlock, which is protecting it, and
before timecounter and cyclecounter initialization;
- CPTS ref_clk requested using devm API while cpts_register() is
called from .ndo_open(), as result additional checks required;
- CPTS ref_clk is prepared, but never unprepared;
- CPTS is not disabled even when unregistered..

Hence, make things simpler and fix above issues by adding
cpts_create()/cpts_release() which should be called from
.probe()/.remove() respectively and move all static initialization
there. Clean up and update cpts_register/unregister() so PTP clock is
registered the last and unregistered first. In addition, this change
allows to clean up cpts.h for the case when CPTS is disabled.

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/cpsw.c |  24 
 drivers/net/ethernet/ti/cpts.c | 125 ++---
 drivers/net/ethernet/ti/cpts.h |  26 +++--
 3 files changed, 113 insertions(+), 62 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 9b900f0..dfd5707 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -1406,9 +1406,7 @@ static int cpsw_ndo_open(struct net_device *ndev)
if (ret < 0)
goto err_cleanup;
 
-   if (cpts_register(cpsw->dev, cpsw->cpts,
- cpsw->data.cpts_clock_mult,
- cpsw->data.cpts_clock_shift))
+   if (cpts_register(cpsw->cpts))
dev_err(priv->dev, "error registering cpts device\n");
 
}
@@ -2551,6 +2549,7 @@ static int cpsw_probe(struct platform_device *pdev)
struct cpdma_params dma_params;
struct cpsw_ale_params  ale_params;
void __iomem*ss_regs;
+   void __iomem*cpts_regs;
struct resource *res, *ss_res;
const struct of_device_id   *of_id;
struct gpio_descs   *mode;
@@ -2575,12 +2574,6 @@ static int cpsw_probe(struct platform_device *pdev)
priv->dev  = &ndev->dev;
priv->msg_enable = netif_msg_init(debug_level, CPSW_DEBUG);
cpsw->rx_packet_max = max(rx_packet_max, 128);
-   cpsw->cpts = devm_kzalloc(&pdev->dev, sizeof(struct cpts), GFP_KERNEL);
-   if (!cpsw->cpts) {
-   dev_err(&pdev->dev, "error allocating cpts\n");
-   ret = -ENOMEM;
-   goto clean_ndev_ret;
-   }
 
mode = devm_gpiod_get_array_optional(&pdev->dev, "mode", GPIOD_OUT_LOW);
if (IS_ERR(mode)) {
@@ -2669,7 +2662,7 @@ static int cpsw_probe(struct platform_device *pdev)
switch (cpsw->version) {
case CPSW_VERSION_1:
cpsw->host_port_regs = ss_regs + CPSW1_HOST_PORT_OFFSET;
-   cpsw->cpts->reg  = ss_regs + CPSW1_CPTS_OFFSET;
+   cpts_regs   = ss_regs + CPSW1_CPTS_OFFSET;
cpsw->hw_stats   = ss_regs + CPSW1_HW_STATS;
dma_params.dmaregs   = ss_regs + CPSW1_CPDMA_OFFSET;
dma_params.txhdp = ss_regs + CPSW1_STATERAM_OFFSET;
@@ -2683,7 +2676,7 @@ static int cpsw_probe(struct platform_device *pdev)
case CPSW_VERSION_3:
case CPSW_VERSION_4:
cpsw->host_port_regs = ss_regs + CPSW2_HOST_PORT_OFFSET;
-   cpsw->cpts->reg  = ss_regs + CPSW2_CPTS_OFFSET;
+   cpts_regs   = ss_regs + CPSW2_CPTS_OFFSET;
cpsw->hw_stats   = ss_regs + CPSW2_HW_STATS;
dma_params.dmaregs   = ss_regs + CPSW2_CPDMA_OFFSET;
dma_params.txhdp = ss_regs + CPSW2_STATERAM_OFFSET;
@@ -2749,6 +2742,14 @@ static int cpsw_probe(struct platform_device *pdev)
goto clean_dma_ret;
}
 
+   cpsw->cpts = cpts_create(cpsw->dev, cpts_regs,
+cpsw->data.cpts_clock_mult,
+cpsw->data.cpts_clock_shift);
+   if (IS_ERR(cpsw->cpts)) {
+   ret = PTR_ERR(cpsw->cpts);
+   goto clean_ale_ret;
+   }
+
ndev->irq = platform_get_irq(pdev, 1);
if (ndev->irq < 0) {
dev_err(priv->dev, "error getting irq resource\n");
@@ -2857,6 +2858,7 @@ static int cpsw_remove(struct platform_device *pdev)
unregister_netdev(cpsw->slaves[1].ndev);
unregister_netdev(ndev);
 
+   cpts_release(cpsw->cpts);
cpsw_ale_destroy(cpsw->ale);
cpdma_ctlr_destroy(cpsw->dma);
of_platform_depopulate(&pdev->dev);
diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c
index aaab08e..a46478e 100644
--- a/drivers/net/ethernet/ti/cpts.c
+++ b/drivers/net/ethernet/ti/cpts

[PATCH 5/9] net: ethernet: ti: cpts: add return value to tx and rx timestamp funcitons

2016-09-14 Thread Grygorii Strashko

From: WingMan Kwok 

Added return values in tx and rx timestamp funcitons facilitate the
possibililies of timestamping by CPSW modules other than CPTS, such as
packet accelerator on Keystone 2 devices.

Signed-off-by: WingMan Kwok 
Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/cpts.c | 16 ++--
 drivers/net/ethernet/ti/cpts.h | 11 +++
 2 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c
index 1ee64c6..970d4e2 100644
--- a/drivers/net/ethernet/ti/cpts.c
+++ b/drivers/net/ethernet/ti/cpts.c
@@ -302,34 +302,38 @@ static u64 cpts_find_ts(struct cpts *cpts, struct sk_buff 
*skb, int ev_type)
return ns;
 }
 
-void cpts_rx_timestamp(struct cpts *cpts, struct sk_buff *skb)
+int cpts_rx_timestamp(struct cpts *cpts, struct sk_buff *skb)
 {
u64 ns;
struct skb_shared_hwtstamps *ssh;
 
if (!cpts || !cpts->rx_enable)
-   return;
+   return -EPERM;
ns = cpts_find_ts(cpts, skb, CPTS_EV_RX);
if (!ns)
-   return;
+   return -ENOENT;
ssh = skb_hwtstamps(skb);
memset(ssh, 0, sizeof(*ssh));
ssh->hwtstamp = ns_to_ktime(ns);
+
+   return 0;
 }
 
-void cpts_tx_timestamp(struct cpts *cpts, struct sk_buff *skb)
+int cpts_tx_timestamp(struct cpts *cpts, struct sk_buff *skb)
 {
u64 ns;
struct skb_shared_hwtstamps ssh;
 
if (!cpts || !(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS))
-   return;
+   return -EPERM;
ns = cpts_find_ts(cpts, skb, CPTS_EV_TX);
if (!ns)
-   return;
+   return -ENOENT;
memset(&ssh, 0, sizeof(ssh));
ssh.hwtstamp = ns_to_ktime(ns);
skb_tstamp_tx(skb, &ssh);
+
+   return 0;
 }
 
 int cpts_register(struct cpts *cpts)
diff --git a/drivers/net/ethernet/ti/cpts.h b/drivers/net/ethernet/ti/cpts.h
index a865193..47026ec 100644
--- a/drivers/net/ethernet/ti/cpts.h
+++ b/drivers/net/ethernet/ti/cpts.h
@@ -129,8 +129,8 @@ struct cpts {
struct cpts_event pool_data[CPTS_MAX_EVENTS];
 };
 
-void cpts_rx_timestamp(struct cpts *cpts, struct sk_buff *skb);
-void cpts_tx_timestamp(struct cpts *cpts, struct sk_buff *skb);
+int cpts_rx_timestamp(struct cpts *cpts, struct sk_buff *skb);
+int cpts_tx_timestamp(struct cpts *cpts, struct sk_buff *skb);
 int cpts_register(struct cpts *cpts);
 void cpts_unregister(struct cpts *cpts);
 struct cpts *cpts_create(struct device *dev, void __iomem *regs,
@@ -162,11 +162,14 @@ static inline bool cpts_is_tx_enabled(struct cpts *cpts)
 #else
 struct cpts;
 
-static inline void cpts_rx_timestamp(struct cpts *cpts, struct sk_buff *skb)
+static inline int cpts_rx_timestamp(struct cpts *cpts, struct sk_buff *skb)
 {
+   return -EOPNOTSUPP;
 }
-static inline void cpts_tx_timestamp(struct cpts *cpts, struct sk_buff *skb)
+
+static inline int cpts_tx_timestamp(struct cpts *cpts, struct sk_buff *skb)
 {
+   return -EOPNOTSUPP;
 }
 
 static inline
-- 
2.9.3

[PATCH 7/9] net: ethernet: ti: cpts: calc mult and shift from refclk freq

2016-09-14 Thread Grygorii Strashko

The cyclecounter mult and shift values can be calculated based on the
CPTS rftclk frequency and timekeepnig framework provides required algos
and API's.

Hence, calc mult and shift basing on CPTS rftclk frequency if both
cpts_clock_shift and cpts_clock_mult properties are not provided in DT
(the basis of calculation algorithm is borrowed from
__clocksource_update_freq_scale()). After this change cpts_clock_shift
and cpts_clock_mult DT properties will become optional.

Signed-off-by: Grygorii Strashko 
---
 Documentation/devicetree/bindings/net/cpsw.txt |  4 +-
 drivers/net/ethernet/ti/cpts.c | 56 +++---
 2 files changed, 52 insertions(+), 8 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/cpsw.txt 
b/Documentation/devicetree/bindings/net/cpsw.txt
index 5ad439f..88f81c7 100644
--- a/Documentation/devicetree/bindings/net/cpsw.txt
+++ b/Documentation/devicetree/bindings/net/cpsw.txt
@@ -20,8 +20,6 @@ Required properties:
 - slaves   : Specifies number for slaves
 - active_slave : Specifies the slave to use for time stamping,
  ethtool and SIOCGMIIPHY
-- cpts_clock_mult  : Numerator to convert input clock ticks into 
nanoseconds
-- cpts_clock_shift : Denominator to convert input clock ticks into 
nanoseconds
 
 Optional properties:
 - ti,hwmods: Must be "cpgmac0"
@@ -35,6 +33,8 @@ Optional properties:
  For example in dra72x-evm, pcf gpio has to be
  driven low so that cpsw slave 0 and phy data
  lines are connected via mux.
+- cpts_clock_mult  : Numerator to convert input clock ticks into 
nanoseconds
+- cpts_clock_shift : Denominator to convert input clock ticks into 
nanoseconds
 
 
 Slave Properties:
diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c
index ff8bb85..8046a21 100644
--- a/drivers/net/ethernet/ti/cpts.c
+++ b/drivers/net/ethernet/ti/cpts.c
@@ -418,18 +418,60 @@ void cpts_unregister(struct cpts *cpts)
clk_disable(cpts->refclk);
 }
 
+static void cpts_calc_mult_shift(struct cpts *cpts)
+{
+   u64 maxsec;
+   u32 freq;
+   u32 mult;
+   u32 shift;
+   u64 ns;
+   u64 frac;
+
+   if (cpts->cc_mult || cpts->cc.shift)
+   return;
+
+   freq = clk_get_rate(cpts->refclk);
+
+   /* Calc the maximum number of seconds which we can run before
+* wrapping around.
+*/
+   maxsec = cpts->cc.mask;
+   do_div(maxsec, freq);
+   if (!maxsec)
+   maxsec = 1;
+   else if (maxsec > 600 && cpts->cc.mask > UINT_MAX)
+   maxsec = 600;
+
+   clocks_calc_mult_shift(&mult, &shift, freq, NSEC_PER_SEC, maxsec);
+
+   cpts->cc_mult = mult;
+   cpts->cc.mult = mult;
+   cpts->cc.shift = shift;
+   /* Check calculations and inform if not precise */
+   frac = 0;
+   ns = cyclecounter_cyc2ns(&cpts->cc, freq, cpts->cc.mask, &frac);
+
+   dev_info(cpts->dev,
+"CPTS: ref_clk_freq:%u calc_mult:%u calc_shift:%u error:%lld 
nsec/sec\n",
+freq, cpts->cc_mult, cpts->cc.shift, (ns - NSEC_PER_SEC));
+}
+
 static int cpts_of_parse(struct cpts *cpts, struct device_node *node)
 {
int ret = -EINVAL;
u32 prop;
 
-   if (of_property_read_u32(node, "cpts_clock_mult", &prop))
-   goto  of_error;
-   cpts->cc_mult = prop;
+   cpts->cc_mult = 0;
+   if (!of_property_read_u32(node, "cpts_clock_mult", &prop))
+   cpts->cc_mult = prop;
 
-   if (of_property_read_u32(node, "cpts_clock_shift", &prop))
-   goto  of_error;
-   cpts->cc.shift = prop;
+   cpts->cc.shift = 0;
+   if (!of_property_read_u32(node, "cpts_clock_shift", &prop))
+   cpts->cc.shift = prop;
+
+   if ((cpts->cc_mult && !cpts->cc.shift) ||
+   (!cpts->cc_mult && cpts->cc.shift))
+   goto of_error;
 
return 0;
 
@@ -471,6 +513,8 @@ struct cpts *cpts_create(struct device *dev, void __iomem 
*regs,
cpts->cc.read = cpts_systim_read;
cpts->cc.mask = CLOCKSOURCE_MASK(32);
 
+   cpts_calc_mult_shift(cpts);
+
return cpts;
 }
 
-- 
2.9.3

[PATCH net] net: ethernet: mediatek: fix module loading automatically based on MODULE_DEVICE_TABLE

2016-09-14 Thread sean.wang

From: Sean Wang 

The device table is required to load modules based on
modaliases. After adding MODULE_DEVICE_TABLE, below entries
for example will be added to modules.alias:
alias of:N*T*Cmediatek,mt7623-ethC* mtk_eth_soc

Signed-off-by: Sean Wang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index a6a9a2f..b44ff3c 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -2053,6 +2053,7 @@ const struct of_device_id of_mtk_match[] = {
{ .compatible = "mediatek,mt7623-eth" },
{},
 };
+MODULE_DEVICE_TABLE(of, of_mtk_match);
 
 static struct platform_driver mtk_driver = {
.probe = mtk_probe,
-- 
1.9.1

[PATCH 18/19] stmmac: dwmac-sti: Remove obsolete STi platforms

2016-09-14 Thread Peter Griffin

This patch removes support for STiH415/6 SoC's from the
dwmac-sti driver and dt binding doc, as support for these
platforms is being removed from the kernel. It also removes
STiD127 related code, which has never actually been supported
upstream.

Signed-off-by: Peter Griffin 
Cc: 
Cc: 
Cc: 
---
 .../devicetree/bindings/net/sti-dwmac.txt  |  3 +-
 drivers/net/ethernet/stmicro/stmmac/dwmac-sti.c| 37 --
 2 files changed, 1 insertion(+), 39 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/sti-dwmac.txt 
b/Documentation/devicetree/bindings/net/sti-dwmac.txt
index d05c1e1..2031786 100644
--- a/Documentation/devicetree/bindings/net/sti-dwmac.txt
+++ b/Documentation/devicetree/bindings/net/sti-dwmac.txt
@@ -7,8 +7,7 @@ and what is needed on STi platforms to program the stmmac glue 
logic.
 The device node has following properties.
 
 Required properties:
- - compatible  : Can be "st,stih415-dwmac", "st,stih416-dwmac",
-   "st,stih407-dwmac", "st,stid127-dwmac".
+ - compatible  : Should be "st,stih407-dwmac".
  - st,syscon : Should be phandle/offset pair. The phandle to the syscon node 
which
encompases the glue register, and the offset of the control register.
  - st,gmac_en: this is to enable the gmac into a dedicated sysctl control
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sti.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-sti.c
index 58c05ac..fcbe374 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sti.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sti.c
@@ -198,36 +198,6 @@ static void stih4xx_fix_retime_src(void *priv, u32 spd)
   stih4xx_tx_retime_val[src]);
 }
 
-static void stid127_fix_retime_src(void *priv, u32 spd)
-{
-   struct sti_dwmac *dwmac = priv;
-   u32 reg = dwmac->ctrl_reg;
-   u32 freq = 0;
-   u32 val = 0;
-
-   if (dwmac->interface == PHY_INTERFACE_MODE_MII) {
-   val = STID127_ETH_SEL_INTERNAL_NOTEXT_TXCLK;
-   } else if (dwmac->interface == PHY_INTERFACE_MODE_RMII) {
-   if (!dwmac->ext_phyclk) {
-   val = STID127_ETH_SEL_INTERNAL_NOTEXT_PHYCLK;
-   freq = DWMAC_50MHZ;
-   }
-   } else if (IS_PHY_IF_MODE_RGMII(dwmac->interface)) {
-   val = STID127_ETH_SEL_INTERNAL_NOTEXT_TXCLK;
-   if (spd == SPEED_1000)
-   freq = DWMAC_125MHZ;
-   else if (spd == SPEED_100)
-   freq = DWMAC_25MHZ;
-   else if (spd == SPEED_10)
-   freq = DWMAC_2_5MHZ;
-   }
-
-   if (dwmac->clk && freq)
-   clk_set_rate(dwmac->clk, freq);
-
-   regmap_update_bits(dwmac->regmap, reg, STID127_RETIME_SRC_MASK, val);
-}
-
 static int sti_dwmac_init(struct platform_device *pdev, void *priv)
 {
struct sti_dwmac *dwmac = priv;
@@ -372,14 +342,7 @@ static const struct sti_dwmac_of_data stih4xx_dwmac_data = 
{
.fix_retime_src = stih4xx_fix_retime_src,
 };
 
-static const struct sti_dwmac_of_data stid127_dwmac_data = {
-   .fix_retime_src = stid127_fix_retime_src,
-};
-
 static const struct of_device_id sti_dwmac_match[] = {
-   { .compatible = "st,stih415-dwmac", .data = &stih4xx_dwmac_data},
-   { .compatible = "st,stih416-dwmac", .data = &stih4xx_dwmac_data},
-   { .compatible = "st,stid127-dwmac", .data = &stid127_dwmac_data},
{ .compatible = "st,stih407-dwmac", .data = &stih4xx_dwmac_data},
{ }
 };
-- 
1.9.1

Re: [PATCH v2 0/6] Move runnable code (tests) from Documentation to selftests

2016-09-14 Thread Jonathan Corbet

On Tue, 13 Sep 2016 14:18:39 -0600
Shuah Khan  wrote:

> Move runnable code (tests) from Documentation to selftests and update
> Makefiles to work under selftests.

This all seems good to me.

Acked-by: Jonathan Corbet 

jon

Re: [PATCH 3/9] net: ethernet: ti: cpts: rework initialization/deinitialization

2016-09-14 Thread Richard Cochran

On Wed, Sep 14, 2016 at 04:02:25PM +0300, Grygorii Strashko wrote:
> @@ -323,7 +307,7 @@ void cpts_rx_timestamp(struct cpts *cpts, struct sk_buff 
> *skb)
>   u64 ns;
>   struct skb_shared_hwtstamps *ssh;
>  
> - if (!cpts->rx_enable)
> + if (!cpts || !cpts->rx_enable)
>   return;

This function is in the hot path, and you have added a pointless new
test.  Don't do that.

>   ns = cpts_find_ts(cpts, skb, CPTS_EV_RX);
>   if (!ns)
> @@ -338,7 +322,7 @@ void cpts_tx_timestamp(struct cpts *cpts, struct sk_buff 
> *skb)
>   u64 ns;
>   struct skb_shared_hwtstamps ssh;
>  
> - if (!(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS))
> + if (!cpts || !(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS))
>   return;

Same here.

>   ns = cpts_find_ts(cpts, skb, CPTS_EV_TX);
>   if (!ns)
> @@ -348,53 +332,102 @@ void cpts_tx_timestamp(struct cpts *cpts, struct 
> sk_buff *skb)
>   skb_tstamp_tx(skb, &ssh);
>  }
>  
> -int cpts_register(struct device *dev, struct cpts *cpts,
> -   u32 mult, u32 shift)
> +int cpts_register(struct cpts *cpts)
>  {
>   int err, i;
> - unsigned long flags;
>  
> - cpts->info = cpts_info;
> - cpts->clock = ptp_clock_register(&cpts->info, dev);
> - if (IS_ERR(cpts->clock)) {
> - err = PTR_ERR(cpts->clock);
> - cpts->clock = NULL;
> - return err;
> - }
> - spin_lock_init(&cpts->lock);
> -
> - cpts->cc.read = cpts_systim_read;
> - cpts->cc.mask = CLOCKSOURCE_MASK(32);
> - cpts->cc_mult = mult;
> - cpts->cc.mult = mult;
> - cpts->cc.shift = shift;
> + if (!cpts)
> + return -EINVAL;

Not hot path, but still silly.  The caller should never pass NULL.

Thanks,
Richard

Re: [PATCH] net/mlx4_en: fix off by one in error handling

2016-09-14 Thread Sebastian Ott

Hello Tariq,

On Wed, 14 Sep 2016, Tariq Toukan wrote:
> On 14/09/2016 2:09 PM, Sebastian Ott wrote:
> > If an error occurs in mlx4_init_eq_table the index used in the
> > err_out_unmap label is one too big which results in a panic in
> > mlx4_free_eq. This patch fixes the index in the error path.
> You are right, but your change below does not cover all cases.
> The full solution looks like this:
> 
> @@ -1260,7 +1260,7 @@ int mlx4_init_eq_table(struct mlx4_dev *dev)
>  eq);
> }
> if (err)
> -   goto err_out_unmap;
> +   goto err_out_unmap_excluded;

In this case a call to mlx4_create_eq failed. Do you really have to call
mlx4_free_eq for this index again? As far as I understood this code
mlx4_create_eq cleans up when it fails and thus there is no need for an
additional mlx4_free_eq call.

Regards,
Sebastian

Re: [PATCH 4/9] net: ethernet: ti: cpts: move dt props parsing to cpts driver

2016-09-14 Thread Richard Cochran

On Wed, Sep 14, 2016 at 04:02:26PM +0300, Grygorii Strashko wrote:
> Move DT properties parsing into CPTS driver to simplify consumer's
> code and CPTS driver porting on other SoC in the future
> (like Keystone 2).

And just who is the consumer?
 
> Signed-off-by: Grygorii Strashko 
> ---
>  drivers/net/ethernet/ti/cpsw.c | 16 +---
>  drivers/net/ethernet/ti/cpsw.h |  2 --
>  drivers/net/ethernet/ti/cpts.c | 29 ++---
>  drivers/net/ethernet/ti/cpts.h |  5 +++--
>  4 files changed, 30 insertions(+), 22 deletions(-)

You have more (+) than (-).  I wouldn't call that a simplification.

Thanks,
Richard

Re: [PATCH] net/mlx4_en: fix off by one in error handling

2016-09-14 Thread Tariq Toukan


Hi Sebastian,

Thanks for this fix.

On 14/09/2016 2:09 PM, Sebastian Ott wrote:

If an error occurs in mlx4_init_eq_table the index used in the
err_out_unmap label is one too big which results in a panic in
mlx4_free_eq. This patch fixes the index in the error path.

You are right, but your change below does not cover all cases.
The full solution looks like this:

@@ -1260,7 +1260,7 @@ int mlx4_init_eq_table(struct mlx4_dev *dev)
 eq);
}
if (err)
-   goto err_out_unmap;
+   goto err_out_unmap_excluded;
}

if (dev->flags & MLX4_FLAG_MSI_X) {
@@ -1306,8 +1306,10 @@ int mlx4_init_eq_table(struct mlx4_dev *dev)
return 0;

 err_out_unmap:
-   while (i >= 0)
-   mlx4_free_eq(dev, &priv->eq_table.eq[i--]);
+   mlx4_free_eq(dev, &priv->eq_table.eq[i]);
+err_out_unmap_excluded:
+   while (i > 0)
+   mlx4_free_eq(dev, &priv->eq_table.eq[--i]);
 #ifdef CONFIG_RFS_ACCEL
for (i = 1; i <= dev->caps.num_ports; i++) {
if (mlx4_priv(dev)->port[i].rmap) {




Signed-off-by: Sebastian Ott 
---
  drivers/net/ethernet/mellanox/mlx4/eq.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c 
b/drivers/net/ethernet/mellanox/mlx4/eq.c
index f613977..cf8f8a7 100644
--- a/drivers/net/ethernet/mellanox/mlx4/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
@@ -1305,8 +1305,8 @@ int mlx4_init_eq_table(struct mlx4_dev *dev)
return 0;
  
  err_out_unmap:

-   while (i >= 0)
-   mlx4_free_eq(dev, &priv->eq_table.eq[i--]);
+   while (i > 0)
+   mlx4_free_eq(dev, &priv->eq_table.eq[--i]);
  #ifdef CONFIG_RFS_ACCEL
for (i = 1; i <= dev->caps.num_ports; i++) {
if (mlx4_priv(dev)->port[i].rmap) {
You can choose to submit again, or we can take it from here. Whatever 
you prefer.


Regards,
Tariq

Re: [PATCH 5/9] net: ethernet: ti: cpts: add return value to tx and rx timestamp funcitons

2016-09-14 Thread Richard Cochran

On Wed, Sep 14, 2016 at 04:02:27PM +0300, Grygorii Strashko wrote:
> From: WingMan Kwok 
> 
> Added return values in tx and rx timestamp funcitons facilitate the
> possibililies of timestamping by CPSW modules other than CPTS, such as
> packet accelerator on Keystone 2 devices.

I'm sorry, this is totally bogus.
 
> Signed-off-by: WingMan Kwok 
> Signed-off-by: Grygorii Strashko 
> ---
>  drivers/net/ethernet/ti/cpts.c | 16 ++--
>  drivers/net/ethernet/ti/cpts.h | 11 +++
>  2 files changed, 17 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c
> index 1ee64c6..970d4e2 100644
> --- a/drivers/net/ethernet/ti/cpts.c
> +++ b/drivers/net/ethernet/ti/cpts.c
> @@ -302,34 +302,38 @@ static u64 cpts_find_ts(struct cpts *cpts, struct 
> sk_buff *skb, int ev_type)
>   return ns;
>  }
>  
> -void cpts_rx_timestamp(struct cpts *cpts, struct sk_buff *skb)
> +int cpts_rx_timestamp(struct cpts *cpts, struct sk_buff *skb)
>  {
>   u64 ns;
>   struct skb_shared_hwtstamps *ssh;
>  
>   if (!cpts || !cpts->rx_enable)
> - return;
> + return -EPERM;

EPERM?  Come on.

>   ns = cpts_find_ts(cpts, skb, CPTS_EV_RX);
>   if (!ns)
> - return;
> + return -ENOENT;
>   ssh = skb_hwtstamps(skb);
>   memset(ssh, 0, sizeof(*ssh));
>   ssh->hwtstamp = ns_to_ktime(ns);
> +
> + return 0;
>  }
>  
> -void cpts_tx_timestamp(struct cpts *cpts, struct sk_buff *skb)
> +int cpts_tx_timestamp(struct cpts *cpts, struct sk_buff *skb)
>  {
>   u64 ns;
>   struct skb_shared_hwtstamps ssh;
>  
>   if (!cpts || !(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS))
> - return;
> + return -EPERM;
>   ns = cpts_find_ts(cpts, skb, CPTS_EV_TX);
>   if (!ns)
> - return;
> + return -ENOENT;
>   memset(&ssh, 0, sizeof(ssh));
>   ssh.hwtstamp = ns_to_ktime(ns);
>   skb_tstamp_tx(skb, &ssh);
> +
> + return 0;
>  }
>  
>  int cpts_register(struct cpts *cpts)
> diff --git a/drivers/net/ethernet/ti/cpts.h b/drivers/net/ethernet/ti/cpts.h
> index a865193..47026ec 100644
> --- a/drivers/net/ethernet/ti/cpts.h
> +++ b/drivers/net/ethernet/ti/cpts.h
> @@ -129,8 +129,8 @@ struct cpts {
>   struct cpts_event pool_data[CPTS_MAX_EVENTS];
>  };
>  
> -void cpts_rx_timestamp(struct cpts *cpts, struct sk_buff *skb);
> -void cpts_tx_timestamp(struct cpts *cpts, struct sk_buff *skb);
> +int cpts_rx_timestamp(struct cpts *cpts, struct sk_buff *skb);
> +int cpts_tx_timestamp(struct cpts *cpts, struct sk_buff *skb);
>  int cpts_register(struct cpts *cpts);
>  void cpts_unregister(struct cpts *cpts);
>  struct cpts *cpts_create(struct device *dev, void __iomem *regs,
> @@ -162,11 +162,14 @@ static inline bool cpts_is_tx_enabled(struct cpts *cpts)
>  #else
>  struct cpts;
>  
> -static inline void cpts_rx_timestamp(struct cpts *cpts, struct sk_buff *skb)
> +static inline int cpts_rx_timestamp(struct cpts *cpts, struct sk_buff *skb)
>  {
> + return -EOPNOTSUPP;

You are planning to check in the hot path if a compile time feature is
enabled?

Brilliant stuff.

>  }
> -static inline void cpts_tx_timestamp(struct cpts *cpts, struct sk_buff *skb)
> +
> +static inline int cpts_tx_timestamp(struct cpts *cpts, struct sk_buff *skb)
>  {
> + return -EOPNOTSUPP;
>  }
>  
>  static inline
> -- 
> 2.9.3
> 

Thanks,
Richard

Re: [RFC PATCH 9/9] ethernet: sun8i-emac: add pm_runtime support

2016-09-14 Thread LABBE Corentin

On Mon, Sep 12, 2016 at 10:44:51PM +0200, Maxime Ripard wrote:
> Hi,
> 
> On Fri, Sep 09, 2016 at 02:45:17PM +0200, Corentin Labbe wrote:
> > This patch add pm_runtime support to sun8i-emac.
> > For the moment, only basic support is added, (the device is marked as
> > used when net/open)
> > 
> > Signed-off-by: Corentin Labbe 
> > ---
> >  drivers/net/ethernet/allwinner/sun8i-emac.c | 62 
> > -
> >  1 file changed, 60 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/allwinner/sun8i-emac.c 
> > b/drivers/net/ethernet/allwinner/sun8i-emac.c
> > index 1c4bc80..cce886e 100644
> > --- a/drivers/net/ethernet/allwinner/sun8i-emac.c
> > +++ b/drivers/net/ethernet/allwinner/sun8i-emac.c
> > @@ -9,7 +9,6 @@
> >   * - MAC filtering
> >   * - Jumbo frame
> >   * - features rx-all (NETIF_F_RXALL_BIT)
> > - * - PM runtime
> >   */
> >  #include 
> >  #include 
> > @@ -27,6 +26,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -1301,11 +1301,18 @@ static int sun8i_emac_open(struct net_device *ndev)
> > int err;
> > u32 v;
> >  
> > +   err = pm_runtime_get_sync(priv->dev);
> > +   if (err) {
> > +   pm_runtime_put_noidle(priv->dev);
> > +   dev_err(priv->dev, "pm_runtime error: %d\n", err);
> > +   return err;
> > +   }
> > +
> > err = request_irq(priv->irq, sun8i_emac_dma_interrupt, 0,
> >   dev_name(priv->dev), ndev);
> > if (err) {
> > dev_err(priv->dev, "Cannot request IRQ: %d\n", err);
> > -   return err;
> > +   goto err_runtime;
> > }
> >  
> > /* Set interface mode (and configure internal PHY on H3) */
> > @@ -1395,6 +1402,8 @@ err_syscon:
> > sun8i_emac_unset_syscon(ndev);
> >  err_irq:
> > free_irq(priv->irq, ndev);
> > +err_runtime:
> > +   pm_runtime_put(priv->dev);
> > return err;
> >  }
> >  
> > @@ -1483,6 +1492,8 @@ static int sun8i_emac_stop(struct net_device *ndev)
> > dma_free_coherent(priv->dev, priv->nbdesc_tx * sizeof(struct dma_desc),
> >   priv->dd_tx, priv->dd_tx_phy);
> >  
> > +   pm_runtime_put(priv->dev);
> > +
> > return 0;
> >  }
> >  
> > @@ -2210,6 +2221,8 @@ static int sun8i_emac_probe(struct platform_device 
> > *pdev)
> > goto probe_err;
> > }
> >  
> > +   pm_runtime_enable(priv->dev);
> > +
> > return 0;
> >  
> >  probe_err:
> > @@ -2221,6 +2234,8 @@ static int sun8i_emac_remove(struct platform_device 
> > *pdev)
> >  {
> > struct net_device *ndev = platform_get_drvdata(pdev);
> >  
> > +   pm_runtime_disable(&pdev->dev);
> > +
> > unregister_netdev(ndev);
> > platform_set_drvdata(pdev, NULL);
> > free_netdev(ndev);
> > @@ -2228,6 +2243,47 @@ static int sun8i_emac_remove(struct platform_device 
> > *pdev)
> > return 0;
> >  }
> >  
> > +static int __maybe_unused sun8i_emac_suspend(struct platform_device *pdev, 
> > pm_message_t state)
> > +{
> > +   struct net_device *ndev = platform_get_drvdata(pdev);
> > +   struct sun8i_emac_priv *priv = netdev_priv(ndev);
> > +
> > +   napi_disable(&priv->napi);
> > +
> > +   if (netif_running(ndev))
> > +   netif_device_detach(ndev);
> > +
> > +   sun8i_emac_stop_tx(ndev);
> > +   sun8i_emac_stop_rx(ndev);
> > +
> > +   sun8i_emac_rx_clean(ndev);
> > +   sun8i_emac_tx_clean(ndev);
> > +
> > +   phy_stop(ndev->phydev);
> > +
> > +   return 0;
> > +}
> > +
> > +static int __maybe_unused sun8i_emac_resume(struct platform_device *pdev)
> > +{
> > +   struct net_device *ndev = platform_get_drvdata(pdev);
> > +   struct sun8i_emac_priv *priv = netdev_priv(ndev);
> > +
> > +   phy_start(ndev->phydev);
> > +
> > +   sun8i_emac_start_tx(ndev);
> > +   sun8i_emac_start_rx(ndev);
> > +
> > +   if (netif_running(ndev))
> > +   netif_device_attach(ndev);
> > +
> > +   netif_start_queue(ndev);
> > +
> > +   napi_enable(&priv->napi);
> > +
> > +   return 0;
> > +}
> 
> The main idea behind the runtime PM hooks is that they bring the
> device to a working state and shuts it down when it's not needed
> anymore.
> 

I expect that the first part (all pm_runtime_xxx) of the patch bring that.
When the interface is not opened:
cat /sys/devices/platform/soc/1c3.ethernet/power/runtime_status 
suspended

> However, they shouldn't be called when the device is still in used, so
> all the mangling with NAPI, the phy and so on is irrelevant here, but
> the clocks, resets, for example, are.
> 

I do the same as other ethernet driver for suspend/resume.

> >  static const struct of_device_id sun8i_emac_of_match_table[] = {
> > { .compatible = "allwinner,sun8i-a83t-emac",
> >   .data = &emac_variant_a83t },
> > @@ -2246,6 +2302,8 @@ static struct platform_driver sun8i_emac_driver = {
> > .name   = "sun8i-emac",
> > .of_match_table = sun8i_emac_of_match_table,
> > },
> > +   .suspend= sun8i_emac_suspend,
> > +   .res

Re: [PATCH 6/9] net: ethernet: ti: cpts: clean up event list if event pool is empty

2016-09-14 Thread Richard Cochran

On Wed, Sep 14, 2016 at 04:02:28PM +0300, Grygorii Strashko wrote:
> From: WingMan Kwok 
> 
> When a CPTS user does not exit gracefully by disabling cpts
> timestamping and leaving a joined multicast group, the system
> continues to receive and timestamps the ptp packets which eventually
> occupy all the event list entries.  When this happns, the added code
> tries to remove some list entries which are expired.
> 
> Signed-off-by: WingMan Kwok 
> Signed-off-by: Grygorii Strashko 
> ---
>  drivers/net/ethernet/ti/cpts.c | 30 --
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c
> index 970d4e2..ff8bb85 100644
> --- a/drivers/net/ethernet/ti/cpts.c
> +++ b/drivers/net/ethernet/ti/cpts.c
> @@ -57,22 +57,48 @@ static int cpts_fifo_pop(struct cpts *cpts, u32 *high, 
> u32 *low)
>   return -1;
>  }
>  
> +static int cpts_event_list_clean_up(struct cpts *cpts)

5 words, that is quite a mouth full.  How about this instead?

static int cpts_purge_events(struct cpts *cpts);

> +{
> + struct list_head *this, *next;
> + struct cpts_event *event;
> + int removed = 0;
> +
> + list_for_each_safe(this, next, &cpts->events) {
> + event = list_entry(this, struct cpts_event, list);
> + if (event_expired(event)) {
> + list_del_init(&event->list);
> + list_add(&event->list, &cpts->pool);
> + ++removed;
> + }
> + }
> + return removed;
> +}
> +
>  /*
>   * Returns zero if matching event type was found.
>   */
>  static int cpts_fifo_read(struct cpts *cpts, int match)
>  {
>   int i, type = -1;
> + int removed;

No need for another variable, just change the return code above to

return removed ? 0 : -1;

and then you have ...

>   u32 hi, lo;
>   struct cpts_event *event;
>  
>   for (i = 0; i < CPTS_FIFO_DEPTH; i++) {
>   if (cpts_fifo_pop(cpts, &hi, &lo))
>   break;
> +
>   if (list_empty(&cpts->pool)) {
> - pr_err("cpts: event pool is empty\n");
> - return -1;
> + removed = cpts_event_list_clean_up(cpts);
> + if (!removed) {
> + dev_err(cpts->dev,
> + "cpts: event pool is empty\n");
> + return -1;
> + }

if (cpts_purge_events(cpts)) {
dev_err(cpts->dev, "cpts: event pool empty\n");
return -1;
}

Notice how I avoided the ugly line break?

> + dev_dbg(cpts->dev,
> + "cpts: event pool cleaned up %d\n", removed);
>   }
> +
>   event = list_first_entry(&cpts->pool, struct cpts_event, list);
>   event->tmo = jiffies + 2;
>   event->high = hi;
> -- 
> 2.9.3
> 

Thanks,
Richard

Re: [net v1] fib_rules: interface group matching

2016-09-14 Thread David Ahern

On 9/14/16 6:40 AM, Vincent Bernat wrote:
> When a user wants to assign a routing table to a group of incoming
> interfaces, the current solutions are:
> 
>  - one IP rule for each interface (scalability problems)
>  - use of fwmark and devgroup matcher (don't work with internal route
>lookups, used for example by RPF)
>  - use of VRF devices (more complex)

Why do you believe that? A VRF is a formalized grouping of interfaces that 
includes an API for locally generated traffic to specify which VRF/group to 
use. And, with the l3mdev rule you only need 1 rule for all VRFs regardless of 
the number which is the best solution to the scalability problem of adding 
rules per device/group/VRF.

What use case are trying to solve?

Re: [PATCH 7/9] net: ethernet: ti: cpts: calc mult and shift from refclk freq

2016-09-14 Thread Richard Cochran

On Wed, Sep 14, 2016 at 04:02:29PM +0300, Grygorii Strashko wrote:
> @@ -35,6 +33,8 @@ Optional properties:
> For example in dra72x-evm, pcf gpio has to be
> driven low so that cpsw slave 0 and phy data
> lines are connected via mux.
> +- cpts_clock_mult: Numerator to convert input clock ticks into 
> nanoseconds
> +- cpts_clock_shift   : Denominator to convert input clock ticks into 
> nanoseconds

You should explain to the reader how these will be calculated when the
properties are missing.

> diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c
> index ff8bb85..8046a21 100644
> --- a/drivers/net/ethernet/ti/cpts.c
> +++ b/drivers/net/ethernet/ti/cpts.c
> @@ -418,18 +418,60 @@ void cpts_unregister(struct cpts *cpts)
>   clk_disable(cpts->refclk);
>  }
>  
> +static void cpts_calc_mult_shift(struct cpts *cpts)
> +{
> + u64 maxsec;
> + u32 freq;
> + u32 mult;
> + u32 shift;
> + u64 ns;
> + u64 frac;

Why so many new lines?  This isn't good style.  Please combine
variables of the same type into one line and sort the lists
alphabetically.

> + if (cpts->cc_mult || cpts->cc.shift)
> + return;
> +
> + freq = clk_get_rate(cpts->refclk);
> +
> + /* Calc the maximum number of seconds which we can run before
> +  * wrapping around.
> +  */
> + maxsec = cpts->cc.mask;
> + do_div(maxsec, freq);
> + if (!maxsec)
> + maxsec = 1;
> + else if (maxsec > 600 && cpts->cc.mask > UINT_MAX)
> + maxsec = 600;
> +
> + clocks_calc_mult_shift(&mult, &shift, freq, NSEC_PER_SEC, maxsec);
> +
> + cpts->cc_mult = mult;
> + cpts->cc.mult = mult;
> + cpts->cc.shift = shift;
> + /* Check calculations and inform if not precise */

Contrary to this comment, you are not making any kind of judgment
about whether the calculations are precise or not.

> + frac = 0;
> + ns = cyclecounter_cyc2ns(&cpts->cc, freq, cpts->cc.mask, &frac);
> +
> + dev_info(cpts->dev,
> +  "CPTS: ref_clk_freq:%u calc_mult:%u calc_shift:%u error:%lld 
> nsec/sec\n",
> +  freq, cpts->cc_mult, cpts->cc.shift, (ns - NSEC_PER_SEC));
> +}
> +

Thanks,
Richard

Re: [net v1] fib_rules: interface group matching

2016-09-14 Thread Vincent Bernat

 ❦ 14 septembre 2016 16:15 CEST, David Ahern  :

>> When a user wants to assign a routing table to a group of incoming
>> interfaces, the current solutions are:
>> 
>>  - one IP rule for each interface (scalability problems)
>>  - use of fwmark and devgroup matcher (don't work with internal route
>>lookups, used for example by RPF)
>>  - use of VRF devices (more complex)
>
> Why do you believe that? A VRF is a formalized grouping of interfaces
> that includes an API for locally generated traffic to specify which
> VRF/group to use. And, with the l3mdev rule you only need 1 rule for
> all VRFs regardless of the number which is the best solution to the
> scalability problem of adding rules per device/group/VRF.
>
> What use case are trying to solve?

Local processes have to be made aware of the VRF by binding to the
pseudo-device. Some processes may be tricked by LD_PRELOAD but some
won't (like stuff written in Go). Maybe I should just find a better way
to bind a process to a VRF without its cooperation.
-- 
Instrument your programs.  Measure before making "efficiency" changes.
- The Elements of Programming Style (Kernighan & Plauger)

Re: [PATCH 8/9] net: ethernet: ti: cpts: fix overflow check period

2016-09-14 Thread Richard Cochran

On Wed, Sep 14, 2016 at 04:02:30PM +0300, Grygorii Strashko wrote:
> @@ -427,9 +427,6 @@ static void cpts_calc_mult_shift(struct cpts *cpts)
>   u64 ns;
>   u64 frac;
>  
> - if (cpts->cc_mult || cpts->cc.shift)
> - return;
> -
>   freq = clk_get_rate(cpts->refclk);
>  
>   /* Calc the maximum number of seconds which we can run before

This hunk has nothing to do with $subject.

> @@ -442,11 +439,20 @@ static void cpts_calc_mult_shift(struct cpts *cpts)
>   else if (maxsec > 600 && cpts->cc.mask > UINT_MAX)
>   maxsec = 600;
>  
> + /* Calc overflow check period (maxsec / 2) */
> + cpts->ov_check_period = (HZ * maxsec) / 2;
> + dev_info(cpts->dev, "cpts: overflow check period %lu\n",
> +  cpts->ov_check_period);
> +
> + if (cpts->cc_mult || cpts->cc.shift)
> + return;
> +
>   clocks_calc_mult_shift(&mult, &shift, freq, NSEC_PER_SEC, maxsec);
>  
>   cpts->cc_mult = mult;
>   cpts->cc.mult = mult;
>   cpts->cc.shift = shift;
> +

Nor does this.

Thanks,
Richard


>   /* Check calculations and inform if not precise */
>   frac = 0;
>   ns = cyclecounter_cyc2ns(&cpts->cc, freq, cpts->cc.mask, &frac);
> diff --git a/drivers/net/ethernet/ti/cpts.h b/drivers/net/ethernet/ti/cpts.h
> index 47026ec..e0e4a62b 100644
> --- a/drivers/net/ethernet/ti/cpts.h
> +++ b/drivers/net/ethernet/ti/cpts.h
> @@ -97,9 +97,6 @@ enum {
>   CPTS_EV_TX,   /* Ethernet Transmit Event */
>  };
>  
> -/* This covers any input clock up to about 500 MHz. */
> -#define CPTS_OVERFLOW_PERIOD (HZ * 8)
> -
>  #define CPTS_FIFO_DEPTH 16
>  #define CPTS_MAX_EVENTS 32
>  
> @@ -127,6 +124,7 @@ struct cpts {
>   struct list_head events;
>   struct list_head pool;
>   struct cpts_event pool_data[CPTS_MAX_EVENTS];
> + unsigned long ov_check_period;
>  };
>  
>  int cpts_rx_timestamp(struct cpts *cpts, struct sk_buff *skb);
> -- 
> 2.9.3
>

RE: [RFC v3 00/22] Landlock LSM: Unprivileged sandboxing

2016-09-14 Thread David Laight

From: Mickaël Salaün
> Sent: 14 September 2016 08:24
...
> ## Why does seccomp-filter is not enough?
> 
> A seccomp filter can access to raw syscall arguments which means that it is 
> not
> possible to filter according to pointed data as a file path. As demonstrated
> the first version of this patch series, filtering at the syscall level is
> complicated (e.g. need to take care of race conditions). This is mainly 
> because
> the access control checkpoints of the kernel are not at this high-level but
> more underneath, at LSM hooks level. The LSM hooks are designed to handle this
> kind of checks. This series use this approach to leverage the ability of
> unprivileged users to limit themselves.

You cannot validate file path parameters during syscall entry.
It can only be done after the user buffer has been read into kernel memory.
(ie you must only access the buffer once.)

This has nothing to do with where the kernel does any access checks,
and everything to do with the fact that another thread/process can
modify the buffer after you have validated it.

David

Re: [net v1] fib_rules: interface group matching

2016-09-14 Thread David Ahern

On 9/14/16 8:25 AM, Vincent Bernat wrote:
>  ❦ 14 septembre 2016 16:15 CEST, David Ahern  :
> 
>>> When a user wants to assign a routing table to a group of incoming
>>> interfaces, the current solutions are:
>>>
>>>  - one IP rule for each interface (scalability problems)
>>>  - use of fwmark and devgroup matcher (don't work with internal route
>>>lookups, used for example by RPF)
>>>  - use of VRF devices (more complex)
>>
>> Why do you believe that? A VRF is a formalized grouping of interfaces
>> that includes an API for locally generated traffic to specify which
>> VRF/group to use. And, with the l3mdev rule you only need 1 rule for
>> all VRFs regardless of the number which is the best solution to the
>> scalability problem of adding rules per device/group/VRF.
>>
>> What use case are trying to solve?
> 
> Local processes have to be made aware of the VRF by binding to the
> pseudo-device. Some processes may be tricked by LD_PRELOAD but some
> won't (like stuff written in Go). Maybe I should just find a better way
> to bind a process to a VRF without its cooperation.
> 

What API are you using for interface groups? How does an app tell the kernel to 
use interface group 1 versus group 2?


LD_PRELOAD and overloading socket is an ad-hoc hack at best with many holes - 
as you have found.

We (Cumulus Linux) are using this cgroups patch:
   http://www.mail-archive.com/netdev@vger.kernel.org/msg93408.html

I want something formal like the cgroups patch or even the first idea of adding 
a default sk_bound_dev_if to the task struct:

https://github.com/dsahern/linux/commit/b3e5ccc291505c8a503edb20ea2c2b5e86bed96f

Parent-to-child inheritance of the setting is a requirement as is the setting 
getting applied to all IPv4/v6 sockets without action by the process itself.

Still have some work to do to get a solution into the kernel.

1 2 3 >

1 - 100 of 265 matches

Mail list logo