Re: [PATCH 1/4] sparc: bpf_jit: Use kmalloc_array() in bpf_jit_compile()

2016-09-03 Thread Daniel Borkmann

On 09/03/2016 06:36 PM, SF Markus Elfring wrote:

From: Markus Elfring 
Date: Sat, 3 Sep 2016 17:10:20 +0200

A multiplication for the size determination of a memory allocation
indicated that an array data structure should be processed.
Thus use the corresponding function "kmalloc_array".

This issue was detected by using the Coccinelle software.


When you talk about "issue", could you please explain yourself what
concrete "issue" you were seeing ?!

This particular multiplication here is guaranteed to never overflow,
so at best a minor cleanup if you will.

(Do you actually have a sparc to test out your changes?)


Signed-off-by: Markus Elfring 
---
  arch/sparc/net/bpf_jit_comp.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/sparc/net/bpf_jit_comp.c b/arch/sparc/net/bpf_jit_comp.c
index a6d9204..ced1393 100644
--- a/arch/sparc/net/bpf_jit_comp.c
+++ b/arch/sparc/net/bpf_jit_comp.c
@@ -372,7 +372,7 @@ void bpf_jit_compile(struct bpf_prog *fp)
if (!bpf_jit_enable)
return;

-   addrs = kmalloc(flen * sizeof(*addrs), GFP_KERNEL);
+   addrs = kmalloc_array(flen, sizeof(*addrs), GFP_KERNEL);
if (addrs == NULL)
return;






Re: [PATCH v2 net-next 2/6] bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type

2016-09-01 Thread Daniel Borkmann

On 09/01/2016 09:44 AM, Peter Zijlstra wrote:

On Wed, Aug 31, 2016 at 02:50:39PM -0700, Alexei Starovoitov wrote:

+static u32 pe_prog_convert_ctx_access(enum bpf_access_type type, int dst_reg,
+ int src_reg, int ctx_off,
+ struct bpf_insn *insn_buf,
+ struct bpf_prog *prog)
+{
+   struct bpf_insn *insn = insn_buf;
+
+   BUILD_BUG_ON(FIELD_SIZEOF(struct perf_sample_data, period) != 
sizeof(u64));
+   switch (ctx_off) {
+   case offsetof(struct bpf_perf_event_data, sample_period):
+   *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct 
bpf_perf_event_data_kern, data)),
+ dst_reg, src_reg,
+ offsetof(struct bpf_perf_event_data_kern, 
data));
+   *insn++ = BPF_LDX_MEM(BPF_DW, dst_reg, dst_reg,
+ offsetof(struct perf_sample_data, 
period));
+   break;


OK, so that deals with us moving the period field in the structure, and
break compile if we'd change its size or remove it outright (highly
unlikely). In that latter case we can change this code to simply return
a (u64)0 and things would continue to 'work'.

Did I understand that correctly?


Yes, if a program accesses sample_period member of the struct 
bpf_perf_event_data
context, then the verifier rewrites this into above two loads to eventually 
fetch
the struct perf_sample_data's period to the given target register. As you said,
should the period field change size (or get removed), compilation would break
so this doesn't get unnoticed and the code can be adapted along with it. In the
(hopefully very unlikely) case the member gets removed, it could be replaced 
with
loading 0 (or some other, better workaround to derive it, if possible).


Re: [PATCH net-next 3/6] bpf: perf_event progs should only use preallocated maps

2016-08-29 Thread Daniel Borkmann

On 08/27/2016 04:31 AM, Alexei Starovoitov wrote:

Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use
preallocated hash maps, since doing memory allocation
in overflow_handler can crash depending on where nmi got triggered.

Signed-off-by: Alexei Starovoitov 


Acked-by: Daniel Borkmann 


Re: [PATCH net-next 2/6] bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type

2016-08-29 Thread Daniel Borkmann

On 08/27/2016 04:31 AM, Alexei Starovoitov wrote:

Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
correspondingly in uapi/linux/perf_event.h)

The program visible context meta structure is
struct bpf_perf_event_data {
 struct pt_regs regs;
  __u64 sample_period;
};
which is accessible directly from the program:
int bpf_prog(struct bpf_perf_event_data *ctx)
{
   ... ctx->sample_period ...
   ... ctx->regs.ip ...
}

The bpf verifier rewrites the accesses into kernel internal
struct bpf_perf_event_data_kern which allows changing
struct perf_sample_data without affecting bpf programs.
New fields can be added to the end of struct bpf_perf_event_data
in the future.

Signed-off-by: Alexei Starovoitov 


Two things I noticed below, otherwise for BPF bits:

Acked-by: Daniel Borkmann 

[...]


+static bool pe_prog_is_valid_access(int off, int size, enum bpf_access_type 
type,
+   enum bpf_reg_type *reg_type)
+{
+   if (off < 0 || off >= sizeof(struct bpf_perf_event_data))
+   return false;
+   if (type != BPF_READ)
+   return false;
+   if (off % size != 0)
+   return false;
+   if (off == offsetof(struct bpf_perf_event_data, sample_period) &&
+   size != sizeof(u64))
+   return false;
+   if (size != sizeof(long))
+   return false;


Wouldn't this one rather need to be:

if (off == offsetof(struct bpf_perf_event_data, sample_period) {
if (size != sizeof(u64))
return false;
} else {
if (size != sizeof(long))
return false;
}

Otherwise on 32bit accessing sample_period might fail?


+   return true;
+}
+
+static u32 pe_prog_convert_ctx_access(enum bpf_access_type type, int dst_reg,
+ int src_reg, int ctx_off,
+ struct bpf_insn *insn_buf,
+ struct bpf_prog *prog)
+{
+   struct bpf_insn *insn = insn_buf;
+
+   switch (ctx_off) {
+   case offsetof(struct bpf_perf_event_data, sample_period):


Would be good to add a test as we usually have done:

BUILD_BUG_ON(FIELD_SIZEOF(struct perf_sample_data, period) != 8);


+   *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct 
bpf_perf_event_data_kern, data)),
+ dst_reg, src_reg,
+ offsetof(struct bpf_perf_event_data_kern, 
data));
+   *insn++ = BPF_LDX_MEM(BPF_DW, dst_reg, dst_reg,
+ offsetof(struct perf_sample_data, 
period));
+   break;
+   default:
+   *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct 
bpf_perf_event_data_kern, regs)),
+ dst_reg, src_reg,
+ offsetof(struct bpf_perf_event_data_kern, 
regs));
+   *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(sizeof(long)),
+ dst_reg, dst_reg, ctx_off);
+   break;
+   }
+   return insn - insn_buf;
+}
+


Re: [PATCH net-next 1/6] bpf: support 8-byte metafield access

2016-08-29 Thread Daniel Borkmann

On 08/27/2016 04:31 AM, Alexei Starovoitov wrote:

The verifier supported only 4-byte metafields in
struct __sk_buff and struct xdp_md. The metafields in upcoming
struct bpf_perf_event are 8-byte to match register width in struct pt_regs.
Teach verifier to recognize 8-byte metafield access.
The patch doesn't affect safety of sockets and xdp programs.
They check for 4-byte only ctx access before these conditions are hit.

Signed-off-by: Alexei Starovoitov 


Acked-by: Daniel Borkmann 


Re: linux-next: manual merge of the net-next tree with the net tree

2016-08-15 Thread Daniel Borkmann

On 08/15/2016 02:35 AM, Stephen Rothwell wrote:

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

   kernel/bpf/verifier.c

between commit:

   747ea55e4f78 ("bpf: fix bpf_skb_in_cgroup helper naming")

from the net tree and commit:

   60d20f9195b2 ("bpf: Add bpf_current_task_under_cgroup helper")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary.


Looks good to me, thanks!


Re: [PATCH 1/1 linux-next] Documentation/features/core: update HAVE_BPF_JIT

2016-08-12 Thread Daniel Borkmann

On 08/12/2016 11:38 PM, Fabian Frederick wrote:

Update documentation according to commit 606b5908
("bpf: split HAVE_BPF_JIT into cBPF and eBPF variant")

Signed-off-by: Fabian Frederick 


Thanks, Fabian!

Acked-by: Daniel Borkmann 


Re: [PATCH 2/2] net: sched: convert qdisc linked list to hashtable

2016-08-12 Thread Daniel Borkmann

On 08/12/2016 04:36 PM, Jiri Kosina wrote:

On Fri, 12 Aug 2016, Daniel Borkmann wrote:


I was thinking about something like the patch below (the reasong being
that ->dev would be NULL only in cases of singletonish qdiscs) ...
wouldn't that also fix the issue you're seeing? Have to think it
through a little bit more ..


Ahh, so this has the same effect as previously observed with the other fix.


Thanks a lot for confirming that this fixes the panic. I still have to
think a little bit more about this though.


Perhaps it's just a dumping issue, but to the below clsact, there shouldn't
be pfifo_fast instances appearing.

# tc qdisc show dev wlp2s0b1
qdisc mq 0: root
qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
# tc qdisc add dev wlp2s0b1 clsact
# tc qdisc show dev wlp2s0b1
qdisc mq 0: root
qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact : parent :fff1
qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1


Hmm, no immediate idea where those are coming from, we'll have to figure
it out. The mq device used here has 4 queues, right?


Yes, the first tc qdisc show is after boot and how it should normally
look like, so 4 tx queues.

# ls /sys/class/net/wlp2s0b1/queues/
rx-0  tx-0  tx-1  tx-2  tx-3

When adding clsact, only the 'qdisc clsact' line should be extra. Given
the extra pfifo_fast ones look the same as above, I would suspect a htab
dumping issue, perhaps. I can debug a bit later tonight on this.


Re: [PATCH 2/2] net: sched: convert qdisc linked list to hashtable

2016-08-12 Thread Daniel Borkmann

On 08/12/2016 03:53 PM, Jiri Kosina wrote:

On Fri, 12 Aug 2016, Daniel Borkmann wrote:


This results in below panic. Tested reverting this patch and it fixes
the panic.

Did you test this also with ingress or clsact qdisc (just try adding
it to lo dev for example) ?


Hi Daniel,

thanks for the report. Hmm, I am pretty sure clsact worked for me, but
I'll recheck.


What happens is the following in qdisc_match_from_root():

[  995.422187] XXX qdisc:88025e4fc800 queue:880262759000
dev:880261cc2000 handle:
[  995.422200] XXX qdisc:81cf8100 queue:81cf8240 dev:
(null) handle:

I believe this is due to dev_ingress_queue_create() assigning the
global noop_qdisc instance as qdisc_sleeping, which later qdisc_lookup()
uses for qdisc_match_from_root().

But everything that uses things like noop_qdisc cannot work with the
new qdisc_match_from_root(), because qdisc_dev(root) will always trigger
NULL pointer dereference there. Reason is because the dev is always
NULL for noop, it's a singleton, see noop_qdisc and noop_netdev_queue
in sch_generic.c.

Now how to fix it? Creating separate noop instances each time it's set
would be quite a waste of memory. Even fuglier would be to hack a static
net device struct into sch_generic.c and let noop_netdev_queue point there
to get to the hash table. Or we just not use qdisc_dev().


How about we actually extend a little bit the TCQ_F_BUILTIN special case
test in qdisc_match_from_root()?

After the change, the only way how qdisc_dev() could be NULL should be a
TCQ_F_BUILTIN case, right?

I was thinking about something like the patch below (the reasong being
that ->dev would be NULL only in cases of singletonish qdiscs) ...
wouldn't that also fix the issue you're seeing? Have to think it through a
little bit more ..


Ahh, so this has the same effect as previously observed with the other fix.

Perhaps it's just a dumping issue, but to the below clsact, there shouldn't
be pfifo_fast instances appearing.

# tc qdisc show dev wlp2s0b1
qdisc mq 0: root
qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
# tc qdisc add dev wlp2s0b1 clsact
# tc qdisc show dev wlp2s0b1
qdisc mq 0: root
qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact : parent :fff1
qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1


diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 25aada7..1c9faed 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -260,6 +260,9 @@ static struct Qdisc *qdisc_match_from_root(struct Qdisc 
*root, u32 handle)
  {
struct Qdisc *q;

+   if (!qdisc_dev(root))
+   return (root->handle == handle ? root : NULL);
+
if (!(root->flags & TCQ_F_BUILTIN) &&
root->handle == handle)
return root;


Thanks!





Re: [PATCH 2/2] net: sched: convert qdisc linked list to hashtable

2016-08-12 Thread Daniel Borkmann

Hi Jiri,

On 08/10/2016 11:05 AM, Jiri Kosina wrote:

From: Jiri Kosina 

Convert the per-device linked list into a hashtable. The primary
motivation for this change is that currently, we're not tracking all the
qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup
performed over the linked list by qdisc_match_from_root() is rather
expensive.

The ultimate goal is to get rid of hidden qdiscs completely, which will
bring much more determinism in user experience.

Reviewed-by: Cong Wang 
Signed-off-by: Jiri Kosina 


This results in below panic. Tested reverting this patch and it fixes
the panic.

Did you test this also with ingress or clsact qdisc (just try adding
it to lo dev for example) ?

What happens is the following in qdisc_match_from_root():

[  995.422187] XXX qdisc:88025e4fc800 queue:880262759000 
dev:880261cc2000 handle:
[  995.422200] XXX qdisc:81cf8100 queue:81cf8240 dev:  
(null) handle:

I believe this is due to dev_ingress_queue_create() assigning the
global noop_qdisc instance as qdisc_sleeping, which later qdisc_lookup()
uses for qdisc_match_from_root().

But everything that uses things like noop_qdisc cannot work with the
new qdisc_match_from_root(), because qdisc_dev(root) will always trigger
NULL pointer dereference there. Reason is because the dev is always
NULL for noop, it's a singleton, see noop_qdisc and noop_netdev_queue
in sch_generic.c.

Now how to fix it? Creating separate noop instances each time it's set
would be quite a waste of memory. Even fuglier would be to hack a static
net device struct into sch_generic.c and let noop_netdev_queue point there
to get to the hash table. Or we just not use qdisc_dev().

I've tried the below to hand in dev pointer instead of qdisc_dev(), but
I think this is not sound yet. Despite fixing the panic, I get something
weird like:

# tc qdisc show dev wlp2s0b1
qdisc mq 0: root
qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact : parent :fff1
qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1

Not yet sure whether this below, really quickly hacked patch is just buggy
(I guess so) or it's another side effect of the original patch.

If you have some cycles to take a look into fixing the panic, would be great.

Thanks

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 25aada7..c2c9799 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -256,7 +256,8 @@ int qdisc_set_default(const char *name)
  * Note: caller either uses rtnl or rcu_read_lock()
  */

-static struct Qdisc *qdisc_match_from_root(struct Qdisc *root, u32 handle)
+static struct Qdisc *qdisc_match_from_root(struct net_device *dev,
+  struct Qdisc *root, u32 handle)
 {
struct Qdisc *q;

@@ -264,7 +265,7 @@ static struct Qdisc *qdisc_match_from_root(struct Qdisc 
*root, u32 handle)
root->handle == handle)
return root;

-   hash_for_each_possible_rcu(qdisc_dev(root)->qdisc_hash, q, hash, 
handle) {
+   hash_for_each_possible_rcu(dev->qdisc_hash, q, hash, handle) {
if (q->handle == handle)
return q;
}
@@ -296,12 +297,12 @@ struct Qdisc *qdisc_lookup(struct net_device *dev, u32 
handle)
 {
struct Qdisc *q;

-   q = qdisc_match_from_root(dev->qdisc, handle);
+   q = qdisc_match_from_root(dev, dev->qdisc, handle);
if (q)
goto out;

if (dev_ingress_queue(dev))
-   q = qdisc_match_from_root(
+   q = qdisc_match_from_root(dev,
dev_ingress_queue(dev)->qdisc_sleeping,
handle);
 out:
@@ -1430,8 +1431,8 @@ err_out:
return -EINVAL;
 }

-static int tc_dump_qdisc_root(struct Qdisc *root, struct sk_buff *skb,
- struct netlink_callback *cb,
+static int tc_dump_qdisc_root(struct net_device *dev, struct Qdisc *root,
+ struct sk_buff *skb, struct netlink_callback *cb,
  int *q_idx_p, int s_q_idx)
 {
int ret = 0, q_idx = *q_idx_p;
@@ -1451,7 +1452,7 @@ static int tc_dump_qdisc_root(struct Qdisc *root, struct 
sk_buff *skb,
goto done;
q_idx++;
}
-   hash_for_each(qdisc_dev(root)->qdisc_hash, b, q, hash) {
+   hash_for_each(dev->qdisc_hash, b, q, hash) {
if (q_idx < s_q_id

Re: [RFC V2 PATCH 17/25] net/netpolicy: introduce netpolicy_pick_queue

2016-08-04 Thread Daniel Borkmann

On 08/05/2016 12:54 AM, Andi Kleen wrote:

+1, I tried to bring this up here [1] in the last spin. I think only very
few changes would be needed, f.e. on eBPF side to add a queue setting
helper function which is probably straight forward ~10loc patch; and with
regards to actually picking it up after clsact egress, we'd need to adapt
__netdev_pick_tx() slightly when CONFIG_XPS so it doesn't override it.


You're proposing to rewrite the whole net policy manager as EBPF and run
it in a crappy JITer? Is that a serious proposal? It just sounds crazy
to me.

Especially since we already have a perfectly good compiler and
programming language to write system code in.

EBPF is ok for temporal instrumentation (if you somehow can accept
its security challenges), but using it to replace core
kernel functionality (which network policy IMHO is) with some bizarre
JITed setup and multiple languages doesn't really make any sense.

Especially it doesn't make sense for anything with shared state,
which is the core part of network policy: it negotiates with multiple
users.

After all we're writing Linux here and not some research toy.


From what I read I guess you didn't really bother to look any deeper into
this bizarre "research toy" to double check some of your claims. One of the
things it's often deployed for by the way is defining policy. And the
suggestion here was merely to explore existing infrastructure around things
like tc and whether it already resolves at least a part of your net policy
manager's requirements (like queue selection) or whether existing infrastructure
can be extended with fewer complexity this way (as was mentioned with a new
cls module as one option).


Re: [RFC V2 PATCH 17/25] net/netpolicy: introduce netpolicy_pick_queue

2016-08-04 Thread Daniel Borkmann

On 08/04/2016 10:21 PM, John Fastabend wrote:

On 16-08-04 12:36 PM, kan.li...@intel.com wrote:

From: Kan Liang 

To achieve better network performance, the key step is to distribute the
packets to dedicated queues according to policy and system run time
status.

This patch provides an interface which can return the proper dedicated
queue for socket/task. Then the packets of the socket/task will be
redirect to the dedicated queue for better network performance.

For selecting the proper queue, currently it uses round-robin algorithm
to find the available object from the given policy object list. The
algorithm is good enough for now. But it could be improved by some
adaptive algorithm later.

The selected object will be stored in hashtable. So it does not need to
go through the whole object list every time.

Signed-off-by: Kan Liang 
---
  include/linux/netpolicy.h |   5 ++
  net/core/netpolicy.c  | 136 ++
  2 files changed, 141 insertions(+)


There is a hook in the tx path now (recently added)

# ifdef CONFIG_NET_EGRESS
 if (static_key_false(&egress_needed)) {
 skb = sch_handle_egress(skb, &rc, dev);
 if (!skb)
 goto out;
 }
# endif

that allows pushing any policy you like for picking tx queues. It would
be better to use this mechanism. The hook runs 'tc' classifiers so
either write a new ./net/sch/cls_*.c for this or just use ebpf to stick
your policy in at runtime.

I'm out of the office for a few days but when I get pack I can test that
it actually picks the selected queue in all cases I know there was an
issue with some of the drivers using select_queue awhile back.


+1, I tried to bring this up here [1] in the last spin. I think only very
few changes would be needed, f.e. on eBPF side to add a queue setting
helper function which is probably straight forward ~10loc patch; and with
regards to actually picking it up after clsact egress, we'd need to adapt
__netdev_pick_tx() slightly when CONFIG_XPS so it doesn't override it.

  [1] http://www.spinics.net/lists/netdev/msg386953.html


Re: [RFC 4/4] bpf: Restrict Checmate bpf programs to current kernel ABI

2016-08-04 Thread Daniel Borkmann

On 08/04/2016 11:52 AM, Daniel Borkmann wrote:

On 08/04/2016 09:12 AM, Sargun Dhillon wrote:

I think it makes sense to restrict Checmate to loading programs that have been
compiled with the current kernel ABI. We can further stabilize the ABI, and
perhaps lift this restriction later.

Signed-off-by: Sargun Dhillon 
---
  kernel/bpf/syscall.c | 2 +-
  samples/bpf/checmate1_kern.c | 3 ++-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..2a37b4d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -741,7 +741,7 @@ static int bpf_prog_load(union bpf_attr *attr)
  if (attr->insn_cnt >= BPF_MAXINSNS)
  return -EINVAL;

-if (type == BPF_PROG_TYPE_KPROBE &&
+if ((type & (BPF_PROG_TYPE_KPROBE | BPF_PROG_TYPE_CHECMATE)) &&
  attr->kern_version != LINUX_VERSION_CODE)


Btw, this check is correct, program types are not masks.


Sorry, I meant to write *not* correct, which was hopefully inferable from
the rest.


BPF_PROG_TYPE_KPROBE (== 2) and BPF_PROG_TYPE_CHECMATE (== 7) will now
require every type to have a version code ...


  return -EINVAL;



Re: [RFC 4/4] bpf: Restrict Checmate bpf programs to current kernel ABI

2016-08-04 Thread Daniel Borkmann

On 08/04/2016 09:12 AM, Sargun Dhillon wrote:

I think it makes sense to restrict Checmate to loading programs that have been
compiled with the current kernel ABI. We can further stabilize the ABI, and
perhaps lift this restriction later.

Signed-off-by: Sargun Dhillon 
---
  kernel/bpf/syscall.c | 2 +-
  samples/bpf/checmate1_kern.c | 3 ++-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..2a37b4d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -741,7 +741,7 @@ static int bpf_prog_load(union bpf_attr *attr)
if (attr->insn_cnt >= BPF_MAXINSNS)
return -EINVAL;

-   if (type == BPF_PROG_TYPE_KPROBE &&
+   if ((type & (BPF_PROG_TYPE_KPROBE | BPF_PROG_TYPE_CHECMATE)) &&
attr->kern_version != LINUX_VERSION_CODE)


Btw, this check is correct, program types are not masks.

BPF_PROG_TYPE_KPROBE (== 2) and BPF_PROG_TYPE_CHECMATE (== 7) will now
require every type to have a version code ...


return -EINVAL;

diff --git a/samples/bpf/checmate1_kern.c b/samples/bpf/checmate1_kern.c
index f78b66b..d4ec1fa 100644
--- a/samples/bpf/checmate1_kern.c
+++ b/samples/bpf/checmate1_kern.c
@@ -3,6 +3,7 @@
  #include 
  #include 
  #include "bpf_helpers.h"
+#include 

  SEC("checmate")
  int prog(struct checmate_ctx *ctx)
@@ -24,4 +25,4 @@ int prog(struct checmate_ctx *ctx)
  }

  char _license[] SEC("license") = "GPL";
-
+u32 _version SEC("version") = LINUX_VERSION_CODE;





Re: [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM

2016-08-04 Thread Daniel Borkmann

Hi Sargun,

On 08/04/2016 09:11 AM, Sargun Dhillon wrote:
[...]

[It's a] minor LSM. My particular use case is one in which containers are being
dynamically deployed to machines by internal developers in a different group.

[...]

For many of these containers, the security policies can be fairly nuanced. One
particular one to take into account is network security. Often times,
administrators want to prevent ingress, and egress connectivity except from a
few select IPs. Egress filtering can be managed using net_cls, but without
modifying running software, it's non-trivial to attach a filter to all sockets
being created within a container. The inet_conn_request, socket_recvmsg,
socket_sock_rcv_skb hooks make this trivial to implement.


I'm not too familiar with LSMs, but afaik, when you install such policies they
are effectively global, right? How would you install/manage such policies per
container?

On a quick glance, this would then be the job of the BPF proglet from the global
hook, no? If yes, then the BPF contexts the BPF prog works with seem rather 
quite
limited ...

+struct checmate_file_open_ctx {
+   struct file *file;
+   const struct cred *cred;
+};
+
+struct checmate_task_create_ctx {
+   unsigned long clone_flags;
+};
+
+struct checmate_task_free_ctx {
+   struct task_struct *task;
+};
+
+struct checmate_socket_connect_ctx {
+   struct socket *sock;
+   struct sockaddr *address;
+   int addrlen;
+};

... or are you using bpf_probe_read() in some way to walk 'current' to retrieve
a namespace from there somehow? Like via nsproxy? But how you make sense of this
for defining a per container policy?

Do you see a way where we don't need to define so many different ctx each time?

My other concern from a security PoV is that when using things like 
bpf_probe_read()
we're dependent on kernel structs and there's a risk that when people migrate 
such
policies that expectations break due to underlying structs changed. I see you've
addressed that in patch 4 to place a small stone in the way, yeah kinda works. 
It's
mostly a reminder that this is not stable ABI.

Thanks,
Daniel


Re: [PATCH net-next v7 1/2] bpf: Add bpf_probe_write_user BPF helper to be called in tracers

2016-07-25 Thread Daniel Borkmann

On 07/25/2016 02:54 PM, Sargun Dhillon wrote:

This allows user memory to be written to during the course of a kprobe.
It shouldn't be used to implement any kind of security mechanism
because of TOC-TOU attacks, but rather to debug, divert, and
manipulate execution of semi-cooperative processes.


[...]

v7 looks good to me now as well, thanks a bunch Sargun!


Re: [PATCH v4 1/2] bpf: Add bpf_probe_write BPF helper to be called in tracers (kprobes)

2016-07-22 Thread Daniel Borkmann

On 07/22/2016 04:14 AM, Alexei Starovoitov wrote:

On Thu, Jul 21, 2016 at 06:09:17PM -0700, Sargun Dhillon wrote:

This allows user memory to be written to during the course of a kprobe.
It shouldn't be used to implement any kind of security mechanism
because of TOC-TOU attacks, but rather to debug, divert, and
manipulate execution of semi-cooperative processes.

Although it uses probe_kernel_write, we limit the address space
the probe can write into by checking the space with access_ok.
This is so the call doesn't sleep.

Given this feature is experimental, and has the risk of crashing
the system, we print a warning on invocation.

It was tested with the tracex7 program on x86-64.

Signed-off-by: Sargun Dhillon 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
---
  include/uapi/linux/bpf.h  | 12 
  kernel/bpf/verifier.c |  9 +
  kernel/trace/bpf_trace.c  | 37 +
  samples/bpf/bpf_helpers.h |  2 ++
  4 files changed, 60 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2b7076f..4536282 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -365,6 +365,18 @@ enum bpf_func_id {
 */
BPF_FUNC_get_current_task,

+   /**
+* bpf_probe_write(void *dst, void *src, int len)
+* safely attempt to write to a location
+* @dst: destination address in userspace
+* @src: source address on stack
+* @len: number of bytes to copy
+* Return:
+*   Returns number of bytes that could not be copied.
+*   On success, this will be zero


that is odd comment.
there are only three possible return values 0, -EFAULT, -EPERM


Agree.


+*/
+   BPF_FUNC_probe_write,
+
__BPF_FUNC_MAX_ID,
  };

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f72f23b..6785008 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1154,6 +1154,15 @@ static int check_call(struct verifier_env *env, int 
func_id)
return -EINVAL;
}

+   if (func_id == BPF_FUNC_probe_write) {
+   
pr_warn_once("\n");
+   pr_warn_once("* bpf_probe_write: Experimental Feature in use 
*\n");
+   pr_warn_once("* bpf_probe_write: Feature may corrupt memory  
*\n");
+   
pr_warn_once("\n");
+   pr_notice_ratelimited("bpf_probe_write in use by: %.16s-%d",
+ current->comm, task_pid_nr(current));
+   }


I think single line pr_notice_ratelimited() with 'feature may corrupt user 
memory'
will be enough.


Agree.


Also please move this to tracing specific part into bpf_trace.c
similar to bpf_get_trace_printk_proto() instead of verifier.c


Yes, sorry for not being too clear about it, this spot will then be
called by the verifier to fetch it for every function call. Meaning that
this could get printed multiple times for loading a single program, but
I think it's okay. If single line, I'd make that pr_warn_ratelimited(),
and probably something like ...

 "note: %s[%d] is installing a program with bpf_probe_write helper that may corrupt 
user memory!",
 current->comm, task_pid_nr(current)


+static u64 bpf_probe_write(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+   void *unsafe_ptr = (void *) (long) r1;
+   void *src = (void *) (long) r2;
+   int size = (int) r3;
+   struct task_struct *task = current;
+
+   /*


bpf_trace.c follows non-net comment style, so it's good here.
just distracting vs the rest of net style.


+* Ensure we're in a user context which it is safe for the helper
+* to run. This helper has no business in a kthread
+*
+* access_ok should prevent writing to non-user memory, but on
+* some architectures (nommu, etc...) access_ok isn't enough
+* So we check the current segment
+*/
+
+   if (unlikely(in_interrupt() || (task->flags & PF_KTHREAD)))
+   return -EPERM;


Should we also add a check for !PF_EXITING ?
Like signals are not delivered to such tasks and I'm not sure
what would be the state of mm of it.


I agree, good point.

You can make that 'current->flags & (PF_KTHREAD | PF_EXITING)' and
we don't need the extra task var either.


+   if (unlikely(segment_eq(get_fs(), KERNEL_DS)))
+   return -EPERM;
+   if (!access_ok(VERIFY_WRITE, unsafe_ptr, size))
+   return -EPERM;
+
+   return probe_kernel_write(unsafe_ptr, src, size);
+}
+
+static const struct bpf_func_proto bpf_probe_write_proto = {
+   .func   = bpf_probe_write,
+   .gpl_only   = true,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_ANYTHING,

[PATCH net-next] bpf, events: fix offset in skb copy handler

2016-07-21 Thread Daniel Borkmann
This patch fixes the __output_custom() routine we currently use with
bpf_skb_copy(). I missed that when len is larger than the size of the
current handle, we can issue multiple invocations of copy_func, and
__output_custom() advances destination but also source buffer by the
written amount of bytes. When we have __output_custom(), this is actually
wrong since in that case the source buffer points to a non-linear object,
in our case an skb, which the copy_func helper is supposed to walk.
Therefore, since this is non-linear we thus need to pass the offset into
the helper, so that copy_func can use it for extracting the data from
the source object.

Therefore, adjust the callback signatures properly and pass offset
into the skb_header_pointer() invoked from bpf_skb_copy() callback. The
__DEFINE_OUTPUT_COPY_BODY() is adjusted to accommodate for two things:
i) to pass in whether we should advance source buffer or not; this is
a compile-time constant condition, ii) to pass in the offset for
__output_custom(), which we do with help of __VA_ARGS__, so everything
can stay inlined as is currently. Both changes allow for adapting the
__output_* fast-path helpers w/o extra overhead.

Fixes: 555c8a8623a3 ("bpf: avoid stack copy and use skb ctx for event output")
Fixes: 7e3f977edd0b ("perf, events: add non-linear data support for raw 
records")
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf.h|  2 +-
 include/linux/perf_event.h |  2 +-
 kernel/events/internal.h   | 15 ++-
 net/core/filter.c  |  4 ++--
 4 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 36da074..1113423 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -211,7 +211,7 @@ bool bpf_prog_array_compatible(struct bpf_array *array, 
const struct bpf_prog *f
 const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
 
 typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src,
-   unsigned long len);
+   unsigned long off, unsigned long len);
 
 u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
 void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e79e6c6..15e55b7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -70,7 +70,7 @@ struct perf_callchain_entry_ctx {
 };
 
 typedef unsigned long (*perf_copy_f)(void *dst, const void *src,
-unsigned long len);
+unsigned long off, unsigned long len);
 
 struct perf_raw_frag {
union {
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 2417eb5..486fd78 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -123,18 +123,19 @@ static inline unsigned long perf_aux_size(struct 
ring_buffer *rb)
return rb->aux_nr_pages << PAGE_SHIFT;
 }
 
-#define __DEFINE_OUTPUT_COPY_BODY(memcpy_func) \
+#define __DEFINE_OUTPUT_COPY_BODY(advance_buf, memcpy_func, ...)   \
 {  \
unsigned long size, written;\
\
do {\
size= min(handle->size, len);   \
-   written = memcpy_func(handle->addr, buf, size); \
+   written = memcpy_func(__VA_ARGS__); \
written = size - written;   \
\
len -= written; \
handle->addr += written;\
-   buf += written; \
+   if (advance_buf)\
+   buf += written; \
handle->size -= written;\
if (!handle->size) {\
struct ring_buffer *rb = handle->rb;\
@@ -153,12 +154,16 @@ static inline unsigned long perf_aux_size(struct 
ring_buffer *rb)
 static inline unsigned long\
 func_name(struct perf_output_handle *handle,   \
  const void *buf, unsigned long len)   \
-__DEFINE_OUTPUT_COPY_BODY(memcpy_func)
+__DEFINE_OUTPUT_COPY_BODY(true, memcpy_func, handle->addr, buf, size)
 
 static inline unsigned long
 __output_custom(struct

Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes)

2016-07-21 Thread Daniel Borkmann

On 07/21/2016 12:47 PM, Sargun Dhillon wrote:

On Thu, Jul 21, 2016 at 01:00:51AM +0200, Daniel Borkmann wrote:

[...]

I don't really like couple of things, your ifdef CONFIG_MMU might not be
needed I think, couple of these checks seem redundant, (I'm not yet sure
about the task->mm != task->active_mm thingy), the helper should definitely
be gpl_only and ARG_PTR_TO_RAW_STACK is just buggy. Also, this should be
a bit analogue to bpf_probe_read we have. How about something roughly along
the lines of below diff (obviously needs extensive testing ...)? This
can still do all kind of ugly crap to the user process, but limited to
the cap_sys_admin to shoot himself in the foot.

* You're right about CONFIG_MMU. We don't need it, all of the nommu platforms
properly deal with it from my research.


The segment_eq() test should generically catch this from what I see.


It was always ARG_PTR_TO_STACK? Or are you saying ARG_PTR_TO_STACK is buggy and
we should make it ARG_PTR_TO_RAW_STACK?


No, in your patch, you had '+.arg2_type= ARG_PTR_TO_RAW_STACK,',
which is not correct as it means you don't need to initialize the memory
you pass in for your *src pointer. I believe you took this over from
probe_read(), but there it's correct. ARG_PTR_TO_STACK means the verifier
checks that it's initialized with data.


I originally named the function bpf_probe_write. Upon further thought I don't
think that makes sense. The reason is because bpf_probe_read is analogous to
probe_kernel_read. If we had bpf_probe_write, I think people might reason it to
be equivalent to probe_kernel_write, and be confused when they can't write to
kernel space.


I still think that bpf_probe_write is the more appropriate name, and that
the -EPERM are also better than -EINVAL. For user space, you'll have the
bpf_probe_read() / bpf_probe_write() pair you can use, which is the more
intuitive complement, also since people might already use bpf_probe_read(),
so they kind of can infer its meaning. It's just that the kernel doesn't
give you _permission_ to mess with kernel memory, hence due to not being
allowed -EPERM to make this absolutely clear to the user that this is illegal.
-EINVAL usually means one of the function arguments might be wrong, I think
-EPERM is a better / more clear fit in this case, imho.


I tried to make the external facing documentaion close to copy_to_user. That's
how people should use it, not like _write. Therefor I think it makes sense to
keep that the name.


But still, you *probe* to write somewhere to the process' address space,
so it can still fail with -EFAULT. Other than that, see comment above.


I added a check for (!task) -- It seems to be spattered throughou the eBPF
helper code. Alexei mentioned earlier that it can be null, but I'm not sure of
the case


Well, the test of unlikely(in_interrupt() || (task->flags & PF_KTHREAD))
will cover these cases. It makes sure that you're neither in hardirq (NULL
here) nor softirq and that you're not in a kthread.


RE: task->mm != task->active_mm -- There are a couple scenarios where kthreads
do this, and the only user call that should hit this is execve. There's only a
brief period where this can be true and I don't think it's worth dealing with
that case -- I'm not really sure you can plant a kprobe at the right site either


But if kthreads do this, wouldn't this already be covered by task->flags &
PF_KTHREAD test for such case?


Did some minimal testing with tracex7 and others.


Ok, great!


I was able to copy memory into other process's space while certain
syscalls were going on. I don't think that there are a reasonable set of
protections.


Right, it doesn't prevent that syscalls going on in parallel.


I'll do more testing. Any suggestions of what might break? I've looked at
writing to unitialized memory, Memory out of bounds, etc... Do you know of any
"weak spots"?


Well, you could write into text/data, stack, etc for the user process so
quite a bit. ;) Or do you mean wrt kernel space? If someone runs some binary
installing such a proglet, I think it would also make sense to print a message
(rate-limited) to the klog with current->comm, task_pid_nr(current) for the
process installing this, from verifier side I mean. Maybe makes more sense
than the print-once from the helper side.

Thanks,
Daniel


Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes)

2016-07-20 Thread Daniel Borkmann

On 07/20/2016 11:58 AM, Sargun Dhillon wrote:
[...]

So, with that, what about the following:
It includes
-Desupporting no MMU platforms as we've deemed them incapable of being
  safe
-Checking that we're not in a kthread
-Checking that the active mm is the thread's mm
-A log message indicating the experimental nature of this helper

It does not include:
-A heuristic to determine is access_ok is broken, or if the platform
  didn't implement it. It seems all platforms with MMUs implement it today,
  and it seems clear to make that platforms should do something better than
  return 1, if they can


I don't really like couple of things, your ifdef CONFIG_MMU might not be
needed I think, couple of these checks seem redundant, (I'm not yet sure
about the task->mm != task->active_mm thingy), the helper should definitely
be gpl_only and ARG_PTR_TO_RAW_STACK is just buggy. Also, this should be
a bit analogue to bpf_probe_read we have. How about something roughly along
the lines of below diff (obviously needs extensive testing ...)? This
can still do all kind of ugly crap to the user process, but limited to
the cap_sys_admin to shoot himself in the foot.

 include/uapi/linux/bpf.h |  3 +++
 kernel/trace/bpf_trace.c | 30 ++
 2 files changed, 33 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2b7076f..4d339c6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -365,6 +365,9 @@ enum bpf_func_id {
 */
BPF_FUNC_get_current_task,

+   /* Doc goes here ... */
+   BPF_FUNC_probe_write,
+
__BPF_FUNC_MAX_ID,
 };

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index a12bbd3..43a4386c 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -81,6 +81,34 @@ static const struct bpf_func_proto bpf_probe_read_proto = {
.arg3_type  = ARG_ANYTHING,
 };

+static u64 bpf_probe_write(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+   struct task_struct *task = current;
+   void *unsafe_ptr = (void *)(long) r1;
+   void *src = (void *)(long) r2;
+   int size = (int) r3;
+
+   if (unlikely(in_interrupt() || (task->flags & PF_KTHREAD)))
+   return -EPERM;
+   if (segment_eq(get_fs(), KERNEL_DS))
+   return -EPERM;
+   if (!access_ok(VERIFY_WRITE, unsafe_ptr, size))
+   return -EPERM;
+
+   /* pr_warn_once() barks here ... */
+
+   return probe_kernel_write(unsafe_ptr, src, size);
+}
+
+static const struct bpf_func_proto bpf_probe_write_proto = {
+   .func   = bpf_probe_write,
+   .gpl_only   = true,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_ANYTHING,
+   .arg2_type  = ARG_PTR_TO_STACK,
+   .arg3_type  = ARG_CONST_STACK_SIZE,
+};
+
 /*
  * limited trace_printk()
  * only %d %u %x %ld %lu %lx %lld %llu %llx %p %s conversion specifiers allowed
@@ -344,6 +372,8 @@ static const struct bpf_func_proto *tracing_func_proto(enum 
bpf_func_id func_id)
return &bpf_map_delete_elem_proto;
case BPF_FUNC_probe_read:
return &bpf_probe_read_proto;
+   case BPF_FUNC_probe_write:
+   return &bpf_probe_write_proto;
case BPF_FUNC_ktime_get_ns:
return &bpf_ktime_get_ns_proto;
case BPF_FUNC_tail_call:
--
1.9.3


Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes)

2016-07-20 Thread Daniel Borkmann

On 07/20/2016 05:02 AM, Alexei Starovoitov wrote:

On Wed, Jul 20, 2016 at 01:19:51AM +0200, Daniel Borkmann wrote:

On 07/19/2016 06:34 PM, Alexei Starovoitov wrote:

On Tue, Jul 19, 2016 at 01:17:53PM +0200, Daniel Borkmann wrote:

+   return -EINVAL;
+
+   /* Is this a user address, or a kernel address? */
+   if (!access_ok(VERIFY_WRITE, to, size))
+   return -EINVAL;
+
+   return probe_kernel_write(to, from, size);


I'm still worried that this can lead to all kind of hard to find
bugs or races for user processes, if you make this writable to entire
user address space (which is the only thing that access_ok() checks
for). What if the BPF program has bugs and writes to wrong addresses
for example, introducing bugs in some other, non-intended processes
elsewhere? Can this be limited to syscalls only? And if so, to the
passed memory only?


my understanding that above code will write only to memory of current process,
so impact is contained and in that sense buggy kprobe program is no different

>from buggy seccomp prorgram.

Compared to seccomp, you might not notice that a race has happened,
in seccomp case you might have killed your process, which is visible.
But ok, in ptrace() case it might be similar issue perhaps ...

The asm-generic version does __access_ok(..) { return 1; } for nommu
case, I haven't checked closely enough whether there's actually an arch
that uses this, but for example arm nommu with only one addr space would
certainly result in access_ok() as 1, and then you could also go into
probe_kernel_write(), no?


good point. how arm nommu handles copy_to_user? if there is nommu


Should probably boil down to something similar as plain memcpy().


then there is no user/kernel mm ? Crazy archs.
I guess we have to disable this helper on all such archs.


Don't know that code well enough, but I believe the check would only
ensure in normal use-cases that user process doesn't fiddle with kernel
address space, but not necessarily guarantee that this really only
belongs to the process address space.


why? on x86 that exactly what it does. access_ok=true means
it's user space address and since we're in _this user context_
probe_kernel_write can only affect this user.


x86 code comments this with "note that, depending on architecture,
this function probably just checks that the pointer is in the user
space range - after calling this function, memory access functions may
still return -EFAULT".


Yes. I've read that comment to :)
Certainly not an expert, but the archs I've looked at, access_ok
has the same meaning as on x86. They check the space range to
make sure address doesn't belong to kernel.
Could I have missed something? Certainly. Please double check :)


Also, what happens in case of kernel thread?


my understanding if access_ok(addr)=true the addr will never point
to memory of kernel thread.


If you're coming from user context only, this should be true, it'll
check whether it's some user space pointer.


As it stands, it does ...

if (unlikely(in_interrupt()))
return -EINVAL;
if (unlikely(!task || !task->pid))
return -EINVAL;

So up to here, irq/sirq, NULL current and that current is not the 'idle'
process is being checked (still fail to see the point for the !task->pid,
I believe the intend here is different).

/* Is this a user address, or a kernel address? */
if (!access_ok(VERIFY_WRITE, to, size))
return -EINVAL;

Now here. What if it's a kernel thread? You'll have KERNEL_DS segment,
task->pid was non-zero as well for the kthread, so access_ok() will
pass and you can still execute probe_kernel_write() ...


I think user_addr_max() should be zero for kthread, but
worth checking for sure.


It's 0x, I did a quick test yesterday night with
creating a kthread, so access_ok() should pass for such case.


Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes)

2016-07-19 Thread Daniel Borkmann

On 07/19/2016 06:34 PM, Alexei Starovoitov wrote:

On Tue, Jul 19, 2016 at 01:17:53PM +0200, Daniel Borkmann wrote:

+   return -EINVAL;
+
+   /* Is this a user address, or a kernel address? */
+   if (!access_ok(VERIFY_WRITE, to, size))
+   return -EINVAL;
+
+   return probe_kernel_write(to, from, size);


I'm still worried that this can lead to all kind of hard to find
bugs or races for user processes, if you make this writable to entire
user address space (which is the only thing that access_ok() checks
for). What if the BPF program has bugs and writes to wrong addresses
for example, introducing bugs in some other, non-intended processes
elsewhere? Can this be limited to syscalls only? And if so, to the
passed memory only?


my understanding that above code will write only to memory of current process,
so impact is contained and in that sense buggy kprobe program is no different
from buggy seccomp prorgram.


Compared to seccomp, you might not notice that a race has happened,
in seccomp case you might have killed your process, which is visible.
But ok, in ptrace() case it might be similar issue perhaps ...

The asm-generic version does __access_ok(..) { return 1; } for nommu
case, I haven't checked closely enough whether there's actually an arch
that uses this, but for example arm nommu with only one addr space would
certainly result in access_ok() as 1, and then you could also go into
probe_kernel_write(), no?

Don't know that code well enough, but I believe the check would only
ensure in normal use-cases that user process doesn't fiddle with kernel
address space, but not necessarily guarantee that this really only
belongs to the process address space.

x86 code comments this with "note that, depending on architecture,
this function probably just checks that the pointer is in the user
space range - after calling this function, memory access functions may
still return -EFAULT".

Also, what happens in case of kernel thread?

As it stands, it does ...

if (unlikely(in_interrupt()))
return -EINVAL;
if (unlikely(!task || !task->pid))
return -EINVAL;

So up to here, irq/sirq, NULL current and that current is not the 'idle'
process is being checked (still fail to see the point for the !task->pid,
I believe the intend here is different).

/* Is this a user address, or a kernel address? */
if (!access_ok(VERIFY_WRITE, to, size))
return -EINVAL;

Now here. What if it's a kernel thread? You'll have KERNEL_DS segment,
task->pid was non-zero as well for the kthread, so access_ok() will
pass and you can still execute probe_kernel_write() ...


Limiting this to syscalls will make it too limited.
I'm in favor of this change, because it allows us to experiment
with restartable sequences and lock-free algorithms that need ultrafast
access to cpuid without burdening the kernel with stable abi.


Have you played around with ptrace() to check whether you could
achieve similar functionality (was thinking about things like [1],
PTRACE_PEEK{TEXT,DATA} / PTRACE_POKE{TEXT,DATA}). If not, why can't
this be limited to a similar functionality for only the current task.
ptrace() utilizes helpers like access_process_vm(), maybe this can
similarly be adapted here, too (under the circumstances that sleeping
is not allowed)?


If we hack access_process_vm I think at the end it will look like
probe_kernel_write. Walking mm requires semaphore, so we would only
be able to do it in task_work and there we can do normal copy_to_user
just as well, but it will complicate the programs quite a bit, since
writes will be asynchronous and batched.
Looks like with access_ok+probe_write we can achieve the same
with a lot less code.


I believe it may not quite be the same as it currently stands. No
fundamental objection, just that this needs to be made "safe" to the
limits you state above yourself. ;)


As far as races between user and bpf program, yeah, if process
is multithreaded, the other threads may access the same mem that
bpf is writing to, but that's no different from reading.
For tracing we walk complex datastructures and pointers. They
can be changed by user space on the fly and bpf will see garbage.
Like we have uprobe+bpf that walks clang c++ internal datastructures
to figure out how long it takes clang to process .h headers.
There is a lot of fragility in the bpf script, yet it's pretty
useful to quickly analyze compile times.
I see bpf_copy_to_user to be hugely valuable too, not as a stable
interface, but rather as a tool to quickly debug and experiment.


Right, so maybe there should be a warn once to the dmesg log that this
is just experimental?

Thanks,
Daniel


Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes)

2016-07-19 Thread Daniel Borkmann

Hi Sargun,

On 07/19/2016 11:32 AM, Sargun Dhillon wrote:

This allows user memory to be written to during the course of a kprobe.
It shouldn't be used to implement any kind of security mechanism
because of TOC-TOU attacks, but rather to debug, divert, and
manipulate execution of semi-cooperative processes.

Although it uses probe_kernel_write, we limit the address space
the probe can write into by checking the space with access_ok.
This call shouldn't sleep on any architectures based on review.

It was tested with the tracex7 program on x86-64.

Signed-off-by: Sargun Dhillon 
Cc: Alexei Starovoitov 

[...]

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ebfbb7d..45878f3 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -81,6 +81,35 @@ static const struct bpf_func_proto bpf_probe_read_proto = {
.arg3_type  = ARG_ANYTHING,
  };

+static u64 bpf_copy_to_user(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+   void *to = (void *) (long) r1;
+   void *from = (void *) (long) r2;
+   int  size = (int) r3;


Nit: you have two spaces here?


+   struct task_struct *task = current;
+
+   /* check if we're in a user context */
+   if (unlikely(in_interrupt()))
+   return -EINVAL;
+   if (unlikely(!task || !task->pid))


Why !task->pid is needed? Can you point to the code where this is
set up as such? I'd assume if that's the case, there's a generic
helper for it to determine that the task_struct is a kernel task?


+   return -EINVAL;
+
+   /* Is this a user address, or a kernel address? */
+   if (!access_ok(VERIFY_WRITE, to, size))
+   return -EINVAL;
+
+   return probe_kernel_write(to, from, size);


I'm still worried that this can lead to all kind of hard to find
bugs or races for user processes, if you make this writable to entire
user address space (which is the only thing that access_ok() checks
for). What if the BPF program has bugs and writes to wrong addresses
for example, introducing bugs in some other, non-intended processes
elsewhere? Can this be limited to syscalls only? And if so, to the
passed memory only?

Have you played around with ptrace() to check whether you could
achieve similar functionality (was thinking about things like [1],
PTRACE_PEEK{TEXT,DATA} / PTRACE_POKE{TEXT,DATA}). If not, why can't
this be limited to a similar functionality for only the current task.
ptrace() utilizes helpers like access_process_vm(), maybe this can
similarly be adapted here, too (under the circumstances that sleeping
is not allowed)?

  [1] https://code.google.com/archive/p/ptrace-parasite/


+}
+
+static const struct bpf_func_proto bpf_copy_to_user_proto = {
+   .func = bpf_copy_to_user,
+   .gpl_only = false,
+   .ret_type = RET_INTEGER,
+   .arg1_type = ARG_ANYTHING,
+   .arg2_type = ARG_PTR_TO_RAW_STACK,
+   .arg3_type = ARG_CONST_STACK_SIZE,


Nit: please tab-align '=' like we have in the other *_proto cases in
bpf_trace.


+};
+
  /*
   * limited trace_printk()
   * only %d %u %x %ld %lu %lx %lld %llu %llx %p %s conversion specifiers 
allowed


Thanks,
Daniel


Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Daniel Borkmann

On 07/18/2016 08:30 PM, Liang, Kan wrote:

On 07/18/2016 08:55 AM, kan.li...@intel.com wrote:

[...]

On a higher level picture, why for example, a new cgroup in combination
with tc shouldn't be the ones resolving these policies on resource usage?


The NET policy doesn't support cgroup yet, but it's on my todo list.
The granularity for the device resource is per queue. The packet will be
redirected to the specific queue.
I'm not sure if cgroup with tc can do that.


Did you have a look at sch_mqprio, which can be used along with either
netprio cgroup or netcls cgroup plus tc on clsact's egress side to set
the priority for mqprio mappings from application side? At leats ixgbe,
i40e, fm10k have offload support for it and a number of other nics. You
could also use cls_bpf for making the prio assignment if you need to
involve also other meta data from the skb (like mark or prio derived from
sockets, etc). Maybe it doesn't cover all of what you need, but could be
a start to extend upon?

Thanks,
Daniel


Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Daniel Borkmann

Hi Kan,

On 07/18/2016 08:55 AM, kan.li...@intel.com wrote:

From: Kan Liang 

It is a big challenge to get good network performance. First, the network
performance is not good with default system settings. Second, it is too
difficult to do automatic tuning for all possible workloads, since workloads
have different requirements. Some workloads may want high throughput. Some may
need low latency. Last but not least, there are lots of manual configurations.
Fine grained configuration is too difficult for users.

NET policy intends to simplify the network configuration and get a good network
performance according to the hints(policy) which is applied by user. It
provides some typical "policies" for user which can be set per-socket, per-task
or per-device. The kernel will automatically figures out how to merge different
requests to get good network performance.
Net policy is designed for multiqueue network devices. This implementation is
only for Intel NICs using i40e driver. But the concepts and generic code should
apply to other multiqueue NICs too.
Net policy is also a combination of generic policy manager code and some
ethtool callbacks (per queue coalesce setting, flow classification rules) to
configure the driver.
This series also supports CPU hotplug and device hotplug.

Here are some key Interfaces/APIs for NET policy.

/proc/net/netpolicy/$DEV/policy
User can set/get per device policy from /proc

/proc/$PID/net_policy
User can set/get per task policy from /proc
prctl(PR_SET_NETPOLICY, POLICY_NAME, NULL, NULL, NULL)
An alternative way to set/get per task policy is from prctl.

setsockopt(sockfd,SOL_SOCKET,SO_NETPOLICY,&policy,sizeof(int))
User can set/get per socket policy by setsockopt


int (*ndo_netpolicy_init)(struct net_device *dev,
  struct netpolicy_info *info);
Initialize device driver for NET policy

int (*ndo_get_irq_info)(struct net_device *dev,
struct netpolicy_dev_info *info);
Collect device irq information

int (*ndo_set_net_policy)(struct net_device *dev,
  enum netpolicy_name name);
Configure device according to policy name

netpolicy_register(struct netpolicy_reg *reg);
netpolicy_unregister(struct netpolicy_reg *reg);
NET policy API to register/unregister per task/socket net policy.
For each task/socket, an record will be created and inserted into an RCU
hash table.

netpolicy_pick_queue(struct netpolicy_reg *reg, bool is_rx);
NET policy API to find the proper queue for packet receiving and
transmitting.

netpolicy_set_rules(struct netpolicy_reg *reg, u32 queue_index,
 struct netpolicy_flow_spec *flow);
NET policy API to add flow director rules.

For using NET policy, the per-device policy must be set in advance. It will
automatically configure the system and re-organize the resource of the system
accordingly. For system configuration, in this series, it will disable irq
balance, set device queue irq affinity, and modify interrupt moderation. For
re-organizing the resource, current implementation forces that CPU and queue
irq are 1:1 mapping. An 1:1 mapping group is also called net policy object.
For each device policy, it maintains a policy list. Once the device policy is
applied, the objects will be insert and tracked in that device policy list. The
policy list only be updated when cpu/device hotplug, queue number changes or
device policy changes.
The user can use /proc, prctl and setsockopt to set per-task and per-socket
net policy. Once the policy is set, an related record will be inserted into RCU
hash table. The record includes ptr, policy and net policy object. The ptr is
the pointer address of task/socket. The object will not be assigned until the
first package receive/transmit. The object is picked by round-robin from object
list. Once the object is determined, the following packets will be set to
redirect to the queue(object).
The object can be shared. The per-task or per-socket policy can be inherited.

Now NET policy supports four per device policies and three per task/socket
policies.
 - BULK policy: This policy is designed for high throughput. It can be
   applied to either per device policy or per task/socket policy.
 - CPU policy: This policy is designed for high throughput but lower CPU
   utilization. It can be applied to either per device policy or
   per task/socket policy.
 - LATENCY policy: This policy is designed for low latency. It can be
   applied to either per device policy or per task/socket policy.
 - MIX policy: This policy can only be applied to per device policy. This
   is designed for the case which miscellaneous types of workload running
   on the device.


I'm missing a bit of discussion on the existing facilities there are under
networking and why they cannot be adapted to support these kind of hints?

On a h

[PATCH net-next v2 2/3] bpf, perf: split bpf_perf_event_output

2016-07-14 Thread Daniel Borkmann
Split the bpf_perf_event_output() helper as a preparation into
two parts. The new bpf_perf_event_output() will prepare the raw
record itself and test for unknown flags from BPF trace context,
where the __bpf_perf_event_output() does the core work. The
latter will be reused later on from bpf_event_output() directly.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 kernel/trace/bpf_trace.c | 35 ++-
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 35ab1b2..c35883a 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -233,26 +233,17 @@ static const struct bpf_func_proto 
bpf_perf_event_read_proto = {
.arg2_type  = ARG_ANYTHING,
 };
 
-static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size)
+static __always_inline u64
+__bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map,
+   u64 flags, struct perf_raw_record *raw)
 {
-   struct pt_regs *regs = (struct pt_regs *) (long) r1;
-   struct bpf_map *map = (struct bpf_map *) (long) r2;
struct bpf_array *array = container_of(map, struct bpf_array, map);
unsigned int cpu = smp_processor_id();
u64 index = flags & BPF_F_INDEX_MASK;
-   void *data = (void *) (long) r4;
struct perf_sample_data sample_data;
struct bpf_event_entry *ee;
struct perf_event *event;
-   struct perf_raw_record raw = {
-   .frag = {
-   .size = size,
-   .data = data,
-   },
-   };
 
-   if (unlikely(flags & ~(BPF_F_INDEX_MASK)))
-   return -EINVAL;
if (index == BPF_F_CURRENT_CPU)
index = cpu;
if (unlikely(index >= array->map.max_entries))
@@ -271,11 +262,29 @@ static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 
flags, u64 r4, u64 size)
return -EOPNOTSUPP;
 
perf_sample_data_init(&sample_data, 0, 0);
-   sample_data.raw = &raw;
+   sample_data.raw = raw;
perf_event_output(event, &sample_data, regs);
return 0;
 }
 
+static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size)
+{
+   struct pt_regs *regs = (struct pt_regs *)(long) r1;
+   struct bpf_map *map  = (struct bpf_map *)(long) r2;
+   void *data = (void *)(long) r4;
+   struct perf_raw_record raw = {
+   .frag = {
+   .size = size,
+   .data = data,
+   },
+   };
+
+   if (unlikely(flags & ~(BPF_F_INDEX_MASK)))
+   return -EINVAL;
+
+   return __bpf_perf_event_output(regs, map, flags, &raw);
+}
+
 static const struct bpf_func_proto bpf_perf_event_output_proto = {
.func   = bpf_perf_event_output,
.gpl_only   = true,
-- 
1.9.3



[PATCH net-next v2 3/3] bpf: avoid stack copy and use skb ctx for event output

2016-07-14 Thread Daniel Borkmann
This work addresses a couple of issues bpf_skb_event_output()
helper currently has: i) We need two copies instead of just a
single one for the skb data when it should be part of a sample.
The data can be non-linear and thus needs to be extracted via
bpf_skb_load_bytes() helper first, and then copied once again
into the ring buffer slot. ii) Since bpf_skb_load_bytes()
currently needs to be used first, the helper needs to see a
constant size on the passed stack buffer to make sure BPF
verifier can do sanity checks on it during verification time.
Thus, just passing skb->len (or any other non-constant value)
wouldn't work, but changing bpf_skb_load_bytes() is also not
the proper solution, since the two copies are generally still
needed. iii) bpf_skb_load_bytes() is just for rather small
buffers like headers, since they need to sit on the limited
BPF stack anyway. Instead of working around in bpf_skb_load_bytes(),
this work improves the bpf_skb_event_output() helper to address
all 3 at once.

We can make use of the passed in skb context that we have in
the helper anyway, and use some of the reserved flag bits as
a length argument. The helper will use the new __output_custom()
facility from perf side with bpf_skb_copy() as callback helper
to walk and extract the data. It will pass the data for setup
to bpf_event_output(), which generates and pushes the raw record
with an additional frag part. The linear data used in the first
frag of the record serves as programmatically defined meta data
passed along with the appended sample.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf.h  |  7 ++-
 include/uapi/linux/bpf.h |  2 ++
 kernel/bpf/core.c|  6 --
 kernel/trace/bpf_trace.c | 33 +++--
 net/core/filter.c| 43 ++-
 5 files changed, 69 insertions(+), 22 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b3336b4..c13e92b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -209,7 +209,12 @@ u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 
r5);
 bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog 
*fp);
 
 const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
-const struct bpf_func_proto *bpf_get_event_output_proto(void);
+
+typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src,
+   unsigned long len);
+
+u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
+void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy);
 
 #ifdef CONFIG_BPF_SYSCALL
 DECLARE_PER_CPU(int, bpf_prog_active);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 262a7e8..c4d9224 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -401,6 +401,8 @@ enum bpf_func_id {
 /* BPF_FUNC_perf_event_output and BPF_FUNC_perf_event_read flags. */
 #define BPF_F_INDEX_MASK   0xULL
 #define BPF_F_CURRENT_CPU  BPF_F_INDEX_MASK
+/* BPF_FUNC_perf_event_output for sk_buff input context. */
+#define BPF_F_CTXLEN_MASK  (0xfULL << 32)
 
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index d638062..03fd23d 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1054,9 +1054,11 @@ const struct bpf_func_proto * __weak 
bpf_get_trace_printk_proto(void)
return NULL;
 }
 
-const struct bpf_func_proto * __weak bpf_get_event_output_proto(void)
+u64 __weak
+bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
+void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy)
 {
-   return NULL;
+   return -ENOTSUPP;
 }
 
 /* Always built-in helper functions. */
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index c35883a..ebfbb7d 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -298,29 +298,26 @@ static const struct bpf_func_proto 
bpf_perf_event_output_proto = {
 
 static DEFINE_PER_CPU(struct pt_regs, bpf_pt_regs);
 
-static u64 bpf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size)
+u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
+void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy)
 {
struct pt_regs *regs = this_cpu_ptr(&bpf_pt_regs);
+   struct perf_raw_frag frag = {
+   .copy   = ctx_copy,
+   .size   = ctx_size,
+   .data   = ctx,
+   };
+   struct perf_raw_record raw = {
+   .frag = {
+   .next   = ctx_size ? &frag : NULL,
+   .size   = meta_size,
+   .data   = meta,
+   },
+   };
 
perf_fetch_caller_regs(regs);
 
-   return bpf_perf_event_output

[PATCH net-next v2 0/3] BPF event output helper improvements

2016-07-14 Thread Daniel Borkmann
This set adds improvements to the BPF event output helper to
support non-linear data sampling, here specifically, for skb
context. For details please see individual patches. The set
is based against net-next tree.

v1 -> v2:
  - Integrated and adapted Peter's diff into patch 1, updated
the remaining ones accordingly. Thanks Peter!

Thanks a lot!

Daniel Borkmann (3):
  perf, events: add non-linear data support for raw records
  bpf, perf: split bpf_perf_event_output
  bpf: avoid stack copy and use skb ctx for event output

 arch/s390/kernel/perf_cpum_sf.c |  9 --
 arch/x86/events/amd/ibs.c   |  8 +++--
 include/linux/bpf.h |  7 -
 include/linux/perf_event.h  | 20 -
 include/uapi/linux/bpf.h|  2 ++
 kernel/bpf/core.c   |  6 ++--
 kernel/events/core.c| 66 -
 kernel/events/internal.h| 16 +++---
 kernel/trace/bpf_trace.c| 66 +++--
 net/core/filter.c   | 43 ++-
 10 files changed, 180 insertions(+), 63 deletions(-)

-- 
1.9.3



[PATCH net-next v2 1/3] perf, events: add non-linear data support for raw records

2016-07-14 Thread Daniel Borkmann
This patch adds support for non-linear data on raw records. It
extends raw records to have one or multiple fragments that will
be written linearly into the ring slot, where each fragment can
optionally have a custom callback handler to walk and extract
complex, possibly non-linear data.

If a callback handler is provided for a fragment, then the new
__output_custom() will be used instead of __output_copy() for
the perf_output_sample() part. perf_prepare_sample() does all
the size calculation only once, so perf_output_sample() doesn't
need to redo the same work anymore, meaning real_size and padding
will be cached in the raw record. The raw record becomes 32 bytes
in size without holes; to not increase it further and to avoid
doing unnecessary recalculations in fast-path, we can reuse
next pointer of the last fragment, idea here is borrowed from
ZERO_OR_NULL_PTR(), which should keep the perf_output_sample()
path for PERF_SAMPLE_RAW minimal.

This facility is needed for BPF's event output helper as a first
user that will, in a follow-up, add an additional perf_raw_frag
to its perf_raw_record in order to be able to more efficiently
dump skb context after a linear head meta data related to it.
skbs can be non-linear and thus need a custom output function to
dump buffers. Currently, the skb data needs to be copied twice;
with the help of __output_custom() this work only needs to be
done once. Future users could be things like XDP/BPF programs
that work on different context though and would thus also have
a different callback function.

The few users of raw records are adapted to initialize their frag
data from the raw record itself, no change in behavior for them.
The code is based upon a PoC diff provided by Peter Zijlstra [1].

  [1] http://thread.gmane.org/gmane.linux.network/421294

Suggested-by: Peter Zijlstra 
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 Hi Peter, I've adapted the patch to your suggestion and also
 added the padding; all size calculation is only done once at
 perf_prepare_sample() time as well to avoid unnecessary work.
 Please let me know if you're good with this. Thanks a lot!

 arch/s390/kernel/perf_cpum_sf.c |  9 --
 arch/x86/events/amd/ibs.c   |  8 +++--
 include/linux/perf_event.h  | 20 -
 kernel/events/core.c| 66 -
 kernel/events/internal.h| 16 +++---
 kernel/trace/bpf_trace.c|  6 ++--
 6 files changed, 93 insertions(+), 32 deletions(-)

diff --git a/arch/s390/kernel/perf_cpum_sf.c b/arch/s390/kernel/perf_cpum_sf.c
index a8e8321..92619cc 100644
--- a/arch/s390/kernel/perf_cpum_sf.c
+++ b/arch/s390/kernel/perf_cpum_sf.c
@@ -979,12 +979,15 @@ static int perf_push_sample(struct perf_event *event, 
struct sf_raw_sample *sfr)
struct pt_regs regs;
struct perf_sf_sde_regs *sde_regs;
struct perf_sample_data data;
-   struct perf_raw_record raw;
+   struct perf_raw_record raw = {
+   .frag = {
+   .size = sfr->size,
+   .data = sfr,
+   },
+   };
 
/* Setup perf sample */
perf_sample_data_init(&data, 0, event->hw.last_period);
-   raw.size = sfr->size;
-   raw.data = sfr;
data.raw = &raw;
 
/* Setup pt_regs to look like an CPU-measurement external interrupt
diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index feb90f6..72dea2f 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -655,8 +655,12 @@ fail:
}
 
if (event->attr.sample_type & PERF_SAMPLE_RAW) {
-   raw.size = sizeof(u32) + ibs_data.size;
-   raw.data = ibs_data.data;
+   raw = (struct perf_raw_record){
+   .frag = {
+   .size = sizeof(u32) + ibs_data.size,
+   .data = ibs_data.data,
+   },
+   };
data.raw = &raw;
}
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1a827ce..e79e6c6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -69,9 +69,22 @@ struct perf_callchain_entry_ctx {
boolcontexts_maxed;
 };
 
+typedef unsigned long (*perf_copy_f)(void *dst, const void *src,
+unsigned long len);
+
+struct perf_raw_frag {
+   union {
+   struct perf_raw_frag*next;
+   unsigned long   pad;
+   };
+   perf_copy_f copy;
+   void*data;
+   u32 size;
+} __packed;
+
 struct perf_raw_record {
+   struct perf_raw_fragfrag;
u32 size;
-   void*data;
 };
 
 /*
@@ -1283,6 +1296,11 @@ extern voi

Re: [PATCH net-next 1/3] perf, events: add non-linear data support for raw records

2016-07-13 Thread Daniel Borkmann

On 07/13/2016 06:40 PM, Peter Zijlstra wrote:

On Wed, Jul 13, 2016 at 04:08:55PM +0200, Daniel Borkmann wrote:

On 07/13/2016 03:42 PM, Peter Zijlstra wrote:


Ok so the nonlinear thing was it doing _two_ copies, one the regular
__output_copy() on raw->data and second the optional fragment thingy
using __output_custom().

Would something like this work instead?

It does the nonlinear thing and the custom copy function thing but
allows more than 2 fragments and allows each fragment to have a custom
copy.

It doesn't look obviously more expensive; it has the one ->copy branch
extra, but then it doesn't recompute the sizes.


Yes, that would work as well on a quick glance with diff just a bit
bigger, but more generic this way. Do you want me to adapt this into
the first patch?


Please.


One question below:




-   u64 zero = 0;



-   if (real_size - raw_size)
-   __output_copy(handle, &zero, real_size - 
raw_size);



We still need the zero padding here from above with the computed
raw->size, right?


Ah, yes, we need some __output*() in order to advance the handle offset.
We don't _need_ to copy the 0s, but I doubt __output_skip() is much
cheaper for these 1-3 bytes worth of data; we've already touched that
line anyway.


Okay, thanks for your input! I'll respin then.


Re: [PATCH net-next 1/3] perf, events: add non-linear data support for raw records

2016-07-13 Thread Daniel Borkmann

Hi Peter,

On 07/13/2016 03:42 PM, Peter Zijlstra wrote:


Ok so the nonlinear thing was it doing _two_ copies, one the regular
__output_copy() on raw->data and second the optional fragment thingy
using __output_custom().

Would something like this work instead?

It does the nonlinear thing and the custom copy function thing but
allows more than 2 fragments and allows each fragment to have a custom
copy.

It doesn't look obviously more expensive; it has the one ->copy branch
extra, but then it doesn't recompute the sizes.


Yes, that would work as well on a quick glance with diff just a bit
bigger, but more generic this way. Do you want me to adapt this into
the first patch?

One question below:


diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1fe22032f228..83e2a83e8db3 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -69,9 +69,18 @@ struct perf_callchain_entry_ctx {
boolcontexts_maxed;
  };

+typedef unsigned long (*perf_copy_f)(void *dst, const void *src, unsigned long 
len);
+
+struct perf_raw_frag {
+   struct perf_raw_frag*next;
+   perf_copy_f copy;
+   void*data;
+   u32 size;
+} __packed;
+
  struct perf_raw_record {
+   struct perf_raw_fragfrag;
u32 size;
-   void*data;
  };

  /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index fe8d49a56322..f7ad7d65317d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5617,16 +5617,21 @@ void perf_output_sample(struct perf_output_handle 
*handle,
}

if (sample_type & PERF_SAMPLE_RAW) {
-   if (data->raw) {
-   u32 raw_size = data->raw->size;
-   u32 real_size = round_up(raw_size + sizeof(u32),
-sizeof(u64)) - sizeof(u32);
-   u64 zero = 0;
-
-   perf_output_put(handle, real_size);
-   __output_copy(handle, data->raw->data, raw_size);
-   if (real_size - raw_size)
-   __output_copy(handle, &zero, real_size - 
raw_size);
+   struct perf_raw_record *raw = data->raw;
+
+   if (raw) {
+   struct perf_raw_frag *frag = &raw->frag;
+
+   perf_output_put(handle, raw->size);
+   do {
+   if (frag->copy) {
+   __output_custom(handle, frag->copy,
+   frag->data, frag->size);
+   } else {
+   __output_copy(handle, frag->data, 
frag->size);
+   }
+   frag = frag->next;
+   } while (frag);


We still need the zero padding here from above with the computed
raw->size, right?


} else {
struct {
u32 size;
@@ -5751,14 +5756,22 @@ void perf_prepare_sample(struct perf_event_header 
*header,


Thanks,
Daniel


Re: [PATCH net-next 1/3] perf, events: add non-linear data support for raw records

2016-07-13 Thread Daniel Borkmann

On 07/13/2016 02:10 PM, Peter Zijlstra wrote:

On Wed, Jul 13, 2016 at 11:24:13AM +0200, Daniel Borkmann wrote:

On 07/13/2016 09:52 AM, Peter Zijlstra wrote:

On Wed, Jul 13, 2016 at 12:36:17AM +0200, Daniel Borkmann wrote:

This patch adds support for non-linear data on raw records. It means
that for such data, the newly introduced __output_custom() helper will
be used instead of __output_copy(). __output_custom() will invoke
whatever custom callback is passed in via struct perf_raw_record_frag
to extract the data into the ring buffer slot.

To keep changes in perf_prepare_sample() and in perf_output_sample()
minimal, size/size_head split was added to perf_raw_record that call
sites fill out, so that two extra tests in fast-path can be avoided.

The few users of raw records are adapted to initialize their size_head
and frag data; no change in behavior for them. Later patch will extend
BPF side with a first user and callback for this facility, future users
could be things like XDP BPF programs (that work on different context
though and would thus have a different callback), etc.


Why? What problem are we solving?


I've tried to summarize it in patch 3/3,


Which is pretty useless if you're staring at this patch.


This currently has 3 issues we'd like to resolve:



i) We need two copies instead of just a single one for the skb data.
The data can be non-linear, see also skb_copy_bits() as an example for
walking/extracting it,


I'm not familiar enough with the network gunk to be able to read that.
But upto skb_walk_frags() it looks entirely linear to me.


Hm, fair enough, there are three parts, skb can have a linear part
which is taken via skb->data, either in its entirety or there can be a
non-linear part appended to that which can consist of pages that are in
shared info section (skb_shinfo(skb) -> frags[], nr_frags members), that
will be linearized, and in addition to that, appended after the frags[]
data there can be further skbs to the 'root' skb that contain fragmented
data, which is all what skb_copy_bits() copies linearized into 'to' buffer.
So depending on the origin of the skb, its structure can be quite different
and skb_copy_bits() covers all the cases generically. Maybe [1] summarizes
it better if you want to familiarize yourself with how skbs work,
although some parts are not up to date anymore.

  [1] http://vger.kernel.org/~davem/skb_data.html


ii) for static verification reasons, the bpf_skb_load_bytes() helper
needs to see a constant size on the passed buffer to make sure BPF
verifier can do its sanity checks on it during verification time, so
just passing in skb->len (or any other non-constant value) wouldn't
work, but changing bpf_skb_load_bytes() is also not the real solution
since we still have two copies we'd like to avoid as well, and



iii) bpf_skb_load_bytes() is just for rather smaller buffers (e.g.
headers) since they need to sit on the limited eBPF stack anyway. The
set would improve the BPF helper to address all 3 at once.


Humm, maybe. Lemme go try and reverse engineer that patch, because I'm
not at all sure wth it's supposed to do, nor am I entirely sure this
clarified things :/


Re: [PATCH net-next 1/3] perf, events: add non-linear data support for raw records

2016-07-13 Thread Daniel Borkmann

Hi Peter,

On 07/13/2016 09:52 AM, Peter Zijlstra wrote:

On Wed, Jul 13, 2016 at 12:36:17AM +0200, Daniel Borkmann wrote:

This patch adds support for non-linear data on raw records. It means
that for such data, the newly introduced __output_custom() helper will
be used instead of __output_copy(). __output_custom() will invoke
whatever custom callback is passed in via struct perf_raw_record_frag
to extract the data into the ring buffer slot.

To keep changes in perf_prepare_sample() and in perf_output_sample()
minimal, size/size_head split was added to perf_raw_record that call
sites fill out, so that two extra tests in fast-path can be avoided.

The few users of raw records are adapted to initialize their size_head
and frag data; no change in behavior for them. Later patch will extend
BPF side with a first user and callback for this facility, future users
could be things like XDP BPF programs (that work on different context
though and would thus have a different callback), etc.


Why? What problem are we solving?


I've tried to summarize it in patch 3/3, there are a number of improvements
this set would provide: the current BPF_FUNC_perf_event_output helper that
can be used from tc/networking programs has the limitation that if data from
the skb should be part of the sample, that data first needs to be copied to
eBPF stack with the bpf_skb_load_bytes() helper, and only then passed into
bpf_event_output().

This currently has 3 issues we'd like to resolve: i) We need two copies instead
of just a single one for the skb data. The data can be non-linear, see also
skb_copy_bits() as an example for walking/extracting it, ii) for static
verification reasons, the bpf_skb_load_bytes() helper needs to see a constant
size on the passed buffer to make sure BPF verifier can do its sanity checks
on it during verification time, so just passing in skb->len (or any other
non-constant value) wouldn't work, but changing bpf_skb_load_bytes() is also
not the real solution since we still have two copies we'd like to avoid as well
and iii) bpf_skb_load_bytes() is just for rather smaller buffers (e.g. headers)
since they need to sit on the limited eBPF stack anyway. The set would improve
the BPF helper to address all 3 at once.

Thanks,
Daniel


Re: [PATCH net-next 3/3] bpf: avoid stack copy and use skb ctx for event output

2016-07-12 Thread Daniel Borkmann

On 07/13/2016 01:25 AM, kbuild test robot wrote:

Hi,

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Daniel-Borkmann/BPF-event-output-helper-improvements/20160713-065944
config: s390-allyesconfig (attached as .config)
compiler: s390x-linux-gnu-gcc (Debian 5.3.1-8) 5.3.1 20160205
reproduce:
 wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
 chmod +x ~/bin/make.cross
 # save the attached .config to linux build tree
 make.cross ARCH=s390

All warnings (new ones prefixed by >>):

kernel/trace/bpf_trace.c: In function 'bpf_perf_event_output':
kernel/trace/bpf_trace.c:284:1: warning: 'bpf_perf_event_output' uses 
dynamic stack allocation
 }
 ^
kernel/trace/bpf_trace.c: In function 'bpf_event_output':

kernel/trace/bpf_trace.c:319:1: warning: 'bpf_event_output' uses dynamic stack 
allocation

 }
 ^


Hmm, searching a bit on lkml, it seems these warnings on s390 are actually 
mostly
harmless I believe [1][2] ... looks like they are there to find structs sitting
on stack, for example, at least that's also what the currently existing one in 
the
above line (bpf_trace.c +284) appears to be about.

Thanks,
Daniel

  [1] http://lkml.iu.edu/hypermail/linux/kernel/1601.2/04074.html
  [2] https://lkml.org/lkml/2013/6/25/42


[PATCH net-next 1/3] perf, events: add non-linear data support for raw records

2016-07-12 Thread Daniel Borkmann
This patch adds support for non-linear data on raw records. It means
that for such data, the newly introduced __output_custom() helper will
be used instead of __output_copy(). __output_custom() will invoke
whatever custom callback is passed in via struct perf_raw_record_frag
to extract the data into the ring buffer slot.

To keep changes in perf_prepare_sample() and in perf_output_sample()
minimal, size/size_head split was added to perf_raw_record that call
sites fill out, so that two extra tests in fast-path can be avoided.

The few users of raw records are adapted to initialize their size_head
and frag data; no change in behavior for them. Later patch will extend
BPF side with a first user and callback for this facility, future users
could be things like XDP BPF programs (that work on different context
though and would thus have a different callback), etc.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 arch/s390/kernel/perf_cpum_sf.c |  2 ++
 arch/x86/events/amd/ibs.c   |  2 ++
 include/linux/perf_event.h  |  8 
 kernel/events/core.c| 13 ++---
 kernel/events/internal.h| 18 ++
 kernel/trace/bpf_trace.c|  1 +
 6 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/arch/s390/kernel/perf_cpum_sf.c b/arch/s390/kernel/perf_cpum_sf.c
index a8e8321..99c5952 100644
--- a/arch/s390/kernel/perf_cpum_sf.c
+++ b/arch/s390/kernel/perf_cpum_sf.c
@@ -984,7 +984,9 @@ static int perf_push_sample(struct perf_event *event, 
struct sf_raw_sample *sfr)
/* Setup perf sample */
perf_sample_data_init(&data, 0, event->hw.last_period);
raw.size = sfr->size;
+   raw.size_head = raw.size;
raw.data = sfr;
+   raw.frag = NULL;
data.raw = &raw;
 
/* Setup pt_regs to look like an CPU-measurement external interrupt
diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index feb90f6..9b27dff 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -656,7 +656,9 @@ fail:
 
if (event->attr.sample_type & PERF_SAMPLE_RAW) {
raw.size = sizeof(u32) + ibs_data.size;
+   raw.size_head = raw.size;
raw.data = ibs_data.data;
+   raw.frag = NULL;
data.raw = &raw;
}
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1a827ce..bf08bdf 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -69,9 +69,17 @@ struct perf_callchain_entry_ctx {
boolcontexts_maxed;
 };
 
+struct perf_raw_record_frag {
+   void*data;
+   unsigned long (*copy_cb)(void *dst, const void *src,
+unsigned long n);
+};
+
 struct perf_raw_record {
u32 size;
+   u32 size_head;
void*data;
+   struct perf_raw_record_frag *frag;
 };
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9c51ec3..3e1dd7a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5553,14 +5553,20 @@ void perf_output_sample(struct perf_output_handle 
*handle,
}
 
if (sample_type & PERF_SAMPLE_RAW) {
-   if (data->raw) {
-   u32 raw_size = data->raw->size;
+   struct perf_raw_record *raw = data->raw;
+
+   if (raw) {
+   u32 raw_size = raw->size;
u32 real_size = round_up(raw_size + sizeof(u32),
 sizeof(u64)) - sizeof(u32);
u64 zero = 0;
 
perf_output_put(handle, real_size);
-   __output_copy(handle, data->raw->data, raw_size);
+   __output_copy(handle, raw->data, raw->size_head);
+   if (raw->frag)
+   __output_custom(handle, raw->frag->copy_cb,
+   raw->frag->data,
+   raw->size - raw->size_head);
if (real_size - raw_size)
__output_copy(handle, &zero, real_size - 
raw_size);
} else {
@@ -7388,6 +7394,7 @@ void perf_tp_event(u16 event_type, u64 count, void 
*record, int entry_size,
 
struct perf_raw_record raw = {
.size = entry_size,
+   .size_head = entry_size,
.data = record,
};
 
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 05f9f6d..1b08d94 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -123,10 +123,7 @@ static inline unsigned long perf_aux_size(struct 
ring_buffer *rb)
return rb->

[PATCH net-next 2/3] bpf, perf: split bpf_perf_event_output

2016-07-12 Thread Daniel Borkmann
Split the bpf_perf_event_output() helper as a preparation into two
parts. The newly bpf_perf_event_output() will prepare the raw record
itself and test for unknown flags from BPF trace context, where the
__bpf_perf_event_output() does the core work and will be reused later
as well.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 kernel/trace/bpf_trace.c | 32 
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 8540bd5..4d3d5b8 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -233,25 +233,17 @@ static const struct bpf_func_proto 
bpf_perf_event_read_proto = {
.arg2_type  = ARG_ANYTHING,
 };
 
-static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size)
+static __always_inline u64
+__bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map,
+   u64 flags, struct perf_raw_record *raw)
 {
-   struct pt_regs *regs = (struct pt_regs *) (long) r1;
-   struct bpf_map *map = (struct bpf_map *) (long) r2;
struct bpf_array *array = container_of(map, struct bpf_array, map);
unsigned int cpu = smp_processor_id();
u64 index = flags & BPF_F_INDEX_MASK;
-   void *data = (void *) (long) r4;
struct perf_sample_data sample_data;
struct bpf_event_entry *ee;
struct perf_event *event;
-   struct perf_raw_record raw = {
-   .size = size,
-   .size_head = size,
-   .data = data,
-   };
 
-   if (unlikely(flags & ~(BPF_F_INDEX_MASK)))
-   return -EINVAL;
if (index == BPF_F_CURRENT_CPU)
index = cpu;
if (unlikely(index >= array->map.max_entries))
@@ -270,11 +262,27 @@ static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 
flags, u64 r4, u64 size)
return -EOPNOTSUPP;
 
perf_sample_data_init(&sample_data, 0, 0);
-   sample_data.raw = &raw;
+   sample_data.raw = raw;
perf_event_output(event, &sample_data, regs);
return 0;
 }
 
+static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size)
+{
+   struct perf_raw_record raw = {
+   .size   = size,
+   .size_head  = size,
+   .data   = (void *)(long) r4,
+   };
+
+   if (unlikely(flags & ~(BPF_F_INDEX_MASK)))
+   return -EINVAL;
+
+   return __bpf_perf_event_output((struct pt_regs *)(long) r1,
+  (struct bpf_map *)(long) r2,
+  flags, &raw);
+}
+
 static const struct bpf_func_proto bpf_perf_event_output_proto = {
.func   = bpf_perf_event_output,
.gpl_only   = true,
-- 
1.9.3



[PATCH net-next 0/3] BPF event output helper improvements

2016-07-12 Thread Daniel Borkmann
This set adds improvements to the BPF event output helper to
support non-linear data sampling, here specifically, for skb
context. For details please see individual patches. The set
is based against net-next tree.

Thanks a lot!

Daniel Borkmann (3):
  perf, events: add non-linear data support for raw records
  bpf, perf: split bpf_perf_event_output
  bpf: avoid stack copy and use skb ctx for event output

 arch/s390/kernel/perf_cpum_sf.c |  2 ++
 arch/x86/events/amd/ibs.c   |  2 ++
 include/linux/bpf.h |  5 +++-
 include/linux/perf_event.h  |  8 ++
 include/uapi/linux/bpf.h|  2 ++
 kernel/bpf/core.c   |  8 --
 kernel/events/core.c| 13 +++--
 kernel/events/internal.h| 18 +---
 kernel/trace/bpf_trace.c| 64 ++---
 net/core/filter.c   | 43 ++-
 10 files changed, 125 insertions(+), 40 deletions(-)

-- 
1.9.3



[PATCH net-next 3/3] bpf: avoid stack copy and use skb ctx for event output

2016-07-12 Thread Daniel Borkmann
This work improves bpf_skb_event_output() helper in two ways, i) it
avoids that users need to unnecessary extract sampled skb data to
stack first via bpf_skb_load_bytes() and then copy once again into
the ring buffer slot, and ii) it avoids that only portions can be
sampled with bpf_skb_load_bytes() due to stack limit.

Instead, we can make use of the passed in skb context that we have
in the helper anyway, and use some of the reserved flag bits as a
length argument. The helper will use the new __output_custom() facility
from perf with bpf_skb_copy_cb() as callback helper. It will pass
the data for setup to bpf_event_output(), which generates and pushes
the raw record. The linear data used in the non-frag part of the
record serves as custom / programmatically defined meta data passed
along with the appended sample.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf.h  |  5 -
 include/uapi/linux/bpf.h |  2 ++
 kernel/bpf/core.c|  8 ++--
 kernel/trace/bpf_trace.c | 33 +++--
 net/core/filter.c| 43 ++-
 5 files changed, 69 insertions(+), 22 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b3336b4..afd64c8 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -209,8 +209,11 @@ u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 
r5);
 bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog 
*fp);
 
 const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
-const struct bpf_func_proto *bpf_get_event_output_proto(void);
 
+u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 size_meta,
+void *ctx, u64 size_ctx,
+unsigned long (*ctx_copy_cb)(void *dst, const void *src,
+ unsigned long n));
 #ifdef CONFIG_BPF_SYSCALL
 DECLARE_PER_CPU(int, bpf_prog_active);
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 262a7e8..c4d9224 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -401,6 +401,8 @@ enum bpf_func_id {
 /* BPF_FUNC_perf_event_output and BPF_FUNC_perf_event_read flags. */
 #define BPF_F_INDEX_MASK   0xULL
 #define BPF_F_CURRENT_CPU  BPF_F_INDEX_MASK
+/* BPF_FUNC_perf_event_output for sk_buff input context. */
+#define BPF_F_CTXLEN_MASK  (0xfULL << 32)
 
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index d638062..47a7054 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1054,9 +1054,13 @@ const struct bpf_func_proto * __weak 
bpf_get_trace_printk_proto(void)
return NULL;
 }
 
-const struct bpf_func_proto * __weak bpf_get_event_output_proto(void)
+u64 __weak
+bpf_event_output(struct bpf_map *map, u64 flags, void *meta,
+u64 size_meta, void *ctx, u64 size_ctx,
+unsigned long (*ctx_copy_cb)(void *dst, const void *src,
+ unsigned long n))
 {
-   return NULL;
+   return -ENOTSUPP;
 }
 
 /* Always built-in helper functions. */
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 4d3d5b8..9c076d1 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -296,29 +296,26 @@ static const struct bpf_func_proto 
bpf_perf_event_output_proto = {
 
 static DEFINE_PER_CPU(struct pt_regs, bpf_pt_regs);
 
-static u64 bpf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size)
+u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 size_meta,
+void *ctx, u64 size_ctx,
+unsigned long (*ctx_copy_cb)(void *dst, const void *src,
+ unsigned long n))
 {
struct pt_regs *regs = this_cpu_ptr(&bpf_pt_regs);
+   struct perf_raw_record_frag frag = {
+   .data   = ctx,
+   .copy_cb= ctx_copy_cb,
+   };
+   struct perf_raw_record raw = {
+   .size   = size_meta + size_ctx,
+   .size_head  = size_meta,
+   .data   = meta,
+   .frag   = size_ctx ? &frag : NULL,
+   };
 
perf_fetch_caller_regs(regs);
 
-   return bpf_perf_event_output((long)regs, r2, flags, r4, size);
-}
-
-static const struct bpf_func_proto bpf_event_output_proto = {
-   .func   = bpf_event_output,
-   .gpl_only   = true,
-   .ret_type   = RET_INTEGER,
-   .arg1_type  = ARG_PTR_TO_CTX,
-   .arg2_type  = ARG_CONST_MAP_PTR,
-   .arg3_type  = ARG_ANYTHING,
-   .arg4_type  = ARG_PTR_TO_STACK,
-   .arg5_type  = ARG_CONST_STACK_SIZE,
-};
-
-const struct bpf_func_proto *bpf_get_event_output_proto(void)
-{

Re: [PATCH -next] bpf: make inode code explicitly non-modular

2016-07-11 Thread Daniel Borkmann

On 07/11/2016 06:51 PM, Paul Gortmaker wrote:

The Kconfig currently controlling compilation of this code is:

init/Kconfig:config BPF_SYSCALL
init/Kconfig:   bool "Enable bpf() system call"

...meaning that it currently is not being built as a module by anyone.

Lets remove the couple traces of modular infrastructure use, so that
when reading the driver there is no doubt it is builtin-only.

Note that MODULE_ALIAS is a no-op for non-modular code.

We replace module.h with init.h since the file does use __init.

Cc: Alexei Starovoitov 
Cc: net...@vger.kernel.org
Signed-off-by: Paul Gortmaker 


(Patch is for net-next tree then.)

Acked-by: Daniel Borkmann 


Re: [PATCH net] udp: prevent bugcheck if filter truncates packet too much

2016-07-09 Thread Daniel Borkmann

On 07/09/2016 02:20 AM, Alexei Starovoitov wrote:

On Sat, Jul 09, 2016 at 01:31:40AM +0200, Eric Dumazet wrote:

On Fri, 2016-07-08 at 17:52 +0200, Michal Kubecek wrote:

If socket filter truncates an udp packet below the length of UDP header
in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a
BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if
kernel is configured that way) can be easily enforced by an unprivileged
user which was reported as CVE-2016-6162. For a reproducer, see
http://seclists.org/oss-sec/2016/q3/8

Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Reported-by: Marco Grassi 
Signed-off-by: Michal Kubecek 
---
  net/ipv4/udp.c | 2 ++
  net/ipv6/udp.c | 2 ++
  2 files changed, 4 insertions(+)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ca5e8ea29538..4aed8fc23d32 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1583,6 +1583,8 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff 
*skb)

if (sk_filter(sk, skb))
goto drop;
+   if (unlikely(skb->len < sizeof(struct udphdr)))
+   goto drop;

udp_csum_pull_header(skb);
if (sk_rcvqueues_full(sk, sk->sk_rcvbuf)) {
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 005dc82c2138..acc09705618b 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -620,6 +620,8 @@ int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff 
*skb)

if (sk_filter(sk, skb))
goto drop;
+   if (unlikely(skb->len < sizeof(struct udphdr)))
+   goto drop;

udp_csum_pull_header(skb);
if (sk_rcvqueues_full(sk, sk->sk_rcvbuf)) {



Arg :(

Acked-by: Eric Dumazet 


this is incomplete fix. Please do not apply. See discussion at security@kernel


Ohh well, didn't see it earlier before starting the discussion at security@...

I'm okay if we take this for now as a quick band aid and find a better way how
to deal with the underlying issue long-term so that it's /guaranteed/ that it
doesn't bite us any further in such fragile ways.


[PATCH net] bpf, perf: delay release of BPF prog after grace period

2016-06-27 Thread Daniel Borkmann
Commit dead9f29ddcc ("perf: Fix race in BPF program unregister") moved
destruction of BPF program from free_event_rcu() callback to __free_event(),
which is problematic if used with tail calls: if prog A is attached as
trace event directly, but at the same time present in a tail call map used
by another trace event program elsewhere, then we need to delay destruction
via RCU grace period since it can still be in use by the program doing the
tail call (the prog first needs to be dropped from the tail call map, then
trace event with prog A attached destroyed, so we get immediate destruction).

Fixes: dead9f29ddcc ("perf: Fix race in BPF program unregister")
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
Cc: Jann Horn 
---
 include/linux/bpf.h  | 4 
 kernel/events/core.c | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8269caf..0de4de6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -264,6 +264,10 @@ static inline struct bpf_prog *bpf_prog_get(u32 ufd)
 static inline void bpf_prog_put(struct bpf_prog *prog)
 {
 }
+
+static inline void bpf_prog_put_rcu(struct bpf_prog *prog)
+{
+}
 #endif /* CONFIG_BPF_SYSCALL */
 
 /* verifier prototypes for helper functions called from eBPF programs */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 274450e..d00c47b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7531,7 +7531,7 @@ static void perf_event_free_bpf_prog(struct perf_event 
*event)
prog = event->tp_event->prog;
if (prog) {
event->tp_event->prog = NULL;
-   bpf_prog_put(prog);
+   bpf_prog_put_rcu(prog);
}
 }
 
-- 
1.9.3



Re: [PATCH net-next v2 2/4] cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY

2016-06-23 Thread Daniel Borkmann

On 06/23/2016 11:26 PM, Martin KaFai Lau wrote:

On Thu, Jun 23, 2016 at 11:42:31AM +0200, Daniel Borkmann wrote:

Hi Martin,

[ sorry to jump late in here, on pto currently ]

Thanks for reviewing.


Could you describe a bit more with regards to pinning maps and how this
should interact with cgroups? The two specialized array maps we have (tail
calls, perf events) have fairly complicated semantics for when to clean up
map slots (see commits c9da161c6517ba1, 3b1efb196eee45b2f0c4).

How is this managed with cgroups? Once a cgroup fd is placed into a map and
the user removes the cgroup, will this be prevented due to 'being busy', or
will the cgroup live further as long as a program is running with a cgroup
map entry (but the cgroup itself is not visible from user space in any way
anymore)?

Having a cgroup ptr stored in the bpf_map will not stop the user from
removing the cgroup (by rmdir /mnt/cgroup2/tc/test_cgrp).


Right.


The cgroup ptr stored in the bpf_map holds a refcnt which answer the
second part.


Yep, clear.


The situation is similar to the netfilter usecase in
commit 38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match")


I presume it's a valid use case to pin a cgroup map, put fds into it and
remove the pinned file expecting to continue to match on it, right? So
lifetime is really until last prog using a cgroup map somewhere gets removed
(even if not accessible from user space anymore, meaning no prog has fd and
pinned file was removed).

Yes.

We are still hatching out how to set this up in production. However, the
situation is similar to removing the pinned file.


I presume you mean removing the last BPF program holding a reference on
the cgroup array map. (Any user space visibility like struct files given
from the anon inode and pinnings are tracked via uref, btw, which is
needed to break possible complex dependencies among tail called programs.)
But dropping cgroup ref at latest when the last map ref is dropped as you
currently do seems fine. It makes cgroup array maps effectively no different
from plain regular array maps.


We probably will not use tc and pin a bpf_map to do that.  Instead,
one process will setup eveything (e.g. create the cgroup, pouplate the
cgroup map, load the bpf to egress) and then go away.


Yep, that seems a valid case as well, both use cases (pinned and non-pinned)
should be fine with your code then.

Thanks,
Daniel


Re: [PATCH net-next v2 2/4] cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY

2016-06-23 Thread Daniel Borkmann

On 06/23/2016 11:13 PM, Tejun Heo wrote:

Hello,

On Thu, Jun 23, 2016 at 11:42:31AM +0200, Daniel Borkmann wrote:

I presume it's a valid use case to pin a cgroup map, put fds into it and
remove the pinned file expecting to continue to match on it, right? So
lifetime is really until last prog using a cgroup map somewhere gets removed
(even if not accessible from user space anymore, meaning no prog has fd and
pinned file was removed).


Yeap, from what I can see, the cgroup will stay around (even if it
gets deleted) as long as the bpf rule using it is around and that's
completely fine from cgroup side.


Ok, thanks for confirming!


Thanks.


Re: [PATCH net-next v2 3/4] cgroup: bpf: Add bpf_skb_in_cgroup_proto

2016-06-23 Thread Daniel Borkmann

On 06/23/2016 06:54 PM, Martin KaFai Lau wrote:

On Thu, Jun 23, 2016 at 11:53:50AM +0200, Daniel Borkmann wrote:

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 668e079..68753e0 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1062,6 +1062,10 @@ static int check_map_func_compatibility(struct bpf_map 
*map, int func_id)
if (func_id != BPF_FUNC_get_stackid)
goto error;
break;
+   case BPF_MAP_TYPE_CGROUP_ARRAY:
+   if (func_id != BPF_FUNC_skb_in_cgroup)
+   goto error;
+   break;


I think the BPF_MAP_TYPE_CGROUP_ARRAY case should have been fist here in
patch 2/4, but with unconditional goto error. And this one only adds the
'func_id != BPF_FUNC_skb_in_cgroup' test.

I am not sure I understand.  Can you elaborate? I am probably missing
something here.


If someone backports patch 2/4 as-is, but for some reason not 3/4, then you
could craft a program that calls f.e. bpf_map_update_elem() on a cgroup array
and would thus cause a NULL pointer deref, since verifier doesn't prevent it.
I'm just trying to say that it would probably make sense to add the above 'case
BPF_MAP_TYPE_CGROUP_ARRAY:' with an unconditional 'goto error' in patch 2/4
and extend upon it in patch 3/4 so result looks like here, so that the patches
are fine/complete each as stand-alone.


Re: [PATCH net-next v2 4/4] cgroup: bpf: Add an example to do cgroup checking in BPF

2016-06-23 Thread Daniel Borkmann

On 06/22/2016 11:17 PM, Martin KaFai Lau wrote:

test_cgrp2_array_pin.c:
A userland program that creates a bpf_map (BPF_MAP_TYPE_GROUP_ARRAY),
pouplates/updates it with a cgroup2's backed fd and pins it to a
bpf-fs's file.  The pinned file can be loaded by tc and then used
by the bpf prog later.  This program can also update an existing pinned
array and it could be useful for debugging/testing purpose.

test_cgrp2_tc_kern.c:
A bpf prog which should be loaded by tc.  It is to demonstrate
the usage of bpf_skb_in_cgroup.

test_cgrp2_tc.sh:
A script that glues the test_cgrp2_array_pin.c and
test_cgrp2_tc_kern.c together.  The idea is like:
1. Use test_cgrp2_array_pin.c to populate a BPF_MAP_TYPE_CGROUP_ARRAY
with a cgroup fd
2. Load the test_cgrp2_tc_kern.o by tc
3. Do a 'ping -6 ff02::1%ve' to ensure the packet has been
dropped because of a match on the cgroup

Most of the lines in test_cgrp2_tc.sh is the boilerplate
to setup the cgroup/bpf-fs/net-devices/netns...etc.  It is
not bulletproof on errors but should work well enough and
give enough debug info if things did not go well.

Signed-off-by: Martin KaFai Lau 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Tejun Heo 
Acked-by: Alexei Starovoitov 


Btw, when no bpf fs is mounted, tc will already auto-mount it. I noticed in
your script, you do mount the fs manually. I guess it's okay to leave it like
this, but I hope users won't wrongly copy it assuming they /have/ to mount it
themselves.


Re: [PATCH net-next v2 3/4] cgroup: bpf: Add bpf_skb_in_cgroup_proto

2016-06-23 Thread Daniel Borkmann

On 06/22/2016 11:17 PM, Martin KaFai Lau wrote:

Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk
belongs to a descendant of a cgroup2.  It is similar to the
feature added in netfilter:
commit c38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match")

The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY
which will be used by the bpf_skb_in_cgroup.

Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY
and bpf_skb_in_cgroup() are always used together.

Signed-off-by: Martin KaFai Lau 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Tejun Heo 
Acked-by: Alexei Starovoitov 
---
  include/uapi/linux/bpf.h | 12 
  kernel/bpf/verifier.c|  8 
  net/core/filter.c| 40 
  3 files changed, 60 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ef4e386..bad309f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -314,6 +314,18 @@ enum bpf_func_id {
 */
BPF_FUNC_skb_get_tunnel_opt,
BPF_FUNC_skb_set_tunnel_opt,
+
+   /**
+* bpf_skb_in_cgroup(skb, map, index) - Check cgroup2 membership of skb
+* @skb: pointer to skb
+* @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
+* @index: index of the cgroup in the bpf_map
+* Return:
+*   == 0 skb failed the cgroup2 descendant test
+*   == 1 skb succeeded the cgroup2 descendant test
+*< 0 error
+*/
+   BPF_FUNC_skb_in_cgroup,
__BPF_FUNC_MAX_ID,
  };

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 668e079..68753e0 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1062,6 +1062,10 @@ static int check_map_func_compatibility(struct bpf_map 
*map, int func_id)
if (func_id != BPF_FUNC_get_stackid)
goto error;
break;
+   case BPF_MAP_TYPE_CGROUP_ARRAY:
+   if (func_id != BPF_FUNC_skb_in_cgroup)
+   goto error;
+   break;


I think the BPF_MAP_TYPE_CGROUP_ARRAY case should have been fist here in
patch 2/4, but with unconditional goto error. And this one only adds the
'func_id != BPF_FUNC_skb_in_cgroup' test.


default:
break;
}
@@ -1081,6 +1085,10 @@ static int check_map_func_compatibility(struct bpf_map 
*map, int func_id)
if (map->map_type != BPF_MAP_TYPE_STACK_TRACE)
goto error;
break;
+   case BPF_FUNC_skb_in_cgroup:
+   if (map->map_type != BPF_MAP_TYPE_CGROUP_ARRAY)
+   goto error;
+   break;
default:
break;
}
diff --git a/net/core/filter.c b/net/core/filter.c
index df6860c..a16f7d2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2024,6 +2024,42 @@ bpf_get_skb_set_tunnel_proto(enum bpf_func_id which)
}
  }

+#ifdef CONFIG_CGROUPS
+static u64 bpf_skb_in_cgroup(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+   struct sk_buff *skb = (struct sk_buff *)(long)r1;
+   struct bpf_map *map = (struct bpf_map *)(long)r2;
+   struct bpf_array *array = container_of(map, struct bpf_array, map);
+   struct cgroup *cgrp;
+   struct sock *sk;
+   u32 i = (u32)r3;
+
+   WARN_ON_ONCE(!rcu_read_lock_held());


I think the WARN_ON_ONCE() test can be removed all-together. There are many
other functions without it. We really rely on RCU read-lock being held for
BPF programs (otherwise it would be horribly broken). F.e. it's kinda silly
that for some map update/lookups we even have this WARN_ON_ONCE() test twice
we go through in the fast-path (once from the generic eBPF helper function
and then once again from the actual implementation since it could also be
called from syscall). The actual invocation points are not that many and we
can make sure that related call sites hold RCU read lock.

Rest looks good to me, thanks.


+   sk = skb->sk;
+   if (!sk || !sk_fullsock(sk))
+   return -ENOENT;
+
+   if (unlikely(i >= array->map.max_entries))
+   return -E2BIG;
+
+   cgrp = READ_ONCE(array->ptrs[i]);
+   if (unlikely(!cgrp))
+   return -ENOENT;
+
+   return cgroup_is_descendant(sock_cgroup_ptr(&sk->sk_cgrp_data), cgrp);
+}
+
+static const struct bpf_func_proto bpf_skb_in_cgroup_proto = {
+   .func   = bpf_skb_in_cgroup,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+   .arg2_type  = ARG_CONST_MAP_PTR,
+   .arg3_type  = ARG_ANYTHING,
+};
+#endif
+
  static const struct bpf_func_proto *
  sk_filter_func_proto(enum bpf_func_id func_id)
  {
@@ -2086,6 +2122,10 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
return &bpf_ge

Re: [PATCH net-next v2 2/4] cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY

2016-06-23 Thread Daniel Borkmann

Hi Martin,

[ sorry to jump late in here, on pto currently ]

On 06/22/2016 11:17 PM, Martin KaFai Lau wrote:

Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations.
To update an element, the caller is expected to obtain a cgroup2 backed
fd by open(cgroup2_dir) and then update the array with that fd.

Signed-off-by: Martin KaFai Lau 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Tejun Heo 
Acked-by: Alexei Starovoitov 


Could you describe a bit more with regards to pinning maps and how this
should interact with cgroups? The two specialized array maps we have (tail
calls, perf events) have fairly complicated semantics for when to clean up
map slots (see commits c9da161c6517ba1, 3b1efb196eee45b2f0c4).

How is this managed with cgroups? Once a cgroup fd is placed into a map and
the user removes the cgroup, will this be prevented due to 'being busy', or
will the cgroup live further as long as a program is running with a cgroup
map entry (but the cgroup itself is not visible from user space in any way
anymore)?

I presume it's a valid use case to pin a cgroup map, put fds into it and
remove the pinned file expecting to continue to match on it, right? So
lifetime is really until last prog using a cgroup map somewhere gets removed
(even if not accessible from user space anymore, meaning no prog has fd and
pinned file was removed).

I assume that using struct file here doesn't make sense (commit e03e7ee34fdd1c3)
either, right?

[...]

+#ifdef CONFIG_CGROUPS
+static void *cgroup_fd_array_get_ptr(struct bpf_map *map,
+struct file *map_file /* not used */,
+int fd)
+{
+   return cgroup_get_from_fd(fd);
+}
+
+static void cgroup_fd_array_put_ptr(void *ptr)
+{
+   /* cgroup_put free cgrp after a rcu grace period */
+   cgroup_put(ptr);


Yeah, as long as this respects freeing after RCU grace period, it's fine
like this ...


+}
+
+static void cgroup_fd_array_free(struct bpf_map *map)
+{
+   bpf_fd_array_map_clear(map);
+   fd_array_map_free(map);
+}
+
+static const struct bpf_map_ops cgroup_array_ops = {
+   .map_alloc = fd_array_map_alloc,
+   .map_free = cgroup_fd_array_free,
+   .map_get_next_key = array_map_get_next_key,
+   .map_lookup_elem = fd_array_map_lookup_elem,
+   .map_delete_elem = fd_array_map_delete_elem,
+   .map_fd_get_ptr = cgroup_fd_array_get_ptr,
+   .map_fd_put_ptr = cgroup_fd_array_put_ptr,
+};
+
+static struct bpf_map_type_list cgroup_array_type __read_mostly = {
+   .ops = &cgroup_array_ops,
+   .type = BPF_MAP_TYPE_CGROUP_ARRAY,
+};
+
+static int __init register_cgroup_array_map(void)
+{
+   bpf_register_map_type(&cgroup_array_type);
+   return 0;
+}
+late_initcall(register_cgroup_array_map);
+#endif
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c23a4e93..cac13f1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -393,7 +393,8 @@ static int map_update_elem(union bpf_attr *attr)
} else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
err = bpf_percpu_array_update(map, key, value, attr->flags);
} else if (map->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY ||
-  map->map_type == BPF_MAP_TYPE_PROG_ARRAY) {
+  map->map_type == BPF_MAP_TYPE_PROG_ARRAY ||
+  map->map_type == BPF_MAP_TYPE_CGROUP_ARRAY) {
rcu_read_lock();
err = bpf_fd_array_map_update_elem(map, f.file, key, value,
   attr->flags);





Re: [PATCH 0/4] net-next: mediatek: IRQ cleanups, fixes and grouping

2016-06-16 Thread Daniel Borkmann

On 06/16/2016 11:44 AM, John Crispin wrote:

On 16/06/2016 07:20, David Miller wrote:

From: John Crispin 
Date: Wed, 15 Jun 2016 16:58:46 +0200


This series contains 2 small code cleanups that are leftovers from the
MIPS support. There is also a small fix that adds proper locking to the
code accessing the IRQ registers. Without this fix we saw deadlocks caused
by the last patch of the series, which adds IRQ grouping. The grouping
feature allows us to use different IRQs for TX and RX. By doing so we can
use affinity to let the SoC handle the IRQs on different cores.


This patch series doesn't apply cleanly to the net-next tree, I get rejects
on patch #4.


it depends on the series with the 11 fixes that i sent last week which
is however in the net tree and not the next tree (i also still had the
DQL hack in my tree). if i resend this at the start of the next rc1
would you merge it into the net tree so that it becomes part of v4.8 ?


Well, it would rather require a merge of -net into -net-next then. But
that would be useful indeed.

Cheers,
Daniel


Re: [PATCH net-next 1/3] arm64: bpf: implement bpf_tail_call() helper

2016-06-06 Thread Daniel Borkmann

On 06/06/2016 06:56 AM, Z Lim wrote:
[...]

How about the attached patch? Fixes compilation error on build
!CONFIG_BPF_SYSCALL.

Also, should this patch be sent to net or net-next (along with this series)?


Looks good, feel free to add:

Acked-by: Daniel Borkmann 

I think net-next along with your series should be fine since the issue
first appeared there. Thanks, Zi!


Re: [PATCH net-next 1/3] arm64: bpf: implement bpf_tail_call() helper

2016-06-05 Thread Daniel Borkmann

On 06/05/2016 01:46 AM, kbuild test robot wrote:

Hi,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Zi-Shen-Lim/arm64-bpf-implement-bpf_tail_call-helper/20160605-060435
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gnu-gcc (Debian 5.3.1-8) 5.3.1 20160205
reproduce:
 wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
 chmod +x ~/bin/make.cross
 # save the attached .config to linux build tree
 make.cross ARCH=arm64

All errors (new ones prefixed by >>):

In file included from arch/arm64/net/bpf_jit_comp.c:21:0:
include/linux/bpf.h: In function 'bpf_prog_get':

include/linux/bpf.h:235:9: error: implicit declaration of function 'ERR_PTR' 
[-Werror=implicit-function-declaration]

  return ERR_PTR(-EOPNOTSUPP);
 ^
include/linux/bpf.h:235:9: warning: return makes pointer from integer 
without a cast [-Wint-conversion]
In file included from include/linux/rwsem.h:17:0,
 from include/linux/mm_types.h:10,
 from include/linux/sched.h:27,
 from arch/arm64/include/asm/compat.h:25,
 from arch/arm64/include/asm/stat.h:23,
 from include/linux/stat.h:5,
 from include/linux/compat.h:12,
 from include/linux/filter.h:10,
 from arch/arm64/net/bpf_jit_comp.c:22:
include/linux/err.h: At top level:

include/linux/err.h:23:35: error: conflicting types for 'ERR_PTR'

 static inline void * __must_check ERR_PTR(long error)
   ^
In file included from arch/arm64/net/bpf_jit_comp.c:21:0:
include/linux/bpf.h:235:9: note: previous implicit declaration of 'ERR_PTR' 
was here
  return ERR_PTR(-EOPNOTSUPP);
 ^
cc1: some warnings being treated as errors


Looks like including linux/bpf.h at the very beginning causes issues when bpf
syscall is disabled. We should probably just include linux/err.h from bpf.h.


Re: linux-next: manual merge of the net-next tree with the arm64 tree

2016-05-17 Thread Daniel Borkmann

On 05/17/2016 03:38 PM, Catalin Marinas wrote:

On Tue, May 17, 2016 at 09:12:34AM +0200, Daniel Borkmann wrote:

On 05/17/2016 09:03 AM, Geert Uytterhoeven wrote:
[...]

Someone's not gonna be happy with commit 606b5908 ("bpf: split
HAVE_BPF_JIT into cBPF and eBPF variant") breaking the sort order again...


Wasn't aware of that. Maybe I'm missing something, but there appears
to be no throughout consistent ordering ...

[...]
 select HAVE_PERF_REGS
 select HAVE_PERF_USER_STACK_DUMP
 select HAVE_RCU_TABLE_FREE
 select HAVE_SYSCALL_TRACEPOINTS
 select IOMMU_DMA if IOMMU_SUPPORT
 select IRQ_DOMAIN
 select IRQ_FORCED_THREADING
[...]
 select RTC_LIB
 select SPARSE_IRQ
 select SYSCTL_EXCEPTION_TRACE
 select HAVE_CONTEXT_TRACKING
 select HAVE_ARM_SMCCC
[...]


We keep fixing them as we merge other stuff. For example, latest
mainline has commit 8ee708792e1c ("arm64: Kconfig: remove redundant
HAVE_ARCH_TRANSPARENT_HUGEPAGE definition") which also fixes up the
Kconfig order.


Understood, thanks for the clarification (and sorry for the sort order
issue).


Re: [PATCH v2 net-next] bpf: arm64: remove callee-save registers use for tmp registers

2016-05-17 Thread Daniel Borkmann

On 05/17/2016 01:36 AM, Yang Shi wrote:

In the current implementation of ARM64 eBPF JIT, R23 and R24 are used for
tmp registers, which are callee-saved registers. This leads to variable size
of JIT prologue and epilogue. The latest blinding constant change prefers to
constant size of prologue and epilogue. AAPCS reserves R9 ~ R15 for temp
registers which not need to be saved/restored during function call. So, replace
R23 and R24 to R10 and R11, and remove tmp_used flag to save 2 instructions for
some jited BPF program.

CC: Daniel Borkmann 
Acked-by: Zi Shen Lim 
Signed-off-by: Yang Shi 


LGTM, thanks!

Acked-by: Daniel Borkmann 


Re: linux-next: manual merge of the net-next tree with the arm64 tree

2016-05-17 Thread Daniel Borkmann

On 05/17/2016 09:03 AM, Geert Uytterhoeven wrote:
[...]

Someone's not gonna be happy with commit 606b5908 ("bpf: split
HAVE_BPF_JIT into cBPF and eBPF variant") breaking the sort order again...


Wasn't aware of that. Maybe I'm missing something, but there appears
to be no throughout consistent ordering ...

[...]
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE
select HAVE_SYSCALL_TRACEPOINTS
select IOMMU_DMA if IOMMU_SUPPORT
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
[...]
select RTC_LIB
select SPARSE_IRQ
select SYSCTL_EXCEPTION_TRACE
select HAVE_CONTEXT_TRACKING
select HAVE_ARM_SMCCC
[...]


Re: linux-next: manual merge of the net-next tree with the arm64 tree

2016-05-16 Thread Daniel Borkmann

On 05/17/2016 02:24 AM, Stephen Rothwell wrote:

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

   arch/arm64/Kconfig

between commit:

   8ee708792e1c ("arm64: Kconfig: remove redundant HAVE_ARCH_TRANSPARENT_HUGEPAGE 
definition")

from the arm64 tree and commit:

   606b5908 ("bpf: split HAVE_BPF_JIT into cBPF and eBPF variant")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.


Diff looks good, thanks!


Re: [PATCH] tools: bpf_jit_disasm: check for klogctl failure

2016-05-05 Thread Daniel Borkmann

On 05/06/2016 12:39 AM, Colin King wrote:

From: Colin Ian King 

klogctl can fail and return -ve len, so check for this and
return NULL to avoid passing a (size_t)-1 to malloc.

Signed-off-by: Colin Ian King 


[ would be nice to get Cc'ed in future ... ]

Acked-by: Daniel Borkmann 


[PATCH net-next v2 1/2] bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_output

2016-04-18 Thread Daniel Borkmann
Add a BPF_F_CURRENT_CPU flag to optimize the use-case where user space has
per-CPU ring buffers and the eBPF program pushes the data into the current
CPU's ring buffer which saves us an extra helper function call in eBPF.
Also, make sure to properly reserve the remaining flags which are not used.

Signed-off-by: Daniel Borkmann 
Signed-off-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h | 4 
 kernel/trace/bpf_trace.c | 7 ++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 70eda5a..b7b0fb1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -347,6 +347,10 @@ enum bpf_func_id {
 #define BPF_F_ZERO_CSUM_TX (1ULL << 1)
 #define BPF_F_DONT_FRAGMENT(1ULL << 2)
 
+/* BPF_FUNC_perf_event_output flags. */
+#define BPF_F_INDEX_MASK   0xULL
+#define BPF_F_CURRENT_CPU  BPF_F_INDEX_MASK
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6855878..6bfe55c 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -225,11 +225,12 @@ static const struct bpf_func_proto 
bpf_perf_event_read_proto = {
.arg2_type  = ARG_ANYTHING,
 };
 
-static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 index, u64 r4, u64 size)
+static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size)
 {
struct pt_regs *regs = (struct pt_regs *) (long) r1;
struct bpf_map *map = (struct bpf_map *) (long) r2;
struct bpf_array *array = container_of(map, struct bpf_array, map);
+   u64 index = flags & BPF_F_INDEX_MASK;
void *data = (void *) (long) r4;
struct perf_sample_data sample_data;
struct perf_event *event;
@@ -239,6 +240,10 @@ static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 
index, u64 r4, u64 size)
.data = data,
};
 
+   if (unlikely(flags & ~(BPF_F_INDEX_MASK)))
+   return -EINVAL;
+   if (index == BPF_F_CURRENT_CPU)
+   index = raw_smp_processor_id();
if (unlikely(index >= array->map.max_entries))
return -E2BIG;
 
-- 
1.9.3



[PATCH net-next v2 2/2] bpf: add event output helper for notifications/sampling/logging

2016-04-18 Thread Daniel Borkmann
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example").

The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.

User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.

While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.

Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.

Signed-off-by: Daniel Borkmann 
Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf.h  |  2 ++
 kernel/bpf/core.c|  7 +++
 kernel/trace/bpf_trace.c | 27 +++
 net/core/filter.c|  2 ++
 4 files changed, 38 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5fb3c61..f63afdc 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -169,7 +169,9 @@ u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 
r5);
 u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 void bpf_fd_array_map_clear(struct bpf_map *map);
 bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog 
*fp);
+
 const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
+const struct bpf_func_proto *bpf_get_event_output_proto(void);
 
 #ifdef CONFIG_BPF_SYSCALL
 DECLARE_PER_CPU(int, bpf_prog_active);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index be0abf6..e4248fe 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -764,14 +764,21 @@ const struct bpf_func_proto bpf_map_delete_elem_proto 
__weak;
 const struct bpf_func_proto bpf_get_prandom_u32_proto __weak;
 const struct bpf_func_proto bpf_get_smp_processor_id_proto __weak;
 const struct bpf_func_proto bpf_ktime_get_ns_proto __weak;
+
 const struct bpf_func_proto bpf_get_current_pid_tgid_proto __weak;
 const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
 const struct bpf_func_proto bpf_get_current_comm_proto __weak;
+
 const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
 {
return NULL;
 }
 
+const struct bpf_func_proto * __weak bpf_get_event_output_proto(void)
+{
+   return NULL;
+}
+
 /* Always built-in helper functions. */
 const struct bpf_func_proto bpf_tail_call_proto = {
.func   = NULL,
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6bfe55c..75cb154 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -277,6 +277,33 @@ static const struct bpf_func_proto 
bpf_perf_event_output_proto = {
.arg5_type  = ARG_CONST_STACK_SIZE,
 };
 
+static DEFINE_PER_CPU(struct pt_regs, bpf_pt_regs);
+
+static u64 bpf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size)
+{
+   struct pt_regs *regs = this_cpu_ptr(&bpf_pt_regs);
+
+   perf_fetch_caller_regs(regs);
+
+   return bpf_perf_event_output((long)regs, r2, flags, r4, size);
+}
+
+static const struct bpf_func_proto bpf_event_output_proto = {
+   .func   = bpf_event_output,
+   .gpl_only   = true,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+   .arg2_type  = ARG_CONST_MAP_PTR,
+   .arg3_type  = ARG_ANYTHING,
+   .arg4_type  = ARG_PTR_TO_STACK,
+   .arg5_type  = ARG_CONST_STACK_SIZE,
+};
+
+const struct bpf_func_proto *bpf_get_event_output_proto(void)
+{
+   return &bpf_event_output_proto;
+}
+
 static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id 
func_id)
 {
switch (func_id) {
diff --git a/net/core/filter.c b/net/core/filter.c
index 5d2ac2b..218e5d

[PATCH net-next v2 0/2] BPF updates

2016-04-18 Thread Daniel Borkmann
This minor set adds a new helper bpf_event_output() for eBPF cls/act
program types which allows to pass events to user space applications.
For details, please see individual patches.

Thanks!

v1 -> v2:
  - Address kbuild bot found compile issue in patch 2
  - Rest as is

Daniel Borkmann (2):
  bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_output
  bpf: add event output helper for notifications/sampling/logging

 include/linux/bpf.h  |  2 ++
 include/uapi/linux/bpf.h |  4 
 kernel/bpf/core.c|  7 +++
 kernel/trace/bpf_trace.c | 34 +-
 net/core/filter.c|  2 ++
 5 files changed, 48 insertions(+), 1 deletion(-)

-- 
1.9.3



Re: [PATCH net-next 2/2] bpf: add event output helper for notifications/sampling/logging

2016-04-18 Thread Daniel Borkmann

On 04/18/2016 01:55 AM, kbuild test robot wrote:

Hi Daniel,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Daniel-Borkmann/bpf-trace-add-BPF_F_CURRENT_CPU-flag-for-bpf_perf_event_output/20160418-063147
config: m68k-allyesconfig (attached as .config)
reproduce:
 wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
 chmod +x ~/bin/make.cross
 # save the attached .config to linux build tree
 make.cross ARCH=m68k

All errors (new ones prefixed by >>):

net/core/filter.c: In function 'bpf_skb_event_output':

net/core/filter.c:1599:2: error: implicit declaration of function 
'perf_fetch_caller_regs' [-Werror=implicit-function-declaration]

  perf_fetch_caller_regs(regs);
  ^
cc1: some warnings being treated as errors

vim +/perf_fetch_caller_regs +1599 net/core/filter.c

   1593 static DEFINE_PER_CPU(struct pt_regs, bpf_pt_regs);
   1594 
   1595 static u64 bpf_skb_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 
size)
   1596 {
   1597 struct pt_regs *regs = this_cpu_ptr(&bpf_pt_regs);
   1598 

1599perf_fetch_caller_regs(regs);

   1600 
   1601 return bpf_perf_event_output((long)regs, r2, flags, r4, size);
   1602 }


Sorry about that, missed this one. Will fix it up in v2!

Thanks,
Daniel


[PATCH net-next 0/2] BPF updates

2016-04-17 Thread Daniel Borkmann
This minor set adds a new helper bpf_skb_event_output() for eBPF cls/act
program types which allows to pass events to user space applications. For
details, please see individual patches.

Thanks!

Daniel Borkmann (2):
  bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_output
  bpf: add event output helper for notifications/sampling/logging

 include/linux/bpf.h  |  2 ++
 include/uapi/linux/bpf.h |  4 
 kernel/trace/bpf_trace.c |  7 ++-
 net/core/filter.c| 30 ++
 4 files changed, 42 insertions(+), 1 deletion(-)

-- 
1.9.3



[PATCH net-next 1/2] bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_output

2016-04-17 Thread Daniel Borkmann
Add a BPF_F_CURRENT_CPU flag to optimize the use-case where user space has
per-CPU ring buffers and the eBPF program pushes the data into the current
CPU's ring buffer which saves us an extra helper function call in eBPF.
Also, make sure to properly reserve the remaining flags which are not used.

Signed-off-by: Daniel Borkmann 
Signed-off-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h | 4 
 kernel/trace/bpf_trace.c | 7 ++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 70eda5a..b7b0fb1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -347,6 +347,10 @@ enum bpf_func_id {
 #define BPF_F_ZERO_CSUM_TX (1ULL << 1)
 #define BPF_F_DONT_FRAGMENT(1ULL << 2)
 
+/* BPF_FUNC_perf_event_output flags. */
+#define BPF_F_INDEX_MASK   0xULL
+#define BPF_F_CURRENT_CPU  BPF_F_INDEX_MASK
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6855878..6bfe55c 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -225,11 +225,12 @@ static const struct bpf_func_proto 
bpf_perf_event_read_proto = {
.arg2_type  = ARG_ANYTHING,
 };
 
-static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 index, u64 r4, u64 size)
+static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size)
 {
struct pt_regs *regs = (struct pt_regs *) (long) r1;
struct bpf_map *map = (struct bpf_map *) (long) r2;
struct bpf_array *array = container_of(map, struct bpf_array, map);
+   u64 index = flags & BPF_F_INDEX_MASK;
void *data = (void *) (long) r4;
struct perf_sample_data sample_data;
struct perf_event *event;
@@ -239,6 +240,10 @@ static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 
index, u64 r4, u64 size)
.data = data,
};
 
+   if (unlikely(flags & ~(BPF_F_INDEX_MASK)))
+   return -EINVAL;
+   if (index == BPF_F_CURRENT_CPU)
+   index = raw_smp_processor_id();
if (unlikely(index >= array->map.max_entries))
return -E2BIG;
 
-- 
1.9.3



[PATCH net-next 2/2] bpf: add event output helper for notifications/sampling/logging

2016-04-17 Thread Daniel Borkmann
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example").

The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.

User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.

While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.

Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.

Signed-off-by: Daniel Borkmann 
Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf.h  |  2 ++
 kernel/trace/bpf_trace.c |  2 +-
 net/core/filter.c| 30 ++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5fb3c61..2116924 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -167,6 +167,8 @@ struct bpf_array {
 
 u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 r5);
 u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size);
+
 void bpf_fd_array_map_clear(struct bpf_map *map);
 bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog 
*fp);
 const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6bfe55c..7fe2f22 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -225,7 +225,7 @@ static const struct bpf_func_proto 
bpf_perf_event_read_proto = {
.arg2_type  = ARG_ANYTHING,
 };
 
-static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size)
+u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size)
 {
struct pt_regs *regs = (struct pt_regs *) (long) r1;
struct bpf_map *map = (struct bpf_map *) (long) r2;
diff --git a/net/core/filter.c b/net/core/filter.c
index 5d2ac2b..3521252 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -46,6 +46,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1584,6 +1585,33 @@ static const struct bpf_func_proto bpf_csum_diff_proto = 
{
.arg5_type  = ARG_ANYTHING,
 };
 
+u64 __weak bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size)
+{
+   return -EACCES;
+}
+
+static DEFINE_PER_CPU(struct pt_regs, bpf_pt_regs);
+
+static u64 bpf_skb_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size)
+{
+   struct pt_regs *regs = this_cpu_ptr(&bpf_pt_regs);
+
+   perf_fetch_caller_regs(regs);
+
+   return bpf_perf_event_output((long)regs, r2, flags, r4, size);
+}
+
+static const struct bpf_func_proto bpf_skb_event_output_proto = {
+   .func   = bpf_skb_event_output,
+   .gpl_only   = true,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+   .arg2_type  = ARG_CONST_MAP_PTR,
+   .arg3_type  = ARG_ANYTHING,
+   .arg4_type  = ARG_PTR_TO_STACK,
+   .arg5_type  = ARG_CONST_STACK_SIZE,
+};
+
 static u64 bpf_clone_redirect(u64 r1, u64 ifindex, u64 flags, u64 r4, u64 r5)
 {
struct sk_buff *skb = (struct sk_buff *) (long) r1, *skb2;
@@ -2039,6 +2067,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
return &bpf_redirect_proto;
case BPF_FUNC_get_route_realm:
return &bpf_get_route_realm_proto;
+   case BPF_FUNC_perf_event_output:
+   return &bpf_skb_event_output_proto;
default:
return sk_filter_func_proto(func_id);
}
-- 
1.9.3



Re: [PATCH] sctp: Fix error handling for switch statement case in the function sctp_cmd_interprete

2016-04-05 Thread Daniel Borkmann

On 04/05/2016 11:36 PM, Bastien Philbert wrote:

This fixes error handling for the switch statement case
SCTP_CMD_SEND_PKT by making the error value of the call
to sctp_packet_transmit equal the variable error due to
this function being able to fail with a error code. In


What actual issue have you observed that you fix?


addition allow the call to sctp_ootb_pkt_free afterwards
to free up the no longer in use sctp packet even if the
call to the function sctp_packet_transmit fails in order
to avoid a memory leak here for not freeing the sctp


Not sure how this relates to your code?


Signed-off-by: Bastien Philbert 
---
  net/sctp/sm_sideeffect.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index 7fe56d0..f3a8b58 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -1434,7 +1434,7 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
case SCTP_CMD_SEND_PKT:
/* Send a full packet to our peer.  */
packet = cmd->obj.packet;
-   sctp_packet_transmit(packet, gfp);
+   error = sctp_packet_transmit(packet, gfp);
sctp_ootb_pkt_free(packet);
break;






Re: [PATCH net 4/4] lib/test_bpf: Add additional BPF_ADD tests

2016-04-05 Thread Daniel Borkmann

On 04/05/2016 12:02 PM, Naveen N. Rao wrote:

Some of these tests proved useful with the powerpc eBPF JIT port due to
sign-extended 16-bit immediate loads. Though some of these aspects get
covered in other tests, it is better to have explicit tests so as to
quickly tag the precise problem.

Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: "David S. Miller" 
Cc: Ananth N Mavinakayanahalli 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Signed-off-by: Naveen N. Rao 


Thanks for adding these!

Acked-by: Daniel Borkmann 


Re: [PATCH net 3/4] lib/test_bpf: Add test to check for result of 32-bit add that overflows

2016-04-05 Thread Daniel Borkmann

On 04/05/2016 12:02 PM, Naveen N. Rao wrote:

BPF_ALU32 and BPF_ALU64 tests for adding two 32-bit values that results in
32-bit overflow.

Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: "David S. Miller" 
Cc: Ananth N Mavinakayanahalli 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Signed-off-by: Naveen N. Rao 


Acked-by: Daniel Borkmann 


Re: [PATCH net 2/4] lib/test_bpf: Add tests for unsigned BPF_JGT

2016-04-05 Thread Daniel Borkmann

On 04/05/2016 12:02 PM, Naveen N. Rao wrote:

Unsigned Jump-if-Greater-Than.

Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: "David S. Miller" 
Cc: Ananth N Mavinakayanahalli 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Signed-off-by: Naveen N. Rao 


Acked-by: Daniel Borkmann 


Re: [PATCH net 1/4] lib/test_bpf: Fix JMP_JSET tests

2016-04-05 Thread Daniel Borkmann

On 04/05/2016 12:02 PM, Naveen N. Rao wrote:

JMP_JSET tests incorrectly used BPF_JNE. Fix the same.

Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: "David S. Miller" 
Cc: Ananth N Mavinakayanahalli 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Signed-off-by: Naveen N. Rao 


Acked-by: Daniel Borkmann 


---
  lib/test_bpf.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 27a7a26..e76fa4d 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -4303,7 +4303,7 @@ static struct bpf_test tests[] = {
.u.insns_int = {
BPF_ALU32_IMM(BPF_MOV, R0, 0),
BPF_LD_IMM64(R1, 3),
-   BPF_JMP_IMM(BPF_JNE, R1, 2, 1),
+   BPF_JMP_IMM(BPF_JSET, R1, 2, 1),
BPF_EXIT_INSN(),
BPF_ALU32_IMM(BPF_MOV, R0, 1),
BPF_EXIT_INSN(),
@@ -4317,7 +4317,7 @@ static struct bpf_test tests[] = {
.u.insns_int = {
BPF_ALU32_IMM(BPF_MOV, R0, 0),
BPF_LD_IMM64(R1, 3),
-   BPF_JMP_IMM(BPF_JNE, R1, 0x, 1),
+   BPF_JMP_IMM(BPF_JSET, R1, 0x, 1),
BPF_EXIT_INSN(),
BPF_ALU32_IMM(BPF_MOV, R0, 1),
BPF_EXIT_INSN(),
@@ -4474,7 +4474,7 @@ static struct bpf_test tests[] = {
BPF_ALU32_IMM(BPF_MOV, R0, 0),
BPF_LD_IMM64(R1, 3),
BPF_LD_IMM64(R2, 2),
-   BPF_JMP_REG(BPF_JNE, R1, R2, 1),
+   BPF_JMP_REG(BPF_JSET, R1, R2, 1),
BPF_EXIT_INSN(),
BPF_ALU32_IMM(BPF_MOV, R0, 1),
BPF_EXIT_INSN(),
@@ -4489,7 +4489,7 @@ static struct bpf_test tests[] = {
BPF_ALU32_IMM(BPF_MOV, R0, 0),
BPF_LD_IMM64(R1, 3),
BPF_LD_IMM64(R2, 0x),
-   BPF_JMP_REG(BPF_JNE, R1, R2, 1),
+   BPF_JMP_REG(BPF_JSET, R1, R2, 1),
BPF_EXIT_INSN(),
BPF_ALU32_IMM(BPF_MOV, R0, 1),
BPF_EXIT_INSN(),





Re: [RFC PATCH 6/6] ppc: ebpf/jit: Implement JIT compiler for extended BPF

2016-04-01 Thread Daniel Borkmann

On 04/01/2016 08:10 PM, Alexei Starovoitov wrote:

On 4/1/16 2:58 AM, Naveen N. Rao wrote:

PPC64 eBPF JIT compiler. Works for both ABIv1 and ABIv2.

Enable with:
echo 1 > /proc/sys/net/core/bpf_jit_enable
or
echo 2 > /proc/sys/net/core/bpf_jit_enable

... to see the generated JIT code. This can further be processed with
tools/net/bpf_jit_disasm.

With CONFIG_TEST_BPF=m and 'modprobe test_bpf':
test_bpf: Summary: 291 PASSED, 0 FAILED, [234/283 JIT'ed]

... on both ppc64 BE and LE.

The details of the approach are documented through various comments in
the code, as are the TODOs. Some of the prominent TODOs include
implementing BPF tail calls and skb loads.

Cc: Matt Evans 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Alexei Starovoitov 
Cc: "David S. Miller" 
Cc: Ananth N Mavinakayanahalli 
Signed-off-by: Naveen N. Rao 
---
  arch/powerpc/include/asm/ppc-opcode.h |  19 +-
  arch/powerpc/net/Makefile |   4 +
  arch/powerpc/net/bpf_jit.h|  66 ++-
  arch/powerpc/net/bpf_jit64.h  |  58 +++
  arch/powerpc/net/bpf_jit_comp64.c | 828 ++
  5 files changed, 973 insertions(+), 2 deletions(-)
  create mode 100644 arch/powerpc/net/bpf_jit64.h
  create mode 100644 arch/powerpc/net/bpf_jit_comp64.c

...

-#ifdef CONFIG_PPC64
+#if defined(CONFIG_PPC64) && (!defined(_CALL_ELF) || _CALL_ELF != 2)


impressive stuff!


+1, awesome to see another one!


Everything nicely documented. Could you add few words for the above
condition as well ?
Or may be a new macro, since it occurs many times?
What are these _CALL_ELF == 2 and != 2 conditions mean? ppc ABIs ?
Will there ever be v3 ?


Minor TODO would also be to convert to use bpf_jit_binary_alloc() and
bpf_jit_binary_free() API for the image, which is done by other eBPF
jits, too.


So far most of the bpf jits were going via net-next tree, but if
in this case no changes to the core is necessary then I guess it's fine
to do it via powerpc tree. What's your plan?





Re: [PATCH 2/4] samples/bpf: Use llc in PATH, rather than a hardcoded value

2016-03-31 Thread Daniel Borkmann

On 03/31/2016 07:46 PM, Alexei Starovoitov wrote:

On 3/31/16 4:25 AM, Naveen N. Rao wrote:

While at it, fix some typos in the comment.

Cc: Alexei Starovoitov 
Cc: David S. Miller 
Cc: Ananth N Mavinakayanahalli 
Cc: Michael Ellerman 
Signed-off-by: Naveen N. Rao 
---
  samples/bpf/Makefile | 11 ---
  1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 502c9fc..88bc5a0 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -76,16 +76,13 @@ HOSTLOADLIBES_offwaketime += -lelf
  HOSTLOADLIBES_spintest += -lelf
  HOSTLOADLIBES_map_perf_test += -lelf -lrt

-# point this to your LLVM backend with bpf support
-LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
-
-# asm/sysreg.h inline assmbly used by it is incompatible with llvm.
-# But, ehere is not easy way to fix it, so just exclude it since it is
+# asm/sysreg.h - inline assembly used by it is incompatible with llvm.
+# But, there is no easy way to fix it, so just exclude it since it is
  # useless for BPF samples.
  $(obj)/%.o: $(src)/%.c
  clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
  -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \
--O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
+-O2 -emit-llvm -c $< -o -| llc -march=bpf -filetype=obj -o $@
  clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
  -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \
--O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=asm -o $@.s
+-O2 -emit-llvm -c $< -o -| llc -march=bpf -filetype=asm -o $@.s


that was a workaround when clang/llvm didn't have bpf support.
Now clang 3.7 and 3.8 have bpf built-in, so make sense to remove
manual calls to llc completely.
Just use 'clang -target bpf -O2 -D... -c $< -o $@'


+1, the clang part in that Makefile should also more correctly be called
with '-target bpf' as it turns out (despite llc with '-march=bpf' ...).
Better to use clang directly as suggested by Alexei.


Re: bpf: net/core/filter.c:2115 suspicious rcu_dereference_protected() usage!

2016-03-30 Thread Daniel Borkmann

On 03/30/2016 02:24 PM, Michal Kubecek wrote:

On Wed, Mar 30, 2016 at 01:33:44PM +0200, Daniel Borkmann wrote:

On 03/30/2016 11:42 AM, Michal Kubecek wrote:


I'm just not sure checking if we hold the right lock depending on caller
is worth the extra complexity. After all, what is really needed is to
hold _some_ lock guaranteeing sk_attach_prog() and sk_detach_filter()
are safe so that just changing the condition in both to

   sock_owned_by_user(sk) || lockdep_rtnl_is_held()


It would certainly silence it, but would be less accurate in terms of lock
proving as opposed to the diff above. E.g. rntl could be held elsewhere,
while someone attaches a socket filter w/o having locked the socket (currently
not the case, but it would kind of defeat the purpose of 
rcu_dereference_protected()
here). Was thinking about using a extra socket flag to indicate it's
externally managed, but it's not really worth wasting sk's flags bit
space just for this corner case.


Originally my reasoning was that to actually hide a locking issue from
lockdep, this would have to happen every time we get down into the
function which is unlikely. But thinking about it again, this code path
is not so frequent and the fuzzers tend to do strange things so that it
could really happen.


In this case actually nothing too fancy, just seems that filters on tap devices
might not be really used by anyone (issue is already couple of years old).


Sasha/Jiri, could you test the patch with your testcases? I received it
corrupted (strange leading whitespaces) so I better add cleaned up
version below.


Tested this yesterday night on my machine with PROVE_RCU + PROVE_RCU_REPEATEDLY
enabled, and it can easily be triggered with a simple ioctl(tun_fd,
TUN{ATTACH,DETACH}FILTER, ...) on a tap device, and the patch now silences
it. Sorry for the white space damage (should have just attached it), I'd send
it later today.

Thanks,
Daniel


Re: bpf: net/core/filter.c:2115 suspicious rcu_dereference_protected() usage!

2016-03-30 Thread Daniel Borkmann

On 03/30/2016 11:42 AM, Michal Kubecek wrote:

On Tue, Mar 29, 2016 at 04:39:43PM +0200, Daniel Borkmann wrote:



diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afdf950617c3..7417d7c20bab 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1818,11 +1818,13 @@ static int set_offload(struct tun_struct *tun, unsigned 
long arg)
  static void tun_detach_filter(struct tun_struct *tun, int n)
  {
  int i;
-struct tun_file *tfile;

  for (i = 0; i < n; i++) {
-tfile = rtnl_dereference(tun->tfiles[i]);
-sk_detach_filter(tfile->socket.sk);
+struct sock *sk = rtnl_dereference(tun->tfiles[i])->socket.sk;
+
+lock_sock(sk);
+sk_detach_filter(sk);
+release_sock(sk);
  }

  tun->filter_attached = false;



In tun case, the control path for tun_attach_filter() and tun_detach_filter()
is under RTNL lock (held in __tun_chr_ioctl()).

So in the BPF core the rcu_dereference_protected(, 
sock_owned_by_user(sk))
looks like a false positive in this specific use case to me, that we should 
probably
just silence.

Running the filter via sk_filter() in tun device happens under rcu_read_lock(),
so the dereference and assignment pair seems okay to me.

Was wondering whether we should convert this to unattached BPF filter, but this
would break with existing expectations from sk_filter() (e.g. security modules).


If we want to silence it, could be something like the below (only 
compile-tested):

  drivers/net/tun.c  |  8 +---
  include/linux/filter.h |  4 
  net/core/filter.c  | 33 +
  3 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afdf950..510e90a 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -622,7 +622,8 @@ static int tun_attach(struct tun_struct *tun, struct file 
*file, bool skip_filte

/* Re-attach the filter to persist device */
if (!skip_filter && (tun->filter_attached == true)) {
-   err = sk_attach_filter(&tun->fprog, tfile->socket.sk);
+   err = __sk_attach_filter(&tun->fprog, tfile->socket.sk,
+lockdep_rtnl_is_held());
if (!err)
goto out;
}
@@ -1822,7 +1823,7 @@ static void tun_detach_filter(struct tun_struct *tun, int 
n)

for (i = 0; i < n; i++) {
tfile = rtnl_dereference(tun->tfiles[i]);
-   sk_detach_filter(tfile->socket.sk);
+   __sk_detach_filter(tfile->socket.sk, lockdep_rtnl_is_held());
}

tun->filter_attached = false;
@@ -1835,7 +1836,8 @@ static int tun_attach_filter(struct tun_struct *tun)

for (i = 0; i < tun->numqueues; i++) {
tfile = rtnl_dereference(tun->tfiles[i]);
-   ret = sk_attach_filter(&tun->fprog, tfile->socket.sk);
+   ret = __sk_attach_filter(&tun->fprog, tfile->socket.sk,
+lockdep_rtnl_is_held());
if (ret) {
tun_detach_filter(tun, i);
return ret;
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 43aa1f8..a51a536 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -465,10 +465,14 @@ int bpf_prog_create_from_user(struct bpf_prog **pfp, 
struct sock_fprog *fprog,
  void bpf_prog_destroy(struct bpf_prog *fp);

  int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
+int __sk_attach_filter(struct sock_fprog *fprog, struct sock *sk,
+  bool locked);
  int sk_attach_bpf(u32 ufd, struct sock *sk);
  int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk);
  int sk_reuseport_attach_bpf(u32 ufd, struct sock *sk);
  int sk_detach_filter(struct sock *sk);
+int __sk_detach_filter(struct sock *sk, bool locked);
+
  int sk_get_filter(struct sock *sk, struct sock_filter __user *filter,
  unsigned int len);

diff --git a/net/core/filter.c b/net/core/filter.c
index 2429918..02f2f6c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1149,7 +1149,8 @@ void bpf_prog_destroy(struct bpf_prog *fp)
  }
  EXPORT_SYMBOL_GPL(bpf_prog_destroy);

-static int __sk_attach_prog(struct bpf_prog *prog, struct sock *sk)
+static int __sk_attach_prog(struct bpf_prog *prog, struct sock *sk,
+   bool locked)
  {
struct sk_filter *fp, *old_fp;

@@ -1165,10 +1166,8 @@ static int __sk_attach_prog(struct bpf_prog *prog, 
struct sock *sk)
return -ENOMEM;
}

-   old_fp = rcu_dereference_protected(sk->sk_filter,
-  sock_owned_by_user(sk));
+   old_fp = rcu_dereference_protected(sk->sk_filter, locked);
rcu_assign_pointer(sk->sk_filter, fp);
-
if (old_fp)
sk_filt

Re: bpf: net/core/filter.c:2115 suspicious rcu_dereference_protected() usage!

2016-03-29 Thread Daniel Borkmann

On 03/29/2016 03:55 PM, Daniel Borkmann wrote:

[ dropping my old email address ]

On 03/29/2016 02:58 PM, Michal Kubecek wrote:

On Mon, Feb 22, 2016 at 10:31:33AM -0500, Sasha Levin wrote:


I've hit the following warning while fuzzing with trinity inside a kvmtool guest
running the latest -next kernel:

[ 1343.104588] ===
[ 1343.104591] [ INFO: suspicious RCU usage. ]
[ 1343.104619] 4.5.0-rc4-next-20160219-sasha-00026-g7978205-dirty #2978 Not 
tainted
[ 1343.104624] ---
[ 1343.104635] net/core/filter.c:2115 suspicious rcu_dereference_protected() 
usage!
[ 1343.104641]
[ 1343.104641] other info that might help us debug this:
[ 1343.104641]
[ 1343.104650]
[ 1343.104650] rcu_scheduler_active = 1, debug_locks = 0
[ 1343.104660] 1 lock held by syz-executor/17916:
[ 1343.104784] #0: (rtnl_mutex){+.+.+.}, at: rtnl_lock (net/core/rtnetlink.c:71)
[ 1343.104789]
[ 1343.104789] stack backtrace:
[ 1343.104820] CPU: 1 PID: 17916 Comm: trinity-c8 Not tainted 
4.5.0-rc4-next-20160219-sasha-00026-g7978205-dirty #2978
[ 1343.104868]  110036968f44 8801b4b47aa8 a23d9a9d 
0001
[ 1343.104891]  fbfff5c2a630 41b58ab3 adb3a8f2 
a23d9905
[ 1343.104914]   8801b5419b40 fbfff7596522 
0001
[ 1343.104919] Call Trace:
[ 1343.104985] dump_stack (lib/dump_stack.c:53)
[ 1343.105060] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4282)
[ 1343.105093] sk_detach_filter (net/core/filter.c:2114 (discriminator 5))
[ 1343.105193] tun_detach_filter (drivers/net/tun.c:1808 (discriminator 7))
[ 1343.105238] __tun_chr_ioctl (drivers/net/tun.c:2133)
[ 1343.105370] tun_chr_ioctl (drivers/net/tun.c:2161)
[ 1343.105407] do_vfs_ioctl (fs/ioctl.c:44 fs/ioctl.c:674)
[ 1343.105506] SyS_ioctl (fs/ioctl.c:689 fs/ioctl.c:680)
[ 1343.105542] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:200)


Looks like sk_detach_filter() wants the socket to be owned which neither
tun_detach_filter() does not do, unlike sock_setsockopt(). Could you
check if the patch below helps?

I'm also not really sure if it is safe to ignore return value of
sk_detach_filter() and just sets tun->filter_attached to false - but
it's not really clear what should be done if one of the calls fails
after some succeeded.


Wrt return value, afaik SOCK_FILTER_LOCKED cannot be set for tun devs, so we
should be okay.


diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afdf950617c3..7417d7c20bab 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1818,11 +1818,13 @@ static int set_offload(struct tun_struct *tun, unsigned 
long arg)
  static void tun_detach_filter(struct tun_struct *tun, int n)
  {
  int i;
-struct tun_file *tfile;

  for (i = 0; i < n; i++) {
-tfile = rtnl_dereference(tun->tfiles[i]);
-sk_detach_filter(tfile->socket.sk);
+struct sock *sk = rtnl_dereference(tun->tfiles[i])->socket.sk;
+
+lock_sock(sk);
+sk_detach_filter(sk);
+release_sock(sk);
  }

  tun->filter_attached = false;



In tun case, the control path for tun_attach_filter() and tun_detach_filter()
is under RTNL lock (held in __tun_chr_ioctl()).

So in the BPF core the rcu_dereference_protected(, 
sock_owned_by_user(sk))
looks like a false positive in this specific use case to me, that we should 
probably
just silence.

Running the filter via sk_filter() in tun device happens under rcu_read_lock(),
so the dereference and assignment pair seems okay to me.

Was wondering whether we should convert this to unattached BPF filter, but this
would break with existing expectations from sk_filter() (e.g. security modules).


If we want to silence it, could be something like the below (only 
compile-tested):

 drivers/net/tun.c  |  8 +---
 include/linux/filter.h |  4 
 net/core/filter.c  | 33 +
 3 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afdf950..510e90a 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -622,7 +622,8 @@ static int tun_attach(struct tun_struct *tun, struct file 
*file, bool skip_filte

/* Re-attach the filter to persist device */
if (!skip_filter && (tun->filter_attached == true)) {
-   err = sk_attach_filter(&tun->fprog, tfile->socket.sk);
+   err = __sk_attach_filter(&tun->fprog, tfile->socket.sk,
+lockdep_rtnl_is_held());
if (!err)
goto out;
}
@@ -1822,7 +1823,7 @@ static void tun_detach_filter(struct tun_struct *tun, int 
n)

for (i = 0; i < n; i++) {
tfile = rtnl_dereference(tun->tfiles[i]);
-   sk_detach_filter(tfile->socket.sk);
+   __sk_detach_filter(tfile->socket.sk, lockdep_rtnl_is_held());
  

Re: bpf: net/core/filter.c:2115 suspicious rcu_dereference_protected() usage!

2016-03-29 Thread Daniel Borkmann

[ dropping my old email address ]

On 03/29/2016 02:58 PM, Michal Kubecek wrote:

On Mon, Feb 22, 2016 at 10:31:33AM -0500, Sasha Levin wrote:


I've hit the following warning while fuzzing with trinity inside a kvmtool guest
running the latest -next kernel:

[ 1343.104588] ===
[ 1343.104591] [ INFO: suspicious RCU usage. ]
[ 1343.104619] 4.5.0-rc4-next-20160219-sasha-00026-g7978205-dirty #2978 Not 
tainted
[ 1343.104624] ---
[ 1343.104635] net/core/filter.c:2115 suspicious rcu_dereference_protected() 
usage!
[ 1343.104641]
[ 1343.104641] other info that might help us debug this:
[ 1343.104641]
[ 1343.104650]
[ 1343.104650] rcu_scheduler_active = 1, debug_locks = 0
[ 1343.104660] 1 lock held by syz-executor/17916:
[ 1343.104784] #0: (rtnl_mutex){+.+.+.}, at: rtnl_lock (net/core/rtnetlink.c:71)
[ 1343.104789]
[ 1343.104789] stack backtrace:
[ 1343.104820] CPU: 1 PID: 17916 Comm: trinity-c8 Not tainted 
4.5.0-rc4-next-20160219-sasha-00026-g7978205-dirty #2978
[ 1343.104868]  110036968f44 8801b4b47aa8 a23d9a9d 
0001
[ 1343.104891]  fbfff5c2a630 41b58ab3 adb3a8f2 
a23d9905
[ 1343.104914]   8801b5419b40 fbfff7596522 
0001
[ 1343.104919] Call Trace:
[ 1343.104985] dump_stack (lib/dump_stack.c:53)
[ 1343.105060] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4282)
[ 1343.105093] sk_detach_filter (net/core/filter.c:2114 (discriminator 5))
[ 1343.105193] tun_detach_filter (drivers/net/tun.c:1808 (discriminator 7))
[ 1343.105238] __tun_chr_ioctl (drivers/net/tun.c:2133)
[ 1343.105370] tun_chr_ioctl (drivers/net/tun.c:2161)
[ 1343.105407] do_vfs_ioctl (fs/ioctl.c:44 fs/ioctl.c:674)
[ 1343.105506] SyS_ioctl (fs/ioctl.c:689 fs/ioctl.c:680)
[ 1343.105542] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:200)


Looks like sk_detach_filter() wants the socket to be owned which neither
tun_detach_filter() does not do, unlike sock_setsockopt(). Could you
check if the patch below helps?

I'm also not really sure if it is safe to ignore return value of
sk_detach_filter() and just sets tun->filter_attached to false - but
it's not really clear what should be done if one of the calls fails
after some succeeded.


Wrt return value, afaik SOCK_FILTER_LOCKED cannot be set for tun devs, so we
should be okay.


diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afdf950617c3..7417d7c20bab 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1818,11 +1818,13 @@ static int set_offload(struct tun_struct *tun, unsigned 
long arg)
  static void tun_detach_filter(struct tun_struct *tun, int n)
  {
int i;
-   struct tun_file *tfile;

for (i = 0; i < n; i++) {
-   tfile = rtnl_dereference(tun->tfiles[i]);
-   sk_detach_filter(tfile->socket.sk);
+   struct sock *sk = rtnl_dereference(tun->tfiles[i])->socket.sk;
+
+   lock_sock(sk);
+   sk_detach_filter(sk);
+   release_sock(sk);
}

tun->filter_attached = false;



In tun case, the control path for tun_attach_filter() and tun_detach_filter()
is under RTNL lock (held in __tun_chr_ioctl()).

So in the BPF core the rcu_dereference_protected(, 
sock_owned_by_user(sk))
looks like a false positive in this specific use case to me, that we should 
probably
just silence.

Running the filter via sk_filter() in tun device happens under rcu_read_lock(),
so the dereference and assignment pair seems okay to me.

Was wondering whether we should convert this to unattached BPF filter, but this
would break with existing expectations from sk_filter() (e.g. security modules).


Re: [PATCH] bpf: doc: "neg" opcode has no operands

2016-03-29 Thread Daniel Borkmann

On 03/28/2016 11:56 PM, Kees Cook wrote:

From: Dave Anderson 

Fixes a copy-paste-o in the BPF opcode table: "neg" takes no arguments
and thus has no addressing modes.

Signed-off-by: Dave Anderson 
Signed-off-by: Kees Cook 


Acked-by: Daniel Borkmann 


Re: [PATCH net-next] bpf: avoid copying junk bytes in bpf_get_current_comm()

2016-03-11 Thread Daniel Borkmann

On 03/11/2016 06:20 PM, Alexei Starovoitov wrote:

On 3/11/16 2:24 AM, Daniel Borkmann wrote:

On 03/10/2016 05:02 AM, Alexei Starovoitov wrote:

Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but
the result is typically passed to print("%s", buf) and extra bytes
after zero don't cause any harm.
In bpf the result of bpf_get_current_comm() is used as the part of
map key and was causing spurious hash map mismatches.
Use strlcpy() to guarantee zero-terminated string.
bpf verifier checks that output buffer is zero-initialized,


Sorry for late reply, more below:


so even for short task names the output buffer don't have junk bytes.
Note it's not a security concern, since kprobe+bpf is root only.

Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid,
comm accessors")
Reported-by: Tobias Waldekranz 
Signed-off-by: Alexei Starovoitov 

[...]

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 4504ca66118d..50da680c479f 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -166,7 +166,7 @@ static u64 bpf_get_current_comm(u64 r1, u64 size,
u64 r3, u64 r4, u64 r5)
  if (!task)
  return -EINVAL;

-memcpy(buf, task->comm, min_t(size_t, size, sizeof(task->comm)));
+strlcpy(buf, task->comm, min_t(size_t, size, sizeof(task->comm)));


If I see this correctly, __set_task_comm() makes sure comm is always zero
terminated, so that seems good, but isn't it already sufficient when
switching
to strlcpy() to simply use:

 strlcpy(buf, task->comm, size);

The min_t() seems unnecessary work to me, why do we still need it? size
is guaranteed to be > 0 through the eBPF verifier, so strlcpy() should take
care of the rest.


that's one clever optimization. yep. we can drop min_t.
btw I wanted to add memset to __set_task_comm, keep memcpy in
bpf_get_current_comm and optimize perf_event_comm_event
(which doing: memset+strlcpy and can be replaced with memcpy),
but figured that such 'fix' is not suitable for stable.
I guess we can do in the next cycle? strlen is not cheap.
Especially since it turned out that bpf_get_current_comm() is
used very often in the hot path in bcc/tools.


Would strscpy() help in this case (see 30035e45753b ("string: provide
strscpy()"))?


Also for the next cycle I'm planning to extend verifier to
allow uninitialized stack to be passed to functions like
bpf_get_current_comm() and they would have to zero it in
error cases. Then we can save few more cycles from the programs.


That would be useful also for other helpers indeed.


Re: [PATCH net-next] bpf: avoid copying junk bytes in bpf_get_current_comm()

2016-03-11 Thread Daniel Borkmann

On 03/10/2016 05:02 AM, Alexei Starovoitov wrote:

Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but
the result is typically passed to print("%s", buf) and extra bytes
after zero don't cause any harm.
In bpf the result of bpf_get_current_comm() is used as the part of
map key and was causing spurious hash map mismatches.
Use strlcpy() to guarantee zero-terminated string.
bpf verifier checks that output buffer is zero-initialized,


Sorry for late reply, more below:


so even for short task names the output buffer don't have junk bytes.
Note it's not a security concern, since kprobe+bpf is root only.

Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm 
accessors")
Reported-by: Tobias Waldekranz 
Signed-off-by: Alexei Starovoitov 

[...]

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 4504ca66118d..50da680c479f 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -166,7 +166,7 @@ static u64 bpf_get_current_comm(u64 r1, u64 size, u64 r3, 
u64 r4, u64 r5)
if (!task)
return -EINVAL;

-   memcpy(buf, task->comm, min_t(size_t, size, sizeof(task->comm)));
+   strlcpy(buf, task->comm, min_t(size_t, size, sizeof(task->comm)));


If I see this correctly, __set_task_comm() makes sure comm is always zero
terminated, so that seems good, but isn't it already sufficient when switching
to strlcpy() to simply use:

strlcpy(buf, task->comm, size);

The min_t() seems unnecessary work to me, why do we still need it? size
is guaranteed to be > 0 through the eBPF verifier, so strlcpy() should take
care of the rest.

Thanks,
Daniel


Re: linux-next: Tree for Mar 9 (net: bpf)

2016-03-09 Thread Daniel Borkmann

On 03/09/2016 06:18 PM, Randy Dunlap wrote:

On 03/09/16 09:07, Randy Dunlap wrote:

[...]

Does it handle where net/Kconfig symbol NET selects BPF unconditionally?


I got the patch from git and tested it.  Works.  Thanks.


Ahh, perfect, thanks a lot!


Re: linux-next: Tree for Mar 9 (net: bpf)

2016-03-09 Thread Daniel Borkmann

On 03/09/2016 06:07 PM, Randy Dunlap wrote:

On 03/09/16 08:48, Daniel Borkmann wrote:

On 03/09/2016 05:44 PM, Randy Dunlap wrote:

On 03/08/16 21:38, Stephen Rothwell wrote:

Hi all,

Changes since 20160308:


on x86_64:

../net/core/filter.c: In function 'bpf_skb_get_tunnel_opt':
../net/core/filter.c:1824:2: error: implicit declaration of function 
'ip_tunnel_info_opts_get' [-Werror=implicit-function-declaration]
ip_tunnel_info_opts_get(to, info);
^
../net/core/filter.c: In function 'bpf_skb_set_tunnel_opt':
../net/core/filter.c:1918:2: error: implicit declaration of function 
'ip_tunnel_info_opts_set' [-Werror=implicit-function-declaration]
ip_tunnel_info_opts_set(info, from, size);
^
Full randconfig file is attached.


Already fixed with net-next commit e28e87ed474c ("ip_tunnel, bpf:
ip_tunnel_info_opts_{get, set} depends on CONFIG_INET").

[...]

Is that patch on the netdev mailing list or patchwork?
I can't find it.


It is in patchwork archive or also here:

https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=e28e87ed474c5a0b378c66fb85efc8e487f4f63f

Was reported first by Fengguang's build bot i.e. his config had !CONFIG_INET
like in your case, so should resolve the issue.

Thanks,
Daniel


Re: linux-next: Tree for Mar 9 (net: bpf)

2016-03-09 Thread Daniel Borkmann

On 03/09/2016 05:44 PM, Randy Dunlap wrote:

On 03/08/16 21:38, Stephen Rothwell wrote:

Hi all,

Changes since 20160308:


on x86_64:

../net/core/filter.c: In function 'bpf_skb_get_tunnel_opt':
../net/core/filter.c:1824:2: error: implicit declaration of function 
'ip_tunnel_info_opts_get' [-Werror=implicit-function-declaration]
   ip_tunnel_info_opts_get(to, info);
   ^
../net/core/filter.c: In function 'bpf_skb_set_tunnel_opt':
../net/core/filter.c:1918:2: error: implicit declaration of function 
'ip_tunnel_info_opts_set' [-Werror=implicit-function-declaration]
   ip_tunnel_info_opts_set(info, from, size);
   ^
Full randconfig file is attached.


Already fixed with net-next commit e28e87ed474c ("ip_tunnel, bpf:
ip_tunnel_info_opts_{get, set} depends on CONFIG_INET").

Thanks,
Daniel


Re: [PATCH net-next 3/9] bpf: pre-allocate hash map elements

2016-03-07 Thread Daniel Borkmann

On 03/07/2016 02:58 AM, Alexei Starovoitov wrote:
[...]

---
  include/linux/bpf.h  |   1 +
  include/uapi/linux/bpf.h |   3 +
  kernel/bpf/hashtab.c | 264 ++-
  kernel/bpf/syscall.c |   2 +-
  4 files changed, 196 insertions(+), 74 deletions(-)


Shouldn't all other map types (like array) need something like this as well to
reserve this for their future flags?

  if (attr->map_flags)
return ERR_PTR(-EINVAL);


diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4b070827200d..c81efb10bbb5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -37,6 +37,7 @@ struct bpf_map {
u32 key_size;
u32 value_size;
u32 max_entries;
+   u32 map_flags;


Just naming this 'flags' doesn't work due to the anonymous struct inside that
union, right? :/


u32 pages;
struct user_struct *user;
const struct bpf_map_ops *ops;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6496f98d3d68..5eeb2ca9441e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -101,12 +101,15 @@ enum bpf_prog_type {
  #define BPF_NOEXIST   1 /* create new element if it didn't exist */
  #define BPF_EXIST 2 /* update existing element */

+#define BPF_F_NO_PREALLOC  (1ULL << 0)


Nit: Should better be (1U << 0) as map_flags are of __u32.


  union bpf_attr {
struct { /* anonymous struct used by BPF_MAP_CREATE command */
__u32   map_type;   /* one of enum bpf_map_type */
__u32   key_size;   /* size of key in bytes */
__u32   value_size; /* size of value in bytes */
__u32   max_entries;/* max number of entries in a map */
+   __u32   map_flags;  /* prealloc or not */
};

struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */


Thanks,
Daniel


Re: [PATCH net-next 2/9] bpf: introduce percpu_freelist

2016-03-07 Thread Daniel Borkmann

On 03/07/2016 02:58 AM, Alexei Starovoitov wrote:

Introduce simple percpu_freelist to keep single list of elements
spread across per-cpu singly linked lists.

/* push element into the list */
void pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *);

/* pop element from the list */
struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *);

The object is pushed to the current cpu list.
Pop first trying to get the object from the current cpu list,
if it's empty goes to the neigbour cpu list.

For bpf program usage pattern the collision rate is very low,
since programs push and pop the objects typically on the same cpu.

Signed-off-by: Alexei Starovoitov 


These bits and their usage in combination with preallocation of objects
in patch 3/9 look very useful to me!

This code seems generic enough and doesn't contain any BPF specifics,
other subsystems could potentially utilize it as well, I'd suggest that
this should better be placed under lib/ so it's exposed/visible for other
developers too.

You can still add 'F:' entries into the MAINTAINERS file to make sure
patches also hit netdev:

BPF (Safe dynamic programs and tools)
[...]
F:  kernel/bpf/
F:  lib/percpu_freelist.c
F:  include/linux/percpu_freelist.h

When BPF_SYSCALL is enabled, it would then just select these library bits
via Kconfig. The lib bits themselves can be a hidden Kconfig entry.

Thanks,
Daniel


Re: [PATCH net-next 1/9] bpf: prevent kprobe+bpf deadlocks

2016-03-07 Thread Daniel Borkmann

On 03/07/2016 02:58 AM, Alexei Starovoitov wrote:

if kprobe is placed within update or delete hash map helpers
that hold bucket spin lock and triggered bpf program is trying to
grab the spinlock for the same bucket on the same cpu, it will
deadlock.
Fix it by extending existing recursion prevention mechanism.

Note, map_lookup and other tracing helpers don't have this problem,
since they don't hold any locks and don't modify global data.
bpf_trace_printk has its own recursive check and ok as well.

Signed-off-by: Alexei Starovoitov 


LGTM

Acked-by: Daniel Borkmann 


Re: linux-next: manual merge of the net-next tree with the net tree

2016-03-03 Thread Daniel Borkmann

On 03/04/2016 03:09 AM, Stephen Rothwell wrote:

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

   drivers/net/vxlan.c

between commit:

   4024fcf70556 ("vxlan: fix missing options_len update on RX with collect 
metadata")

from the net tree and commit:

   3288af0892e3 ("vxlan: move GBP header parsing to a separate function")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).


Looks good, thanks!


Re: linux-next: manual merge of the net-next tree with the net tree

2016-02-25 Thread Daniel Borkmann

On 02/26/2016 01:13 AM, Stephen Rothwell wrote:
[...]

I fixed it up (see below) and can carry the fix as necessary (no action
is required).


Looks good to me, thanks Stephen!

Best,
Daniel


Re: [PATCH v2] bpf: grab rcu read lock for bpf_percpu_hash_update

2016-02-19 Thread Daniel Borkmann

On 02/19/2016 07:53 PM, Sasha Levin wrote:

bpf_percpu_hash_update() expects rcu lock to be held and warns if it's not,
which pointed out a missing rcu read lock.

Fixes: 15a07b338 ("bpf: add lookup/update support for per-cpu hash and array 
maps")
Signed-off-by: Sasha Levin 


LGTM, patch is against net-next tree.

Acked-by: Daniel Borkmann 


Re: [RFC][PATCH 00/10] Add trace event support to eBPF

2016-02-18 Thread Daniel Borkmann

On 02/18/2016 10:27 PM, Tom Zanussi wrote:

On Tue, 2016-02-16 at 20:51 -0800, Alexei Starovoitov wrote:

On Tue, Feb 16, 2016 at 04:35:27PM -0600, Tom Zanussi wrote:

On Sun, 2016-02-14 at 01:02 +0100, Alexei Starovoitov wrote:

[...]

Take a look at all the tools written on top of it:
https://github.com/iovisor/bcc/tree/master/tools


That's great, but it's all out-of-tree.  Supporting out-of-tree users
has never been justification for merging in-kernel code (or for blocking
it from being merged).


huh? perf is the only in-tree user space project.
All others tools and libraries are out-of-tree and that makes sense.


What about all the other things under tools/?


Actually would be great to merge bcc with perf eventually, but choice
of C++ isn't going to make it easy. The only real difference
between perf+bpf and bcc is that bcc integrates clang/llvm
as a library whereas perf+bpf deals with elf files and standalone compiler.
There are pros and cons for both and it's great that both are actively
growing and gaining user traction.


Why worry about merging bcc with perf?  Why not a tools/bcc?


It would indeed be great to mid-term have bcc internals to some degree
merged(/rewritten) into perf. tools/bcc doesn't make much sense to me
as it really should be perf, where already the rest of the eBPF front-end
logic resides that Wang et al initially integrated. So efforts could
consolidate from that side.

The user could be given a choice whether his use-case is to load the
object file, or directly pass in a C file where perf does the rest. And
f.e. Brendan's tools could ship natively as perf "built-ins" that users
can go and try out from there directly and/or use the code as a starting
point for their own proglets.

Cheers,
Daniel


Re: [PATCH] vsprintf: do not append unset Scope ID to IPv6

2016-02-03 Thread Daniel Borkmann

On 02/03/2016 10:47 PM, Joe Perches wrote:

On Wed, 2016-02-03 at 22:14 +0100, Jason A. Donenfeld wrote:

The idea here is to be able to printk a sockaddr_in6, and have it show
something that looks like what the user would naturally pass to
getaddrinfo(3), which is entirely complete.

However, I could be convinced that this kind of behavior belongs in
it's own flag. Maybe I'll cook up a flag for that instead.


I think that'd be best.


Agreed.


Maybe using something like %pISG for this that
would optionally show these flow and scope values
only when non-zero.

Something like:


Looks good to me, having this as a single generic option seems
even cleaner.

Thanks!


Re: IRe: [PATCH] vsprintf: flowinfo in IPv6 is optional too

2016-02-03 Thread Daniel Borkmann

On 02/03/2016 06:56 PM, Joe Perches wrote:

On Wed, 2016-02-03 at 13:13 +0100, Jason A. Donenfeld wrote:

Signed-off-by: Jason A. Donenfeld 
---
  lib/vsprintf.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 1b1b1c8..85e6645 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -1189,7 +1189,7 @@ char *ip6_addr_string_sa(char *buf, char *end, const 
struct sockaddr_in6 *sa,
*p++ = ':';
p = number(p, pend, ntohs(sa->sin6_port), spec);
}
-   if (have_f) {
+   if (have_f && (sa->sin6_flowinfo & IPV6_FLOWINFO_MASK)) {
*p++ = '/';
p = number(p, pend, ntohl(sa->sin6_flowinfo &
  IPV6_FLOWINFO_MASK), spec);


Why does this matter at all?

The format string "%pIS[...]f" is not used currently in the kernel.

If one were to call out this 'f' qualifier to %pIS, wouldn't it
be better to show /0 than elide the / output completely?


+1

Another possibility also in regards to your other patch would be to have
a different format string char option instead of 'f'/'s' that would then
allow a developer for having both options to choose from. Dunno if it's
really worth it, but if you have a use case that definitely needs it, then
it'd be probably better. Less surprises during debugging, at least.


Re: [PATCH] vsprintf: do not append unset Scope ID to IPv6

2016-02-03 Thread Daniel Borkmann

On 02/03/2016 11:41 AM, Jason A. Donenfeld wrote:

The sockaddr_in6 has the sin6_scope_id parameter. This contains the
netdev index of the output device. When set to 0, sin6_scope_id is
considered to be "unset" -- it has no Scope ID (see RFC4007). When it is
set to >0, it has a Scope ID.

[...]

So, here, either it has Scope ID, in which case it's returned, or it
doesn't need a Scope ID, in which case 0 - "no scope id" - is returned.

Therefore, vsprintf should treat the 0 Scope ID the same way the rest of
the v6 stack and the RFC treats it: 0 Scope ID means no Scope ID. And if
there is no Scope ID, there is nothing to be printed.


I don't see an "issue" here. I.e. if the developer requested to append 's'
to the specifier, then it's expected to dump what is in the scope id, even
if that is 0 meaning "unset" scope.

It would be principle of least surprise -- I could imagine if not dumped,
then a developer would check the source code to find out whether he did a
mistake or it's expected behavior, just to make sure. That's what I would
do probably.

Is this causing a _real_ issue somewhere (I couldn't parse one out of your
commit message other than 'it would be nice to have')?


So, this patch only prints the Scope ID when one exists, and does not
print a Scope ID when one does not exist.

It turns out, too, that this is what developers want. The most common
purpose of %pISpfsc is to show "what is in that sockaddr so that I can
have some idea of how to connect to it?"

Signed-off-by: Jason A. Donenfeld 
---
  lib/vsprintf.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 48ff9c3..1b1b1c8 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -1194,7 +1194,7 @@ char *ip6_addr_string_sa(char *buf, char *end, const 
struct sockaddr_in6 *sa,
p = number(p, pend, ntohl(sa->sin6_flowinfo &
  IPV6_FLOWINFO_MASK), spec);
}
-   if (have_s) {
+   if (have_s && sa->sin6_scope_id) {
*p++ = '%';
p = number(p, pend, sa->sin6_scope_id, spec);
}





Re: [PATCH] af_packet: Raw socket destruction warning fix

2016-01-25 Thread Daniel Borkmann

On 01/21/2016 12:40 PM, Maninder Singh wrote:

The other sock_put() in packet_release() to drop the final ref and call into
sk_free(), which drops the 1 ref on the sk_wmem_alloc from init time. Since you
got into __sk_free() via sock_wfree() destructor, your socket must have invoked
packet_release() prior to this (perhaps kernel destroying the process).

What kernel do you use?


Issue is coming for 3.10.58.


[ sorry for late reply ]

What driver are you using (is that in-tree)? Can you reproduce the same issue
with a latest -net kernel, for example (or, a 'reasonably' recent one like 4.3 
or
4.4)? There has been quite a bit of changes in err queue handling (which also
accounts rmem) as well. How reliably can you trigger the issue? Does it trigger
with a completely different in-tree network driver as well with your tests? 
Would
be useful to track/debug sk_rmem_alloc increases/decreases to see from which 
path
new rmem is being charged in the time between packet_release() and 
packet_sock_destruct()
for that socket ...


Driver calls dev_kfree_skb_any->dev_kfree_skb_irq
and it adds buffer in completion queue to free and raises softirq NET_TX_SOFTIRQ

net_tx_action->__kfree_skb->skb_release_all->skb_release_head_state->sock_wfree->
__sk_free->packet_sock_destruct

Also purging of receive queue has been taken care in other protocols.


Re: net: GPF in netlink_getsockbyportid

2016-01-23 Thread Daniel Borkmann

On 01/23/2016 08:25 PM, Florian Westphal wrote:

Dmitry Vyukov  wrote:

[ CC nf-devel, not sure if its nfnetlink fault or NETLINK_MMAP ]


The following program causes GPF in netlink_getsockbyportid:

// autogenerated by syzkaller (http://github.com/google/syzkaller)
#include 
#include 
#include 
#include 
#include 

int main()
{
   syscall(SYS_mmap, 0x2000ul, 0xe65000ul, 0x3ul, 0x32ul,
  0xul, 0x0ul);
   int fd = syscall(SYS_socket, 0x10ul, 0x803ul, 0xcul, 0, 0, 0);
   *(uint32_t*)0x20e64000 = (uint32_t)0x28;
   *(uint32_t*)0x20e64004 = (uint32_t)0x10;
   *(uint64_t*)0x20e64008 = (uint64_t)0x0;
   *(uint64_t*)0x20e64010 = (uint64_t)0x3;
   *(uint64_t*)0x20e64018 = (uint64_t)0xfff;
   *(uint16_t*)0x20e64020 = (uint16_t)0x5;
   syscall(SYS_write, fd, 0x20e64000ul, 0x28ul, 0, 0, 0);
   return 0;
}


CONFIG_NETLINK_MMAP and nfnetlink batching strike in unison :-/

root cause is in nfnetlink_rcv_batch():

296 replay:
297 status = 0;
298
299 skb = netlink_skb_clone(oskb, GFP_KERNEL);

The clone op doesn't copy oskb->sk, so we oops in
__netlink_alloc_skb -> netlink_getsockbyportid() when nfnetlink_rcv_batch
tries to send netlink ack.


If indeed oskb is the mmap'ed netlink skb, then it's not even allowed
to call into skb_clone() as it would access skb shared info data that
can be controlled by the user space mmap buffer, iirc, we had that in
the past with nlmon where skb_clone() was accidentally used.


Re: clang --target=bpf missing on f23 was: Re: [PATCH 1/2] perf test: Add libbpf relocation checker

2016-01-22 Thread Daniel Borkmann

On 01/22/2016 06:35 PM, Adam Jackson wrote:

On Fri, 2016-01-22 at 14:22 -0300, Arnaldo Carvalho de Melo wrote:


the 'bpf' target for clang is being used together with perf to
build scriptlets into object code that then gets uploaded to the kernel
via sys_bpf(), was the decision not to include 'bpf' just an accident?


I wouldn't call it a "decision", that would imply intent. The main
reason I explicitly list targets for llvm is to limit the CPU backends
to arches Fedora actually runs on (which itself is because I really
only care about llvmpipe, and am only touching llvm because it's in my
way). Had no idea there was a bpf backend, so never thought to enable
it.

llvm-3.7.0-4.fc2{3,4} are building now with the bpf backend enabled,
I'll create an update for F23 when it's built.


Awesome, thanks!


Re: [PATCH 31/33] bpf: Add __bpf_prog_run() to stacktool whitelist

2016-01-21 Thread Daniel Borkmann

On 01/21/2016 11:49 PM, Josh Poimboeuf wrote:

stacktool reports the following false positive warnings:

   stacktool: kernel/bpf/core.o: __bpf_prog_run()+0x5c: sibling call from 
callable instruction with changed frame pointer
   stacktool: kernel/bpf/core.o: __bpf_prog_run()+0x60: function has 
unreachable instruction
   stacktool: kernel/bpf/core.o: __bpf_prog_run()+0x64: function has 
unreachable instruction
   [...]

It's confused by the following dynamic jump instruction in
__bpf_prog_run()::

   jmp *(%r12,%rax,8)

which corresponds to the following line in the C code:

   goto *jumptable[insn->code];

There's no way for stacktool to deterministically find all possible
branch targets for a dynamic jump, so it can't verify this code.

In this case the jumps all stay within the function, and there's nothing
unusual going on related to the stack, so we can whitelist the function.

Signed-off-by: Josh Poimboeuf 
Cc: Alexei Starovoitov 
Cc: net...@vger.kernel.org


Fine by me:

Acked-by: Daniel Borkmann 


Re: [PATCH] net: filter: make JITs zero A for SKF_AD_ALU_XOR_X

2016-01-05 Thread Daniel Borkmann

On 01/05/2016 05:03 PM, Rabin Vincent wrote:

On Tue, Jan 05, 2016 at 08:00:45AM -0800, Eric Dumazet wrote:

On Tue, 2016-01-05 at 16:23 +0100, Rabin Vincent wrote:

The SKF_AD_ALU_XOR_X ancillary is not like the other ancillary data
instructions since it XORs A with X while all the others replace A with
some loaded value.  All the BPF JITs fail to clear A if this is used as
the first instruction in a filter.


Is x86_64 part of this 'All' subset ? ;)


No, because it's an eBPF JIT.


Correct, filter conversion to eBPF clears it already.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] net: filter: make JITs zero A for SKF_AD_ALU_XOR_X

2016-01-05 Thread Daniel Borkmann

On 01/05/2016 04:23 PM, Rabin Vincent wrote:

The SKF_AD_ALU_XOR_X ancillary is not like the other ancillary data
instructions since it XORs A with X while all the others replace A with
some loaded value.  All the BPF JITs fail to clear A if this is used as
the first instruction in a filter.  This was found using american fuzzy
lop.

Add a helper to determine if A needs to be cleared given the first
instruction in a filter, and use this in the JITs.  Except for ARM, the
rest have only been compile-tested.

Fixes: 3480593131e0 ("net: filter: get rid of BPF_S_* enum")
Signed-off-by: Rabin Vincent 


Excellent catch, thanks a lot! The fix looks good to me and should
go to -net tree.

Acked-by: Daniel Borkmann 

If you're interested, feel free to add a small test case for the
SKF_AD_ALU_XOR_X issue to lib/test_bpf.c for -net-next tree. Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/3] bpf: hash: use per-bucket spinlock

2015-12-29 Thread Daniel Borkmann

On 12/29/2015 03:40 PM, Ming Lei wrote:

Both htab_map_update_elem() and htab_map_delete_elem() can be
called from eBPF program, and they may be in kernel hot path,
so it isn't efficient to use a per-hashtable lock in this two
helpers.

The per-hashtable spinlock is used for protecting bucket's
hlist, and per-bucket lock is just enough. This patch converts
the per-hashtable lock into per-bucket spinlock, so that
contention can be decreased a lot.

Signed-off-by: Ming Lei 


Looks better, thanks!

Acked-by: Daniel Borkmann 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v1 3/3] bpf: hash: use per-bucket spinlock

2015-12-28 Thread Daniel Borkmann

On 12/28/2015 01:55 PM, Ming Lei wrote:

Both htab_map_update_elem() and htab_map_delete_elem() can be
called from eBPF program, and they may be in kernel hot path,
so it isn't efficient to use a per-hashtable lock in this two
helpers.

The per-hashtable spinlock is used for protecting bucket's
hlist, and per-bucket lock is just enough. This patch converts
the per-hashtable lock into per-bucket spinlock, so that
contention can be decreased a lot.

Signed-off-by: Ming Lei 
---
  kernel/bpf/hashtab.c | 46 ++
  1 file changed, 30 insertions(+), 16 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index d857fcb..67222a9 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -14,10 +14,14 @@
  #include 
  #include 

+struct bucket {
+   struct hlist_head head;
+   raw_spinlock_t lock;
+};
+
  struct bpf_htab {
struct bpf_map map;
-   struct hlist_head *buckets;
-   raw_spinlock_t lock;
+   struct bucket *buckets;
atomic_t count; /* number of elements in this hashtable */
u32 n_buckets;  /* number of hash buckets */
u32 elem_size;  /* size of each element in bytes */
@@ -88,24 +92,25 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
/* make sure page count doesn't overflow */
goto free_htab;


When adapting memory accounting and allocation sizes below where you replace
sizeof(struct hlist_head) with sizeof(struct bucket), is there a reason why
you don't update the overflow checks along with it?

[...]
/* prevent zero size kmalloc and check for u32 overflow */
if (htab->n_buckets == 0 ||
htab->n_buckets > U32_MAX / sizeof(struct hlist_head))
goto free_htab;

if ((u64) htab->n_buckets * sizeof(struct hlist_head) +
(u64) htab->elem_size * htab->map.max_entries >=
U32_MAX - PAGE_SIZE)
/* make sure page count doesn't overflow */
goto free_htab;
[...]


-   htab->map.pages = round_up(htab->n_buckets * sizeof(struct hlist_head) +
+   htab->map.pages = round_up(htab->n_buckets * sizeof(struct bucket) +
   htab->elem_size * htab->map.max_entries,
   PAGE_SIZE) >> PAGE_SHIFT;

err = -ENOMEM;
-   htab->buckets = kmalloc_array(htab->n_buckets, sizeof(struct 
hlist_head),
+   htab->buckets = kmalloc_array(htab->n_buckets, sizeof(struct bucket),
  GFP_USER | __GFP_NOWARN);

if (!htab->buckets) {
-   htab->buckets = vmalloc(htab->n_buckets * sizeof(struct 
hlist_head));
+   htab->buckets = vmalloc(htab->n_buckets * sizeof(struct 
bucket));
if (!htab->buckets)
goto free_htab;
}


[...]

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] bpf: hash: use per-bucket spinlock

2015-12-28 Thread Daniel Borkmann

On 12/26/2015 10:31 AM, Ming Lei wrote:

From: Ming Lei 

Both htab_map_update_elem() and htab_map_delete_elem() can be
called from eBPF program, and they may be in kernel hot path,
so it isn't efficient to use a per-hashtable lock in this two
helpers.

The per-hashtable spinlock is used just for protecting bucket's
hlist, and per-bucket lock should be enough. This patch converts
the per-hashtable lock into per-bucket spinlock, so that contention
can be decreased a lot.

Signed-off-by: Ming Lei 
---
  kernel/bpf/hashtab.c | 35 +--
  1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index d857fcb..66bad7a 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -14,9 +14,14 @@
  #include 
  #include 

+struct bucket {
+   struct hlist_head head;
+   raw_spinlock_t lock;
+};
+
  struct bpf_htab {
struct bpf_map map;
-   struct hlist_head *buckets;
+   struct bucket *buckets;
raw_spinlock_t lock;


Shouldn't the lock member be removed from here?


atomic_t count; /* number of elements in this hashtable */
u32 n_buckets;  /* number of hash buckets */
@@ -88,24 +93,25 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
/* make sure page count doesn't overflow */
goto free_htab;

-   htab->map.pages = round_up(htab->n_buckets * sizeof(struct hlist_head) +
+   htab->map.pages = round_up(htab->n_buckets * sizeof(struct bucket) +
   htab->elem_size * htab->map.max_entries,
   PAGE_SIZE) >> PAGE_SHIFT;

err = -ENOMEM;
-   htab->buckets = kmalloc_array(htab->n_buckets, sizeof(struct 
hlist_head),
+   htab->buckets = kmalloc_array(htab->n_buckets, sizeof(struct bucket),
  GFP_USER | __GFP_NOWARN);

if (!htab->buckets) {
-   htab->buckets = vmalloc(htab->n_buckets * sizeof(struct 
hlist_head));
+   htab->buckets = vmalloc(htab->n_buckets * sizeof(struct 
bucket));
if (!htab->buckets)
goto free_htab;
}

-   for (i = 0; i < htab->n_buckets; i++)
-   INIT_HLIST_HEAD(&htab->buckets[i]);
+   for (i = 0; i < htab->n_buckets; i++) {
+   INIT_HLIST_HEAD(&htab->buckets[i].head);
+   raw_spin_lock_init(&htab->buckets[i].lock);
+   }

-   raw_spin_lock_init(&htab->lock);
atomic_set(&htab->count, 0);

return &htab->map;
@@ -120,11 +126,16 @@ static inline u32 htab_map_hash(const void *key, u32 
key_len)
return jhash(key, key_len, 0);
  }

-static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
+static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash)
  {
return &htab->buckets[hash & (htab->n_buckets - 1)];
  }

+static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
+{
+   return &__select_bucket(htab, hash)->head;
+}
+
  static struct htab_elem *lookup_elem_raw(struct hlist_head *head, u32 hash,
 void *key, u32 key_size)
  {
@@ -227,6 +238,7 @@ static int htab_map_update_elem(struct bpf_map *map, void 
*key, void *value,
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
struct htab_elem *l_new, *l_old;
struct hlist_head *head;
+   struct bucket *b;
unsigned long flags;
u32 key_size;
int ret;
@@ -248,7 +260,8 @@ static int htab_map_update_elem(struct bpf_map *map, void 
*key, void *value,
memcpy(l_new->key + round_up(key_size, 8), value, map->value_size);

l_new->hash = htab_map_hash(l_new->key, key_size);
-   head = select_bucket(htab, l_new->hash);
+   b = __select_bucket(htab, l_new->hash);
+   head = &b->head;

/* bpf_map_update_elem() can be called in_irq() */
raw_spin_lock_irqsave(&htab->lock, flags);


I am a bit confused on this one, though.

The raw spin lock is still per hashtable (htab->lock), and has not been 
converted in
this patch to the per bucket one (as opposed to what the commit message says), 
so
this patch seems to go into the right direction, but is a bit incomplete? 
Shouldn't
the above f.e. take b->lock, etc?


@@ -299,6 +312,7 @@ static int htab_map_delete_elem(struct bpf_map *map, void 
*key)
  {
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
struct hlist_head *head;
+   struct bucket *b;
struct htab_elem *l;
unsigned long flags;
u32 hash, key_size;
@@ -309,7 +323,8 @@ static int htab_map_delete_elem(struct bpf_map *map, void 
*key)
key_size = map->key_size;

hash = htab_map_hash(key, key_size);
-   head = select_bucket(htab, hash);
+   b = __select_bucket(htab, hash);
+   head = &b->head;

raw_spin_lock_irqsave(&htab->

Re: [PATCH 2/3] bpf: hash: move select_bucket() out of htab's spinlock

2015-12-28 Thread Daniel Borkmann

On 12/26/2015 10:31 AM, Ming Lei wrote:

The spinlock is just used for protecting the per-bucket
hlist, so it isn't needed for selecting bucket.

Signed-off-by: Ming Lei 


Acked-by: Daniel Borkmann 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] bpf: hash: use atomic count

2015-12-28 Thread Daniel Borkmann

On 12/26/2015 10:31 AM, Ming Lei wrote:

Preparing for removing global per-hashtable lock, so
the counter need to be defined as aotmic_t first.

Signed-off-by: Ming Lei 


Acked-by: Daniel Borkmann 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    3   4   5   6   7   8   9   10   11   12   >