Re: [PATCH 1/4] sparc: bpf_jit: Use kmalloc_array() in bpf_jit_compile()
On 09/03/2016 06:36 PM, SF Markus Elfring wrote: From: Markus Elfring Date: Sat, 3 Sep 2016 17:10:20 +0200 A multiplication for the size determination of a memory allocation indicated that an array data structure should be processed. Thus use the corresponding function "kmalloc_array". This issue was detected by using the Coccinelle software. When you talk about "issue", could you please explain yourself what concrete "issue" you were seeing ?! This particular multiplication here is guaranteed to never overflow, so at best a minor cleanup if you will. (Do you actually have a sparc to test out your changes?) Signed-off-by: Markus Elfring --- arch/sparc/net/bpf_jit_comp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/sparc/net/bpf_jit_comp.c b/arch/sparc/net/bpf_jit_comp.c index a6d9204..ced1393 100644 --- a/arch/sparc/net/bpf_jit_comp.c +++ b/arch/sparc/net/bpf_jit_comp.c @@ -372,7 +372,7 @@ void bpf_jit_compile(struct bpf_prog *fp) if (!bpf_jit_enable) return; - addrs = kmalloc(flen * sizeof(*addrs), GFP_KERNEL); + addrs = kmalloc_array(flen, sizeof(*addrs), GFP_KERNEL); if (addrs == NULL) return;
Re: [PATCH v2 net-next 2/6] bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type
On 09/01/2016 09:44 AM, Peter Zijlstra wrote: On Wed, Aug 31, 2016 at 02:50:39PM -0700, Alexei Starovoitov wrote: +static u32 pe_prog_convert_ctx_access(enum bpf_access_type type, int dst_reg, + int src_reg, int ctx_off, + struct bpf_insn *insn_buf, + struct bpf_prog *prog) +{ + struct bpf_insn *insn = insn_buf; + + BUILD_BUG_ON(FIELD_SIZEOF(struct perf_sample_data, period) != sizeof(u64)); + switch (ctx_off) { + case offsetof(struct bpf_perf_event_data, sample_period): + *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct bpf_perf_event_data_kern, data)), + dst_reg, src_reg, + offsetof(struct bpf_perf_event_data_kern, data)); + *insn++ = BPF_LDX_MEM(BPF_DW, dst_reg, dst_reg, + offsetof(struct perf_sample_data, period)); + break; OK, so that deals with us moving the period field in the structure, and break compile if we'd change its size or remove it outright (highly unlikely). In that latter case we can change this code to simply return a (u64)0 and things would continue to 'work'. Did I understand that correctly? Yes, if a program accesses sample_period member of the struct bpf_perf_event_data context, then the verifier rewrites this into above two loads to eventually fetch the struct perf_sample_data's period to the given target register. As you said, should the period field change size (or get removed), compilation would break so this doesn't get unnoticed and the code can be adapted along with it. In the (hopefully very unlikely) case the member gets removed, it could be replaced with loading 0 (or some other, better workaround to derive it, if possible).
Re: [PATCH net-next 3/6] bpf: perf_event progs should only use preallocated maps
On 08/27/2016 04:31 AM, Alexei Starovoitov wrote: Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use preallocated hash maps, since doing memory allocation in overflow_handler can crash depending on where nmi got triggered. Signed-off-by: Alexei Starovoitov Acked-by: Daniel Borkmann
Re: [PATCH net-next 2/6] bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type
On 08/27/2016 04:31 AM, Alexei Starovoitov wrote: Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE correspondingly in uapi/linux/perf_event.h) The program visible context meta structure is struct bpf_perf_event_data { struct pt_regs regs; __u64 sample_period; }; which is accessible directly from the program: int bpf_prog(struct bpf_perf_event_data *ctx) { ... ctx->sample_period ... ... ctx->regs.ip ... } The bpf verifier rewrites the accesses into kernel internal struct bpf_perf_event_data_kern which allows changing struct perf_sample_data without affecting bpf programs. New fields can be added to the end of struct bpf_perf_event_data in the future. Signed-off-by: Alexei Starovoitov Two things I noticed below, otherwise for BPF bits: Acked-by: Daniel Borkmann [...] +static bool pe_prog_is_valid_access(int off, int size, enum bpf_access_type type, + enum bpf_reg_type *reg_type) +{ + if (off < 0 || off >= sizeof(struct bpf_perf_event_data)) + return false; + if (type != BPF_READ) + return false; + if (off % size != 0) + return false; + if (off == offsetof(struct bpf_perf_event_data, sample_period) && + size != sizeof(u64)) + return false; + if (size != sizeof(long)) + return false; Wouldn't this one rather need to be: if (off == offsetof(struct bpf_perf_event_data, sample_period) { if (size != sizeof(u64)) return false; } else { if (size != sizeof(long)) return false; } Otherwise on 32bit accessing sample_period might fail? + return true; +} + +static u32 pe_prog_convert_ctx_access(enum bpf_access_type type, int dst_reg, + int src_reg, int ctx_off, + struct bpf_insn *insn_buf, + struct bpf_prog *prog) +{ + struct bpf_insn *insn = insn_buf; + + switch (ctx_off) { + case offsetof(struct bpf_perf_event_data, sample_period): Would be good to add a test as we usually have done: BUILD_BUG_ON(FIELD_SIZEOF(struct perf_sample_data, period) != 8); + *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct bpf_perf_event_data_kern, data)), + dst_reg, src_reg, + offsetof(struct bpf_perf_event_data_kern, data)); + *insn++ = BPF_LDX_MEM(BPF_DW, dst_reg, dst_reg, + offsetof(struct perf_sample_data, period)); + break; + default: + *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct bpf_perf_event_data_kern, regs)), + dst_reg, src_reg, + offsetof(struct bpf_perf_event_data_kern, regs)); + *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(sizeof(long)), + dst_reg, dst_reg, ctx_off); + break; + } + return insn - insn_buf; +} +
Re: [PATCH net-next 1/6] bpf: support 8-byte metafield access
On 08/27/2016 04:31 AM, Alexei Starovoitov wrote: The verifier supported only 4-byte metafields in struct __sk_buff and struct xdp_md. The metafields in upcoming struct bpf_perf_event are 8-byte to match register width in struct pt_regs. Teach verifier to recognize 8-byte metafield access. The patch doesn't affect safety of sockets and xdp programs. They check for 4-byte only ctx access before these conditions are hit. Signed-off-by: Alexei Starovoitov Acked-by: Daniel Borkmann
Re: linux-next: manual merge of the net-next tree with the net tree
On 08/15/2016 02:35 AM, Stephen Rothwell wrote: Hi all, Today's linux-next merge of the net-next tree got a conflict in: kernel/bpf/verifier.c between commit: 747ea55e4f78 ("bpf: fix bpf_skb_in_cgroup helper naming") from the net tree and commit: 60d20f9195b2 ("bpf: Add bpf_current_task_under_cgroup helper") from the net-next tree. I fixed it up (see below) and can carry the fix as necessary. Looks good to me, thanks!
Re: [PATCH 1/1 linux-next] Documentation/features/core: update HAVE_BPF_JIT
On 08/12/2016 11:38 PM, Fabian Frederick wrote: Update documentation according to commit 606b5908 ("bpf: split HAVE_BPF_JIT into cBPF and eBPF variant") Signed-off-by: Fabian Frederick Thanks, Fabian! Acked-by: Daniel Borkmann
Re: [PATCH 2/2] net: sched: convert qdisc linked list to hashtable
On 08/12/2016 04:36 PM, Jiri Kosina wrote: On Fri, 12 Aug 2016, Daniel Borkmann wrote: I was thinking about something like the patch below (the reasong being that ->dev would be NULL only in cases of singletonish qdiscs) ... wouldn't that also fix the issue you're seeing? Have to think it through a little bit more .. Ahh, so this has the same effect as previously observed with the other fix. Thanks a lot for confirming that this fixes the panic. I still have to think a little bit more about this though. Perhaps it's just a dumping issue, but to the below clsact, there shouldn't be pfifo_fast instances appearing. # tc qdisc show dev wlp2s0b1 qdisc mq 0: root qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 # tc qdisc add dev wlp2s0b1 clsact # tc qdisc show dev wlp2s0b1 qdisc mq 0: root qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc clsact : parent :fff1 qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Hmm, no immediate idea where those are coming from, we'll have to figure it out. The mq device used here has 4 queues, right? Yes, the first tc qdisc show is after boot and how it should normally look like, so 4 tx queues. # ls /sys/class/net/wlp2s0b1/queues/ rx-0 tx-0 tx-1 tx-2 tx-3 When adding clsact, only the 'qdisc clsact' line should be extra. Given the extra pfifo_fast ones look the same as above, I would suspect a htab dumping issue, perhaps. I can debug a bit later tonight on this.
Re: [PATCH 2/2] net: sched: convert qdisc linked list to hashtable
On 08/12/2016 03:53 PM, Jiri Kosina wrote: On Fri, 12 Aug 2016, Daniel Borkmann wrote: This results in below panic. Tested reverting this patch and it fixes the panic. Did you test this also with ingress or clsact qdisc (just try adding it to lo dev for example) ? Hi Daniel, thanks for the report. Hmm, I am pretty sure clsact worked for me, but I'll recheck. What happens is the following in qdisc_match_from_root(): [ 995.422187] XXX qdisc:88025e4fc800 queue:880262759000 dev:880261cc2000 handle: [ 995.422200] XXX qdisc:81cf8100 queue:81cf8240 dev: (null) handle: I believe this is due to dev_ingress_queue_create() assigning the global noop_qdisc instance as qdisc_sleeping, which later qdisc_lookup() uses for qdisc_match_from_root(). But everything that uses things like noop_qdisc cannot work with the new qdisc_match_from_root(), because qdisc_dev(root) will always trigger NULL pointer dereference there. Reason is because the dev is always NULL for noop, it's a singleton, see noop_qdisc and noop_netdev_queue in sch_generic.c. Now how to fix it? Creating separate noop instances each time it's set would be quite a waste of memory. Even fuglier would be to hack a static net device struct into sch_generic.c and let noop_netdev_queue point there to get to the hash table. Or we just not use qdisc_dev(). How about we actually extend a little bit the TCQ_F_BUILTIN special case test in qdisc_match_from_root()? After the change, the only way how qdisc_dev() could be NULL should be a TCQ_F_BUILTIN case, right? I was thinking about something like the patch below (the reasong being that ->dev would be NULL only in cases of singletonish qdiscs) ... wouldn't that also fix the issue you're seeing? Have to think it through a little bit more .. Ahh, so this has the same effect as previously observed with the other fix. Perhaps it's just a dumping issue, but to the below clsact, there shouldn't be pfifo_fast instances appearing. # tc qdisc show dev wlp2s0b1 qdisc mq 0: root qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 # tc qdisc add dev wlp2s0b1 clsact # tc qdisc show dev wlp2s0b1 qdisc mq 0: root qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc clsact : parent :fff1 qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index 25aada7..1c9faed 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -260,6 +260,9 @@ static struct Qdisc *qdisc_match_from_root(struct Qdisc *root, u32 handle) { struct Qdisc *q; + if (!qdisc_dev(root)) + return (root->handle == handle ? root : NULL); + if (!(root->flags & TCQ_F_BUILTIN) && root->handle == handle) return root; Thanks!
Re: [PATCH 2/2] net: sched: convert qdisc linked list to hashtable
Hi Jiri, On 08/10/2016 11:05 AM, Jiri Kosina wrote: From: Jiri Kosina Convert the per-device linked list into a hashtable. The primary motivation for this change is that currently, we're not tracking all the qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup performed over the linked list by qdisc_match_from_root() is rather expensive. The ultimate goal is to get rid of hidden qdiscs completely, which will bring much more determinism in user experience. Reviewed-by: Cong Wang Signed-off-by: Jiri Kosina This results in below panic. Tested reverting this patch and it fixes the panic. Did you test this also with ingress or clsact qdisc (just try adding it to lo dev for example) ? What happens is the following in qdisc_match_from_root(): [ 995.422187] XXX qdisc:88025e4fc800 queue:880262759000 dev:880261cc2000 handle: [ 995.422200] XXX qdisc:81cf8100 queue:81cf8240 dev: (null) handle: I believe this is due to dev_ingress_queue_create() assigning the global noop_qdisc instance as qdisc_sleeping, which later qdisc_lookup() uses for qdisc_match_from_root(). But everything that uses things like noop_qdisc cannot work with the new qdisc_match_from_root(), because qdisc_dev(root) will always trigger NULL pointer dereference there. Reason is because the dev is always NULL for noop, it's a singleton, see noop_qdisc and noop_netdev_queue in sch_generic.c. Now how to fix it? Creating separate noop instances each time it's set would be quite a waste of memory. Even fuglier would be to hack a static net device struct into sch_generic.c and let noop_netdev_queue point there to get to the hash table. Or we just not use qdisc_dev(). I've tried the below to hand in dev pointer instead of qdisc_dev(), but I think this is not sound yet. Despite fixing the panic, I get something weird like: # tc qdisc show dev wlp2s0b1 qdisc mq 0: root qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc clsact : parent :fff1 qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Not yet sure whether this below, really quickly hacked patch is just buggy (I guess so) or it's another side effect of the original patch. If you have some cycles to take a look into fixing the panic, would be great. Thanks diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index 25aada7..c2c9799 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -256,7 +256,8 @@ int qdisc_set_default(const char *name) * Note: caller either uses rtnl or rcu_read_lock() */ -static struct Qdisc *qdisc_match_from_root(struct Qdisc *root, u32 handle) +static struct Qdisc *qdisc_match_from_root(struct net_device *dev, + struct Qdisc *root, u32 handle) { struct Qdisc *q; @@ -264,7 +265,7 @@ static struct Qdisc *qdisc_match_from_root(struct Qdisc *root, u32 handle) root->handle == handle) return root; - hash_for_each_possible_rcu(qdisc_dev(root)->qdisc_hash, q, hash, handle) { + hash_for_each_possible_rcu(dev->qdisc_hash, q, hash, handle) { if (q->handle == handle) return q; } @@ -296,12 +297,12 @@ struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle) { struct Qdisc *q; - q = qdisc_match_from_root(dev->qdisc, handle); + q = qdisc_match_from_root(dev, dev->qdisc, handle); if (q) goto out; if (dev_ingress_queue(dev)) - q = qdisc_match_from_root( + q = qdisc_match_from_root(dev, dev_ingress_queue(dev)->qdisc_sleeping, handle); out: @@ -1430,8 +1431,8 @@ err_out: return -EINVAL; } -static int tc_dump_qdisc_root(struct Qdisc *root, struct sk_buff *skb, - struct netlink_callback *cb, +static int tc_dump_qdisc_root(struct net_device *dev, struct Qdisc *root, + struct sk_buff *skb, struct netlink_callback *cb, int *q_idx_p, int s_q_idx) { int ret = 0, q_idx = *q_idx_p; @@ -1451,7 +1452,7 @@ static int tc_dump_qdisc_root(struct Qdisc *root, struct sk_buff *skb, goto done; q_idx++; } - hash_for_each(qdisc_dev(root)->qdisc_hash, b, q, hash) { + hash_for_each(dev->qdisc_hash, b, q, hash) { if (q_idx < s_q_id
Re: [RFC V2 PATCH 17/25] net/netpolicy: introduce netpolicy_pick_queue
On 08/05/2016 12:54 AM, Andi Kleen wrote: +1, I tried to bring this up here [1] in the last spin. I think only very few changes would be needed, f.e. on eBPF side to add a queue setting helper function which is probably straight forward ~10loc patch; and with regards to actually picking it up after clsact egress, we'd need to adapt __netdev_pick_tx() slightly when CONFIG_XPS so it doesn't override it. You're proposing to rewrite the whole net policy manager as EBPF and run it in a crappy JITer? Is that a serious proposal? It just sounds crazy to me. Especially since we already have a perfectly good compiler and programming language to write system code in. EBPF is ok for temporal instrumentation (if you somehow can accept its security challenges), but using it to replace core kernel functionality (which network policy IMHO is) with some bizarre JITed setup and multiple languages doesn't really make any sense. Especially it doesn't make sense for anything with shared state, which is the core part of network policy: it negotiates with multiple users. After all we're writing Linux here and not some research toy. From what I read I guess you didn't really bother to look any deeper into this bizarre "research toy" to double check some of your claims. One of the things it's often deployed for by the way is defining policy. And the suggestion here was merely to explore existing infrastructure around things like tc and whether it already resolves at least a part of your net policy manager's requirements (like queue selection) or whether existing infrastructure can be extended with fewer complexity this way (as was mentioned with a new cls module as one option).
Re: [RFC V2 PATCH 17/25] net/netpolicy: introduce netpolicy_pick_queue
On 08/04/2016 10:21 PM, John Fastabend wrote: On 16-08-04 12:36 PM, kan.li...@intel.com wrote: From: Kan Liang To achieve better network performance, the key step is to distribute the packets to dedicated queues according to policy and system run time status. This patch provides an interface which can return the proper dedicated queue for socket/task. Then the packets of the socket/task will be redirect to the dedicated queue for better network performance. For selecting the proper queue, currently it uses round-robin algorithm to find the available object from the given policy object list. The algorithm is good enough for now. But it could be improved by some adaptive algorithm later. The selected object will be stored in hashtable. So it does not need to go through the whole object list every time. Signed-off-by: Kan Liang --- include/linux/netpolicy.h | 5 ++ net/core/netpolicy.c | 136 ++ 2 files changed, 141 insertions(+) There is a hook in the tx path now (recently added) # ifdef CONFIG_NET_EGRESS if (static_key_false(&egress_needed)) { skb = sch_handle_egress(skb, &rc, dev); if (!skb) goto out; } # endif that allows pushing any policy you like for picking tx queues. It would be better to use this mechanism. The hook runs 'tc' classifiers so either write a new ./net/sch/cls_*.c for this or just use ebpf to stick your policy in at runtime. I'm out of the office for a few days but when I get pack I can test that it actually picks the selected queue in all cases I know there was an issue with some of the drivers using select_queue awhile back. +1, I tried to bring this up here [1] in the last spin. I think only very few changes would be needed, f.e. on eBPF side to add a queue setting helper function which is probably straight forward ~10loc patch; and with regards to actually picking it up after clsact egress, we'd need to adapt __netdev_pick_tx() slightly when CONFIG_XPS so it doesn't override it. [1] http://www.spinics.net/lists/netdev/msg386953.html
Re: [RFC 4/4] bpf: Restrict Checmate bpf programs to current kernel ABI
On 08/04/2016 11:52 AM, Daniel Borkmann wrote: On 08/04/2016 09:12 AM, Sargun Dhillon wrote: I think it makes sense to restrict Checmate to loading programs that have been compiled with the current kernel ABI. We can further stabilize the ABI, and perhaps lift this restriction later. Signed-off-by: Sargun Dhillon --- kernel/bpf/syscall.c | 2 +- samples/bpf/checmate1_kern.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 228f962..2a37b4d 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -741,7 +741,7 @@ static int bpf_prog_load(union bpf_attr *attr) if (attr->insn_cnt >= BPF_MAXINSNS) return -EINVAL; -if (type == BPF_PROG_TYPE_KPROBE && +if ((type & (BPF_PROG_TYPE_KPROBE | BPF_PROG_TYPE_CHECMATE)) && attr->kern_version != LINUX_VERSION_CODE) Btw, this check is correct, program types are not masks. Sorry, I meant to write *not* correct, which was hopefully inferable from the rest. BPF_PROG_TYPE_KPROBE (== 2) and BPF_PROG_TYPE_CHECMATE (== 7) will now require every type to have a version code ... return -EINVAL;
Re: [RFC 4/4] bpf: Restrict Checmate bpf programs to current kernel ABI
On 08/04/2016 09:12 AM, Sargun Dhillon wrote: I think it makes sense to restrict Checmate to loading programs that have been compiled with the current kernel ABI. We can further stabilize the ABI, and perhaps lift this restriction later. Signed-off-by: Sargun Dhillon --- kernel/bpf/syscall.c | 2 +- samples/bpf/checmate1_kern.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 228f962..2a37b4d 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -741,7 +741,7 @@ static int bpf_prog_load(union bpf_attr *attr) if (attr->insn_cnt >= BPF_MAXINSNS) return -EINVAL; - if (type == BPF_PROG_TYPE_KPROBE && + if ((type & (BPF_PROG_TYPE_KPROBE | BPF_PROG_TYPE_CHECMATE)) && attr->kern_version != LINUX_VERSION_CODE) Btw, this check is correct, program types are not masks. BPF_PROG_TYPE_KPROBE (== 2) and BPF_PROG_TYPE_CHECMATE (== 7) will now require every type to have a version code ... return -EINVAL; diff --git a/samples/bpf/checmate1_kern.c b/samples/bpf/checmate1_kern.c index f78b66b..d4ec1fa 100644 --- a/samples/bpf/checmate1_kern.c +++ b/samples/bpf/checmate1_kern.c @@ -3,6 +3,7 @@ #include #include #include "bpf_helpers.h" +#include SEC("checmate") int prog(struct checmate_ctx *ctx) @@ -24,4 +25,4 @@ int prog(struct checmate_ctx *ctx) } char _license[] SEC("license") = "GPL"; - +u32 _version SEC("version") = LINUX_VERSION_CODE;
Re: [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM
Hi Sargun, On 08/04/2016 09:11 AM, Sargun Dhillon wrote: [...] [It's a] minor LSM. My particular use case is one in which containers are being dynamically deployed to machines by internal developers in a different group. [...] For many of these containers, the security policies can be fairly nuanced. One particular one to take into account is network security. Often times, administrators want to prevent ingress, and egress connectivity except from a few select IPs. Egress filtering can be managed using net_cls, but without modifying running software, it's non-trivial to attach a filter to all sockets being created within a container. The inet_conn_request, socket_recvmsg, socket_sock_rcv_skb hooks make this trivial to implement. I'm not too familiar with LSMs, but afaik, when you install such policies they are effectively global, right? How would you install/manage such policies per container? On a quick glance, this would then be the job of the BPF proglet from the global hook, no? If yes, then the BPF contexts the BPF prog works with seem rather quite limited ... +struct checmate_file_open_ctx { + struct file *file; + const struct cred *cred; +}; + +struct checmate_task_create_ctx { + unsigned long clone_flags; +}; + +struct checmate_task_free_ctx { + struct task_struct *task; +}; + +struct checmate_socket_connect_ctx { + struct socket *sock; + struct sockaddr *address; + int addrlen; +}; ... or are you using bpf_probe_read() in some way to walk 'current' to retrieve a namespace from there somehow? Like via nsproxy? But how you make sense of this for defining a per container policy? Do you see a way where we don't need to define so many different ctx each time? My other concern from a security PoV is that when using things like bpf_probe_read() we're dependent on kernel structs and there's a risk that when people migrate such policies that expectations break due to underlying structs changed. I see you've addressed that in patch 4 to place a small stone in the way, yeah kinda works. It's mostly a reminder that this is not stable ABI. Thanks, Daniel
Re: [PATCH net-next v7 1/2] bpf: Add bpf_probe_write_user BPF helper to be called in tracers
On 07/25/2016 02:54 PM, Sargun Dhillon wrote: This allows user memory to be written to during the course of a kprobe. It shouldn't be used to implement any kind of security mechanism because of TOC-TOU attacks, but rather to debug, divert, and manipulate execution of semi-cooperative processes. [...] v7 looks good to me now as well, thanks a bunch Sargun!
Re: [PATCH v4 1/2] bpf: Add bpf_probe_write BPF helper to be called in tracers (kprobes)
On 07/22/2016 04:14 AM, Alexei Starovoitov wrote: On Thu, Jul 21, 2016 at 06:09:17PM -0700, Sargun Dhillon wrote: This allows user memory to be written to during the course of a kprobe. It shouldn't be used to implement any kind of security mechanism because of TOC-TOU attacks, but rather to debug, divert, and manipulate execution of semi-cooperative processes. Although it uses probe_kernel_write, we limit the address space the probe can write into by checking the space with access_ok. This is so the call doesn't sleep. Given this feature is experimental, and has the risk of crashing the system, we print a warning on invocation. It was tested with the tracex7 program on x86-64. Signed-off-by: Sargun Dhillon Cc: Alexei Starovoitov Cc: Daniel Borkmann --- include/uapi/linux/bpf.h | 12 kernel/bpf/verifier.c | 9 + kernel/trace/bpf_trace.c | 37 + samples/bpf/bpf_helpers.h | 2 ++ 4 files changed, 60 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 2b7076f..4536282 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -365,6 +365,18 @@ enum bpf_func_id { */ BPF_FUNC_get_current_task, + /** +* bpf_probe_write(void *dst, void *src, int len) +* safely attempt to write to a location +* @dst: destination address in userspace +* @src: source address on stack +* @len: number of bytes to copy +* Return: +* Returns number of bytes that could not be copied. +* On success, this will be zero that is odd comment. there are only three possible return values 0, -EFAULT, -EPERM Agree. +*/ + BPF_FUNC_probe_write, + __BPF_FUNC_MAX_ID, }; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f72f23b..6785008 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1154,6 +1154,15 @@ static int check_call(struct verifier_env *env, int func_id) return -EINVAL; } + if (func_id == BPF_FUNC_probe_write) { + pr_warn_once("\n"); + pr_warn_once("* bpf_probe_write: Experimental Feature in use *\n"); + pr_warn_once("* bpf_probe_write: Feature may corrupt memory *\n"); + pr_warn_once("\n"); + pr_notice_ratelimited("bpf_probe_write in use by: %.16s-%d", + current->comm, task_pid_nr(current)); + } I think single line pr_notice_ratelimited() with 'feature may corrupt user memory' will be enough. Agree. Also please move this to tracing specific part into bpf_trace.c similar to bpf_get_trace_printk_proto() instead of verifier.c Yes, sorry for not being too clear about it, this spot will then be called by the verifier to fetch it for every function call. Meaning that this could get printed multiple times for loading a single program, but I think it's okay. If single line, I'd make that pr_warn_ratelimited(), and probably something like ... "note: %s[%d] is installing a program with bpf_probe_write helper that may corrupt user memory!", current->comm, task_pid_nr(current) +static u64 bpf_probe_write(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) +{ + void *unsafe_ptr = (void *) (long) r1; + void *src = (void *) (long) r2; + int size = (int) r3; + struct task_struct *task = current; + + /* bpf_trace.c follows non-net comment style, so it's good here. just distracting vs the rest of net style. +* Ensure we're in a user context which it is safe for the helper +* to run. This helper has no business in a kthread +* +* access_ok should prevent writing to non-user memory, but on +* some architectures (nommu, etc...) access_ok isn't enough +* So we check the current segment +*/ + + if (unlikely(in_interrupt() || (task->flags & PF_KTHREAD))) + return -EPERM; Should we also add a check for !PF_EXITING ? Like signals are not delivered to such tasks and I'm not sure what would be the state of mm of it. I agree, good point. You can make that 'current->flags & (PF_KTHREAD | PF_EXITING)' and we don't need the extra task var either. + if (unlikely(segment_eq(get_fs(), KERNEL_DS))) + return -EPERM; + if (!access_ok(VERIFY_WRITE, unsafe_ptr, size)) + return -EPERM; + + return probe_kernel_write(unsafe_ptr, src, size); +} + +static const struct bpf_func_proto bpf_probe_write_proto = { + .func = bpf_probe_write, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_ANYTHING,
[PATCH net-next] bpf, events: fix offset in skb copy handler
This patch fixes the __output_custom() routine we currently use with bpf_skb_copy(). I missed that when len is larger than the size of the current handle, we can issue multiple invocations of copy_func, and __output_custom() advances destination but also source buffer by the written amount of bytes. When we have __output_custom(), this is actually wrong since in that case the source buffer points to a non-linear object, in our case an skb, which the copy_func helper is supposed to walk. Therefore, since this is non-linear we thus need to pass the offset into the helper, so that copy_func can use it for extracting the data from the source object. Therefore, adjust the callback signatures properly and pass offset into the skb_header_pointer() invoked from bpf_skb_copy() callback. The __DEFINE_OUTPUT_COPY_BODY() is adjusted to accommodate for two things: i) to pass in whether we should advance source buffer or not; this is a compile-time constant condition, ii) to pass in the offset for __output_custom(), which we do with help of __VA_ARGS__, so everything can stay inlined as is currently. Both changes allow for adapting the __output_* fast-path helpers w/o extra overhead. Fixes: 555c8a8623a3 ("bpf: avoid stack copy and use skb ctx for event output") Fixes: 7e3f977edd0b ("perf, events: add non-linear data support for raw records") Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- include/linux/bpf.h| 2 +- include/linux/perf_event.h | 2 +- kernel/events/internal.h | 15 ++- net/core/filter.c | 4 ++-- 4 files changed, 14 insertions(+), 9 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 36da074..1113423 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -211,7 +211,7 @@ bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *f const struct bpf_func_proto *bpf_get_trace_printk_proto(void); typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src, - unsigned long len); + unsigned long off, unsigned long len); u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size, void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy); diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index e79e6c6..15e55b7 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -70,7 +70,7 @@ struct perf_callchain_entry_ctx { }; typedef unsigned long (*perf_copy_f)(void *dst, const void *src, -unsigned long len); +unsigned long off, unsigned long len); struct perf_raw_frag { union { diff --git a/kernel/events/internal.h b/kernel/events/internal.h index 2417eb5..486fd78 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -123,18 +123,19 @@ static inline unsigned long perf_aux_size(struct ring_buffer *rb) return rb->aux_nr_pages << PAGE_SHIFT; } -#define __DEFINE_OUTPUT_COPY_BODY(memcpy_func) \ +#define __DEFINE_OUTPUT_COPY_BODY(advance_buf, memcpy_func, ...) \ { \ unsigned long size, written;\ \ do {\ size= min(handle->size, len); \ - written = memcpy_func(handle->addr, buf, size); \ + written = memcpy_func(__VA_ARGS__); \ written = size - written; \ \ len -= written; \ handle->addr += written;\ - buf += written; \ + if (advance_buf)\ + buf += written; \ handle->size -= written;\ if (!handle->size) {\ struct ring_buffer *rb = handle->rb;\ @@ -153,12 +154,16 @@ static inline unsigned long perf_aux_size(struct ring_buffer *rb) static inline unsigned long\ func_name(struct perf_output_handle *handle, \ const void *buf, unsigned long len) \ -__DEFINE_OUTPUT_COPY_BODY(memcpy_func) +__DEFINE_OUTPUT_COPY_BODY(true, memcpy_func, handle->addr, buf, size) static inline unsigned long __output_custom(struct
Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes)
On 07/21/2016 12:47 PM, Sargun Dhillon wrote: On Thu, Jul 21, 2016 at 01:00:51AM +0200, Daniel Borkmann wrote: [...] I don't really like couple of things, your ifdef CONFIG_MMU might not be needed I think, couple of these checks seem redundant, (I'm not yet sure about the task->mm != task->active_mm thingy), the helper should definitely be gpl_only and ARG_PTR_TO_RAW_STACK is just buggy. Also, this should be a bit analogue to bpf_probe_read we have. How about something roughly along the lines of below diff (obviously needs extensive testing ...)? This can still do all kind of ugly crap to the user process, but limited to the cap_sys_admin to shoot himself in the foot. * You're right about CONFIG_MMU. We don't need it, all of the nommu platforms properly deal with it from my research. The segment_eq() test should generically catch this from what I see. It was always ARG_PTR_TO_STACK? Or are you saying ARG_PTR_TO_STACK is buggy and we should make it ARG_PTR_TO_RAW_STACK? No, in your patch, you had '+.arg2_type= ARG_PTR_TO_RAW_STACK,', which is not correct as it means you don't need to initialize the memory you pass in for your *src pointer. I believe you took this over from probe_read(), but there it's correct. ARG_PTR_TO_STACK means the verifier checks that it's initialized with data. I originally named the function bpf_probe_write. Upon further thought I don't think that makes sense. The reason is because bpf_probe_read is analogous to probe_kernel_read. If we had bpf_probe_write, I think people might reason it to be equivalent to probe_kernel_write, and be confused when they can't write to kernel space. I still think that bpf_probe_write is the more appropriate name, and that the -EPERM are also better than -EINVAL. For user space, you'll have the bpf_probe_read() / bpf_probe_write() pair you can use, which is the more intuitive complement, also since people might already use bpf_probe_read(), so they kind of can infer its meaning. It's just that the kernel doesn't give you _permission_ to mess with kernel memory, hence due to not being allowed -EPERM to make this absolutely clear to the user that this is illegal. -EINVAL usually means one of the function arguments might be wrong, I think -EPERM is a better / more clear fit in this case, imho. I tried to make the external facing documentaion close to copy_to_user. That's how people should use it, not like _write. Therefor I think it makes sense to keep that the name. But still, you *probe* to write somewhere to the process' address space, so it can still fail with -EFAULT. Other than that, see comment above. I added a check for (!task) -- It seems to be spattered throughou the eBPF helper code. Alexei mentioned earlier that it can be null, but I'm not sure of the case Well, the test of unlikely(in_interrupt() || (task->flags & PF_KTHREAD)) will cover these cases. It makes sure that you're neither in hardirq (NULL here) nor softirq and that you're not in a kthread. RE: task->mm != task->active_mm -- There are a couple scenarios where kthreads do this, and the only user call that should hit this is execve. There's only a brief period where this can be true and I don't think it's worth dealing with that case -- I'm not really sure you can plant a kprobe at the right site either But if kthreads do this, wouldn't this already be covered by task->flags & PF_KTHREAD test for such case? Did some minimal testing with tracex7 and others. Ok, great! I was able to copy memory into other process's space while certain syscalls were going on. I don't think that there are a reasonable set of protections. Right, it doesn't prevent that syscalls going on in parallel. I'll do more testing. Any suggestions of what might break? I've looked at writing to unitialized memory, Memory out of bounds, etc... Do you know of any "weak spots"? Well, you could write into text/data, stack, etc for the user process so quite a bit. ;) Or do you mean wrt kernel space? If someone runs some binary installing such a proglet, I think it would also make sense to print a message (rate-limited) to the klog with current->comm, task_pid_nr(current) for the process installing this, from verifier side I mean. Maybe makes more sense than the print-once from the helper side. Thanks, Daniel
Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes)
On 07/20/2016 11:58 AM, Sargun Dhillon wrote: [...] So, with that, what about the following: It includes -Desupporting no MMU platforms as we've deemed them incapable of being safe -Checking that we're not in a kthread -Checking that the active mm is the thread's mm -A log message indicating the experimental nature of this helper It does not include: -A heuristic to determine is access_ok is broken, or if the platform didn't implement it. It seems all platforms with MMUs implement it today, and it seems clear to make that platforms should do something better than return 1, if they can I don't really like couple of things, your ifdef CONFIG_MMU might not be needed I think, couple of these checks seem redundant, (I'm not yet sure about the task->mm != task->active_mm thingy), the helper should definitely be gpl_only and ARG_PTR_TO_RAW_STACK is just buggy. Also, this should be a bit analogue to bpf_probe_read we have. How about something roughly along the lines of below diff (obviously needs extensive testing ...)? This can still do all kind of ugly crap to the user process, but limited to the cap_sys_admin to shoot himself in the foot. include/uapi/linux/bpf.h | 3 +++ kernel/trace/bpf_trace.c | 30 ++ 2 files changed, 33 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 2b7076f..4d339c6 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -365,6 +365,9 @@ enum bpf_func_id { */ BPF_FUNC_get_current_task, + /* Doc goes here ... */ + BPF_FUNC_probe_write, + __BPF_FUNC_MAX_ID, }; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index a12bbd3..43a4386c 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -81,6 +81,34 @@ static const struct bpf_func_proto bpf_probe_read_proto = { .arg3_type = ARG_ANYTHING, }; +static u64 bpf_probe_write(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) +{ + struct task_struct *task = current; + void *unsafe_ptr = (void *)(long) r1; + void *src = (void *)(long) r2; + int size = (int) r3; + + if (unlikely(in_interrupt() || (task->flags & PF_KTHREAD))) + return -EPERM; + if (segment_eq(get_fs(), KERNEL_DS)) + return -EPERM; + if (!access_ok(VERIFY_WRITE, unsafe_ptr, size)) + return -EPERM; + + /* pr_warn_once() barks here ... */ + + return probe_kernel_write(unsafe_ptr, src, size); +} + +static const struct bpf_func_proto bpf_probe_write_proto = { + .func = bpf_probe_write, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_ANYTHING, + .arg2_type = ARG_PTR_TO_STACK, + .arg3_type = ARG_CONST_STACK_SIZE, +}; + /* * limited trace_printk() * only %d %u %x %ld %lu %lx %lld %llu %llx %p %s conversion specifiers allowed @@ -344,6 +372,8 @@ static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id) return &bpf_map_delete_elem_proto; case BPF_FUNC_probe_read: return &bpf_probe_read_proto; + case BPF_FUNC_probe_write: + return &bpf_probe_write_proto; case BPF_FUNC_ktime_get_ns: return &bpf_ktime_get_ns_proto; case BPF_FUNC_tail_call: -- 1.9.3
Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes)
On 07/20/2016 05:02 AM, Alexei Starovoitov wrote: On Wed, Jul 20, 2016 at 01:19:51AM +0200, Daniel Borkmann wrote: On 07/19/2016 06:34 PM, Alexei Starovoitov wrote: On Tue, Jul 19, 2016 at 01:17:53PM +0200, Daniel Borkmann wrote: + return -EINVAL; + + /* Is this a user address, or a kernel address? */ + if (!access_ok(VERIFY_WRITE, to, size)) + return -EINVAL; + + return probe_kernel_write(to, from, size); I'm still worried that this can lead to all kind of hard to find bugs or races for user processes, if you make this writable to entire user address space (which is the only thing that access_ok() checks for). What if the BPF program has bugs and writes to wrong addresses for example, introducing bugs in some other, non-intended processes elsewhere? Can this be limited to syscalls only? And if so, to the passed memory only? my understanding that above code will write only to memory of current process, so impact is contained and in that sense buggy kprobe program is no different >from buggy seccomp prorgram. Compared to seccomp, you might not notice that a race has happened, in seccomp case you might have killed your process, which is visible. But ok, in ptrace() case it might be similar issue perhaps ... The asm-generic version does __access_ok(..) { return 1; } for nommu case, I haven't checked closely enough whether there's actually an arch that uses this, but for example arm nommu with only one addr space would certainly result in access_ok() as 1, and then you could also go into probe_kernel_write(), no? good point. how arm nommu handles copy_to_user? if there is nommu Should probably boil down to something similar as plain memcpy(). then there is no user/kernel mm ? Crazy archs. I guess we have to disable this helper on all such archs. Don't know that code well enough, but I believe the check would only ensure in normal use-cases that user process doesn't fiddle with kernel address space, but not necessarily guarantee that this really only belongs to the process address space. why? on x86 that exactly what it does. access_ok=true means it's user space address and since we're in _this user context_ probe_kernel_write can only affect this user. x86 code comments this with "note that, depending on architecture, this function probably just checks that the pointer is in the user space range - after calling this function, memory access functions may still return -EFAULT". Yes. I've read that comment to :) Certainly not an expert, but the archs I've looked at, access_ok has the same meaning as on x86. They check the space range to make sure address doesn't belong to kernel. Could I have missed something? Certainly. Please double check :) Also, what happens in case of kernel thread? my understanding if access_ok(addr)=true the addr will never point to memory of kernel thread. If you're coming from user context only, this should be true, it'll check whether it's some user space pointer. As it stands, it does ... if (unlikely(in_interrupt())) return -EINVAL; if (unlikely(!task || !task->pid)) return -EINVAL; So up to here, irq/sirq, NULL current and that current is not the 'idle' process is being checked (still fail to see the point for the !task->pid, I believe the intend here is different). /* Is this a user address, or a kernel address? */ if (!access_ok(VERIFY_WRITE, to, size)) return -EINVAL; Now here. What if it's a kernel thread? You'll have KERNEL_DS segment, task->pid was non-zero as well for the kthread, so access_ok() will pass and you can still execute probe_kernel_write() ... I think user_addr_max() should be zero for kthread, but worth checking for sure. It's 0x, I did a quick test yesterday night with creating a kthread, so access_ok() should pass for such case.
Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes)
On 07/19/2016 06:34 PM, Alexei Starovoitov wrote: On Tue, Jul 19, 2016 at 01:17:53PM +0200, Daniel Borkmann wrote: + return -EINVAL; + + /* Is this a user address, or a kernel address? */ + if (!access_ok(VERIFY_WRITE, to, size)) + return -EINVAL; + + return probe_kernel_write(to, from, size); I'm still worried that this can lead to all kind of hard to find bugs or races for user processes, if you make this writable to entire user address space (which is the only thing that access_ok() checks for). What if the BPF program has bugs and writes to wrong addresses for example, introducing bugs in some other, non-intended processes elsewhere? Can this be limited to syscalls only? And if so, to the passed memory only? my understanding that above code will write only to memory of current process, so impact is contained and in that sense buggy kprobe program is no different from buggy seccomp prorgram. Compared to seccomp, you might not notice that a race has happened, in seccomp case you might have killed your process, which is visible. But ok, in ptrace() case it might be similar issue perhaps ... The asm-generic version does __access_ok(..) { return 1; } for nommu case, I haven't checked closely enough whether there's actually an arch that uses this, but for example arm nommu with only one addr space would certainly result in access_ok() as 1, and then you could also go into probe_kernel_write(), no? Don't know that code well enough, but I believe the check would only ensure in normal use-cases that user process doesn't fiddle with kernel address space, but not necessarily guarantee that this really only belongs to the process address space. x86 code comments this with "note that, depending on architecture, this function probably just checks that the pointer is in the user space range - after calling this function, memory access functions may still return -EFAULT". Also, what happens in case of kernel thread? As it stands, it does ... if (unlikely(in_interrupt())) return -EINVAL; if (unlikely(!task || !task->pid)) return -EINVAL; So up to here, irq/sirq, NULL current and that current is not the 'idle' process is being checked (still fail to see the point for the !task->pid, I believe the intend here is different). /* Is this a user address, or a kernel address? */ if (!access_ok(VERIFY_WRITE, to, size)) return -EINVAL; Now here. What if it's a kernel thread? You'll have KERNEL_DS segment, task->pid was non-zero as well for the kthread, so access_ok() will pass and you can still execute probe_kernel_write() ... Limiting this to syscalls will make it too limited. I'm in favor of this change, because it allows us to experiment with restartable sequences and lock-free algorithms that need ultrafast access to cpuid without burdening the kernel with stable abi. Have you played around with ptrace() to check whether you could achieve similar functionality (was thinking about things like [1], PTRACE_PEEK{TEXT,DATA} / PTRACE_POKE{TEXT,DATA}). If not, why can't this be limited to a similar functionality for only the current task. ptrace() utilizes helpers like access_process_vm(), maybe this can similarly be adapted here, too (under the circumstances that sleeping is not allowed)? If we hack access_process_vm I think at the end it will look like probe_kernel_write. Walking mm requires semaphore, so we would only be able to do it in task_work and there we can do normal copy_to_user just as well, but it will complicate the programs quite a bit, since writes will be asynchronous and batched. Looks like with access_ok+probe_write we can achieve the same with a lot less code. I believe it may not quite be the same as it currently stands. No fundamental objection, just that this needs to be made "safe" to the limits you state above yourself. ;) As far as races between user and bpf program, yeah, if process is multithreaded, the other threads may access the same mem that bpf is writing to, but that's no different from reading. For tracing we walk complex datastructures and pointers. They can be changed by user space on the fly and bpf will see garbage. Like we have uprobe+bpf that walks clang c++ internal datastructures to figure out how long it takes clang to process .h headers. There is a lot of fragility in the bpf script, yet it's pretty useful to quickly analyze compile times. I see bpf_copy_to_user to be hugely valuable too, not as a stable interface, but rather as a tool to quickly debug and experiment. Right, so maybe there should be a warn once to the dmesg log that this is just experimental? Thanks, Daniel
Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes)
Hi Sargun, On 07/19/2016 11:32 AM, Sargun Dhillon wrote: This allows user memory to be written to during the course of a kprobe. It shouldn't be used to implement any kind of security mechanism because of TOC-TOU attacks, but rather to debug, divert, and manipulate execution of semi-cooperative processes. Although it uses probe_kernel_write, we limit the address space the probe can write into by checking the space with access_ok. This call shouldn't sleep on any architectures based on review. It was tested with the tracex7 program on x86-64. Signed-off-by: Sargun Dhillon Cc: Alexei Starovoitov [...] diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index ebfbb7d..45878f3 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -81,6 +81,35 @@ static const struct bpf_func_proto bpf_probe_read_proto = { .arg3_type = ARG_ANYTHING, }; +static u64 bpf_copy_to_user(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) +{ + void *to = (void *) (long) r1; + void *from = (void *) (long) r2; + int size = (int) r3; Nit: you have two spaces here? + struct task_struct *task = current; + + /* check if we're in a user context */ + if (unlikely(in_interrupt())) + return -EINVAL; + if (unlikely(!task || !task->pid)) Why !task->pid is needed? Can you point to the code where this is set up as such? I'd assume if that's the case, there's a generic helper for it to determine that the task_struct is a kernel task? + return -EINVAL; + + /* Is this a user address, or a kernel address? */ + if (!access_ok(VERIFY_WRITE, to, size)) + return -EINVAL; + + return probe_kernel_write(to, from, size); I'm still worried that this can lead to all kind of hard to find bugs or races for user processes, if you make this writable to entire user address space (which is the only thing that access_ok() checks for). What if the BPF program has bugs and writes to wrong addresses for example, introducing bugs in some other, non-intended processes elsewhere? Can this be limited to syscalls only? And if so, to the passed memory only? Have you played around with ptrace() to check whether you could achieve similar functionality (was thinking about things like [1], PTRACE_PEEK{TEXT,DATA} / PTRACE_POKE{TEXT,DATA}). If not, why can't this be limited to a similar functionality for only the current task. ptrace() utilizes helpers like access_process_vm(), maybe this can similarly be adapted here, too (under the circumstances that sleeping is not allowed)? [1] https://code.google.com/archive/p/ptrace-parasite/ +} + +static const struct bpf_func_proto bpf_copy_to_user_proto = { + .func = bpf_copy_to_user, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_ANYTHING, + .arg2_type = ARG_PTR_TO_RAW_STACK, + .arg3_type = ARG_CONST_STACK_SIZE, Nit: please tab-align '=' like we have in the other *_proto cases in bpf_trace. +}; + /* * limited trace_printk() * only %d %u %x %ld %lu %lx %lld %llu %llx %p %s conversion specifiers allowed Thanks, Daniel
Re: [RFC PATCH 00/30] Kernel NET policy
On 07/18/2016 08:30 PM, Liang, Kan wrote: On 07/18/2016 08:55 AM, kan.li...@intel.com wrote: [...] On a higher level picture, why for example, a new cgroup in combination with tc shouldn't be the ones resolving these policies on resource usage? The NET policy doesn't support cgroup yet, but it's on my todo list. The granularity for the device resource is per queue. The packet will be redirected to the specific queue. I'm not sure if cgroup with tc can do that. Did you have a look at sch_mqprio, which can be used along with either netprio cgroup or netcls cgroup plus tc on clsact's egress side to set the priority for mqprio mappings from application side? At leats ixgbe, i40e, fm10k have offload support for it and a number of other nics. You could also use cls_bpf for making the prio assignment if you need to involve also other meta data from the skb (like mark or prio derived from sockets, etc). Maybe it doesn't cover all of what you need, but could be a start to extend upon? Thanks, Daniel
Re: [RFC PATCH 00/30] Kernel NET policy
Hi Kan, On 07/18/2016 08:55 AM, kan.li...@intel.com wrote: From: Kan Liang It is a big challenge to get good network performance. First, the network performance is not good with default system settings. Second, it is too difficult to do automatic tuning for all possible workloads, since workloads have different requirements. Some workloads may want high throughput. Some may need low latency. Last but not least, there are lots of manual configurations. Fine grained configuration is too difficult for users. NET policy intends to simplify the network configuration and get a good network performance according to the hints(policy) which is applied by user. It provides some typical "policies" for user which can be set per-socket, per-task or per-device. The kernel will automatically figures out how to merge different requests to get good network performance. Net policy is designed for multiqueue network devices. This implementation is only for Intel NICs using i40e driver. But the concepts and generic code should apply to other multiqueue NICs too. Net policy is also a combination of generic policy manager code and some ethtool callbacks (per queue coalesce setting, flow classification rules) to configure the driver. This series also supports CPU hotplug and device hotplug. Here are some key Interfaces/APIs for NET policy. /proc/net/netpolicy/$DEV/policy User can set/get per device policy from /proc /proc/$PID/net_policy User can set/get per task policy from /proc prctl(PR_SET_NETPOLICY, POLICY_NAME, NULL, NULL, NULL) An alternative way to set/get per task policy is from prctl. setsockopt(sockfd,SOL_SOCKET,SO_NETPOLICY,&policy,sizeof(int)) User can set/get per socket policy by setsockopt int (*ndo_netpolicy_init)(struct net_device *dev, struct netpolicy_info *info); Initialize device driver for NET policy int (*ndo_get_irq_info)(struct net_device *dev, struct netpolicy_dev_info *info); Collect device irq information int (*ndo_set_net_policy)(struct net_device *dev, enum netpolicy_name name); Configure device according to policy name netpolicy_register(struct netpolicy_reg *reg); netpolicy_unregister(struct netpolicy_reg *reg); NET policy API to register/unregister per task/socket net policy. For each task/socket, an record will be created and inserted into an RCU hash table. netpolicy_pick_queue(struct netpolicy_reg *reg, bool is_rx); NET policy API to find the proper queue for packet receiving and transmitting. netpolicy_set_rules(struct netpolicy_reg *reg, u32 queue_index, struct netpolicy_flow_spec *flow); NET policy API to add flow director rules. For using NET policy, the per-device policy must be set in advance. It will automatically configure the system and re-organize the resource of the system accordingly. For system configuration, in this series, it will disable irq balance, set device queue irq affinity, and modify interrupt moderation. For re-organizing the resource, current implementation forces that CPU and queue irq are 1:1 mapping. An 1:1 mapping group is also called net policy object. For each device policy, it maintains a policy list. Once the device policy is applied, the objects will be insert and tracked in that device policy list. The policy list only be updated when cpu/device hotplug, queue number changes or device policy changes. The user can use /proc, prctl and setsockopt to set per-task and per-socket net policy. Once the policy is set, an related record will be inserted into RCU hash table. The record includes ptr, policy and net policy object. The ptr is the pointer address of task/socket. The object will not be assigned until the first package receive/transmit. The object is picked by round-robin from object list. Once the object is determined, the following packets will be set to redirect to the queue(object). The object can be shared. The per-task or per-socket policy can be inherited. Now NET policy supports four per device policies and three per task/socket policies. - BULK policy: This policy is designed for high throughput. It can be applied to either per device policy or per task/socket policy. - CPU policy: This policy is designed for high throughput but lower CPU utilization. It can be applied to either per device policy or per task/socket policy. - LATENCY policy: This policy is designed for low latency. It can be applied to either per device policy or per task/socket policy. - MIX policy: This policy can only be applied to per device policy. This is designed for the case which miscellaneous types of workload running on the device. I'm missing a bit of discussion on the existing facilities there are under networking and why they cannot be adapted to support these kind of hints? On a h
[PATCH net-next v2 2/3] bpf, perf: split bpf_perf_event_output
Split the bpf_perf_event_output() helper as a preparation into two parts. The new bpf_perf_event_output() will prepare the raw record itself and test for unknown flags from BPF trace context, where the __bpf_perf_event_output() does the core work. The latter will be reused later on from bpf_event_output() directly. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- kernel/trace/bpf_trace.c | 35 ++- 1 file changed, 22 insertions(+), 13 deletions(-) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 35ab1b2..c35883a 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -233,26 +233,17 @@ static const struct bpf_func_proto bpf_perf_event_read_proto = { .arg2_type = ARG_ANYTHING, }; -static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size) +static __always_inline u64 +__bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map, + u64 flags, struct perf_raw_record *raw) { - struct pt_regs *regs = (struct pt_regs *) (long) r1; - struct bpf_map *map = (struct bpf_map *) (long) r2; struct bpf_array *array = container_of(map, struct bpf_array, map); unsigned int cpu = smp_processor_id(); u64 index = flags & BPF_F_INDEX_MASK; - void *data = (void *) (long) r4; struct perf_sample_data sample_data; struct bpf_event_entry *ee; struct perf_event *event; - struct perf_raw_record raw = { - .frag = { - .size = size, - .data = data, - }, - }; - if (unlikely(flags & ~(BPF_F_INDEX_MASK))) - return -EINVAL; if (index == BPF_F_CURRENT_CPU) index = cpu; if (unlikely(index >= array->map.max_entries)) @@ -271,11 +262,29 @@ static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size) return -EOPNOTSUPP; perf_sample_data_init(&sample_data, 0, 0); - sample_data.raw = &raw; + sample_data.raw = raw; perf_event_output(event, &sample_data, regs); return 0; } +static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size) +{ + struct pt_regs *regs = (struct pt_regs *)(long) r1; + struct bpf_map *map = (struct bpf_map *)(long) r2; + void *data = (void *)(long) r4; + struct perf_raw_record raw = { + .frag = { + .size = size, + .data = data, + }, + }; + + if (unlikely(flags & ~(BPF_F_INDEX_MASK))) + return -EINVAL; + + return __bpf_perf_event_output(regs, map, flags, &raw); +} + static const struct bpf_func_proto bpf_perf_event_output_proto = { .func = bpf_perf_event_output, .gpl_only = true, -- 1.9.3
[PATCH net-next v2 3/3] bpf: avoid stack copy and use skb ctx for event output
This work addresses a couple of issues bpf_skb_event_output() helper currently has: i) We need two copies instead of just a single one for the skb data when it should be part of a sample. The data can be non-linear and thus needs to be extracted via bpf_skb_load_bytes() helper first, and then copied once again into the ring buffer slot. ii) Since bpf_skb_load_bytes() currently needs to be used first, the helper needs to see a constant size on the passed stack buffer to make sure BPF verifier can do sanity checks on it during verification time. Thus, just passing skb->len (or any other non-constant value) wouldn't work, but changing bpf_skb_load_bytes() is also not the proper solution, since the two copies are generally still needed. iii) bpf_skb_load_bytes() is just for rather small buffers like headers, since they need to sit on the limited BPF stack anyway. Instead of working around in bpf_skb_load_bytes(), this work improves the bpf_skb_event_output() helper to address all 3 at once. We can make use of the passed in skb context that we have in the helper anyway, and use some of the reserved flag bits as a length argument. The helper will use the new __output_custom() facility from perf side with bpf_skb_copy() as callback helper to walk and extract the data. It will pass the data for setup to bpf_event_output(), which generates and pushes the raw record with an additional frag part. The linear data used in the first frag of the record serves as programmatically defined meta data passed along with the appended sample. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- include/linux/bpf.h | 7 ++- include/uapi/linux/bpf.h | 2 ++ kernel/bpf/core.c| 6 -- kernel/trace/bpf_trace.c | 33 +++-- net/core/filter.c| 43 ++- 5 files changed, 69 insertions(+), 22 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index b3336b4..c13e92b 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -209,7 +209,12 @@ u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp); const struct bpf_func_proto *bpf_get_trace_printk_proto(void); -const struct bpf_func_proto *bpf_get_event_output_proto(void); + +typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src, + unsigned long len); + +u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size, +void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy); #ifdef CONFIG_BPF_SYSCALL DECLARE_PER_CPU(int, bpf_prog_active); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 262a7e8..c4d9224 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -401,6 +401,8 @@ enum bpf_func_id { /* BPF_FUNC_perf_event_output and BPF_FUNC_perf_event_read flags. */ #define BPF_F_INDEX_MASK 0xULL #define BPF_F_CURRENT_CPU BPF_F_INDEX_MASK +/* BPF_FUNC_perf_event_output for sk_buff input context. */ +#define BPF_F_CTXLEN_MASK (0xfULL << 32) /* user accessible mirror of in-kernel sk_buff. * new fields can only be added to the end of this structure diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index d638062..03fd23d 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -1054,9 +1054,11 @@ const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void) return NULL; } -const struct bpf_func_proto * __weak bpf_get_event_output_proto(void) +u64 __weak +bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size, +void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy) { - return NULL; + return -ENOTSUPP; } /* Always built-in helper functions. */ diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index c35883a..ebfbb7d 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -298,29 +298,26 @@ static const struct bpf_func_proto bpf_perf_event_output_proto = { static DEFINE_PER_CPU(struct pt_regs, bpf_pt_regs); -static u64 bpf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size) +u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size, +void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy) { struct pt_regs *regs = this_cpu_ptr(&bpf_pt_regs); + struct perf_raw_frag frag = { + .copy = ctx_copy, + .size = ctx_size, + .data = ctx, + }; + struct perf_raw_record raw = { + .frag = { + .next = ctx_size ? &frag : NULL, + .size = meta_size, + .data = meta, + }, + }; perf_fetch_caller_regs(regs); - return bpf_perf_event_output
[PATCH net-next v2 0/3] BPF event output helper improvements
This set adds improvements to the BPF event output helper to support non-linear data sampling, here specifically, for skb context. For details please see individual patches. The set is based against net-next tree. v1 -> v2: - Integrated and adapted Peter's diff into patch 1, updated the remaining ones accordingly. Thanks Peter! Thanks a lot! Daniel Borkmann (3): perf, events: add non-linear data support for raw records bpf, perf: split bpf_perf_event_output bpf: avoid stack copy and use skb ctx for event output arch/s390/kernel/perf_cpum_sf.c | 9 -- arch/x86/events/amd/ibs.c | 8 +++-- include/linux/bpf.h | 7 - include/linux/perf_event.h | 20 - include/uapi/linux/bpf.h| 2 ++ kernel/bpf/core.c | 6 ++-- kernel/events/core.c| 66 - kernel/events/internal.h| 16 +++--- kernel/trace/bpf_trace.c| 66 +++-- net/core/filter.c | 43 ++- 10 files changed, 180 insertions(+), 63 deletions(-) -- 1.9.3
[PATCH net-next v2 1/3] perf, events: add non-linear data support for raw records
This patch adds support for non-linear data on raw records. It extends raw records to have one or multiple fragments that will be written linearly into the ring slot, where each fragment can optionally have a custom callback handler to walk and extract complex, possibly non-linear data. If a callback handler is provided for a fragment, then the new __output_custom() will be used instead of __output_copy() for the perf_output_sample() part. perf_prepare_sample() does all the size calculation only once, so perf_output_sample() doesn't need to redo the same work anymore, meaning real_size and padding will be cached in the raw record. The raw record becomes 32 bytes in size without holes; to not increase it further and to avoid doing unnecessary recalculations in fast-path, we can reuse next pointer of the last fragment, idea here is borrowed from ZERO_OR_NULL_PTR(), which should keep the perf_output_sample() path for PERF_SAMPLE_RAW minimal. This facility is needed for BPF's event output helper as a first user that will, in a follow-up, add an additional perf_raw_frag to its perf_raw_record in order to be able to more efficiently dump skb context after a linear head meta data related to it. skbs can be non-linear and thus need a custom output function to dump buffers. Currently, the skb data needs to be copied twice; with the help of __output_custom() this work only needs to be done once. Future users could be things like XDP/BPF programs that work on different context though and would thus also have a different callback function. The few users of raw records are adapted to initialize their frag data from the raw record itself, no change in behavior for them. The code is based upon a PoC diff provided by Peter Zijlstra [1]. [1] http://thread.gmane.org/gmane.linux.network/421294 Suggested-by: Peter Zijlstra Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- Hi Peter, I've adapted the patch to your suggestion and also added the padding; all size calculation is only done once at perf_prepare_sample() time as well to avoid unnecessary work. Please let me know if you're good with this. Thanks a lot! arch/s390/kernel/perf_cpum_sf.c | 9 -- arch/x86/events/amd/ibs.c | 8 +++-- include/linux/perf_event.h | 20 - kernel/events/core.c| 66 - kernel/events/internal.h| 16 +++--- kernel/trace/bpf_trace.c| 6 ++-- 6 files changed, 93 insertions(+), 32 deletions(-) diff --git a/arch/s390/kernel/perf_cpum_sf.c b/arch/s390/kernel/perf_cpum_sf.c index a8e8321..92619cc 100644 --- a/arch/s390/kernel/perf_cpum_sf.c +++ b/arch/s390/kernel/perf_cpum_sf.c @@ -979,12 +979,15 @@ static int perf_push_sample(struct perf_event *event, struct sf_raw_sample *sfr) struct pt_regs regs; struct perf_sf_sde_regs *sde_regs; struct perf_sample_data data; - struct perf_raw_record raw; + struct perf_raw_record raw = { + .frag = { + .size = sfr->size, + .data = sfr, + }, + }; /* Setup perf sample */ perf_sample_data_init(&data, 0, event->hw.last_period); - raw.size = sfr->size; - raw.data = sfr; data.raw = &raw; /* Setup pt_regs to look like an CPU-measurement external interrupt diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c index feb90f6..72dea2f 100644 --- a/arch/x86/events/amd/ibs.c +++ b/arch/x86/events/amd/ibs.c @@ -655,8 +655,12 @@ fail: } if (event->attr.sample_type & PERF_SAMPLE_RAW) { - raw.size = sizeof(u32) + ibs_data.size; - raw.data = ibs_data.data; + raw = (struct perf_raw_record){ + .frag = { + .size = sizeof(u32) + ibs_data.size, + .data = ibs_data.data, + }, + }; data.raw = &raw; } diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 1a827ce..e79e6c6 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -69,9 +69,22 @@ struct perf_callchain_entry_ctx { boolcontexts_maxed; }; +typedef unsigned long (*perf_copy_f)(void *dst, const void *src, +unsigned long len); + +struct perf_raw_frag { + union { + struct perf_raw_frag*next; + unsigned long pad; + }; + perf_copy_f copy; + void*data; + u32 size; +} __packed; + struct perf_raw_record { + struct perf_raw_fragfrag; u32 size; - void*data; }; /* @@ -1283,6 +1296,11 @@ extern voi
Re: [PATCH net-next 1/3] perf, events: add non-linear data support for raw records
On 07/13/2016 06:40 PM, Peter Zijlstra wrote: On Wed, Jul 13, 2016 at 04:08:55PM +0200, Daniel Borkmann wrote: On 07/13/2016 03:42 PM, Peter Zijlstra wrote: Ok so the nonlinear thing was it doing _two_ copies, one the regular __output_copy() on raw->data and second the optional fragment thingy using __output_custom(). Would something like this work instead? It does the nonlinear thing and the custom copy function thing but allows more than 2 fragments and allows each fragment to have a custom copy. It doesn't look obviously more expensive; it has the one ->copy branch extra, but then it doesn't recompute the sizes. Yes, that would work as well on a quick glance with diff just a bit bigger, but more generic this way. Do you want me to adapt this into the first patch? Please. One question below: - u64 zero = 0; - if (real_size - raw_size) - __output_copy(handle, &zero, real_size - raw_size); We still need the zero padding here from above with the computed raw->size, right? Ah, yes, we need some __output*() in order to advance the handle offset. We don't _need_ to copy the 0s, but I doubt __output_skip() is much cheaper for these 1-3 bytes worth of data; we've already touched that line anyway. Okay, thanks for your input! I'll respin then.
Re: [PATCH net-next 1/3] perf, events: add non-linear data support for raw records
Hi Peter, On 07/13/2016 03:42 PM, Peter Zijlstra wrote: Ok so the nonlinear thing was it doing _two_ copies, one the regular __output_copy() on raw->data and second the optional fragment thingy using __output_custom(). Would something like this work instead? It does the nonlinear thing and the custom copy function thing but allows more than 2 fragments and allows each fragment to have a custom copy. It doesn't look obviously more expensive; it has the one ->copy branch extra, but then it doesn't recompute the sizes. Yes, that would work as well on a quick glance with diff just a bit bigger, but more generic this way. Do you want me to adapt this into the first patch? One question below: diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 1fe22032f228..83e2a83e8db3 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -69,9 +69,18 @@ struct perf_callchain_entry_ctx { boolcontexts_maxed; }; +typedef unsigned long (*perf_copy_f)(void *dst, const void *src, unsigned long len); + +struct perf_raw_frag { + struct perf_raw_frag*next; + perf_copy_f copy; + void*data; + u32 size; +} __packed; + struct perf_raw_record { + struct perf_raw_fragfrag; u32 size; - void*data; }; /* diff --git a/kernel/events/core.c b/kernel/events/core.c index fe8d49a56322..f7ad7d65317d 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5617,16 +5617,21 @@ void perf_output_sample(struct perf_output_handle *handle, } if (sample_type & PERF_SAMPLE_RAW) { - if (data->raw) { - u32 raw_size = data->raw->size; - u32 real_size = round_up(raw_size + sizeof(u32), -sizeof(u64)) - sizeof(u32); - u64 zero = 0; - - perf_output_put(handle, real_size); - __output_copy(handle, data->raw->data, raw_size); - if (real_size - raw_size) - __output_copy(handle, &zero, real_size - raw_size); + struct perf_raw_record *raw = data->raw; + + if (raw) { + struct perf_raw_frag *frag = &raw->frag; + + perf_output_put(handle, raw->size); + do { + if (frag->copy) { + __output_custom(handle, frag->copy, + frag->data, frag->size); + } else { + __output_copy(handle, frag->data, frag->size); + } + frag = frag->next; + } while (frag); We still need the zero padding here from above with the computed raw->size, right? } else { struct { u32 size; @@ -5751,14 +5756,22 @@ void perf_prepare_sample(struct perf_event_header *header, Thanks, Daniel
Re: [PATCH net-next 1/3] perf, events: add non-linear data support for raw records
On 07/13/2016 02:10 PM, Peter Zijlstra wrote: On Wed, Jul 13, 2016 at 11:24:13AM +0200, Daniel Borkmann wrote: On 07/13/2016 09:52 AM, Peter Zijlstra wrote: On Wed, Jul 13, 2016 at 12:36:17AM +0200, Daniel Borkmann wrote: This patch adds support for non-linear data on raw records. It means that for such data, the newly introduced __output_custom() helper will be used instead of __output_copy(). __output_custom() will invoke whatever custom callback is passed in via struct perf_raw_record_frag to extract the data into the ring buffer slot. To keep changes in perf_prepare_sample() and in perf_output_sample() minimal, size/size_head split was added to perf_raw_record that call sites fill out, so that two extra tests in fast-path can be avoided. The few users of raw records are adapted to initialize their size_head and frag data; no change in behavior for them. Later patch will extend BPF side with a first user and callback for this facility, future users could be things like XDP BPF programs (that work on different context though and would thus have a different callback), etc. Why? What problem are we solving? I've tried to summarize it in patch 3/3, Which is pretty useless if you're staring at this patch. This currently has 3 issues we'd like to resolve: i) We need two copies instead of just a single one for the skb data. The data can be non-linear, see also skb_copy_bits() as an example for walking/extracting it, I'm not familiar enough with the network gunk to be able to read that. But upto skb_walk_frags() it looks entirely linear to me. Hm, fair enough, there are three parts, skb can have a linear part which is taken via skb->data, either in its entirety or there can be a non-linear part appended to that which can consist of pages that are in shared info section (skb_shinfo(skb) -> frags[], nr_frags members), that will be linearized, and in addition to that, appended after the frags[] data there can be further skbs to the 'root' skb that contain fragmented data, which is all what skb_copy_bits() copies linearized into 'to' buffer. So depending on the origin of the skb, its structure can be quite different and skb_copy_bits() covers all the cases generically. Maybe [1] summarizes it better if you want to familiarize yourself with how skbs work, although some parts are not up to date anymore. [1] http://vger.kernel.org/~davem/skb_data.html ii) for static verification reasons, the bpf_skb_load_bytes() helper needs to see a constant size on the passed buffer to make sure BPF verifier can do its sanity checks on it during verification time, so just passing in skb->len (or any other non-constant value) wouldn't work, but changing bpf_skb_load_bytes() is also not the real solution since we still have two copies we'd like to avoid as well, and iii) bpf_skb_load_bytes() is just for rather smaller buffers (e.g. headers) since they need to sit on the limited eBPF stack anyway. The set would improve the BPF helper to address all 3 at once. Humm, maybe. Lemme go try and reverse engineer that patch, because I'm not at all sure wth it's supposed to do, nor am I entirely sure this clarified things :/
Re: [PATCH net-next 1/3] perf, events: add non-linear data support for raw records
Hi Peter, On 07/13/2016 09:52 AM, Peter Zijlstra wrote: On Wed, Jul 13, 2016 at 12:36:17AM +0200, Daniel Borkmann wrote: This patch adds support for non-linear data on raw records. It means that for such data, the newly introduced __output_custom() helper will be used instead of __output_copy(). __output_custom() will invoke whatever custom callback is passed in via struct perf_raw_record_frag to extract the data into the ring buffer slot. To keep changes in perf_prepare_sample() and in perf_output_sample() minimal, size/size_head split was added to perf_raw_record that call sites fill out, so that two extra tests in fast-path can be avoided. The few users of raw records are adapted to initialize their size_head and frag data; no change in behavior for them. Later patch will extend BPF side with a first user and callback for this facility, future users could be things like XDP BPF programs (that work on different context though and would thus have a different callback), etc. Why? What problem are we solving? I've tried to summarize it in patch 3/3, there are a number of improvements this set would provide: the current BPF_FUNC_perf_event_output helper that can be used from tc/networking programs has the limitation that if data from the skb should be part of the sample, that data first needs to be copied to eBPF stack with the bpf_skb_load_bytes() helper, and only then passed into bpf_event_output(). This currently has 3 issues we'd like to resolve: i) We need two copies instead of just a single one for the skb data. The data can be non-linear, see also skb_copy_bits() as an example for walking/extracting it, ii) for static verification reasons, the bpf_skb_load_bytes() helper needs to see a constant size on the passed buffer to make sure BPF verifier can do its sanity checks on it during verification time, so just passing in skb->len (or any other non-constant value) wouldn't work, but changing bpf_skb_load_bytes() is also not the real solution since we still have two copies we'd like to avoid as well and iii) bpf_skb_load_bytes() is just for rather smaller buffers (e.g. headers) since they need to sit on the limited eBPF stack anyway. The set would improve the BPF helper to address all 3 at once. Thanks, Daniel
Re: [PATCH net-next 3/3] bpf: avoid stack copy and use skb ctx for event output
On 07/13/2016 01:25 AM, kbuild test robot wrote: Hi, [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Daniel-Borkmann/BPF-event-output-helper-improvements/20160713-065944 config: s390-allyesconfig (attached as .config) compiler: s390x-linux-gnu-gcc (Debian 5.3.1-8) 5.3.1 20160205 reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=s390 All warnings (new ones prefixed by >>): kernel/trace/bpf_trace.c: In function 'bpf_perf_event_output': kernel/trace/bpf_trace.c:284:1: warning: 'bpf_perf_event_output' uses dynamic stack allocation } ^ kernel/trace/bpf_trace.c: In function 'bpf_event_output': kernel/trace/bpf_trace.c:319:1: warning: 'bpf_event_output' uses dynamic stack allocation } ^ Hmm, searching a bit on lkml, it seems these warnings on s390 are actually mostly harmless I believe [1][2] ... looks like they are there to find structs sitting on stack, for example, at least that's also what the currently existing one in the above line (bpf_trace.c +284) appears to be about. Thanks, Daniel [1] http://lkml.iu.edu/hypermail/linux/kernel/1601.2/04074.html [2] https://lkml.org/lkml/2013/6/25/42
[PATCH net-next 1/3] perf, events: add non-linear data support for raw records
This patch adds support for non-linear data on raw records. It means that for such data, the newly introduced __output_custom() helper will be used instead of __output_copy(). __output_custom() will invoke whatever custom callback is passed in via struct perf_raw_record_frag to extract the data into the ring buffer slot. To keep changes in perf_prepare_sample() and in perf_output_sample() minimal, size/size_head split was added to perf_raw_record that call sites fill out, so that two extra tests in fast-path can be avoided. The few users of raw records are adapted to initialize their size_head and frag data; no change in behavior for them. Later patch will extend BPF side with a first user and callback for this facility, future users could be things like XDP BPF programs (that work on different context though and would thus have a different callback), etc. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- arch/s390/kernel/perf_cpum_sf.c | 2 ++ arch/x86/events/amd/ibs.c | 2 ++ include/linux/perf_event.h | 8 kernel/events/core.c| 13 ++--- kernel/events/internal.h| 18 ++ kernel/trace/bpf_trace.c| 1 + 6 files changed, 37 insertions(+), 7 deletions(-) diff --git a/arch/s390/kernel/perf_cpum_sf.c b/arch/s390/kernel/perf_cpum_sf.c index a8e8321..99c5952 100644 --- a/arch/s390/kernel/perf_cpum_sf.c +++ b/arch/s390/kernel/perf_cpum_sf.c @@ -984,7 +984,9 @@ static int perf_push_sample(struct perf_event *event, struct sf_raw_sample *sfr) /* Setup perf sample */ perf_sample_data_init(&data, 0, event->hw.last_period); raw.size = sfr->size; + raw.size_head = raw.size; raw.data = sfr; + raw.frag = NULL; data.raw = &raw; /* Setup pt_regs to look like an CPU-measurement external interrupt diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c index feb90f6..9b27dff 100644 --- a/arch/x86/events/amd/ibs.c +++ b/arch/x86/events/amd/ibs.c @@ -656,7 +656,9 @@ fail: if (event->attr.sample_type & PERF_SAMPLE_RAW) { raw.size = sizeof(u32) + ibs_data.size; + raw.size_head = raw.size; raw.data = ibs_data.data; + raw.frag = NULL; data.raw = &raw; } diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 1a827ce..bf08bdf 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -69,9 +69,17 @@ struct perf_callchain_entry_ctx { boolcontexts_maxed; }; +struct perf_raw_record_frag { + void*data; + unsigned long (*copy_cb)(void *dst, const void *src, +unsigned long n); +}; + struct perf_raw_record { u32 size; + u32 size_head; void*data; + struct perf_raw_record_frag *frag; }; /* diff --git a/kernel/events/core.c b/kernel/events/core.c index 9c51ec3..3e1dd7a 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5553,14 +5553,20 @@ void perf_output_sample(struct perf_output_handle *handle, } if (sample_type & PERF_SAMPLE_RAW) { - if (data->raw) { - u32 raw_size = data->raw->size; + struct perf_raw_record *raw = data->raw; + + if (raw) { + u32 raw_size = raw->size; u32 real_size = round_up(raw_size + sizeof(u32), sizeof(u64)) - sizeof(u32); u64 zero = 0; perf_output_put(handle, real_size); - __output_copy(handle, data->raw->data, raw_size); + __output_copy(handle, raw->data, raw->size_head); + if (raw->frag) + __output_custom(handle, raw->frag->copy_cb, + raw->frag->data, + raw->size - raw->size_head); if (real_size - raw_size) __output_copy(handle, &zero, real_size - raw_size); } else { @@ -7388,6 +7394,7 @@ void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size, struct perf_raw_record raw = { .size = entry_size, + .size_head = entry_size, .data = record, }; diff --git a/kernel/events/internal.h b/kernel/events/internal.h index 05f9f6d..1b08d94 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -123,10 +123,7 @@ static inline unsigned long perf_aux_size(struct ring_buffer *rb) return rb->
[PATCH net-next 2/3] bpf, perf: split bpf_perf_event_output
Split the bpf_perf_event_output() helper as a preparation into two parts. The newly bpf_perf_event_output() will prepare the raw record itself and test for unknown flags from BPF trace context, where the __bpf_perf_event_output() does the core work and will be reused later as well. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- kernel/trace/bpf_trace.c | 32 1 file changed, 20 insertions(+), 12 deletions(-) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 8540bd5..4d3d5b8 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -233,25 +233,17 @@ static const struct bpf_func_proto bpf_perf_event_read_proto = { .arg2_type = ARG_ANYTHING, }; -static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size) +static __always_inline u64 +__bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map, + u64 flags, struct perf_raw_record *raw) { - struct pt_regs *regs = (struct pt_regs *) (long) r1; - struct bpf_map *map = (struct bpf_map *) (long) r2; struct bpf_array *array = container_of(map, struct bpf_array, map); unsigned int cpu = smp_processor_id(); u64 index = flags & BPF_F_INDEX_MASK; - void *data = (void *) (long) r4; struct perf_sample_data sample_data; struct bpf_event_entry *ee; struct perf_event *event; - struct perf_raw_record raw = { - .size = size, - .size_head = size, - .data = data, - }; - if (unlikely(flags & ~(BPF_F_INDEX_MASK))) - return -EINVAL; if (index == BPF_F_CURRENT_CPU) index = cpu; if (unlikely(index >= array->map.max_entries)) @@ -270,11 +262,27 @@ static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size) return -EOPNOTSUPP; perf_sample_data_init(&sample_data, 0, 0); - sample_data.raw = &raw; + sample_data.raw = raw; perf_event_output(event, &sample_data, regs); return 0; } +static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size) +{ + struct perf_raw_record raw = { + .size = size, + .size_head = size, + .data = (void *)(long) r4, + }; + + if (unlikely(flags & ~(BPF_F_INDEX_MASK))) + return -EINVAL; + + return __bpf_perf_event_output((struct pt_regs *)(long) r1, + (struct bpf_map *)(long) r2, + flags, &raw); +} + static const struct bpf_func_proto bpf_perf_event_output_proto = { .func = bpf_perf_event_output, .gpl_only = true, -- 1.9.3
[PATCH net-next 0/3] BPF event output helper improvements
This set adds improvements to the BPF event output helper to support non-linear data sampling, here specifically, for skb context. For details please see individual patches. The set is based against net-next tree. Thanks a lot! Daniel Borkmann (3): perf, events: add non-linear data support for raw records bpf, perf: split bpf_perf_event_output bpf: avoid stack copy and use skb ctx for event output arch/s390/kernel/perf_cpum_sf.c | 2 ++ arch/x86/events/amd/ibs.c | 2 ++ include/linux/bpf.h | 5 +++- include/linux/perf_event.h | 8 ++ include/uapi/linux/bpf.h| 2 ++ kernel/bpf/core.c | 8 -- kernel/events/core.c| 13 +++-- kernel/events/internal.h| 18 +--- kernel/trace/bpf_trace.c| 64 ++--- net/core/filter.c | 43 ++- 10 files changed, 125 insertions(+), 40 deletions(-) -- 1.9.3
[PATCH net-next 3/3] bpf: avoid stack copy and use skb ctx for event output
This work improves bpf_skb_event_output() helper in two ways, i) it avoids that users need to unnecessary extract sampled skb data to stack first via bpf_skb_load_bytes() and then copy once again into the ring buffer slot, and ii) it avoids that only portions can be sampled with bpf_skb_load_bytes() due to stack limit. Instead, we can make use of the passed in skb context that we have in the helper anyway, and use some of the reserved flag bits as a length argument. The helper will use the new __output_custom() facility from perf with bpf_skb_copy_cb() as callback helper. It will pass the data for setup to bpf_event_output(), which generates and pushes the raw record. The linear data used in the non-frag part of the record serves as custom / programmatically defined meta data passed along with the appended sample. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- include/linux/bpf.h | 5 - include/uapi/linux/bpf.h | 2 ++ kernel/bpf/core.c| 8 ++-- kernel/trace/bpf_trace.c | 33 +++-- net/core/filter.c| 43 ++- 5 files changed, 69 insertions(+), 22 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index b3336b4..afd64c8 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -209,8 +209,11 @@ u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp); const struct bpf_func_proto *bpf_get_trace_printk_proto(void); -const struct bpf_func_proto *bpf_get_event_output_proto(void); +u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 size_meta, +void *ctx, u64 size_ctx, +unsigned long (*ctx_copy_cb)(void *dst, const void *src, + unsigned long n)); #ifdef CONFIG_BPF_SYSCALL DECLARE_PER_CPU(int, bpf_prog_active); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 262a7e8..c4d9224 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -401,6 +401,8 @@ enum bpf_func_id { /* BPF_FUNC_perf_event_output and BPF_FUNC_perf_event_read flags. */ #define BPF_F_INDEX_MASK 0xULL #define BPF_F_CURRENT_CPU BPF_F_INDEX_MASK +/* BPF_FUNC_perf_event_output for sk_buff input context. */ +#define BPF_F_CTXLEN_MASK (0xfULL << 32) /* user accessible mirror of in-kernel sk_buff. * new fields can only be added to the end of this structure diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index d638062..47a7054 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -1054,9 +1054,13 @@ const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void) return NULL; } -const struct bpf_func_proto * __weak bpf_get_event_output_proto(void) +u64 __weak +bpf_event_output(struct bpf_map *map, u64 flags, void *meta, +u64 size_meta, void *ctx, u64 size_ctx, +unsigned long (*ctx_copy_cb)(void *dst, const void *src, + unsigned long n)) { - return NULL; + return -ENOTSUPP; } /* Always built-in helper functions. */ diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 4d3d5b8..9c076d1 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -296,29 +296,26 @@ static const struct bpf_func_proto bpf_perf_event_output_proto = { static DEFINE_PER_CPU(struct pt_regs, bpf_pt_regs); -static u64 bpf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size) +u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 size_meta, +void *ctx, u64 size_ctx, +unsigned long (*ctx_copy_cb)(void *dst, const void *src, + unsigned long n)) { struct pt_regs *regs = this_cpu_ptr(&bpf_pt_regs); + struct perf_raw_record_frag frag = { + .data = ctx, + .copy_cb= ctx_copy_cb, + }; + struct perf_raw_record raw = { + .size = size_meta + size_ctx, + .size_head = size_meta, + .data = meta, + .frag = size_ctx ? &frag : NULL, + }; perf_fetch_caller_regs(regs); - return bpf_perf_event_output((long)regs, r2, flags, r4, size); -} - -static const struct bpf_func_proto bpf_event_output_proto = { - .func = bpf_event_output, - .gpl_only = true, - .ret_type = RET_INTEGER, - .arg1_type = ARG_PTR_TO_CTX, - .arg2_type = ARG_CONST_MAP_PTR, - .arg3_type = ARG_ANYTHING, - .arg4_type = ARG_PTR_TO_STACK, - .arg5_type = ARG_CONST_STACK_SIZE, -}; - -const struct bpf_func_proto *bpf_get_event_output_proto(void) -{
Re: [PATCH -next] bpf: make inode code explicitly non-modular
On 07/11/2016 06:51 PM, Paul Gortmaker wrote: The Kconfig currently controlling compilation of this code is: init/Kconfig:config BPF_SYSCALL init/Kconfig: bool "Enable bpf() system call" ...meaning that it currently is not being built as a module by anyone. Lets remove the couple traces of modular infrastructure use, so that when reading the driver there is no doubt it is builtin-only. Note that MODULE_ALIAS is a no-op for non-modular code. We replace module.h with init.h since the file does use __init. Cc: Alexei Starovoitov Cc: net...@vger.kernel.org Signed-off-by: Paul Gortmaker (Patch is for net-next tree then.) Acked-by: Daniel Borkmann
Re: [PATCH net] udp: prevent bugcheck if filter truncates packet too much
On 07/09/2016 02:20 AM, Alexei Starovoitov wrote: On Sat, Jul 09, 2016 at 01:31:40AM +0200, Eric Dumazet wrote: On Fri, 2016-07-08 at 17:52 +0200, Michal Kubecek wrote: If socket filter truncates an udp packet below the length of UDP header in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if kernel is configured that way) can be easily enforced by an unprivileged user which was reported as CVE-2016-6162. For a reproducer, see http://seclists.org/oss-sec/2016/q3/8 Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Reported-by: Marco Grassi Signed-off-by: Michal Kubecek --- net/ipv4/udp.c | 2 ++ net/ipv6/udp.c | 2 ++ 2 files changed, 4 insertions(+) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index ca5e8ea29538..4aed8fc23d32 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1583,6 +1583,8 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) if (sk_filter(sk, skb)) goto drop; + if (unlikely(skb->len < sizeof(struct udphdr))) + goto drop; udp_csum_pull_header(skb); if (sk_rcvqueues_full(sk, sk->sk_rcvbuf)) { diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 005dc82c2138..acc09705618b 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -620,6 +620,8 @@ int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) if (sk_filter(sk, skb)) goto drop; + if (unlikely(skb->len < sizeof(struct udphdr))) + goto drop; udp_csum_pull_header(skb); if (sk_rcvqueues_full(sk, sk->sk_rcvbuf)) { Arg :( Acked-by: Eric Dumazet this is incomplete fix. Please do not apply. See discussion at security@kernel Ohh well, didn't see it earlier before starting the discussion at security@... I'm okay if we take this for now as a quick band aid and find a better way how to deal with the underlying issue long-term so that it's /guaranteed/ that it doesn't bite us any further in such fragile ways.
[PATCH net] bpf, perf: delay release of BPF prog after grace period
Commit dead9f29ddcc ("perf: Fix race in BPF program unregister") moved destruction of BPF program from free_event_rcu() callback to __free_event(), which is problematic if used with tail calls: if prog A is attached as trace event directly, but at the same time present in a tail call map used by another trace event program elsewhere, then we need to delay destruction via RCU grace period since it can still be in use by the program doing the tail call (the prog first needs to be dropped from the tail call map, then trace event with prog A attached destroyed, so we get immediate destruction). Fixes: dead9f29ddcc ("perf: Fix race in BPF program unregister") Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov Cc: Jann Horn --- include/linux/bpf.h | 4 kernel/events/core.c | 2 +- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 8269caf..0de4de6 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -264,6 +264,10 @@ static inline struct bpf_prog *bpf_prog_get(u32 ufd) static inline void bpf_prog_put(struct bpf_prog *prog) { } + +static inline void bpf_prog_put_rcu(struct bpf_prog *prog) +{ +} #endif /* CONFIG_BPF_SYSCALL */ /* verifier prototypes for helper functions called from eBPF programs */ diff --git a/kernel/events/core.c b/kernel/events/core.c index 274450e..d00c47b 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7531,7 +7531,7 @@ static void perf_event_free_bpf_prog(struct perf_event *event) prog = event->tp_event->prog; if (prog) { event->tp_event->prog = NULL; - bpf_prog_put(prog); + bpf_prog_put_rcu(prog); } } -- 1.9.3
Re: [PATCH net-next v2 2/4] cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY
On 06/23/2016 11:26 PM, Martin KaFai Lau wrote: On Thu, Jun 23, 2016 at 11:42:31AM +0200, Daniel Borkmann wrote: Hi Martin, [ sorry to jump late in here, on pto currently ] Thanks for reviewing. Could you describe a bit more with regards to pinning maps and how this should interact with cgroups? The two specialized array maps we have (tail calls, perf events) have fairly complicated semantics for when to clean up map slots (see commits c9da161c6517ba1, 3b1efb196eee45b2f0c4). How is this managed with cgroups? Once a cgroup fd is placed into a map and the user removes the cgroup, will this be prevented due to 'being busy', or will the cgroup live further as long as a program is running with a cgroup map entry (but the cgroup itself is not visible from user space in any way anymore)? Having a cgroup ptr stored in the bpf_map will not stop the user from removing the cgroup (by rmdir /mnt/cgroup2/tc/test_cgrp). Right. The cgroup ptr stored in the bpf_map holds a refcnt which answer the second part. Yep, clear. The situation is similar to the netfilter usecase in commit 38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match") I presume it's a valid use case to pin a cgroup map, put fds into it and remove the pinned file expecting to continue to match on it, right? So lifetime is really until last prog using a cgroup map somewhere gets removed (even if not accessible from user space anymore, meaning no prog has fd and pinned file was removed). Yes. We are still hatching out how to set this up in production. However, the situation is similar to removing the pinned file. I presume you mean removing the last BPF program holding a reference on the cgroup array map. (Any user space visibility like struct files given from the anon inode and pinnings are tracked via uref, btw, which is needed to break possible complex dependencies among tail called programs.) But dropping cgroup ref at latest when the last map ref is dropped as you currently do seems fine. It makes cgroup array maps effectively no different from plain regular array maps. We probably will not use tc and pin a bpf_map to do that. Instead, one process will setup eveything (e.g. create the cgroup, pouplate the cgroup map, load the bpf to egress) and then go away. Yep, that seems a valid case as well, both use cases (pinned and non-pinned) should be fine with your code then. Thanks, Daniel
Re: [PATCH net-next v2 2/4] cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY
On 06/23/2016 11:13 PM, Tejun Heo wrote: Hello, On Thu, Jun 23, 2016 at 11:42:31AM +0200, Daniel Borkmann wrote: I presume it's a valid use case to pin a cgroup map, put fds into it and remove the pinned file expecting to continue to match on it, right? So lifetime is really until last prog using a cgroup map somewhere gets removed (even if not accessible from user space anymore, meaning no prog has fd and pinned file was removed). Yeap, from what I can see, the cgroup will stay around (even if it gets deleted) as long as the bpf rule using it is around and that's completely fine from cgroup side. Ok, thanks for confirming! Thanks.
Re: [PATCH net-next v2 3/4] cgroup: bpf: Add bpf_skb_in_cgroup_proto
On 06/23/2016 06:54 PM, Martin KaFai Lau wrote: On Thu, Jun 23, 2016 at 11:53:50AM +0200, Daniel Borkmann wrote: diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 668e079..68753e0 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1062,6 +1062,10 @@ static int check_map_func_compatibility(struct bpf_map *map, int func_id) if (func_id != BPF_FUNC_get_stackid) goto error; break; + case BPF_MAP_TYPE_CGROUP_ARRAY: + if (func_id != BPF_FUNC_skb_in_cgroup) + goto error; + break; I think the BPF_MAP_TYPE_CGROUP_ARRAY case should have been fist here in patch 2/4, but with unconditional goto error. And this one only adds the 'func_id != BPF_FUNC_skb_in_cgroup' test. I am not sure I understand. Can you elaborate? I am probably missing something here. If someone backports patch 2/4 as-is, but for some reason not 3/4, then you could craft a program that calls f.e. bpf_map_update_elem() on a cgroup array and would thus cause a NULL pointer deref, since verifier doesn't prevent it. I'm just trying to say that it would probably make sense to add the above 'case BPF_MAP_TYPE_CGROUP_ARRAY:' with an unconditional 'goto error' in patch 2/4 and extend upon it in patch 3/4 so result looks like here, so that the patches are fine/complete each as stand-alone.
Re: [PATCH net-next v2 4/4] cgroup: bpf: Add an example to do cgroup checking in BPF
On 06/22/2016 11:17 PM, Martin KaFai Lau wrote: test_cgrp2_array_pin.c: A userland program that creates a bpf_map (BPF_MAP_TYPE_GROUP_ARRAY), pouplates/updates it with a cgroup2's backed fd and pins it to a bpf-fs's file. The pinned file can be loaded by tc and then used by the bpf prog later. This program can also update an existing pinned array and it could be useful for debugging/testing purpose. test_cgrp2_tc_kern.c: A bpf prog which should be loaded by tc. It is to demonstrate the usage of bpf_skb_in_cgroup. test_cgrp2_tc.sh: A script that glues the test_cgrp2_array_pin.c and test_cgrp2_tc_kern.c together. The idea is like: 1. Use test_cgrp2_array_pin.c to populate a BPF_MAP_TYPE_CGROUP_ARRAY with a cgroup fd 2. Load the test_cgrp2_tc_kern.o by tc 3. Do a 'ping -6 ff02::1%ve' to ensure the packet has been dropped because of a match on the cgroup Most of the lines in test_cgrp2_tc.sh is the boilerplate to setup the cgroup/bpf-fs/net-devices/netns...etc. It is not bulletproof on errors but should work well enough and give enough debug info if things did not go well. Signed-off-by: Martin KaFai Lau Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Tejun Heo Acked-by: Alexei Starovoitov Btw, when no bpf fs is mounted, tc will already auto-mount it. I noticed in your script, you do mount the fs manually. I guess it's okay to leave it like this, but I hope users won't wrongly copy it assuming they /have/ to mount it themselves.
Re: [PATCH net-next v2 3/4] cgroup: bpf: Add bpf_skb_in_cgroup_proto
On 06/22/2016 11:17 PM, Martin KaFai Lau wrote: Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk belongs to a descendant of a cgroup2. It is similar to the feature added in netfilter: commit c38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match") The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY which will be used by the bpf_skb_in_cgroup. Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY and bpf_skb_in_cgroup() are always used together. Signed-off-by: Martin KaFai Lau Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Tejun Heo Acked-by: Alexei Starovoitov --- include/uapi/linux/bpf.h | 12 kernel/bpf/verifier.c| 8 net/core/filter.c| 40 3 files changed, 60 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index ef4e386..bad309f 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -314,6 +314,18 @@ enum bpf_func_id { */ BPF_FUNC_skb_get_tunnel_opt, BPF_FUNC_skb_set_tunnel_opt, + + /** +* bpf_skb_in_cgroup(skb, map, index) - Check cgroup2 membership of skb +* @skb: pointer to skb +* @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type +* @index: index of the cgroup in the bpf_map +* Return: +* == 0 skb failed the cgroup2 descendant test +* == 1 skb succeeded the cgroup2 descendant test +*< 0 error +*/ + BPF_FUNC_skb_in_cgroup, __BPF_FUNC_MAX_ID, }; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 668e079..68753e0 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1062,6 +1062,10 @@ static int check_map_func_compatibility(struct bpf_map *map, int func_id) if (func_id != BPF_FUNC_get_stackid) goto error; break; + case BPF_MAP_TYPE_CGROUP_ARRAY: + if (func_id != BPF_FUNC_skb_in_cgroup) + goto error; + break; I think the BPF_MAP_TYPE_CGROUP_ARRAY case should have been fist here in patch 2/4, but with unconditional goto error. And this one only adds the 'func_id != BPF_FUNC_skb_in_cgroup' test. default: break; } @@ -1081,6 +1085,10 @@ static int check_map_func_compatibility(struct bpf_map *map, int func_id) if (map->map_type != BPF_MAP_TYPE_STACK_TRACE) goto error; break; + case BPF_FUNC_skb_in_cgroup: + if (map->map_type != BPF_MAP_TYPE_CGROUP_ARRAY) + goto error; + break; default: break; } diff --git a/net/core/filter.c b/net/core/filter.c index df6860c..a16f7d2 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2024,6 +2024,42 @@ bpf_get_skb_set_tunnel_proto(enum bpf_func_id which) } } +#ifdef CONFIG_CGROUPS +static u64 bpf_skb_in_cgroup(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) +{ + struct sk_buff *skb = (struct sk_buff *)(long)r1; + struct bpf_map *map = (struct bpf_map *)(long)r2; + struct bpf_array *array = container_of(map, struct bpf_array, map); + struct cgroup *cgrp; + struct sock *sk; + u32 i = (u32)r3; + + WARN_ON_ONCE(!rcu_read_lock_held()); I think the WARN_ON_ONCE() test can be removed all-together. There are many other functions without it. We really rely on RCU read-lock being held for BPF programs (otherwise it would be horribly broken). F.e. it's kinda silly that for some map update/lookups we even have this WARN_ON_ONCE() test twice we go through in the fast-path (once from the generic eBPF helper function and then once again from the actual implementation since it could also be called from syscall). The actual invocation points are not that many and we can make sure that related call sites hold RCU read lock. Rest looks good to me, thanks. + sk = skb->sk; + if (!sk || !sk_fullsock(sk)) + return -ENOENT; + + if (unlikely(i >= array->map.max_entries)) + return -E2BIG; + + cgrp = READ_ONCE(array->ptrs[i]); + if (unlikely(!cgrp)) + return -ENOENT; + + return cgroup_is_descendant(sock_cgroup_ptr(&sk->sk_cgrp_data), cgrp); +} + +static const struct bpf_func_proto bpf_skb_in_cgroup_proto = { + .func = bpf_skb_in_cgroup, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_CONST_MAP_PTR, + .arg3_type = ARG_ANYTHING, +}; +#endif + static const struct bpf_func_proto * sk_filter_func_proto(enum bpf_func_id func_id) { @@ -2086,6 +2122,10 @@ tc_cls_act_func_proto(enum bpf_func_id func_id) return &bpf_ge
Re: [PATCH net-next v2 2/4] cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY
Hi Martin, [ sorry to jump late in here, on pto currently ] On 06/22/2016 11:17 PM, Martin KaFai Lau wrote: Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations. To update an element, the caller is expected to obtain a cgroup2 backed fd by open(cgroup2_dir) and then update the array with that fd. Signed-off-by: Martin KaFai Lau Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Tejun Heo Acked-by: Alexei Starovoitov Could you describe a bit more with regards to pinning maps and how this should interact with cgroups? The two specialized array maps we have (tail calls, perf events) have fairly complicated semantics for when to clean up map slots (see commits c9da161c6517ba1, 3b1efb196eee45b2f0c4). How is this managed with cgroups? Once a cgroup fd is placed into a map and the user removes the cgroup, will this be prevented due to 'being busy', or will the cgroup live further as long as a program is running with a cgroup map entry (but the cgroup itself is not visible from user space in any way anymore)? I presume it's a valid use case to pin a cgroup map, put fds into it and remove the pinned file expecting to continue to match on it, right? So lifetime is really until last prog using a cgroup map somewhere gets removed (even if not accessible from user space anymore, meaning no prog has fd and pinned file was removed). I assume that using struct file here doesn't make sense (commit e03e7ee34fdd1c3) either, right? [...] +#ifdef CONFIG_CGROUPS +static void *cgroup_fd_array_get_ptr(struct bpf_map *map, +struct file *map_file /* not used */, +int fd) +{ + return cgroup_get_from_fd(fd); +} + +static void cgroup_fd_array_put_ptr(void *ptr) +{ + /* cgroup_put free cgrp after a rcu grace period */ + cgroup_put(ptr); Yeah, as long as this respects freeing after RCU grace period, it's fine like this ... +} + +static void cgroup_fd_array_free(struct bpf_map *map) +{ + bpf_fd_array_map_clear(map); + fd_array_map_free(map); +} + +static const struct bpf_map_ops cgroup_array_ops = { + .map_alloc = fd_array_map_alloc, + .map_free = cgroup_fd_array_free, + .map_get_next_key = array_map_get_next_key, + .map_lookup_elem = fd_array_map_lookup_elem, + .map_delete_elem = fd_array_map_delete_elem, + .map_fd_get_ptr = cgroup_fd_array_get_ptr, + .map_fd_put_ptr = cgroup_fd_array_put_ptr, +}; + +static struct bpf_map_type_list cgroup_array_type __read_mostly = { + .ops = &cgroup_array_ops, + .type = BPF_MAP_TYPE_CGROUP_ARRAY, +}; + +static int __init register_cgroup_array_map(void) +{ + bpf_register_map_type(&cgroup_array_type); + return 0; +} +late_initcall(register_cgroup_array_map); +#endif diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index c23a4e93..cac13f1 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -393,7 +393,8 @@ static int map_update_elem(union bpf_attr *attr) } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) { err = bpf_percpu_array_update(map, key, value, attr->flags); } else if (map->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || - map->map_type == BPF_MAP_TYPE_PROG_ARRAY) { + map->map_type == BPF_MAP_TYPE_PROG_ARRAY || + map->map_type == BPF_MAP_TYPE_CGROUP_ARRAY) { rcu_read_lock(); err = bpf_fd_array_map_update_elem(map, f.file, key, value, attr->flags);
Re: [PATCH 0/4] net-next: mediatek: IRQ cleanups, fixes and grouping
On 06/16/2016 11:44 AM, John Crispin wrote: On 16/06/2016 07:20, David Miller wrote: From: John Crispin Date: Wed, 15 Jun 2016 16:58:46 +0200 This series contains 2 small code cleanups that are leftovers from the MIPS support. There is also a small fix that adds proper locking to the code accessing the IRQ registers. Without this fix we saw deadlocks caused by the last patch of the series, which adds IRQ grouping. The grouping feature allows us to use different IRQs for TX and RX. By doing so we can use affinity to let the SoC handle the IRQs on different cores. This patch series doesn't apply cleanly to the net-next tree, I get rejects on patch #4. it depends on the series with the 11 fixes that i sent last week which is however in the net tree and not the next tree (i also still had the DQL hack in my tree). if i resend this at the start of the next rc1 would you merge it into the net tree so that it becomes part of v4.8 ? Well, it would rather require a merge of -net into -net-next then. But that would be useful indeed. Cheers, Daniel
Re: [PATCH net-next 1/3] arm64: bpf: implement bpf_tail_call() helper
On 06/06/2016 06:56 AM, Z Lim wrote: [...] How about the attached patch? Fixes compilation error on build !CONFIG_BPF_SYSCALL. Also, should this patch be sent to net or net-next (along with this series)? Looks good, feel free to add: Acked-by: Daniel Borkmann I think net-next along with your series should be fine since the issue first appeared there. Thanks, Zi!
Re: [PATCH net-next 1/3] arm64: bpf: implement bpf_tail_call() helper
On 06/05/2016 01:46 AM, kbuild test robot wrote: Hi, [auto build test ERROR on net-next/master] url: https://github.com/0day-ci/linux/commits/Zi-Shen-Lim/arm64-bpf-implement-bpf_tail_call-helper/20160605-060435 config: arm64-defconfig (attached as .config) compiler: aarch64-linux-gnu-gcc (Debian 5.3.1-8) 5.3.1 20160205 reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=arm64 All errors (new ones prefixed by >>): In file included from arch/arm64/net/bpf_jit_comp.c:21:0: include/linux/bpf.h: In function 'bpf_prog_get': include/linux/bpf.h:235:9: error: implicit declaration of function 'ERR_PTR' [-Werror=implicit-function-declaration] return ERR_PTR(-EOPNOTSUPP); ^ include/linux/bpf.h:235:9: warning: return makes pointer from integer without a cast [-Wint-conversion] In file included from include/linux/rwsem.h:17:0, from include/linux/mm_types.h:10, from include/linux/sched.h:27, from arch/arm64/include/asm/compat.h:25, from arch/arm64/include/asm/stat.h:23, from include/linux/stat.h:5, from include/linux/compat.h:12, from include/linux/filter.h:10, from arch/arm64/net/bpf_jit_comp.c:22: include/linux/err.h: At top level: include/linux/err.h:23:35: error: conflicting types for 'ERR_PTR' static inline void * __must_check ERR_PTR(long error) ^ In file included from arch/arm64/net/bpf_jit_comp.c:21:0: include/linux/bpf.h:235:9: note: previous implicit declaration of 'ERR_PTR' was here return ERR_PTR(-EOPNOTSUPP); ^ cc1: some warnings being treated as errors Looks like including linux/bpf.h at the very beginning causes issues when bpf syscall is disabled. We should probably just include linux/err.h from bpf.h.
Re: linux-next: manual merge of the net-next tree with the arm64 tree
On 05/17/2016 03:38 PM, Catalin Marinas wrote: On Tue, May 17, 2016 at 09:12:34AM +0200, Daniel Borkmann wrote: On 05/17/2016 09:03 AM, Geert Uytterhoeven wrote: [...] Someone's not gonna be happy with commit 606b5908 ("bpf: split HAVE_BPF_JIT into cBPF and eBPF variant") breaking the sort order again... Wasn't aware of that. Maybe I'm missing something, but there appears to be no throughout consistent ordering ... [...] select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP select HAVE_RCU_TABLE_FREE select HAVE_SYSCALL_TRACEPOINTS select IOMMU_DMA if IOMMU_SUPPORT select IRQ_DOMAIN select IRQ_FORCED_THREADING [...] select RTC_LIB select SPARSE_IRQ select SYSCTL_EXCEPTION_TRACE select HAVE_CONTEXT_TRACKING select HAVE_ARM_SMCCC [...] We keep fixing them as we merge other stuff. For example, latest mainline has commit 8ee708792e1c ("arm64: Kconfig: remove redundant HAVE_ARCH_TRANSPARENT_HUGEPAGE definition") which also fixes up the Kconfig order. Understood, thanks for the clarification (and sorry for the sort order issue).
Re: [PATCH v2 net-next] bpf: arm64: remove callee-save registers use for tmp registers
On 05/17/2016 01:36 AM, Yang Shi wrote: In the current implementation of ARM64 eBPF JIT, R23 and R24 are used for tmp registers, which are callee-saved registers. This leads to variable size of JIT prologue and epilogue. The latest blinding constant change prefers to constant size of prologue and epilogue. AAPCS reserves R9 ~ R15 for temp registers which not need to be saved/restored during function call. So, replace R23 and R24 to R10 and R11, and remove tmp_used flag to save 2 instructions for some jited BPF program. CC: Daniel Borkmann Acked-by: Zi Shen Lim Signed-off-by: Yang Shi LGTM, thanks! Acked-by: Daniel Borkmann
Re: linux-next: manual merge of the net-next tree with the arm64 tree
On 05/17/2016 09:03 AM, Geert Uytterhoeven wrote: [...] Someone's not gonna be happy with commit 606b5908 ("bpf: split HAVE_BPF_JIT into cBPF and eBPF variant") breaking the sort order again... Wasn't aware of that. Maybe I'm missing something, but there appears to be no throughout consistent ordering ... [...] select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP select HAVE_RCU_TABLE_FREE select HAVE_SYSCALL_TRACEPOINTS select IOMMU_DMA if IOMMU_SUPPORT select IRQ_DOMAIN select IRQ_FORCED_THREADING [...] select RTC_LIB select SPARSE_IRQ select SYSCTL_EXCEPTION_TRACE select HAVE_CONTEXT_TRACKING select HAVE_ARM_SMCCC [...]
Re: linux-next: manual merge of the net-next tree with the arm64 tree
On 05/17/2016 02:24 AM, Stephen Rothwell wrote: Hi all, Today's linux-next merge of the net-next tree got a conflict in: arch/arm64/Kconfig between commit: 8ee708792e1c ("arm64: Kconfig: remove redundant HAVE_ARCH_TRANSPARENT_HUGEPAGE definition") from the arm64 tree and commit: 606b5908 ("bpf: split HAVE_BPF_JIT into cBPF and eBPF variant") from the net-next tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. Diff looks good, thanks!
Re: [PATCH] tools: bpf_jit_disasm: check for klogctl failure
On 05/06/2016 12:39 AM, Colin King wrote: From: Colin Ian King klogctl can fail and return -ve len, so check for this and return NULL to avoid passing a (size_t)-1 to malloc. Signed-off-by: Colin Ian King [ would be nice to get Cc'ed in future ... ] Acked-by: Daniel Borkmann
[PATCH net-next v2 1/2] bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_output
Add a BPF_F_CURRENT_CPU flag to optimize the use-case where user space has per-CPU ring buffers and the eBPF program pushes the data into the current CPU's ring buffer which saves us an extra helper function call in eBPF. Also, make sure to properly reserve the remaining flags which are not used. Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov --- include/uapi/linux/bpf.h | 4 kernel/trace/bpf_trace.c | 7 ++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 70eda5a..b7b0fb1 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -347,6 +347,10 @@ enum bpf_func_id { #define BPF_F_ZERO_CSUM_TX (1ULL << 1) #define BPF_F_DONT_FRAGMENT(1ULL << 2) +/* BPF_FUNC_perf_event_output flags. */ +#define BPF_F_INDEX_MASK 0xULL +#define BPF_F_CURRENT_CPU BPF_F_INDEX_MASK + /* user accessible mirror of in-kernel sk_buff. * new fields can only be added to the end of this structure */ diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 6855878..6bfe55c 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -225,11 +225,12 @@ static const struct bpf_func_proto bpf_perf_event_read_proto = { .arg2_type = ARG_ANYTHING, }; -static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 index, u64 r4, u64 size) +static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size) { struct pt_regs *regs = (struct pt_regs *) (long) r1; struct bpf_map *map = (struct bpf_map *) (long) r2; struct bpf_array *array = container_of(map, struct bpf_array, map); + u64 index = flags & BPF_F_INDEX_MASK; void *data = (void *) (long) r4; struct perf_sample_data sample_data; struct perf_event *event; @@ -239,6 +240,10 @@ static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 index, u64 r4, u64 size) .data = data, }; + if (unlikely(flags & ~(BPF_F_INDEX_MASK))) + return -EINVAL; + if (index == BPF_F_CURRENT_CPU) + index = raw_smp_processor_id(); if (unlikely(index >= array->map.max_entries)) return -E2BIG; -- 1.9.3
[PATCH net-next v2 2/2] bpf: add event output helper for notifications/sampling/logging
This patch adds a new helper for cls/act programs that can push events to user space applications. For networking, this can be f.e. for sampling, debugging, logging purposes or pushing of arbitrary wake-up events. The idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output() helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example"). The eBPF program utilizes a perf event array map that user space populates with fds from perf_event_open(), the eBPF program calls into the helper f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw)) so that the raw data is pushed into the fd f.e. at the map index of the current CPU. User space can poll/mmap/etc on this and has a data channel for receiving events that can be post-processed. The nice thing is that since the eBPF program and user space application making use of it are tightly coupled, they can define their own arbitrary raw data format and what/when they want to push. While f.e. packet headers could be one part of the meta data that is being pushed, this is not a substitute for things like packet sockets as whole packet is not being pushed and push is only done in a single direction. Intention is more of a generically usable, efficient event pipe to applications. Workflow is that tc can pin the map and applications can attach themselves e.g. after cls/act setup to one or multiple map slots, demuxing is done by the eBPF program. Adding this facility is with minimal effort, it reuses the helper introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper") and we get its functionality for free by overloading its BPF_FUNC_ identifier for cls/act programs, ctx is currently unused, but will be made use of in future. Example will be added to iproute2's BPF example files. Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 2 ++ kernel/bpf/core.c| 7 +++ kernel/trace/bpf_trace.c | 27 +++ net/core/filter.c| 2 ++ 4 files changed, 38 insertions(+) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 5fb3c61..f63afdc 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -169,7 +169,9 @@ u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 r5); u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); void bpf_fd_array_map_clear(struct bpf_map *map); bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp); + const struct bpf_func_proto *bpf_get_trace_printk_proto(void); +const struct bpf_func_proto *bpf_get_event_output_proto(void); #ifdef CONFIG_BPF_SYSCALL DECLARE_PER_CPU(int, bpf_prog_active); diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index be0abf6..e4248fe 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -764,14 +764,21 @@ const struct bpf_func_proto bpf_map_delete_elem_proto __weak; const struct bpf_func_proto bpf_get_prandom_u32_proto __weak; const struct bpf_func_proto bpf_get_smp_processor_id_proto __weak; const struct bpf_func_proto bpf_ktime_get_ns_proto __weak; + const struct bpf_func_proto bpf_get_current_pid_tgid_proto __weak; const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak; const struct bpf_func_proto bpf_get_current_comm_proto __weak; + const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void) { return NULL; } +const struct bpf_func_proto * __weak bpf_get_event_output_proto(void) +{ + return NULL; +} + /* Always built-in helper functions. */ const struct bpf_func_proto bpf_tail_call_proto = { .func = NULL, diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 6bfe55c..75cb154 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -277,6 +277,33 @@ static const struct bpf_func_proto bpf_perf_event_output_proto = { .arg5_type = ARG_CONST_STACK_SIZE, }; +static DEFINE_PER_CPU(struct pt_regs, bpf_pt_regs); + +static u64 bpf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size) +{ + struct pt_regs *regs = this_cpu_ptr(&bpf_pt_regs); + + perf_fetch_caller_regs(regs); + + return bpf_perf_event_output((long)regs, r2, flags, r4, size); +} + +static const struct bpf_func_proto bpf_event_output_proto = { + .func = bpf_event_output, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_CONST_MAP_PTR, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_PTR_TO_STACK, + .arg5_type = ARG_CONST_STACK_SIZE, +}; + +const struct bpf_func_proto *bpf_get_event_output_proto(void) +{ + return &bpf_event_output_proto; +} + static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id) { switch (func_id) { diff --git a/net/core/filter.c b/net/core/filter.c index 5d2ac2b..218e5d
[PATCH net-next v2 0/2] BPF updates
This minor set adds a new helper bpf_event_output() for eBPF cls/act program types which allows to pass events to user space applications. For details, please see individual patches. Thanks! v1 -> v2: - Address kbuild bot found compile issue in patch 2 - Rest as is Daniel Borkmann (2): bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_output bpf: add event output helper for notifications/sampling/logging include/linux/bpf.h | 2 ++ include/uapi/linux/bpf.h | 4 kernel/bpf/core.c| 7 +++ kernel/trace/bpf_trace.c | 34 +- net/core/filter.c| 2 ++ 5 files changed, 48 insertions(+), 1 deletion(-) -- 1.9.3
Re: [PATCH net-next 2/2] bpf: add event output helper for notifications/sampling/logging
On 04/18/2016 01:55 AM, kbuild test robot wrote: Hi Daniel, [auto build test ERROR on net-next/master] url: https://github.com/0day-ci/linux/commits/Daniel-Borkmann/bpf-trace-add-BPF_F_CURRENT_CPU-flag-for-bpf_perf_event_output/20160418-063147 config: m68k-allyesconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=m68k All errors (new ones prefixed by >>): net/core/filter.c: In function 'bpf_skb_event_output': net/core/filter.c:1599:2: error: implicit declaration of function 'perf_fetch_caller_regs' [-Werror=implicit-function-declaration] perf_fetch_caller_regs(regs); ^ cc1: some warnings being treated as errors vim +/perf_fetch_caller_regs +1599 net/core/filter.c 1593 static DEFINE_PER_CPU(struct pt_regs, bpf_pt_regs); 1594 1595 static u64 bpf_skb_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size) 1596 { 1597 struct pt_regs *regs = this_cpu_ptr(&bpf_pt_regs); 1598 1599perf_fetch_caller_regs(regs); 1600 1601 return bpf_perf_event_output((long)regs, r2, flags, r4, size); 1602 } Sorry about that, missed this one. Will fix it up in v2! Thanks, Daniel
[PATCH net-next 0/2] BPF updates
This minor set adds a new helper bpf_skb_event_output() for eBPF cls/act program types which allows to pass events to user space applications. For details, please see individual patches. Thanks! Daniel Borkmann (2): bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_output bpf: add event output helper for notifications/sampling/logging include/linux/bpf.h | 2 ++ include/uapi/linux/bpf.h | 4 kernel/trace/bpf_trace.c | 7 ++- net/core/filter.c| 30 ++ 4 files changed, 42 insertions(+), 1 deletion(-) -- 1.9.3
[PATCH net-next 1/2] bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_output
Add a BPF_F_CURRENT_CPU flag to optimize the use-case where user space has per-CPU ring buffers and the eBPF program pushes the data into the current CPU's ring buffer which saves us an extra helper function call in eBPF. Also, make sure to properly reserve the remaining flags which are not used. Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov --- include/uapi/linux/bpf.h | 4 kernel/trace/bpf_trace.c | 7 ++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 70eda5a..b7b0fb1 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -347,6 +347,10 @@ enum bpf_func_id { #define BPF_F_ZERO_CSUM_TX (1ULL << 1) #define BPF_F_DONT_FRAGMENT(1ULL << 2) +/* BPF_FUNC_perf_event_output flags. */ +#define BPF_F_INDEX_MASK 0xULL +#define BPF_F_CURRENT_CPU BPF_F_INDEX_MASK + /* user accessible mirror of in-kernel sk_buff. * new fields can only be added to the end of this structure */ diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 6855878..6bfe55c 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -225,11 +225,12 @@ static const struct bpf_func_proto bpf_perf_event_read_proto = { .arg2_type = ARG_ANYTHING, }; -static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 index, u64 r4, u64 size) +static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size) { struct pt_regs *regs = (struct pt_regs *) (long) r1; struct bpf_map *map = (struct bpf_map *) (long) r2; struct bpf_array *array = container_of(map, struct bpf_array, map); + u64 index = flags & BPF_F_INDEX_MASK; void *data = (void *) (long) r4; struct perf_sample_data sample_data; struct perf_event *event; @@ -239,6 +240,10 @@ static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 index, u64 r4, u64 size) .data = data, }; + if (unlikely(flags & ~(BPF_F_INDEX_MASK))) + return -EINVAL; + if (index == BPF_F_CURRENT_CPU) + index = raw_smp_processor_id(); if (unlikely(index >= array->map.max_entries)) return -E2BIG; -- 1.9.3
[PATCH net-next 2/2] bpf: add event output helper for notifications/sampling/logging
This patch adds a new helper for cls/act programs that can push events to user space applications. For networking, this can be f.e. for sampling, debugging, logging purposes or pushing of arbitrary wake-up events. The idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output() helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example"). The eBPF program utilizes a perf event array map that user space populates with fds from perf_event_open(), the eBPF program calls into the helper f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw)) so that the raw data is pushed into the fd f.e. at the map index of the current CPU. User space can poll/mmap/etc on this and has a data channel for receiving events that can be post-processed. The nice thing is that since the eBPF program and user space application making use of it are tightly coupled, they can define their own arbitrary raw data format and what/when they want to push. While f.e. packet headers could be one part of the meta data that is being pushed, this is not a substitute for things like packet sockets as whole packet is not being pushed and push is only done in a single direction. Intention is more of a generically usable, efficient event pipe to applications. Workflow is that tc can pin the map and applications can attach themselves e.g. after cls/act setup to one or multiple map slots, demuxing is done by the eBPF program. Adding this facility is with minimal effort, it reuses the helper introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper") and we get its functionality for free by overloading its BPF_FUNC_ identifier for cls/act programs, ctx is currently unused, but will be made use of in future. Example will be added to iproute2's BPF example files. Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 2 ++ kernel/trace/bpf_trace.c | 2 +- net/core/filter.c| 30 ++ 3 files changed, 33 insertions(+), 1 deletion(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 5fb3c61..2116924 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -167,6 +167,8 @@ struct bpf_array { u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 r5); u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); +u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size); + void bpf_fd_array_map_clear(struct bpf_map *map); bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp); const struct bpf_func_proto *bpf_get_trace_printk_proto(void); diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 6bfe55c..7fe2f22 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -225,7 +225,7 @@ static const struct bpf_func_proto bpf_perf_event_read_proto = { .arg2_type = ARG_ANYTHING, }; -static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size) +u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size) { struct pt_regs *regs = (struct pt_regs *) (long) r1; struct bpf_map *map = (struct bpf_map *) (long) r2; diff --git a/net/core/filter.c b/net/core/filter.c index 5d2ac2b..3521252 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -46,6 +46,7 @@ #include #include #include +#include #include #include #include @@ -1584,6 +1585,33 @@ static const struct bpf_func_proto bpf_csum_diff_proto = { .arg5_type = ARG_ANYTHING, }; +u64 __weak bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size) +{ + return -EACCES; +} + +static DEFINE_PER_CPU(struct pt_regs, bpf_pt_regs); + +static u64 bpf_skb_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size) +{ + struct pt_regs *regs = this_cpu_ptr(&bpf_pt_regs); + + perf_fetch_caller_regs(regs); + + return bpf_perf_event_output((long)regs, r2, flags, r4, size); +} + +static const struct bpf_func_proto bpf_skb_event_output_proto = { + .func = bpf_skb_event_output, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_CONST_MAP_PTR, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_PTR_TO_STACK, + .arg5_type = ARG_CONST_STACK_SIZE, +}; + static u64 bpf_clone_redirect(u64 r1, u64 ifindex, u64 flags, u64 r4, u64 r5) { struct sk_buff *skb = (struct sk_buff *) (long) r1, *skb2; @@ -2039,6 +2067,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id) return &bpf_redirect_proto; case BPF_FUNC_get_route_realm: return &bpf_get_route_realm_proto; + case BPF_FUNC_perf_event_output: + return &bpf_skb_event_output_proto; default: return sk_filter_func_proto(func_id); } -- 1.9.3
Re: [PATCH] sctp: Fix error handling for switch statement case in the function sctp_cmd_interprete
On 04/05/2016 11:36 PM, Bastien Philbert wrote: This fixes error handling for the switch statement case SCTP_CMD_SEND_PKT by making the error value of the call to sctp_packet_transmit equal the variable error due to this function being able to fail with a error code. In What actual issue have you observed that you fix? addition allow the call to sctp_ootb_pkt_free afterwards to free up the no longer in use sctp packet even if the call to the function sctp_packet_transmit fails in order to avoid a memory leak here for not freeing the sctp Not sure how this relates to your code? Signed-off-by: Bastien Philbert --- net/sctp/sm_sideeffect.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c index 7fe56d0..f3a8b58 100644 --- a/net/sctp/sm_sideeffect.c +++ b/net/sctp/sm_sideeffect.c @@ -1434,7 +1434,7 @@ static int sctp_cmd_interpreter(sctp_event_t event_type, case SCTP_CMD_SEND_PKT: /* Send a full packet to our peer. */ packet = cmd->obj.packet; - sctp_packet_transmit(packet, gfp); + error = sctp_packet_transmit(packet, gfp); sctp_ootb_pkt_free(packet); break;
Re: [PATCH net 4/4] lib/test_bpf: Add additional BPF_ADD tests
On 04/05/2016 12:02 PM, Naveen N. Rao wrote: Some of these tests proved useful with the powerpc eBPF JIT port due to sign-extended 16-bit immediate loads. Though some of these aspects get covered in other tests, it is better to have explicit tests so as to quickly tag the precise problem. Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: "David S. Miller" Cc: Ananth N Mavinakayanahalli Cc: Michael Ellerman Cc: Paul Mackerras Signed-off-by: Naveen N. Rao Thanks for adding these! Acked-by: Daniel Borkmann
Re: [PATCH net 3/4] lib/test_bpf: Add test to check for result of 32-bit add that overflows
On 04/05/2016 12:02 PM, Naveen N. Rao wrote: BPF_ALU32 and BPF_ALU64 tests for adding two 32-bit values that results in 32-bit overflow. Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: "David S. Miller" Cc: Ananth N Mavinakayanahalli Cc: Michael Ellerman Cc: Paul Mackerras Signed-off-by: Naveen N. Rao Acked-by: Daniel Borkmann
Re: [PATCH net 2/4] lib/test_bpf: Add tests for unsigned BPF_JGT
On 04/05/2016 12:02 PM, Naveen N. Rao wrote: Unsigned Jump-if-Greater-Than. Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: "David S. Miller" Cc: Ananth N Mavinakayanahalli Cc: Michael Ellerman Cc: Paul Mackerras Signed-off-by: Naveen N. Rao Acked-by: Daniel Borkmann
Re: [PATCH net 1/4] lib/test_bpf: Fix JMP_JSET tests
On 04/05/2016 12:02 PM, Naveen N. Rao wrote: JMP_JSET tests incorrectly used BPF_JNE. Fix the same. Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: "David S. Miller" Cc: Ananth N Mavinakayanahalli Cc: Michael Ellerman Cc: Paul Mackerras Signed-off-by: Naveen N. Rao Acked-by: Daniel Borkmann --- lib/test_bpf.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/lib/test_bpf.c b/lib/test_bpf.c index 27a7a26..e76fa4d 100644 --- a/lib/test_bpf.c +++ b/lib/test_bpf.c @@ -4303,7 +4303,7 @@ static struct bpf_test tests[] = { .u.insns_int = { BPF_ALU32_IMM(BPF_MOV, R0, 0), BPF_LD_IMM64(R1, 3), - BPF_JMP_IMM(BPF_JNE, R1, 2, 1), + BPF_JMP_IMM(BPF_JSET, R1, 2, 1), BPF_EXIT_INSN(), BPF_ALU32_IMM(BPF_MOV, R0, 1), BPF_EXIT_INSN(), @@ -4317,7 +4317,7 @@ static struct bpf_test tests[] = { .u.insns_int = { BPF_ALU32_IMM(BPF_MOV, R0, 0), BPF_LD_IMM64(R1, 3), - BPF_JMP_IMM(BPF_JNE, R1, 0x, 1), + BPF_JMP_IMM(BPF_JSET, R1, 0x, 1), BPF_EXIT_INSN(), BPF_ALU32_IMM(BPF_MOV, R0, 1), BPF_EXIT_INSN(), @@ -4474,7 +4474,7 @@ static struct bpf_test tests[] = { BPF_ALU32_IMM(BPF_MOV, R0, 0), BPF_LD_IMM64(R1, 3), BPF_LD_IMM64(R2, 2), - BPF_JMP_REG(BPF_JNE, R1, R2, 1), + BPF_JMP_REG(BPF_JSET, R1, R2, 1), BPF_EXIT_INSN(), BPF_ALU32_IMM(BPF_MOV, R0, 1), BPF_EXIT_INSN(), @@ -4489,7 +4489,7 @@ static struct bpf_test tests[] = { BPF_ALU32_IMM(BPF_MOV, R0, 0), BPF_LD_IMM64(R1, 3), BPF_LD_IMM64(R2, 0x), - BPF_JMP_REG(BPF_JNE, R1, R2, 1), + BPF_JMP_REG(BPF_JSET, R1, R2, 1), BPF_EXIT_INSN(), BPF_ALU32_IMM(BPF_MOV, R0, 1), BPF_EXIT_INSN(),
Re: [RFC PATCH 6/6] ppc: ebpf/jit: Implement JIT compiler for extended BPF
On 04/01/2016 08:10 PM, Alexei Starovoitov wrote: On 4/1/16 2:58 AM, Naveen N. Rao wrote: PPC64 eBPF JIT compiler. Works for both ABIv1 and ABIv2. Enable with: echo 1 > /proc/sys/net/core/bpf_jit_enable or echo 2 > /proc/sys/net/core/bpf_jit_enable ... to see the generated JIT code. This can further be processed with tools/net/bpf_jit_disasm. With CONFIG_TEST_BPF=m and 'modprobe test_bpf': test_bpf: Summary: 291 PASSED, 0 FAILED, [234/283 JIT'ed] ... on both ppc64 BE and LE. The details of the approach are documented through various comments in the code, as are the TODOs. Some of the prominent TODOs include implementing BPF tail calls and skb loads. Cc: Matt Evans Cc: Michael Ellerman Cc: Paul Mackerras Cc: Alexei Starovoitov Cc: "David S. Miller" Cc: Ananth N Mavinakayanahalli Signed-off-by: Naveen N. Rao --- arch/powerpc/include/asm/ppc-opcode.h | 19 +- arch/powerpc/net/Makefile | 4 + arch/powerpc/net/bpf_jit.h| 66 ++- arch/powerpc/net/bpf_jit64.h | 58 +++ arch/powerpc/net/bpf_jit_comp64.c | 828 ++ 5 files changed, 973 insertions(+), 2 deletions(-) create mode 100644 arch/powerpc/net/bpf_jit64.h create mode 100644 arch/powerpc/net/bpf_jit_comp64.c ... -#ifdef CONFIG_PPC64 +#if defined(CONFIG_PPC64) && (!defined(_CALL_ELF) || _CALL_ELF != 2) impressive stuff! +1, awesome to see another one! Everything nicely documented. Could you add few words for the above condition as well ? Or may be a new macro, since it occurs many times? What are these _CALL_ELF == 2 and != 2 conditions mean? ppc ABIs ? Will there ever be v3 ? Minor TODO would also be to convert to use bpf_jit_binary_alloc() and bpf_jit_binary_free() API for the image, which is done by other eBPF jits, too. So far most of the bpf jits were going via net-next tree, but if in this case no changes to the core is necessary then I guess it's fine to do it via powerpc tree. What's your plan?
Re: [PATCH 2/4] samples/bpf: Use llc in PATH, rather than a hardcoded value
On 03/31/2016 07:46 PM, Alexei Starovoitov wrote: On 3/31/16 4:25 AM, Naveen N. Rao wrote: While at it, fix some typos in the comment. Cc: Alexei Starovoitov Cc: David S. Miller Cc: Ananth N Mavinakayanahalli Cc: Michael Ellerman Signed-off-by: Naveen N. Rao --- samples/bpf/Makefile | 11 --- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 502c9fc..88bc5a0 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -76,16 +76,13 @@ HOSTLOADLIBES_offwaketime += -lelf HOSTLOADLIBES_spintest += -lelf HOSTLOADLIBES_map_perf_test += -lelf -lrt -# point this to your LLVM backend with bpf support -LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc - -# asm/sysreg.h inline assmbly used by it is incompatible with llvm. -# But, ehere is not easy way to fix it, so just exclude it since it is +# asm/sysreg.h - inline assembly used by it is incompatible with llvm. +# But, there is no easy way to fix it, so just exclude it since it is # useless for BPF samples. $(obj)/%.o: $(src)/%.c clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \ -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \ --O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@ +-O2 -emit-llvm -c $< -o -| llc -march=bpf -filetype=obj -o $@ clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \ -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \ --O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=asm -o $@.s +-O2 -emit-llvm -c $< -o -| llc -march=bpf -filetype=asm -o $@.s that was a workaround when clang/llvm didn't have bpf support. Now clang 3.7 and 3.8 have bpf built-in, so make sense to remove manual calls to llc completely. Just use 'clang -target bpf -O2 -D... -c $< -o $@' +1, the clang part in that Makefile should also more correctly be called with '-target bpf' as it turns out (despite llc with '-march=bpf' ...). Better to use clang directly as suggested by Alexei.
Re: bpf: net/core/filter.c:2115 suspicious rcu_dereference_protected() usage!
On 03/30/2016 02:24 PM, Michal Kubecek wrote: On Wed, Mar 30, 2016 at 01:33:44PM +0200, Daniel Borkmann wrote: On 03/30/2016 11:42 AM, Michal Kubecek wrote: I'm just not sure checking if we hold the right lock depending on caller is worth the extra complexity. After all, what is really needed is to hold _some_ lock guaranteeing sk_attach_prog() and sk_detach_filter() are safe so that just changing the condition in both to sock_owned_by_user(sk) || lockdep_rtnl_is_held() It would certainly silence it, but would be less accurate in terms of lock proving as opposed to the diff above. E.g. rntl could be held elsewhere, while someone attaches a socket filter w/o having locked the socket (currently not the case, but it would kind of defeat the purpose of rcu_dereference_protected() here). Was thinking about using a extra socket flag to indicate it's externally managed, but it's not really worth wasting sk's flags bit space just for this corner case. Originally my reasoning was that to actually hide a locking issue from lockdep, this would have to happen every time we get down into the function which is unlikely. But thinking about it again, this code path is not so frequent and the fuzzers tend to do strange things so that it could really happen. In this case actually nothing too fancy, just seems that filters on tap devices might not be really used by anyone (issue is already couple of years old). Sasha/Jiri, could you test the patch with your testcases? I received it corrupted (strange leading whitespaces) so I better add cleaned up version below. Tested this yesterday night on my machine with PROVE_RCU + PROVE_RCU_REPEATEDLY enabled, and it can easily be triggered with a simple ioctl(tun_fd, TUN{ATTACH,DETACH}FILTER, ...) on a tap device, and the patch now silences it. Sorry for the white space damage (should have just attached it), I'd send it later today. Thanks, Daniel
Re: bpf: net/core/filter.c:2115 suspicious rcu_dereference_protected() usage!
On 03/30/2016 11:42 AM, Michal Kubecek wrote: On Tue, Mar 29, 2016 at 04:39:43PM +0200, Daniel Borkmann wrote: diff --git a/drivers/net/tun.c b/drivers/net/tun.c index afdf950617c3..7417d7c20bab 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1818,11 +1818,13 @@ static int set_offload(struct tun_struct *tun, unsigned long arg) static void tun_detach_filter(struct tun_struct *tun, int n) { int i; -struct tun_file *tfile; for (i = 0; i < n; i++) { -tfile = rtnl_dereference(tun->tfiles[i]); -sk_detach_filter(tfile->socket.sk); +struct sock *sk = rtnl_dereference(tun->tfiles[i])->socket.sk; + +lock_sock(sk); +sk_detach_filter(sk); +release_sock(sk); } tun->filter_attached = false; In tun case, the control path for tun_attach_filter() and tun_detach_filter() is under RTNL lock (held in __tun_chr_ioctl()). So in the BPF core the rcu_dereference_protected(, sock_owned_by_user(sk)) looks like a false positive in this specific use case to me, that we should probably just silence. Running the filter via sk_filter() in tun device happens under rcu_read_lock(), so the dereference and assignment pair seems okay to me. Was wondering whether we should convert this to unattached BPF filter, but this would break with existing expectations from sk_filter() (e.g. security modules). If we want to silence it, could be something like the below (only compile-tested): drivers/net/tun.c | 8 +--- include/linux/filter.h | 4 net/core/filter.c | 33 + 3 files changed, 30 insertions(+), 15 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index afdf950..510e90a 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -622,7 +622,8 @@ static int tun_attach(struct tun_struct *tun, struct file *file, bool skip_filte /* Re-attach the filter to persist device */ if (!skip_filter && (tun->filter_attached == true)) { - err = sk_attach_filter(&tun->fprog, tfile->socket.sk); + err = __sk_attach_filter(&tun->fprog, tfile->socket.sk, +lockdep_rtnl_is_held()); if (!err) goto out; } @@ -1822,7 +1823,7 @@ static void tun_detach_filter(struct tun_struct *tun, int n) for (i = 0; i < n; i++) { tfile = rtnl_dereference(tun->tfiles[i]); - sk_detach_filter(tfile->socket.sk); + __sk_detach_filter(tfile->socket.sk, lockdep_rtnl_is_held()); } tun->filter_attached = false; @@ -1835,7 +1836,8 @@ static int tun_attach_filter(struct tun_struct *tun) for (i = 0; i < tun->numqueues; i++) { tfile = rtnl_dereference(tun->tfiles[i]); - ret = sk_attach_filter(&tun->fprog, tfile->socket.sk); + ret = __sk_attach_filter(&tun->fprog, tfile->socket.sk, +lockdep_rtnl_is_held()); if (ret) { tun_detach_filter(tun, i); return ret; diff --git a/include/linux/filter.h b/include/linux/filter.h index 43aa1f8..a51a536 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -465,10 +465,14 @@ int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog, void bpf_prog_destroy(struct bpf_prog *fp); int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk); +int __sk_attach_filter(struct sock_fprog *fprog, struct sock *sk, + bool locked); int sk_attach_bpf(u32 ufd, struct sock *sk); int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk); int sk_reuseport_attach_bpf(u32 ufd, struct sock *sk); int sk_detach_filter(struct sock *sk); +int __sk_detach_filter(struct sock *sk, bool locked); + int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, unsigned int len); diff --git a/net/core/filter.c b/net/core/filter.c index 2429918..02f2f6c 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -1149,7 +1149,8 @@ void bpf_prog_destroy(struct bpf_prog *fp) } EXPORT_SYMBOL_GPL(bpf_prog_destroy); -static int __sk_attach_prog(struct bpf_prog *prog, struct sock *sk) +static int __sk_attach_prog(struct bpf_prog *prog, struct sock *sk, + bool locked) { struct sk_filter *fp, *old_fp; @@ -1165,10 +1166,8 @@ static int __sk_attach_prog(struct bpf_prog *prog, struct sock *sk) return -ENOMEM; } - old_fp = rcu_dereference_protected(sk->sk_filter, - sock_owned_by_user(sk)); + old_fp = rcu_dereference_protected(sk->sk_filter, locked); rcu_assign_pointer(sk->sk_filter, fp); - if (old_fp) sk_filt
Re: bpf: net/core/filter.c:2115 suspicious rcu_dereference_protected() usage!
On 03/29/2016 03:55 PM, Daniel Borkmann wrote: [ dropping my old email address ] On 03/29/2016 02:58 PM, Michal Kubecek wrote: On Mon, Feb 22, 2016 at 10:31:33AM -0500, Sasha Levin wrote: I've hit the following warning while fuzzing with trinity inside a kvmtool guest running the latest -next kernel: [ 1343.104588] === [ 1343.104591] [ INFO: suspicious RCU usage. ] [ 1343.104619] 4.5.0-rc4-next-20160219-sasha-00026-g7978205-dirty #2978 Not tainted [ 1343.104624] --- [ 1343.104635] net/core/filter.c:2115 suspicious rcu_dereference_protected() usage! [ 1343.104641] [ 1343.104641] other info that might help us debug this: [ 1343.104641] [ 1343.104650] [ 1343.104650] rcu_scheduler_active = 1, debug_locks = 0 [ 1343.104660] 1 lock held by syz-executor/17916: [ 1343.104784] #0: (rtnl_mutex){+.+.+.}, at: rtnl_lock (net/core/rtnetlink.c:71) [ 1343.104789] [ 1343.104789] stack backtrace: [ 1343.104820] CPU: 1 PID: 17916 Comm: trinity-c8 Not tainted 4.5.0-rc4-next-20160219-sasha-00026-g7978205-dirty #2978 [ 1343.104868] 110036968f44 8801b4b47aa8 a23d9a9d 0001 [ 1343.104891] fbfff5c2a630 41b58ab3 adb3a8f2 a23d9905 [ 1343.104914] 8801b5419b40 fbfff7596522 0001 [ 1343.104919] Call Trace: [ 1343.104985] dump_stack (lib/dump_stack.c:53) [ 1343.105060] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4282) [ 1343.105093] sk_detach_filter (net/core/filter.c:2114 (discriminator 5)) [ 1343.105193] tun_detach_filter (drivers/net/tun.c:1808 (discriminator 7)) [ 1343.105238] __tun_chr_ioctl (drivers/net/tun.c:2133) [ 1343.105370] tun_chr_ioctl (drivers/net/tun.c:2161) [ 1343.105407] do_vfs_ioctl (fs/ioctl.c:44 fs/ioctl.c:674) [ 1343.105506] SyS_ioctl (fs/ioctl.c:689 fs/ioctl.c:680) [ 1343.105542] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:200) Looks like sk_detach_filter() wants the socket to be owned which neither tun_detach_filter() does not do, unlike sock_setsockopt(). Could you check if the patch below helps? I'm also not really sure if it is safe to ignore return value of sk_detach_filter() and just sets tun->filter_attached to false - but it's not really clear what should be done if one of the calls fails after some succeeded. Wrt return value, afaik SOCK_FILTER_LOCKED cannot be set for tun devs, so we should be okay. diff --git a/drivers/net/tun.c b/drivers/net/tun.c index afdf950617c3..7417d7c20bab 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1818,11 +1818,13 @@ static int set_offload(struct tun_struct *tun, unsigned long arg) static void tun_detach_filter(struct tun_struct *tun, int n) { int i; -struct tun_file *tfile; for (i = 0; i < n; i++) { -tfile = rtnl_dereference(tun->tfiles[i]); -sk_detach_filter(tfile->socket.sk); +struct sock *sk = rtnl_dereference(tun->tfiles[i])->socket.sk; + +lock_sock(sk); +sk_detach_filter(sk); +release_sock(sk); } tun->filter_attached = false; In tun case, the control path for tun_attach_filter() and tun_detach_filter() is under RTNL lock (held in __tun_chr_ioctl()). So in the BPF core the rcu_dereference_protected(, sock_owned_by_user(sk)) looks like a false positive in this specific use case to me, that we should probably just silence. Running the filter via sk_filter() in tun device happens under rcu_read_lock(), so the dereference and assignment pair seems okay to me. Was wondering whether we should convert this to unattached BPF filter, but this would break with existing expectations from sk_filter() (e.g. security modules). If we want to silence it, could be something like the below (only compile-tested): drivers/net/tun.c | 8 +--- include/linux/filter.h | 4 net/core/filter.c | 33 + 3 files changed, 30 insertions(+), 15 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index afdf950..510e90a 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -622,7 +622,8 @@ static int tun_attach(struct tun_struct *tun, struct file *file, bool skip_filte /* Re-attach the filter to persist device */ if (!skip_filter && (tun->filter_attached == true)) { - err = sk_attach_filter(&tun->fprog, tfile->socket.sk); + err = __sk_attach_filter(&tun->fprog, tfile->socket.sk, +lockdep_rtnl_is_held()); if (!err) goto out; } @@ -1822,7 +1823,7 @@ static void tun_detach_filter(struct tun_struct *tun, int n) for (i = 0; i < n; i++) { tfile = rtnl_dereference(tun->tfiles[i]); - sk_detach_filter(tfile->socket.sk); + __sk_detach_filter(tfile->socket.sk, lockdep_rtnl_is_held());
Re: bpf: net/core/filter.c:2115 suspicious rcu_dereference_protected() usage!
[ dropping my old email address ] On 03/29/2016 02:58 PM, Michal Kubecek wrote: On Mon, Feb 22, 2016 at 10:31:33AM -0500, Sasha Levin wrote: I've hit the following warning while fuzzing with trinity inside a kvmtool guest running the latest -next kernel: [ 1343.104588] === [ 1343.104591] [ INFO: suspicious RCU usage. ] [ 1343.104619] 4.5.0-rc4-next-20160219-sasha-00026-g7978205-dirty #2978 Not tainted [ 1343.104624] --- [ 1343.104635] net/core/filter.c:2115 suspicious rcu_dereference_protected() usage! [ 1343.104641] [ 1343.104641] other info that might help us debug this: [ 1343.104641] [ 1343.104650] [ 1343.104650] rcu_scheduler_active = 1, debug_locks = 0 [ 1343.104660] 1 lock held by syz-executor/17916: [ 1343.104784] #0: (rtnl_mutex){+.+.+.}, at: rtnl_lock (net/core/rtnetlink.c:71) [ 1343.104789] [ 1343.104789] stack backtrace: [ 1343.104820] CPU: 1 PID: 17916 Comm: trinity-c8 Not tainted 4.5.0-rc4-next-20160219-sasha-00026-g7978205-dirty #2978 [ 1343.104868] 110036968f44 8801b4b47aa8 a23d9a9d 0001 [ 1343.104891] fbfff5c2a630 41b58ab3 adb3a8f2 a23d9905 [ 1343.104914] 8801b5419b40 fbfff7596522 0001 [ 1343.104919] Call Trace: [ 1343.104985] dump_stack (lib/dump_stack.c:53) [ 1343.105060] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4282) [ 1343.105093] sk_detach_filter (net/core/filter.c:2114 (discriminator 5)) [ 1343.105193] tun_detach_filter (drivers/net/tun.c:1808 (discriminator 7)) [ 1343.105238] __tun_chr_ioctl (drivers/net/tun.c:2133) [ 1343.105370] tun_chr_ioctl (drivers/net/tun.c:2161) [ 1343.105407] do_vfs_ioctl (fs/ioctl.c:44 fs/ioctl.c:674) [ 1343.105506] SyS_ioctl (fs/ioctl.c:689 fs/ioctl.c:680) [ 1343.105542] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:200) Looks like sk_detach_filter() wants the socket to be owned which neither tun_detach_filter() does not do, unlike sock_setsockopt(). Could you check if the patch below helps? I'm also not really sure if it is safe to ignore return value of sk_detach_filter() and just sets tun->filter_attached to false - but it's not really clear what should be done if one of the calls fails after some succeeded. Wrt return value, afaik SOCK_FILTER_LOCKED cannot be set for tun devs, so we should be okay. diff --git a/drivers/net/tun.c b/drivers/net/tun.c index afdf950617c3..7417d7c20bab 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1818,11 +1818,13 @@ static int set_offload(struct tun_struct *tun, unsigned long arg) static void tun_detach_filter(struct tun_struct *tun, int n) { int i; - struct tun_file *tfile; for (i = 0; i < n; i++) { - tfile = rtnl_dereference(tun->tfiles[i]); - sk_detach_filter(tfile->socket.sk); + struct sock *sk = rtnl_dereference(tun->tfiles[i])->socket.sk; + + lock_sock(sk); + sk_detach_filter(sk); + release_sock(sk); } tun->filter_attached = false; In tun case, the control path for tun_attach_filter() and tun_detach_filter() is under RTNL lock (held in __tun_chr_ioctl()). So in the BPF core the rcu_dereference_protected(, sock_owned_by_user(sk)) looks like a false positive in this specific use case to me, that we should probably just silence. Running the filter via sk_filter() in tun device happens under rcu_read_lock(), so the dereference and assignment pair seems okay to me. Was wondering whether we should convert this to unattached BPF filter, but this would break with existing expectations from sk_filter() (e.g. security modules).
Re: [PATCH] bpf: doc: "neg" opcode has no operands
On 03/28/2016 11:56 PM, Kees Cook wrote: From: Dave Anderson Fixes a copy-paste-o in the BPF opcode table: "neg" takes no arguments and thus has no addressing modes. Signed-off-by: Dave Anderson Signed-off-by: Kees Cook Acked-by: Daniel Borkmann
Re: [PATCH net-next] bpf: avoid copying junk bytes in bpf_get_current_comm()
On 03/11/2016 06:20 PM, Alexei Starovoitov wrote: On 3/11/16 2:24 AM, Daniel Borkmann wrote: On 03/10/2016 05:02 AM, Alexei Starovoitov wrote: Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but the result is typically passed to print("%s", buf) and extra bytes after zero don't cause any harm. In bpf the result of bpf_get_current_comm() is used as the part of map key and was causing spurious hash map mismatches. Use strlcpy() to guarantee zero-terminated string. bpf verifier checks that output buffer is zero-initialized, Sorry for late reply, more below: so even for short task names the output buffer don't have junk bytes. Note it's not a security concern, since kprobe+bpf is root only. Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm accessors") Reported-by: Tobias Waldekranz Signed-off-by: Alexei Starovoitov [...] diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 4504ca66118d..50da680c479f 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -166,7 +166,7 @@ static u64 bpf_get_current_comm(u64 r1, u64 size, u64 r3, u64 r4, u64 r5) if (!task) return -EINVAL; -memcpy(buf, task->comm, min_t(size_t, size, sizeof(task->comm))); +strlcpy(buf, task->comm, min_t(size_t, size, sizeof(task->comm))); If I see this correctly, __set_task_comm() makes sure comm is always zero terminated, so that seems good, but isn't it already sufficient when switching to strlcpy() to simply use: strlcpy(buf, task->comm, size); The min_t() seems unnecessary work to me, why do we still need it? size is guaranteed to be > 0 through the eBPF verifier, so strlcpy() should take care of the rest. that's one clever optimization. yep. we can drop min_t. btw I wanted to add memset to __set_task_comm, keep memcpy in bpf_get_current_comm and optimize perf_event_comm_event (which doing: memset+strlcpy and can be replaced with memcpy), but figured that such 'fix' is not suitable for stable. I guess we can do in the next cycle? strlen is not cheap. Especially since it turned out that bpf_get_current_comm() is used very often in the hot path in bcc/tools. Would strscpy() help in this case (see 30035e45753b ("string: provide strscpy()"))? Also for the next cycle I'm planning to extend verifier to allow uninitialized stack to be passed to functions like bpf_get_current_comm() and they would have to zero it in error cases. Then we can save few more cycles from the programs. That would be useful also for other helpers indeed.
Re: [PATCH net-next] bpf: avoid copying junk bytes in bpf_get_current_comm()
On 03/10/2016 05:02 AM, Alexei Starovoitov wrote: Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but the result is typically passed to print("%s", buf) and extra bytes after zero don't cause any harm. In bpf the result of bpf_get_current_comm() is used as the part of map key and was causing spurious hash map mismatches. Use strlcpy() to guarantee zero-terminated string. bpf verifier checks that output buffer is zero-initialized, Sorry for late reply, more below: so even for short task names the output buffer don't have junk bytes. Note it's not a security concern, since kprobe+bpf is root only. Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm accessors") Reported-by: Tobias Waldekranz Signed-off-by: Alexei Starovoitov [...] diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 4504ca66118d..50da680c479f 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -166,7 +166,7 @@ static u64 bpf_get_current_comm(u64 r1, u64 size, u64 r3, u64 r4, u64 r5) if (!task) return -EINVAL; - memcpy(buf, task->comm, min_t(size_t, size, sizeof(task->comm))); + strlcpy(buf, task->comm, min_t(size_t, size, sizeof(task->comm))); If I see this correctly, __set_task_comm() makes sure comm is always zero terminated, so that seems good, but isn't it already sufficient when switching to strlcpy() to simply use: strlcpy(buf, task->comm, size); The min_t() seems unnecessary work to me, why do we still need it? size is guaranteed to be > 0 through the eBPF verifier, so strlcpy() should take care of the rest. Thanks, Daniel
Re: linux-next: Tree for Mar 9 (net: bpf)
On 03/09/2016 06:18 PM, Randy Dunlap wrote: On 03/09/16 09:07, Randy Dunlap wrote: [...] Does it handle where net/Kconfig symbol NET selects BPF unconditionally? I got the patch from git and tested it. Works. Thanks. Ahh, perfect, thanks a lot!
Re: linux-next: Tree for Mar 9 (net: bpf)
On 03/09/2016 06:07 PM, Randy Dunlap wrote: On 03/09/16 08:48, Daniel Borkmann wrote: On 03/09/2016 05:44 PM, Randy Dunlap wrote: On 03/08/16 21:38, Stephen Rothwell wrote: Hi all, Changes since 20160308: on x86_64: ../net/core/filter.c: In function 'bpf_skb_get_tunnel_opt': ../net/core/filter.c:1824:2: error: implicit declaration of function 'ip_tunnel_info_opts_get' [-Werror=implicit-function-declaration] ip_tunnel_info_opts_get(to, info); ^ ../net/core/filter.c: In function 'bpf_skb_set_tunnel_opt': ../net/core/filter.c:1918:2: error: implicit declaration of function 'ip_tunnel_info_opts_set' [-Werror=implicit-function-declaration] ip_tunnel_info_opts_set(info, from, size); ^ Full randconfig file is attached. Already fixed with net-next commit e28e87ed474c ("ip_tunnel, bpf: ip_tunnel_info_opts_{get, set} depends on CONFIG_INET"). [...] Is that patch on the netdev mailing list or patchwork? I can't find it. It is in patchwork archive or also here: https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=e28e87ed474c5a0b378c66fb85efc8e487f4f63f Was reported first by Fengguang's build bot i.e. his config had !CONFIG_INET like in your case, so should resolve the issue. Thanks, Daniel
Re: linux-next: Tree for Mar 9 (net: bpf)
On 03/09/2016 05:44 PM, Randy Dunlap wrote: On 03/08/16 21:38, Stephen Rothwell wrote: Hi all, Changes since 20160308: on x86_64: ../net/core/filter.c: In function 'bpf_skb_get_tunnel_opt': ../net/core/filter.c:1824:2: error: implicit declaration of function 'ip_tunnel_info_opts_get' [-Werror=implicit-function-declaration] ip_tunnel_info_opts_get(to, info); ^ ../net/core/filter.c: In function 'bpf_skb_set_tunnel_opt': ../net/core/filter.c:1918:2: error: implicit declaration of function 'ip_tunnel_info_opts_set' [-Werror=implicit-function-declaration] ip_tunnel_info_opts_set(info, from, size); ^ Full randconfig file is attached. Already fixed with net-next commit e28e87ed474c ("ip_tunnel, bpf: ip_tunnel_info_opts_{get, set} depends on CONFIG_INET"). Thanks, Daniel
Re: [PATCH net-next 3/9] bpf: pre-allocate hash map elements
On 03/07/2016 02:58 AM, Alexei Starovoitov wrote: [...] --- include/linux/bpf.h | 1 + include/uapi/linux/bpf.h | 3 + kernel/bpf/hashtab.c | 264 ++- kernel/bpf/syscall.c | 2 +- 4 files changed, 196 insertions(+), 74 deletions(-) Shouldn't all other map types (like array) need something like this as well to reserve this for their future flags? if (attr->map_flags) return ERR_PTR(-EINVAL); diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 4b070827200d..c81efb10bbb5 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -37,6 +37,7 @@ struct bpf_map { u32 key_size; u32 value_size; u32 max_entries; + u32 map_flags; Just naming this 'flags' doesn't work due to the anonymous struct inside that union, right? :/ u32 pages; struct user_struct *user; const struct bpf_map_ops *ops; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 6496f98d3d68..5eeb2ca9441e 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -101,12 +101,15 @@ enum bpf_prog_type { #define BPF_NOEXIST 1 /* create new element if it didn't exist */ #define BPF_EXIST 2 /* update existing element */ +#define BPF_F_NO_PREALLOC (1ULL << 0) Nit: Should better be (1U << 0) as map_flags are of __u32. union bpf_attr { struct { /* anonymous struct used by BPF_MAP_CREATE command */ __u32 map_type; /* one of enum bpf_map_type */ __u32 key_size; /* size of key in bytes */ __u32 value_size; /* size of value in bytes */ __u32 max_entries;/* max number of entries in a map */ + __u32 map_flags; /* prealloc or not */ }; struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ Thanks, Daniel
Re: [PATCH net-next 2/9] bpf: introduce percpu_freelist
On 03/07/2016 02:58 AM, Alexei Starovoitov wrote: Introduce simple percpu_freelist to keep single list of elements spread across per-cpu singly linked lists. /* push element into the list */ void pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *); /* pop element from the list */ struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *); The object is pushed to the current cpu list. Pop first trying to get the object from the current cpu list, if it's empty goes to the neigbour cpu list. For bpf program usage pattern the collision rate is very low, since programs push and pop the objects typically on the same cpu. Signed-off-by: Alexei Starovoitov These bits and their usage in combination with preallocation of objects in patch 3/9 look very useful to me! This code seems generic enough and doesn't contain any BPF specifics, other subsystems could potentially utilize it as well, I'd suggest that this should better be placed under lib/ so it's exposed/visible for other developers too. You can still add 'F:' entries into the MAINTAINERS file to make sure patches also hit netdev: BPF (Safe dynamic programs and tools) [...] F: kernel/bpf/ F: lib/percpu_freelist.c F: include/linux/percpu_freelist.h When BPF_SYSCALL is enabled, it would then just select these library bits via Kconfig. The lib bits themselves can be a hidden Kconfig entry. Thanks, Daniel
Re: [PATCH net-next 1/9] bpf: prevent kprobe+bpf deadlocks
On 03/07/2016 02:58 AM, Alexei Starovoitov wrote: if kprobe is placed within update or delete hash map helpers that hold bucket spin lock and triggered bpf program is trying to grab the spinlock for the same bucket on the same cpu, it will deadlock. Fix it by extending existing recursion prevention mechanism. Note, map_lookup and other tracing helpers don't have this problem, since they don't hold any locks and don't modify global data. bpf_trace_printk has its own recursive check and ok as well. Signed-off-by: Alexei Starovoitov LGTM Acked-by: Daniel Borkmann
Re: linux-next: manual merge of the net-next tree with the net tree
On 03/04/2016 03:09 AM, Stephen Rothwell wrote: Hi all, Today's linux-next merge of the net-next tree got a conflict in: drivers/net/vxlan.c between commit: 4024fcf70556 ("vxlan: fix missing options_len update on RX with collect metadata") from the net tree and commit: 3288af0892e3 ("vxlan: move GBP header parsing to a separate function") from the net-next tree. I fixed it up (see below) and can carry the fix as necessary (no action is required). Looks good, thanks!
Re: linux-next: manual merge of the net-next tree with the net tree
On 02/26/2016 01:13 AM, Stephen Rothwell wrote: [...] I fixed it up (see below) and can carry the fix as necessary (no action is required). Looks good to me, thanks Stephen! Best, Daniel
Re: [PATCH v2] bpf: grab rcu read lock for bpf_percpu_hash_update
On 02/19/2016 07:53 PM, Sasha Levin wrote: bpf_percpu_hash_update() expects rcu lock to be held and warns if it's not, which pointed out a missing rcu read lock. Fixes: 15a07b338 ("bpf: add lookup/update support for per-cpu hash and array maps") Signed-off-by: Sasha Levin LGTM, patch is against net-next tree. Acked-by: Daniel Borkmann
Re: [RFC][PATCH 00/10] Add trace event support to eBPF
On 02/18/2016 10:27 PM, Tom Zanussi wrote: On Tue, 2016-02-16 at 20:51 -0800, Alexei Starovoitov wrote: On Tue, Feb 16, 2016 at 04:35:27PM -0600, Tom Zanussi wrote: On Sun, 2016-02-14 at 01:02 +0100, Alexei Starovoitov wrote: [...] Take a look at all the tools written on top of it: https://github.com/iovisor/bcc/tree/master/tools That's great, but it's all out-of-tree. Supporting out-of-tree users has never been justification for merging in-kernel code (or for blocking it from being merged). huh? perf is the only in-tree user space project. All others tools and libraries are out-of-tree and that makes sense. What about all the other things under tools/? Actually would be great to merge bcc with perf eventually, but choice of C++ isn't going to make it easy. The only real difference between perf+bpf and bcc is that bcc integrates clang/llvm as a library whereas perf+bpf deals with elf files and standalone compiler. There are pros and cons for both and it's great that both are actively growing and gaining user traction. Why worry about merging bcc with perf? Why not a tools/bcc? It would indeed be great to mid-term have bcc internals to some degree merged(/rewritten) into perf. tools/bcc doesn't make much sense to me as it really should be perf, where already the rest of the eBPF front-end logic resides that Wang et al initially integrated. So efforts could consolidate from that side. The user could be given a choice whether his use-case is to load the object file, or directly pass in a C file where perf does the rest. And f.e. Brendan's tools could ship natively as perf "built-ins" that users can go and try out from there directly and/or use the code as a starting point for their own proglets. Cheers, Daniel
Re: [PATCH] vsprintf: do not append unset Scope ID to IPv6
On 02/03/2016 10:47 PM, Joe Perches wrote: On Wed, 2016-02-03 at 22:14 +0100, Jason A. Donenfeld wrote: The idea here is to be able to printk a sockaddr_in6, and have it show something that looks like what the user would naturally pass to getaddrinfo(3), which is entirely complete. However, I could be convinced that this kind of behavior belongs in it's own flag. Maybe I'll cook up a flag for that instead. I think that'd be best. Agreed. Maybe using something like %pISG for this that would optionally show these flow and scope values only when non-zero. Something like: Looks good to me, having this as a single generic option seems even cleaner. Thanks!
Re: IRe: [PATCH] vsprintf: flowinfo in IPv6 is optional too
On 02/03/2016 06:56 PM, Joe Perches wrote: On Wed, 2016-02-03 at 13:13 +0100, Jason A. Donenfeld wrote: Signed-off-by: Jason A. Donenfeld --- lib/vsprintf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/vsprintf.c b/lib/vsprintf.c index 1b1b1c8..85e6645 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -1189,7 +1189,7 @@ char *ip6_addr_string_sa(char *buf, char *end, const struct sockaddr_in6 *sa, *p++ = ':'; p = number(p, pend, ntohs(sa->sin6_port), spec); } - if (have_f) { + if (have_f && (sa->sin6_flowinfo & IPV6_FLOWINFO_MASK)) { *p++ = '/'; p = number(p, pend, ntohl(sa->sin6_flowinfo & IPV6_FLOWINFO_MASK), spec); Why does this matter at all? The format string "%pIS[...]f" is not used currently in the kernel. If one were to call out this 'f' qualifier to %pIS, wouldn't it be better to show /0 than elide the / output completely? +1 Another possibility also in regards to your other patch would be to have a different format string char option instead of 'f'/'s' that would then allow a developer for having both options to choose from. Dunno if it's really worth it, but if you have a use case that definitely needs it, then it'd be probably better. Less surprises during debugging, at least.
Re: [PATCH] vsprintf: do not append unset Scope ID to IPv6
On 02/03/2016 11:41 AM, Jason A. Donenfeld wrote: The sockaddr_in6 has the sin6_scope_id parameter. This contains the netdev index of the output device. When set to 0, sin6_scope_id is considered to be "unset" -- it has no Scope ID (see RFC4007). When it is set to >0, it has a Scope ID. [...] So, here, either it has Scope ID, in which case it's returned, or it doesn't need a Scope ID, in which case 0 - "no scope id" - is returned. Therefore, vsprintf should treat the 0 Scope ID the same way the rest of the v6 stack and the RFC treats it: 0 Scope ID means no Scope ID. And if there is no Scope ID, there is nothing to be printed. I don't see an "issue" here. I.e. if the developer requested to append 's' to the specifier, then it's expected to dump what is in the scope id, even if that is 0 meaning "unset" scope. It would be principle of least surprise -- I could imagine if not dumped, then a developer would check the source code to find out whether he did a mistake or it's expected behavior, just to make sure. That's what I would do probably. Is this causing a _real_ issue somewhere (I couldn't parse one out of your commit message other than 'it would be nice to have')? So, this patch only prints the Scope ID when one exists, and does not print a Scope ID when one does not exist. It turns out, too, that this is what developers want. The most common purpose of %pISpfsc is to show "what is in that sockaddr so that I can have some idea of how to connect to it?" Signed-off-by: Jason A. Donenfeld --- lib/vsprintf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/vsprintf.c b/lib/vsprintf.c index 48ff9c3..1b1b1c8 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -1194,7 +1194,7 @@ char *ip6_addr_string_sa(char *buf, char *end, const struct sockaddr_in6 *sa, p = number(p, pend, ntohl(sa->sin6_flowinfo & IPV6_FLOWINFO_MASK), spec); } - if (have_s) { + if (have_s && sa->sin6_scope_id) { *p++ = '%'; p = number(p, pend, sa->sin6_scope_id, spec); }
Re: [PATCH] af_packet: Raw socket destruction warning fix
On 01/21/2016 12:40 PM, Maninder Singh wrote: The other sock_put() in packet_release() to drop the final ref and call into sk_free(), which drops the 1 ref on the sk_wmem_alloc from init time. Since you got into __sk_free() via sock_wfree() destructor, your socket must have invoked packet_release() prior to this (perhaps kernel destroying the process). What kernel do you use? Issue is coming for 3.10.58. [ sorry for late reply ] What driver are you using (is that in-tree)? Can you reproduce the same issue with a latest -net kernel, for example (or, a 'reasonably' recent one like 4.3 or 4.4)? There has been quite a bit of changes in err queue handling (which also accounts rmem) as well. How reliably can you trigger the issue? Does it trigger with a completely different in-tree network driver as well with your tests? Would be useful to track/debug sk_rmem_alloc increases/decreases to see from which path new rmem is being charged in the time between packet_release() and packet_sock_destruct() for that socket ... Driver calls dev_kfree_skb_any->dev_kfree_skb_irq and it adds buffer in completion queue to free and raises softirq NET_TX_SOFTIRQ net_tx_action->__kfree_skb->skb_release_all->skb_release_head_state->sock_wfree-> __sk_free->packet_sock_destruct Also purging of receive queue has been taken care in other protocols.
Re: net: GPF in netlink_getsockbyportid
On 01/23/2016 08:25 PM, Florian Westphal wrote: Dmitry Vyukov wrote: [ CC nf-devel, not sure if its nfnetlink fault or NETLINK_MMAP ] The following program causes GPF in netlink_getsockbyportid: // autogenerated by syzkaller (http://github.com/google/syzkaller) #include #include #include #include #include int main() { syscall(SYS_mmap, 0x2000ul, 0xe65000ul, 0x3ul, 0x32ul, 0xul, 0x0ul); int fd = syscall(SYS_socket, 0x10ul, 0x803ul, 0xcul, 0, 0, 0); *(uint32_t*)0x20e64000 = (uint32_t)0x28; *(uint32_t*)0x20e64004 = (uint32_t)0x10; *(uint64_t*)0x20e64008 = (uint64_t)0x0; *(uint64_t*)0x20e64010 = (uint64_t)0x3; *(uint64_t*)0x20e64018 = (uint64_t)0xfff; *(uint16_t*)0x20e64020 = (uint16_t)0x5; syscall(SYS_write, fd, 0x20e64000ul, 0x28ul, 0, 0, 0); return 0; } CONFIG_NETLINK_MMAP and nfnetlink batching strike in unison :-/ root cause is in nfnetlink_rcv_batch(): 296 replay: 297 status = 0; 298 299 skb = netlink_skb_clone(oskb, GFP_KERNEL); The clone op doesn't copy oskb->sk, so we oops in __netlink_alloc_skb -> netlink_getsockbyportid() when nfnetlink_rcv_batch tries to send netlink ack. If indeed oskb is the mmap'ed netlink skb, then it's not even allowed to call into skb_clone() as it would access skb shared info data that can be controlled by the user space mmap buffer, iirc, we had that in the past with nlmon where skb_clone() was accidentally used.
Re: clang --target=bpf missing on f23 was: Re: [PATCH 1/2] perf test: Add libbpf relocation checker
On 01/22/2016 06:35 PM, Adam Jackson wrote: On Fri, 2016-01-22 at 14:22 -0300, Arnaldo Carvalho de Melo wrote: the 'bpf' target for clang is being used together with perf to build scriptlets into object code that then gets uploaded to the kernel via sys_bpf(), was the decision not to include 'bpf' just an accident? I wouldn't call it a "decision", that would imply intent. The main reason I explicitly list targets for llvm is to limit the CPU backends to arches Fedora actually runs on (which itself is because I really only care about llvmpipe, and am only touching llvm because it's in my way). Had no idea there was a bpf backend, so never thought to enable it. llvm-3.7.0-4.fc2{3,4} are building now with the bpf backend enabled, I'll create an update for F23 when it's built. Awesome, thanks!
Re: [PATCH 31/33] bpf: Add __bpf_prog_run() to stacktool whitelist
On 01/21/2016 11:49 PM, Josh Poimboeuf wrote: stacktool reports the following false positive warnings: stacktool: kernel/bpf/core.o: __bpf_prog_run()+0x5c: sibling call from callable instruction with changed frame pointer stacktool: kernel/bpf/core.o: __bpf_prog_run()+0x60: function has unreachable instruction stacktool: kernel/bpf/core.o: __bpf_prog_run()+0x64: function has unreachable instruction [...] It's confused by the following dynamic jump instruction in __bpf_prog_run():: jmp *(%r12,%rax,8) which corresponds to the following line in the C code: goto *jumptable[insn->code]; There's no way for stacktool to deterministically find all possible branch targets for a dynamic jump, so it can't verify this code. In this case the jumps all stay within the function, and there's nothing unusual going on related to the stack, so we can whitelist the function. Signed-off-by: Josh Poimboeuf Cc: Alexei Starovoitov Cc: net...@vger.kernel.org Fine by me: Acked-by: Daniel Borkmann
Re: [PATCH] net: filter: make JITs zero A for SKF_AD_ALU_XOR_X
On 01/05/2016 05:03 PM, Rabin Vincent wrote: On Tue, Jan 05, 2016 at 08:00:45AM -0800, Eric Dumazet wrote: On Tue, 2016-01-05 at 16:23 +0100, Rabin Vincent wrote: The SKF_AD_ALU_XOR_X ancillary is not like the other ancillary data instructions since it XORs A with X while all the others replace A with some loaded value. All the BPF JITs fail to clear A if this is used as the first instruction in a filter. Is x86_64 part of this 'All' subset ? ;) No, because it's an eBPF JIT. Correct, filter conversion to eBPF clears it already. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] net: filter: make JITs zero A for SKF_AD_ALU_XOR_X
On 01/05/2016 04:23 PM, Rabin Vincent wrote: The SKF_AD_ALU_XOR_X ancillary is not like the other ancillary data instructions since it XORs A with X while all the others replace A with some loaded value. All the BPF JITs fail to clear A if this is used as the first instruction in a filter. This was found using american fuzzy lop. Add a helper to determine if A needs to be cleared given the first instruction in a filter, and use this in the JITs. Except for ARM, the rest have only been compile-tested. Fixes: 3480593131e0 ("net: filter: get rid of BPF_S_* enum") Signed-off-by: Rabin Vincent Excellent catch, thanks a lot! The fix looks good to me and should go to -net tree. Acked-by: Daniel Borkmann If you're interested, feel free to add a small test case for the SKF_AD_ALU_XOR_X issue to lib/test_bpf.c for -net-next tree. Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/3] bpf: hash: use per-bucket spinlock
On 12/29/2015 03:40 PM, Ming Lei wrote: Both htab_map_update_elem() and htab_map_delete_elem() can be called from eBPF program, and they may be in kernel hot path, so it isn't efficient to use a per-hashtable lock in this two helpers. The per-hashtable spinlock is used for protecting bucket's hlist, and per-bucket lock is just enough. This patch converts the per-hashtable lock into per-bucket spinlock, so that contention can be decreased a lot. Signed-off-by: Ming Lei Looks better, thanks! Acked-by: Daniel Borkmann -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v1 3/3] bpf: hash: use per-bucket spinlock
On 12/28/2015 01:55 PM, Ming Lei wrote: Both htab_map_update_elem() and htab_map_delete_elem() can be called from eBPF program, and they may be in kernel hot path, so it isn't efficient to use a per-hashtable lock in this two helpers. The per-hashtable spinlock is used for protecting bucket's hlist, and per-bucket lock is just enough. This patch converts the per-hashtable lock into per-bucket spinlock, so that contention can be decreased a lot. Signed-off-by: Ming Lei --- kernel/bpf/hashtab.c | 46 ++ 1 file changed, 30 insertions(+), 16 deletions(-) diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index d857fcb..67222a9 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -14,10 +14,14 @@ #include #include +struct bucket { + struct hlist_head head; + raw_spinlock_t lock; +}; + struct bpf_htab { struct bpf_map map; - struct hlist_head *buckets; - raw_spinlock_t lock; + struct bucket *buckets; atomic_t count; /* number of elements in this hashtable */ u32 n_buckets; /* number of hash buckets */ u32 elem_size; /* size of each element in bytes */ @@ -88,24 +92,25 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) /* make sure page count doesn't overflow */ goto free_htab; When adapting memory accounting and allocation sizes below where you replace sizeof(struct hlist_head) with sizeof(struct bucket), is there a reason why you don't update the overflow checks along with it? [...] /* prevent zero size kmalloc and check for u32 overflow */ if (htab->n_buckets == 0 || htab->n_buckets > U32_MAX / sizeof(struct hlist_head)) goto free_htab; if ((u64) htab->n_buckets * sizeof(struct hlist_head) + (u64) htab->elem_size * htab->map.max_entries >= U32_MAX - PAGE_SIZE) /* make sure page count doesn't overflow */ goto free_htab; [...] - htab->map.pages = round_up(htab->n_buckets * sizeof(struct hlist_head) + + htab->map.pages = round_up(htab->n_buckets * sizeof(struct bucket) + htab->elem_size * htab->map.max_entries, PAGE_SIZE) >> PAGE_SHIFT; err = -ENOMEM; - htab->buckets = kmalloc_array(htab->n_buckets, sizeof(struct hlist_head), + htab->buckets = kmalloc_array(htab->n_buckets, sizeof(struct bucket), GFP_USER | __GFP_NOWARN); if (!htab->buckets) { - htab->buckets = vmalloc(htab->n_buckets * sizeof(struct hlist_head)); + htab->buckets = vmalloc(htab->n_buckets * sizeof(struct bucket)); if (!htab->buckets) goto free_htab; } [...] Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] bpf: hash: use per-bucket spinlock
On 12/26/2015 10:31 AM, Ming Lei wrote: From: Ming Lei Both htab_map_update_elem() and htab_map_delete_elem() can be called from eBPF program, and they may be in kernel hot path, so it isn't efficient to use a per-hashtable lock in this two helpers. The per-hashtable spinlock is used just for protecting bucket's hlist, and per-bucket lock should be enough. This patch converts the per-hashtable lock into per-bucket spinlock, so that contention can be decreased a lot. Signed-off-by: Ming Lei --- kernel/bpf/hashtab.c | 35 +-- 1 file changed, 25 insertions(+), 10 deletions(-) diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index d857fcb..66bad7a 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -14,9 +14,14 @@ #include #include +struct bucket { + struct hlist_head head; + raw_spinlock_t lock; +}; + struct bpf_htab { struct bpf_map map; - struct hlist_head *buckets; + struct bucket *buckets; raw_spinlock_t lock; Shouldn't the lock member be removed from here? atomic_t count; /* number of elements in this hashtable */ u32 n_buckets; /* number of hash buckets */ @@ -88,24 +93,25 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) /* make sure page count doesn't overflow */ goto free_htab; - htab->map.pages = round_up(htab->n_buckets * sizeof(struct hlist_head) + + htab->map.pages = round_up(htab->n_buckets * sizeof(struct bucket) + htab->elem_size * htab->map.max_entries, PAGE_SIZE) >> PAGE_SHIFT; err = -ENOMEM; - htab->buckets = kmalloc_array(htab->n_buckets, sizeof(struct hlist_head), + htab->buckets = kmalloc_array(htab->n_buckets, sizeof(struct bucket), GFP_USER | __GFP_NOWARN); if (!htab->buckets) { - htab->buckets = vmalloc(htab->n_buckets * sizeof(struct hlist_head)); + htab->buckets = vmalloc(htab->n_buckets * sizeof(struct bucket)); if (!htab->buckets) goto free_htab; } - for (i = 0; i < htab->n_buckets; i++) - INIT_HLIST_HEAD(&htab->buckets[i]); + for (i = 0; i < htab->n_buckets; i++) { + INIT_HLIST_HEAD(&htab->buckets[i].head); + raw_spin_lock_init(&htab->buckets[i].lock); + } - raw_spin_lock_init(&htab->lock); atomic_set(&htab->count, 0); return &htab->map; @@ -120,11 +126,16 @@ static inline u32 htab_map_hash(const void *key, u32 key_len) return jhash(key, key_len, 0); } -static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash) +static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash) { return &htab->buckets[hash & (htab->n_buckets - 1)]; } +static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash) +{ + return &__select_bucket(htab, hash)->head; +} + static struct htab_elem *lookup_elem_raw(struct hlist_head *head, u32 hash, void *key, u32 key_size) { @@ -227,6 +238,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value, struct bpf_htab *htab = container_of(map, struct bpf_htab, map); struct htab_elem *l_new, *l_old; struct hlist_head *head; + struct bucket *b; unsigned long flags; u32 key_size; int ret; @@ -248,7 +260,8 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value, memcpy(l_new->key + round_up(key_size, 8), value, map->value_size); l_new->hash = htab_map_hash(l_new->key, key_size); - head = select_bucket(htab, l_new->hash); + b = __select_bucket(htab, l_new->hash); + head = &b->head; /* bpf_map_update_elem() can be called in_irq() */ raw_spin_lock_irqsave(&htab->lock, flags); I am a bit confused on this one, though. The raw spin lock is still per hashtable (htab->lock), and has not been converted in this patch to the per bucket one (as opposed to what the commit message says), so this patch seems to go into the right direction, but is a bit incomplete? Shouldn't the above f.e. take b->lock, etc? @@ -299,6 +312,7 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key) { struct bpf_htab *htab = container_of(map, struct bpf_htab, map); struct hlist_head *head; + struct bucket *b; struct htab_elem *l; unsigned long flags; u32 hash, key_size; @@ -309,7 +323,8 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key) key_size = map->key_size; hash = htab_map_hash(key, key_size); - head = select_bucket(htab, hash); + b = __select_bucket(htab, hash); + head = &b->head; raw_spin_lock_irqsave(&htab->
Re: [PATCH 2/3] bpf: hash: move select_bucket() out of htab's spinlock
On 12/26/2015 10:31 AM, Ming Lei wrote: The spinlock is just used for protecting the per-bucket hlist, so it isn't needed for selecting bucket. Signed-off-by: Ming Lei Acked-by: Daniel Borkmann -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] bpf: hash: use atomic count
On 12/26/2015 10:31 AM, Ming Lei wrote: Preparing for removing global per-hashtable lock, so the counter need to be defined as aotmic_t first. Signed-off-by: Ming Lei Acked-by: Daniel Borkmann -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/