Re: [PATCH bpf-next] rethook: Remove warning messages printed for finding return address of a frame.
On 4/2/24 6:58 PM, Andrii Nakryiko wrote: On Mon, Apr 1, 2024 at 12:16 PM Kui-Feng Lee wrote: rethook_find_ret_addr() prints a warning message and returns 0 when the target task is running and not the "current" task to prevent returning an incorrect return address. However, this check is incomplete as the target task can still transition to the running state when finding the return address, although it is safe with RCU. The issue we encounter is that the kernel frequently prints warning messages when BPF profiling programs call to bpf_get_task_stack() on running tasks. The callers should be aware and willing to take the risk of receiving an incorrect return address from a task that is currently running other than the "current" one. A warning is not needed here as the callers are intent on it. Signed-off-by: Kui-Feng Lee --- kernel/trace/rethook.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c index fa03094e9e69..4297a132a7ae 100644 --- a/kernel/trace/rethook.c +++ b/kernel/trace/rethook.c @@ -248,7 +248,7 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame if (WARN_ON_ONCE(!cur)) return 0; - if (WARN_ON_ONCE(tsk != current && task_is_running(tsk))) + if (tsk != current && task_is_running(tsk)) return 0; This should probably go through Masami's tree, but the change makes sense to me, given this is an expected condition. Acked-by: Andrii Nakryiko Masami, I assume you'll pick this up? Thanks, Daniel
Re: [PATCH net-next 16/24] net: netkit, veth, tun, virt*: Use nested-BH locking for XDP redirect.
Hi Sebastian, On 12/15/23 6:07 PM, Sebastian Andrzej Siewior wrote: The per-CPU variables used during bpf_prog_run_xdp() invocation and later during xdp_do_redirect() rely on disabled BH for their protection. Without locking in local_bh_disable() on PREEMPT_RT these data structure require explicit locking. This is a follow-up on the previous change which introduced bpf_run_lock.redirect_lock and uses it now within drivers. The simple way is to acquire the lock before bpf_prog_run_xdp() is invoked and hold it until the end of function. This does not always work because some drivers (cpsw, atlantic) invoke xdp_do_flush() in the same context. Acquiring the lock in bpf_prog_run_xdp() and dropping in xdp_do_redirect() (without touching drivers) does not work because not all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and invoke xdp_do_redirect()). Ideally the minimal locking scope would be bpf_prog_run_xdp() + xdp_do_redirect() and everything else (error recovery, DMA unmapping, free/ alloc of memory, …) would happen outside of the locked section. [...] drivers/net/hyperv/netvsc_bpf.c | 1 + drivers/net/netkit.c| 13 +++ drivers/net/tun.c | 28 +-- drivers/net/veth.c | 40 - drivers/net/virtio_net.c| 1 + drivers/net/xen-netfront.c | 1 + 6 files changed, 52 insertions(+), 32 deletions(-) [...] Please exclude netkit from this set given it does not support XDP, but instead only accepts tc BPF typed programs. Thanks, Daniel diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c index 39171380ccf29..fbcf78477bda8 100644 --- a/drivers/net/netkit.c +++ b/drivers/net/netkit.c @@ -80,8 +80,15 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev) netkit_prep_forward(skb, !net_eq(dev_net(dev), dev_net(peer))); skb->dev = peer; entry = rcu_dereference(nk->active); - if (entry) - ret = netkit_run(entry, skb, ret); + if (entry) { + scoped_guard(local_lock_nested_bh, _run_lock.redirect_lock) { + ret = netkit_run(entry, skb, ret); + if (ret == NETKIT_REDIRECT) { + dev_sw_netstats_tx_add(dev, 1, len); + skb_do_redirect(skb); + } + } + } switch (ret) { case NETKIT_NEXT: case NETKIT_PASS: @@ -95,8 +102,6 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev) } break; case NETKIT_REDIRECT: - dev_sw_netstats_tx_add(dev, 1, len); - skb_do_redirect(skb); break; case NETKIT_DROP: default:
Re: [PATCH net] bpf: test_run: fix WARNING in format_decode
On 11/22/23 6:28 AM, Yonghong Song wrote: On 11/21/23 7:50 PM, Edward Adam Davis wrote: Confirm that skb->len is not 0 to ensure that skb length is valid. Fixes: 114039b34201 ("bpf: Move skb->len == 0 checks into __bpf_redirect") Reported-by: syzbot+e2c932aec5c8a6e1d...@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis Stan, Could you take a look at this patch? I think this only papers over the bug.. also BPF selftests seem to break with this change. Looking again at the syzkaller trace : [...] Please remove unsupported %\0 in format string WARNING: CPU: 0 PID: 5068 at lib/vsprintf.c:2675 format_decode+0xa03/0xba0 lib/vsprintf.c:2675 [...] We need to fix bpf_bprintf_prepare() instead to reject invalid fmts such as %0 and similar. net/bpf/test_run.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index c9fdcc5cdce1..78258a822a5c 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -845,6 +845,9 @@ static int convert___skb_to_skb(struct sk_buff *skb, struct __sk_buff *__skb) { struct qdisc_skb_cb *cb = (struct qdisc_skb_cb *)skb->cb; + if (!skb->len) + return -EINVAL; + if (!__skb) return 0;
Re: [PATCH bpf-next v2 3/4] libbpf: add low level TC-BPF API
On 4/19/21 11:43 PM, Toke Høiland-Jørgensen wrote: Daniel Borkmann writes: On 4/19/21 2:18 PM, Kumar Kartikeya Dwivedi wrote: This adds functions that wrap the netlink API used for adding, manipulating, and removing traffic control filters. These functions operate directly on the loaded prog's fd, and return a handle to the filter using an out parameter named id. The basic featureset is covered to allow for attaching, manipulation of properties, and removal of filters. Some additional features like TCA_BPF_POLICE and TCA_RATE for tc_cls have been omitted. These can added on top later by extending the bpf_tc_cls_opts struct. Support for binding actions directly to a classifier by passing them in during filter creation has also been omitted for now. These actions have an auto clean up property because their lifetime is bound to the filter they are attached to. This can be added later, but was omitted for now as direct action mode is a better alternative to it, which is enabled by default. An API summary: bpf_tc_act_{attach, change, replace} may be used to attach, change, and typo on bpf_tc_act_{...} ? ^^^ replace SCHED_CLS bpf classifier. The protocol field can be set as 0, in which case it is subsitituted as ETH_P_ALL by default. Do you have an actual user that needs anything other than ETH_P_ALL? Why is it even needed? Why not stick to just ETH_P_ALL? The behavior of the three functions is as follows: attach = create filter if it does not exist, fail otherwise change = change properties of the classifier of existing filter replace = create filter, and replace any existing filter This touches on tc oddities quite a bit. Why do we need to expose them? Can't we simplify/abstract this e.g. i) create or update instance, ii) delete instance, iii) get instance ? What concrete use case do you have that you need those three above? bpf_tc_cls_detach may be used to detach existing SCHED_CLS filter. The bpf_tc_cls_attach_id object filled in during attach, change, or replace must be passed in to the detach functions for them to remove the filter and its attached classififer correctly. bpf_tc_cls_get_info is a helper that can be used to obtain attributes for the filter and classififer. The opts structure may be used to choose the granularity of search, such that info for a specific filter corresponding to the same loaded bpf program can be obtained. By default, the first match is returned to the user. Examples: struct bpf_tc_cls_attach_id id = {}; struct bpf_object *obj; struct bpf_program *p; int fd, r; obj = bpf_object_open("foo.o"); if (IS_ERR_OR_NULL(obj)) return PTR_ERR(obj); p = bpf_object__find_program_by_title(obj, "classifier"); if (IS_ERR_OR_NULL(p)) return PTR_ERR(p); if (bpf_object__load(obj) < 0) return -1; fd = bpf_program__fd(p); r = bpf_tc_cls_attach(fd, if_nametoindex("lo"), BPF_TC_CLSACT_INGRESS, NULL, ); if (r < 0) return r; ... which is roughly equivalent to (after clsact qdisc setup): # tc filter add dev lo ingress bpf obj foo.o sec classifier da ... as direct action mode is always enabled. If a user wishes to modify existing options on an attached classifier, bpf_tc_cls_change API may be used. Only parameters class_id can be modified, the rest are filled in to identify the correct filter. protocol can be left out if it was not chosen explicitly (defaulting to ETH_P_ALL). Example: /* Optional parameters necessary to select the right filter */ DECLARE_LIBBPF_OPTS(bpf_tc_cls_opts, opts, .handle = id.handle, .priority = id.priority, .chain_index = id.chain_index) Why do we need chain_index as part of the basic API? opts.class_id = TC_H_MAKE(1UL << 16, 12); r = bpf_tc_cls_change(fd, if_nametoindex("lo"), BPF_TC_CLSACT_INGRESS, , ); Also, I'm not sure whether the prefix should even be named bpf_tc_cls_*() tbh, yes, despite being "low level" api. I think in the context of bpf we should stop regarding this as 'classifier' and 'action' objects since it's really just a single entity and not separate ones. It's weird enough to explain this concept to new users and if a libbpf based api could cleanly abstract it, I would be all for it. I don't think we need to map 1:1 the old tc legacy even in the low level api, tbh, as it feels backwards. I think the 'handle' & 'priority' bits are okay, but I would remove the others. Hmm, I'm OK with dropping the TC oddities (including the cls_ in the name), but I think we should be documenting it so that users that do come from TC will not be completel
Re: [PATCH bpf-next v2 3/4] libbpf: add low level TC-BPF API
On 4/19/21 2:18 PM, Kumar Kartikeya Dwivedi wrote: This adds functions that wrap the netlink API used for adding, manipulating, and removing traffic control filters. These functions operate directly on the loaded prog's fd, and return a handle to the filter using an out parameter named id. The basic featureset is covered to allow for attaching, manipulation of properties, and removal of filters. Some additional features like TCA_BPF_POLICE and TCA_RATE for tc_cls have been omitted. These can added on top later by extending the bpf_tc_cls_opts struct. Support for binding actions directly to a classifier by passing them in during filter creation has also been omitted for now. These actions have an auto clean up property because their lifetime is bound to the filter they are attached to. This can be added later, but was omitted for now as direct action mode is a better alternative to it, which is enabled by default. An API summary: bpf_tc_act_{attach, change, replace} may be used to attach, change, and typo on bpf_tc_act_{...} ? ^^^ replace SCHED_CLS bpf classifier. The protocol field can be set as 0, in which case it is subsitituted as ETH_P_ALL by default. Do you have an actual user that needs anything other than ETH_P_ALL? Why is it even needed? Why not stick to just ETH_P_ALL? The behavior of the three functions is as follows: attach = create filter if it does not exist, fail otherwise change = change properties of the classifier of existing filter replace = create filter, and replace any existing filter This touches on tc oddities quite a bit. Why do we need to expose them? Can't we simplify/abstract this e.g. i) create or update instance, ii) delete instance, iii) get instance ? What concrete use case do you have that you need those three above? bpf_tc_cls_detach may be used to detach existing SCHED_CLS filter. The bpf_tc_cls_attach_id object filled in during attach, change, or replace must be passed in to the detach functions for them to remove the filter and its attached classififer correctly. bpf_tc_cls_get_info is a helper that can be used to obtain attributes for the filter and classififer. The opts structure may be used to choose the granularity of search, such that info for a specific filter corresponding to the same loaded bpf program can be obtained. By default, the first match is returned to the user. Examples: struct bpf_tc_cls_attach_id id = {}; struct bpf_object *obj; struct bpf_program *p; int fd, r; obj = bpf_object_open("foo.o"); if (IS_ERR_OR_NULL(obj)) return PTR_ERR(obj); p = bpf_object__find_program_by_title(obj, "classifier"); if (IS_ERR_OR_NULL(p)) return PTR_ERR(p); if (bpf_object__load(obj) < 0) return -1; fd = bpf_program__fd(p); r = bpf_tc_cls_attach(fd, if_nametoindex("lo"), BPF_TC_CLSACT_INGRESS, NULL, ); if (r < 0) return r; ... which is roughly equivalent to (after clsact qdisc setup): # tc filter add dev lo ingress bpf obj foo.o sec classifier da ... as direct action mode is always enabled. If a user wishes to modify existing options on an attached classifier, bpf_tc_cls_change API may be used. Only parameters class_id can be modified, the rest are filled in to identify the correct filter. protocol can be left out if it was not chosen explicitly (defaulting to ETH_P_ALL). Example: /* Optional parameters necessary to select the right filter */ DECLARE_LIBBPF_OPTS(bpf_tc_cls_opts, opts, .handle = id.handle, .priority = id.priority, .chain_index = id.chain_index) Why do we need chain_index as part of the basic API? opts.class_id = TC_H_MAKE(1UL << 16, 12); r = bpf_tc_cls_change(fd, if_nametoindex("lo"), BPF_TC_CLSACT_INGRESS, , ); Also, I'm not sure whether the prefix should even be named bpf_tc_cls_*() tbh, yes, despite being "low level" api. I think in the context of bpf we should stop regarding this as 'classifier' and 'action' objects since it's really just a single entity and not separate ones. It's weird enough to explain this concept to new users and if a libbpf based api could cleanly abstract it, I would be all for it. I don't think we need to map 1:1 the old tc legacy even in the low level api, tbh, as it feels backwards. I think the 'handle' & 'priority' bits are okay, but I would remove the others. if (r < 0) return r; struct bpf_tc_cls_info info = {}; r = bpf_tc_cls_get_info(fd, if_nametoindex("lo"), BPF_TC_CLSACT_INGRESS, , ); if (r < 0) return r; assert(info.class_id ==
Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API
On 4/16/21 12:22 AM, Andrii Nakryiko wrote: On Thu, Apr 15, 2021 at 3:10 PM Daniel Borkmann wrote: On 4/15/21 1:58 AM, Andrii Nakryiko wrote: On Wed, Apr 14, 2021 at 4:32 PM Daniel Borkmann wrote: On 4/15/21 1:19 AM, Andrii Nakryiko wrote: On Wed, Apr 14, 2021 at 3:51 PM Toke Høiland-Jørgensen wrote: Andrii Nakryiko writes: On Wed, Apr 14, 2021 at 3:58 AM Toke Høiland-Jørgensen wrote: Andrii Nakryiko writes: On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen wrote: Andrii Nakryiko writes: On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov wrote: On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi wrote: On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov wrote: On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi wrote: [...] All of these things are messy because of tc legacy. bpf tried to follow tc style with cls and act distinction and it didn't quite work. cls with direct-action is the only thing that became mainstream while tc style attach wasn't really addressed. There were several incidents where tc had tens of thousands of progs attached because of this attach/query/index weirdness described above. I think the only way to address this properly is to introduce bpf_link style of attaching to tc. Such bpf_link would support ingress/egress only. direction-action will be implied. There won't be any index and query will be obvious. Note that we already have bpf_link support working (without support for pinning ofcourse) in a limited way. The ifindex, protocol, parent_id, priority, handle, chain_index tuple uniquely identifies a filter, so we stash this in the bpf_link and are able to operate on the exact filter during release. Except they're not unique. The library can stash them, but something else doing detach via iproute2 or their own netlink calls will detach the prog. This other app can attach to the same spot a different prog and now bpf_link__destroy will be detaching somebody else prog. So I would like to propose to take this patch set a step further from what Daniel said: int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}): and make this proposed api to return FD. To detach from tc ingress/egress just close(fd). You mean adding an fd-based TC API to the kernel? yes. I'm totally for bpf_link-based TC attachment. But I think *also* having "legacy" netlink-based APIs will allow applications to handle older kernels in a much nicer way without extra dependency on iproute2. We have a similar situation with kprobe, where currently libbpf only supports "modern" fd-based attachment, but users periodically ask questions and struggle to figure out issues on older kernels that don't support new APIs. +1; I am OK with adding a new bpf_link-based way to attach TC programs, but we still need to support the netlink API in libbpf. So I think we'd have to support legacy TC APIs, but I agree with Alexei and Daniel that we should keep it to the simplest and most straightforward API of supporting direction-action attachments and setting up qdisc transparently (if I'm getting all the terminology right, after reading Quentin's blog post). That coincidentally should probably match how bpf_link-based TC API will look like, so all that can be abstracted behind a single bpf_link__attach_tc() API as well, right? That's the plan for dealing with kprobe right now, btw. Libbpf will detect the best available API and transparently fall back (maybe with some warning for awareness, due to inherent downsides of legacy APIs: no auto-cleanup being the most prominent one). Yup, SGTM: Expose both in the low-level API (in bpf.c), and make the high-level API auto-detect. That way users can also still use the netlink attach function if they don't want the fd-based auto-close behaviour of bpf_link. So I thought a bit more about this, and it feels like the right move would be to expose only higher-level TC BPF API behind bpf_link. It will keep the API complexity and amount of APIs that libbpf will have to support to the minimum, and will keep the API itself simple: direct-attach with the minimum amount of input arguments. By not exposing low-level APIs we also table the whole bpf_tc_cls_attach_id design discussion, as we now can keep as much info as needed inside bpf_link_tc (which will embed bpf_link internally as well) to support detachment and possibly some additional querying, if needed. But then there would be no way for the caller to explicitly select a mechanism? I.e., if I write a BPF program using this mechanism targeting a 5.12 kernel, I'll get netlink attachment, which can stick around when I do bpf_link__disconnect(). But then if the kernel gets upgraded to support bpf_link for TC programs I'll suddenly transparently get bpf_link and the attachments will go away unless I pin them. This seems... less than ideal? That's what we are doing with bpf_program__attach_kprobe(), though. And so far I've only seen people (privately) sayi
Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API
On 4/15/21 1:58 AM, Andrii Nakryiko wrote: On Wed, Apr 14, 2021 at 4:32 PM Daniel Borkmann wrote: On 4/15/21 1:19 AM, Andrii Nakryiko wrote: On Wed, Apr 14, 2021 at 3:51 PM Toke Høiland-Jørgensen wrote: Andrii Nakryiko writes: On Wed, Apr 14, 2021 at 3:58 AM Toke Høiland-Jørgensen wrote: Andrii Nakryiko writes: On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen wrote: Andrii Nakryiko writes: On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov wrote: On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi wrote: On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov wrote: On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi wrote: [...] All of these things are messy because of tc legacy. bpf tried to follow tc style with cls and act distinction and it didn't quite work. cls with direct-action is the only thing that became mainstream while tc style attach wasn't really addressed. There were several incidents where tc had tens of thousands of progs attached because of this attach/query/index weirdness described above. I think the only way to address this properly is to introduce bpf_link style of attaching to tc. Such bpf_link would support ingress/egress only. direction-action will be implied. There won't be any index and query will be obvious. Note that we already have bpf_link support working (without support for pinning ofcourse) in a limited way. The ifindex, protocol, parent_id, priority, handle, chain_index tuple uniquely identifies a filter, so we stash this in the bpf_link and are able to operate on the exact filter during release. Except they're not unique. The library can stash them, but something else doing detach via iproute2 or their own netlink calls will detach the prog. This other app can attach to the same spot a different prog and now bpf_link__destroy will be detaching somebody else prog. So I would like to propose to take this patch set a step further from what Daniel said: int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}): and make this proposed api to return FD. To detach from tc ingress/egress just close(fd). You mean adding an fd-based TC API to the kernel? yes. I'm totally for bpf_link-based TC attachment. But I think *also* having "legacy" netlink-based APIs will allow applications to handle older kernels in a much nicer way without extra dependency on iproute2. We have a similar situation with kprobe, where currently libbpf only supports "modern" fd-based attachment, but users periodically ask questions and struggle to figure out issues on older kernels that don't support new APIs. +1; I am OK with adding a new bpf_link-based way to attach TC programs, but we still need to support the netlink API in libbpf. So I think we'd have to support legacy TC APIs, but I agree with Alexei and Daniel that we should keep it to the simplest and most straightforward API of supporting direction-action attachments and setting up qdisc transparently (if I'm getting all the terminology right, after reading Quentin's blog post). That coincidentally should probably match how bpf_link-based TC API will look like, so all that can be abstracted behind a single bpf_link__attach_tc() API as well, right? That's the plan for dealing with kprobe right now, btw. Libbpf will detect the best available API and transparently fall back (maybe with some warning for awareness, due to inherent downsides of legacy APIs: no auto-cleanup being the most prominent one). Yup, SGTM: Expose both in the low-level API (in bpf.c), and make the high-level API auto-detect. That way users can also still use the netlink attach function if they don't want the fd-based auto-close behaviour of bpf_link. So I thought a bit more about this, and it feels like the right move would be to expose only higher-level TC BPF API behind bpf_link. It will keep the API complexity and amount of APIs that libbpf will have to support to the minimum, and will keep the API itself simple: direct-attach with the minimum amount of input arguments. By not exposing low-level APIs we also table the whole bpf_tc_cls_attach_id design discussion, as we now can keep as much info as needed inside bpf_link_tc (which will embed bpf_link internally as well) to support detachment and possibly some additional querying, if needed. But then there would be no way for the caller to explicitly select a mechanism? I.e., if I write a BPF program using this mechanism targeting a 5.12 kernel, I'll get netlink attachment, which can stick around when I do bpf_link__disconnect(). But then if the kernel gets upgraded to support bpf_link for TC programs I'll suddenly transparently get bpf_link and the attachments will go away unless I pin them. This seems... less than ideal? That's what we are doing with bpf_program__attach_kprobe(), though. And so far I've only seen people (privately) saying how good it would be to have bpf_link-based TC APIs, doesn't seem like anyone with a realis
Re: [PATCH bpf-next 1/2] bpf: Remove bpf_jit_enable=2 debugging mode
On 4/15/21 11:32 AM, Jianlin Lv wrote: For debugging JITs, dumping the JITed image to kernel log is discouraged, "bpftool prog dump jited" is much better way to examine JITed dumps. This patch get rid of the code related to bpf_jit_enable=2 mode and update the proc handler of bpf_jit_enable, also added auxiliary information to explain how to use bpf_jit_disasm tool after this change. Signed-off-by: Jianlin Lv [...] diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c index 0a7a2870f111..8d36b4658076 100644 --- a/arch/x86/net/bpf_jit_comp32.c +++ b/arch/x86/net/bpf_jit_comp32.c @@ -2566,9 +2566,6 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) cond_resched(); } - if (bpf_jit_enable > 1) - bpf_jit_dump(prog->len, proglen, pass + 1, image); - if (image) { bpf_jit_binary_lock_ro(header); prog->bpf_func = (void *)image; diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index c8496c1142c9..990b1720c7a4 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -273,16 +273,8 @@ static int proc_dointvec_minmax_bpf_enable(struct ctl_table *table, int write, tmp.data = _enable; ret = proc_dointvec_minmax(, write, buffer, lenp, ppos); - if (write && !ret) { - if (jit_enable < 2 || - (jit_enable == 2 && bpf_dump_raw_ok(current_cred( { - *(int *)table->data = jit_enable; - if (jit_enable == 2) - pr_warn("bpf_jit_enable = 2 was set! NEVER use this in production, only for JIT debugging!\n"); - } else { - ret = -EPERM; - } - } + if (write && !ret) + *(int *)table->data = jit_enable; return ret; } @@ -389,7 +381,7 @@ static struct ctl_table net_core_table[] = { .extra2 = SYSCTL_ONE, # else .extra1 = SYSCTL_ZERO, - .extra2 = , + .extra2 = SYSCTL_ONE, # endif }, # ifdef CONFIG_HAVE_EBPF_JIT diff --git a/tools/bpf/bpf_jit_disasm.c b/tools/bpf/bpf_jit_disasm.c index c8ae95804728..efa4b17ae016 100644 --- a/tools/bpf/bpf_jit_disasm.c +++ b/tools/bpf/bpf_jit_disasm.c @@ -7,7 +7,7 @@ * * To get the disassembly of the JIT code, do the following: * - * 1) `echo 2 > /proc/sys/net/core/bpf_jit_enable` + * 1) Insert bpf_jit_dump() and recompile the kernel to output JITed image into log Hmm, if we remove bpf_jit_dump(), the next drive-by cleanup patch will be thrown at bpf@vger stating that bpf_jit_dump() has no in-tree users and should be removed. Maybe we should be removing bpf_jit_disasm.c along with it as well as bpf_jit_dump() itself ... I guess if it's ever needed in those rare occasions for JIT debugging we can resurrect it from old kernels just locally. But yeah, bpftool's jit dump should suffice for vast majority of use cases. There was a recent set for ppc32 jit which was merged into ppc tree which will create a merge conflict with this one [0]. So we would need a rebase and take it maybe during merge win once the ppc32 landed.. [0] https://lore.kernel.org/bpf/cover.1616430991.git.christophe.le...@csgroup.eu/ * 2) Load a BPF filter (e.g. `tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24`) * 3) Run e.g. `bpf_jit_disasm -o` to read out the last JIT code * diff --git a/tools/bpf/bpftool/feature.c b/tools/bpf/bpftool/feature.c index 40a88df275f9..98c7eec2923f 100644 --- a/tools/bpf/bpftool/feature.c +++ b/tools/bpf/bpftool/feature.c @@ -203,9 +203,6 @@ static void probe_jit_enable(void) case 1: printf("JIT compiler is enabled\n"); break; - case 2: - printf("JIT compiler is enabled with debugging traces in kernel logs\n"); - break; This would still need to be there for older kernels ... case -1: printf("Unable to retrieve JIT-compiler status\n"); break;
Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API
On 4/15/21 1:19 AM, Andrii Nakryiko wrote: On Wed, Apr 14, 2021 at 3:51 PM Toke Høiland-Jørgensen wrote: Andrii Nakryiko writes: On Wed, Apr 14, 2021 at 3:58 AM Toke Høiland-Jørgensen wrote: Andrii Nakryiko writes: On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen wrote: Andrii Nakryiko writes: On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov wrote: On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi wrote: On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov wrote: On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi wrote: [...] All of these things are messy because of tc legacy. bpf tried to follow tc style with cls and act distinction and it didn't quite work. cls with direct-action is the only thing that became mainstream while tc style attach wasn't really addressed. There were several incidents where tc had tens of thousands of progs attached because of this attach/query/index weirdness described above. I think the only way to address this properly is to introduce bpf_link style of attaching to tc. Such bpf_link would support ingress/egress only. direction-action will be implied. There won't be any index and query will be obvious. Note that we already have bpf_link support working (without support for pinning ofcourse) in a limited way. The ifindex, protocol, parent_id, priority, handle, chain_index tuple uniquely identifies a filter, so we stash this in the bpf_link and are able to operate on the exact filter during release. Except they're not unique. The library can stash them, but something else doing detach via iproute2 or their own netlink calls will detach the prog. This other app can attach to the same spot a different prog and now bpf_link__destroy will be detaching somebody else prog. So I would like to propose to take this patch set a step further from what Daniel said: int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}): and make this proposed api to return FD. To detach from tc ingress/egress just close(fd). You mean adding an fd-based TC API to the kernel? yes. I'm totally for bpf_link-based TC attachment. But I think *also* having "legacy" netlink-based APIs will allow applications to handle older kernels in a much nicer way without extra dependency on iproute2. We have a similar situation with kprobe, where currently libbpf only supports "modern" fd-based attachment, but users periodically ask questions and struggle to figure out issues on older kernels that don't support new APIs. +1; I am OK with adding a new bpf_link-based way to attach TC programs, but we still need to support the netlink API in libbpf. So I think we'd have to support legacy TC APIs, but I agree with Alexei and Daniel that we should keep it to the simplest and most straightforward API of supporting direction-action attachments and setting up qdisc transparently (if I'm getting all the terminology right, after reading Quentin's blog post). That coincidentally should probably match how bpf_link-based TC API will look like, so all that can be abstracted behind a single bpf_link__attach_tc() API as well, right? That's the plan for dealing with kprobe right now, btw. Libbpf will detect the best available API and transparently fall back (maybe with some warning for awareness, due to inherent downsides of legacy APIs: no auto-cleanup being the most prominent one). Yup, SGTM: Expose both in the low-level API (in bpf.c), and make the high-level API auto-detect. That way users can also still use the netlink attach function if they don't want the fd-based auto-close behaviour of bpf_link. So I thought a bit more about this, and it feels like the right move would be to expose only higher-level TC BPF API behind bpf_link. It will keep the API complexity and amount of APIs that libbpf will have to support to the minimum, and will keep the API itself simple: direct-attach with the minimum amount of input arguments. By not exposing low-level APIs we also table the whole bpf_tc_cls_attach_id design discussion, as we now can keep as much info as needed inside bpf_link_tc (which will embed bpf_link internally as well) to support detachment and possibly some additional querying, if needed. But then there would be no way for the caller to explicitly select a mechanism? I.e., if I write a BPF program using this mechanism targeting a 5.12 kernel, I'll get netlink attachment, which can stick around when I do bpf_link__disconnect(). But then if the kernel gets upgraded to support bpf_link for TC programs I'll suddenly transparently get bpf_link and the attachments will go away unless I pin them. This seems... less than ideal? That's what we are doing with bpf_program__attach_kprobe(), though. And so far I've only seen people (privately) saying how good it would be to have bpf_link-based TC APIs, doesn't seem like anyone with a realistic use case prefers the current APIs. So I suspect it's not going to be a problem in practice. But at least I'd start there and
Re: [PATCH 5.10 39/41] bpf, x86: Validate computation of branch displacements for x86-32
On 4/9/21 9:51 PM, Sudip Mukherjee wrote: On Fri, Apr 09, 2021 at 11:54:01AM +0200, Greg Kroah-Hartman wrote: From: Piotr Krysiuk commit 26f55a59dc65ff77cd1c4b37991e26497fc68049 upstream. I am not finding this in Linus's tree and even not seeing this change in master branch also. Am I missing something? Both are in -net tree at this point, thus commit sha won't change anymore. David or Jakub will likely send their -net PR to Linus today or tomorrow for landing in mainline. For stable things had to move a bit quicker given the announcement in [0] yesterday. Timing was a bit unfortunate here as I would have preferred for things to land in stable the regular way first (aka merge to mainline, cherry-picking to stable, minor stable release, then announcement). Thanks, Daniel [0] https://www.openwall.com/lists/oss-security/2021/04/08/1
Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API
On 3/31/21 11:44 AM, Kumar Kartikeya Dwivedi wrote: On Wed, Mar 31, 2021 at 02:55:47AM IST, Daniel Borkmann wrote: Do we even need the _block variant? I would rather prefer to take the chance and make it as simple as possible, and only iff really needed extend with other APIs, for example: The block variant can be dropped, I'll use the TC_BLOCK/TC_DEV alternative which sets parent_id/ifindex properly. bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}); Internally, this will create the sch_clsact qdisc & cls_bpf filter instance iff not present yet, and attach to a default prio 1 handle 1, and _always_ in direct-action mode. This is /as simple as it gets/ and we don't need to bother users with more complex tc/cls_bpf internals unless desired. For example, extended APIs could add prio/parent so that multi-prog can be attached to a single cls_bpf instance, but even that could be a second step, imho. I am not opposed to clsact qdisc setup if INGRESS/EGRESS is supplied (not sure how others feel about it). What speaks against it? It would be 100% clear from API side where the prog is being attached. Same as with tc cmdline where you specify 'ingress'/'egress'. We could make direct_action mode default, and similarly choose prio To be honest, I wouldn't even support a mode from the lib/API side where direct_action is not set. It should always be forced to true. Everything else is rather broken setup-wise, imho, since it won't scale. We added direct_action a bit later to the kernel than original cls_bpf, but if I would do it again today, I'd make it the only available option. I don't see a reasonable use-case where you have it to false. as 1 by default instead of letting the kernel do it. Then you can just pass in NULL for bpf_tc_cls_opts and be close to what you're proposing. For protocol we can choose ETH_P_ALL by default too if the user doesn't set it. Same here with ETH_P_ALL, I'm not sure anyone uses anything other than ETH_P_ALL, so yes, that should be default. With these modifications, the equivalent would look like bpf_tc_cls_attach(prog_fd, TC_DEV(ifindex, INGRESS), NULL, ); Few things compared to bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}): 1) nit, but why even 'cls' in the name. I think we shouldn't expose such old-days tc semantics to a user. Just bpf_tc_attach() is cleaner/simpler to understand. 2) What's the 'TC_DEV(ifindex, INGRESS)' macro doing exactly? Looks unnecessary, why not regular args to the API? 3) Exposed bpf_tc_attach() API could internally call a bpf_tc_attach_opts() API with preset defaults, and the latter could have all the custom bits if the user needs to go beyond the simple API, so from your bpf_tc_cls_attach() I'd also drop the NULL. 4) For the simple API I'd likely also drop the id (you could have a query API if needed). So as long as the user doesn't care about other details, they can just pass opts as NULL.
Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API
On 3/30/21 10:39 PM, Andrii Nakryiko wrote: On Sun, Mar 28, 2021 at 1:11 AM Kumar Kartikeya Dwivedi wrote: On Sun, Mar 28, 2021 at 10:12:40AM IST, Andrii Nakryiko wrote: Is there some succinct but complete enough documentation/tutorial/etc that I can reasonably read to understand kernel APIs provided by TC (w.r.t. BPF, of course). I'm trying to wrap my head around this and whether API makes sense or not. Please share links, if you have some. Hi Andrii, Unfortunately for the kernel API part, I couldn't find any when I was working on this. So I had to read the iproute2 tc code (tc_filter.c, f_bpf.c, m_action.c, m_bpf.c) and the kernel side bits (cls_api.c, cls_bpf.c, act_api.c, act_bpf.c) to grok anything I didn't understand. There's also similar code in libnl (lib/route/{act,cls}.c). Other than that, these resources were useful (perhaps you already went through some/all of them): https://docs.cilium.io/en/latest/bpf/#tc-traffic-control https://qmonnet.github.io/whirl-offload/2020/04/11/tc-bpf-direct-action/ tc(8), and tc-bpf(8) man pages I hope this is helpful! Thanks! I'll take a look. Sorry, I'm a bit behind with all the stuff, trying to catch up. I was just wondering if it would be more natural instead of having _dev _block variants and having to specify __u32 ifindex, __u32 parent_id, __u32 protocol, to have some struct specifying TC "destination"? Maybe not, but I thought I'd bring this up early. So you'd have just bpf_tc_cls_attach(), and you'd so something like bpf_tc_cls_attach(prog_fd, TC_DEV(ifindex, parent_id, protocol)) or bpf_tc_cls_attach(prog_fd, TC_BLOCK(block_idx, protocol)) ? Or it's taking it too far? But even if not, I think detaching can be unified between _dev and _block, can't it? Do we even need the _block variant? I would rather prefer to take the chance and make it as simple as possible, and only iff really needed extend with other APIs, for example: bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}); Internally, this will create the sch_clsact qdisc & cls_bpf filter instance iff not present yet, and attach to a default prio 1 handle 1, and _always_ in direct-action mode. This is /as simple as it gets/ and we don't need to bother users with more complex tc/cls_bpf internals unless desired. For example, extended APIs could add prio/parent so that multi-prog can be attached to a single cls_bpf instance, but even that could be a second step, imho. Thanks, Daniel
Re: [PATCH v3] bpf: Fix memory leak in copy_process()
On 3/17/21 4:09 AM, qiang.zh...@windriver.com wrote: From: Zqiang The syzbot report a memleak follow: BUG: memory leak unreferenced object 0x888101b41d00 (size 120): comm "kworker/u4:0", pid 8, jiffies 4294944270 (age 12.780s) backtrace: [] alloc_pid+0x66/0x560 [] copy_process+0x1465/0x25e0 [] kernel_clone+0xf3/0x670 [] kernel_thread+0x61/0x80 [] call_usermodehelper_exec_work [] call_usermodehelper_exec_work+0xc4/0x120 [] process_one_work+0x2c9/0x600 [] worker_thread+0x59/0x5d0 [] kthread+0x178/0x1b0 [] ret_from_fork+0x1f/0x30 unreferenced object 0x888110ef5c00 (size 232): comm "kworker/u4:0", pid 8414, jiffies 4294944270 (age 12.780s) backtrace: [] kmem_cache_zalloc [] __alloc_file+0x1f/0xf0 [] alloc_empty_file+0x69/0x120 [] alloc_file+0x33/0x1b0 [] alloc_file_pseudo+0xb2/0x140 [] create_pipe_files+0x138/0x2e0 [] umd_setup+0x33/0x220 [] call_usermodehelper_exec_async+0xb4/0x1b0 [] ret_from_fork+0x1f/0x30 after the UMD process exits, the pipe_to_umh/pipe_from_umh and tgid need to be release. Fixes: d71fa5c9763c ("bpf: Add kernel module with user mode driver that populates bpffs.") Reported-by: syzbot+44908bb56d2bfe56b...@syzkaller.appspotmail.com Signed-off-by: Zqiang Applied to bpf, thanks (also did minor style fixups to fix kernel style)!
Re: linux-next: manual merge of the net-next tree with the net tree
On 3/19/21 4:33 PM, Alexei Starovoitov wrote: On Fri, Mar 19, 2021 at 8:17 AM Yonghong Song wrote: On 3/19/21 12:21 AM, Daniel Borkmann wrote: On 3/19/21 3:11 AM, Piotr Krysiuk wrote: Hi Daniel, On Fri, Mar 19, 2021 at 12:16 AM Stephen Rothwell wrote: diff --cc kernel/bpf/verifier.c index 44e4ec1640f1,f9096b049cd6.. --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@@ -5876,10 -6056,22 +6060,23 @@@ static int retrieve_ptr_limit(const str if (mask_to_left) *ptr_limit = MAX_BPF_STACK + off; else - *ptr_limit = -off; - return 0; + *ptr_limit = -off - 1; + return *ptr_limit >= max ? -ERANGE : 0; + case PTR_TO_MAP_KEY: + /* Currently, this code is not exercised as the only use +* is bpf_for_each_map_elem() helper which requires +* bpf_capble. The code has been tested manually for +* future use. +*/ + if (mask_to_left) { + *ptr_limit = ptr_reg->umax_value + ptr_reg->off; + } else { + off = ptr_reg->smin_value + ptr_reg->off; + *ptr_limit = ptr_reg->map_ptr->key_size - off; + } + return 0; PTR_TO_MAP_VALUE logic above looks like copy-paste of old PTR_TO_MAP_VALUE code from before "bpf: Fix off-by-one for area size in creating mask to left" and is apparently affected by the same off-by-one, except this time on "key_size" area and not "value_size". This needs to be fixed in the same way as we did with PTR_TO_MAP_VALUE. What is the best way to proceed? Hm, not sure why PTR_TO_MAP_KEY was added by 69c087ba6225 in the first place, I presume noone expects this to be used from unprivileged as the comment says. Resolution should be to remove the PTR_TO_MAP_KEY case entirely from that switch until we have an actual user. Alexei suggested so that we don't forget it in the future if bpf_capable() requirement is removed. https://lore.kernel.org/bpf/c837ae55-2487-2f39-47f6-a18781dc6...@fb.com/ I am okay with either way, fix it or remove it. I prefer to fix it. If the bpf_capable() is removed, the verifier would bail out on PTR_TO_MAP_KEY if not covered in the switch given the recent fixes we did. I can fix it up after merge if we think bpf_for_each_map_elem() will be used by unpriv in future..
Re: linux-next: manual merge of the net-next tree with the net tree
On 3/19/21 3:11 AM, Piotr Krysiuk wrote: Hi Daniel, On Fri, Mar 19, 2021 at 12:16 AM Stephen Rothwell wrote: diff --cc kernel/bpf/verifier.c index 44e4ec1640f1,f9096b049cd6.. --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@@ -5876,10 -6056,22 +6060,23 @@@ static int retrieve_ptr_limit(const str if (mask_to_left) *ptr_limit = MAX_BPF_STACK + off; else - *ptr_limit = -off; - return 0; + *ptr_limit = -off - 1; + return *ptr_limit >= max ? -ERANGE : 0; + case PTR_TO_MAP_KEY: + /* Currently, this code is not exercised as the only use +* is bpf_for_each_map_elem() helper which requires +* bpf_capble. The code has been tested manually for +* future use. +*/ + if (mask_to_left) { + *ptr_limit = ptr_reg->umax_value + ptr_reg->off; + } else { + off = ptr_reg->smin_value + ptr_reg->off; + *ptr_limit = ptr_reg->map_ptr->key_size - off; + } + return 0; PTR_TO_MAP_VALUE logic above looks like copy-paste of old PTR_TO_MAP_VALUE code from before "bpf: Fix off-by-one for area size in creating mask to left" and is apparently affected by the same off-by-one, except this time on "key_size" area and not "value_size". This needs to be fixed in the same way as we did with PTR_TO_MAP_VALUE. What is the best way to proceed? Hm, not sure why PTR_TO_MAP_KEY was added by 69c087ba6225 in the first place, I presume noone expects this to be used from unprivileged as the comment says. Resolution should be to remove the PTR_TO_MAP_KEY case entirely from that switch until we have an actual user. Thanks, Daniel
Re: [PATCH] selftests/bpf: fix warning comparing pointer to 0
On 3/18/21 2:55 AM, Jiapeng Chong wrote: Fix the following coccicheck warning: ./tools/testing/selftests/bpf/progs/fentry_test.c:76:15-16: WARNING comparing pointer to 0. Reported-by: Abaci Robot Signed-off-by: Jiapeng Chong --- tools/testing/selftests/bpf/progs/fentry_test.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/bpf/progs/fentry_test.c b/tools/testing/selftests/bpf/progs/fentry_test.c index 5f645fd..d4247d6 100644 --- a/tools/testing/selftests/bpf/progs/fentry_test.c +++ b/tools/testing/selftests/bpf/progs/fentry_test.c @@ -64,7 +64,7 @@ struct bpf_fentry_test_t { SEC("fentry/bpf_fentry_test7") int BPF_PROG(test7, struct bpf_fentry_test_t *arg) { - if (arg == 0) + if (!arg) test7_result = 1; return 0; } @@ -73,7 +73,7 @@ int BPF_PROG(test7, struct bpf_fentry_test_t *arg) SEC("fentry/bpf_fentry_test8") int BPF_PROG(test8, struct bpf_fentry_test_t *arg) { - if (arg->a == 0) + if (!arg->a) test8_result = 1; return 0; } This doesn't apply. Please rebase against bpf-next tree, and also make sure to squash any other such patches into a single one.
Re: [PATCH bpf-next v2] bpf: Simplify expression for identify bpf mem type
On 3/18/21 7:36 AM, Jianlin Lv wrote: Added BPF_LD_ST_SIZE_MASK macro as mask of size modifier that help to reduce the evaluation of expressions in if statements, and remove BPF_SIZE_MASK in netronome driver. Signed-off-by: Jianlin Lv --- v2: Move the bpf_LD_ST_SIZE_MASK macro definition to include/linux/bpf.h --- drivers/net/ethernet/netronome/nfp/bpf/main.h | 8 +++- include/linux/bpf.h | 1 + kernel/bpf/verifier.c | 12 3 files changed, 8 insertions(+), 13 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h index d0e17eebddd9..e90981e69763 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.h +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h @@ -346,8 +346,6 @@ struct nfp_insn_meta { struct list_head l; }; -#define BPF_SIZE_MASK 0x18 - static inline u8 mbpf_class(const struct nfp_insn_meta *meta) { return BPF_CLASS(meta->insn.code); @@ -375,7 +373,7 @@ static inline bool is_mbpf_alu(const struct nfp_insn_meta *meta) static inline bool is_mbpf_load(const struct nfp_insn_meta *meta) { - return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_LDX | BPF_MEM); + return (meta->insn.code & ~BPF_LD_ST_SIZE_MASK) == (BPF_LDX | BPF_MEM); } static inline bool is_mbpf_jmp32(const struct nfp_insn_meta *meta) @@ -395,7 +393,7 @@ static inline bool is_mbpf_jmp(const struct nfp_insn_meta *meta) static inline bool is_mbpf_store(const struct nfp_insn_meta *meta) { - return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_STX | BPF_MEM); + return (meta->insn.code & ~BPF_LD_ST_SIZE_MASK) == (BPF_STX | BPF_MEM); } static inline bool is_mbpf_load_pkt(const struct nfp_insn_meta *meta) @@ -430,7 +428,7 @@ static inline bool is_mbpf_classic_store_pkt(const struct nfp_insn_meta *meta) static inline bool is_mbpf_atomic(const struct nfp_insn_meta *meta) { - return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_STX | BPF_ATOMIC); + return (meta->insn.code & ~BPF_LD_ST_SIZE_MASK) == (BPF_STX | BPF_ATOMIC); } static inline bool is_mbpf_mul(const struct nfp_insn_meta *meta) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index a25730eaa148..e85924719c65 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -995,6 +995,7 @@ struct bpf_array { BPF_F_RDONLY_PROG |\ BPF_F_WRONLY | \ BPF_F_WRONLY_PROG) +#define BPF_LD_ST_SIZE_MASK0x18/* mask of size modifier */ #define BPF_MAP_CAN_READ BIT(0) #define BPF_MAP_CAN_WRITE BIT(1) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f9096b049cd6..29fdfdb8abfa 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -11384,15 +11384,11 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) for (i = 0; i < insn_cnt; i++, insn++) { bpf_convert_ctx_access_t convert_ctx_access; - if (insn->code == (BPF_LDX | BPF_MEM | BPF_B) || - insn->code == (BPF_LDX | BPF_MEM | BPF_H) || - insn->code == (BPF_LDX | BPF_MEM | BPF_W) || - insn->code == (BPF_LDX | BPF_MEM | BPF_DW)) + /* opcode: BPF_MEM | | BPF_LDX */ + if ((insn->code & ~BPF_LD_ST_SIZE_MASK) == (BPF_LDX | BPF_MEM)) type = BPF_READ; - else if (insn->code == (BPF_STX | BPF_MEM | BPF_B) || -insn->code == (BPF_STX | BPF_MEM | BPF_H) || -insn->code == (BPF_STX | BPF_MEM | BPF_W) || -insn->code == (BPF_STX | BPF_MEM | BPF_DW)) + /* opcode: BPF_MEM | | BPF_STX */ + else if ((insn->code & ~BPF_LD_ST_SIZE_MASK) == (BPF_STX | BPF_MEM)) type = BPF_WRITE; else continue; To me this cleanup makes the code harder to read, in particular on verfier side, I don't think it's worth it, especially given it's not in (highly) performance critical code. Thanks, Daniel
Re: [PATCH] MIPS/bpf: Enable bpf_probe_read{, str}() on MIPS again
On 3/17/21 8:15 AM, Tiezhu Yang wrote: After commit 0ebeea8ca8a4 ("bpf: Restrict bpf_probe_read{, str}() only to archs where they work"), bpf_probe_read{, str}() functions were not longer available on MIPS, so there exists some errors when running bpf program: root@linux:/home/loongson/bcc# python examples/tracing/task_switch.py bpf: Failed to load program: Invalid argument [...] 11: (85) call bpf_probe_read#4 unknown func bpf_probe_read#4 [...] Exception: Failed to load BPF program count_sched: Invalid argument So select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE in arch/mips/Kconfig, otherwise the bpf old helper bpf_probe_read() will not be available. This is similar with the commit d195b1d1d1196 ("powerpc/bpf: Enable bpf_probe_read{, str}() on powerpc again"). Fixes: 0ebeea8ca8a4 ("bpf: Restrict bpf_probe_read{, str}() only to archs where they work") Signed-off-by: Tiezhu Yang Thomas, I presume you pick this up via mips tree (with typos fixed)? Or do you want us to route the fix via bpf with your ACK? (I'm fine either way.) Thanks, Daniel
Re: [PATCH] libbpf: avoid inline hint definition from 'linux/stddef.h'
On 3/16/21 10:34 PM, Andrii Nakryiko wrote: On Tue, Mar 16, 2021 at 2:01 PM Daniel Borkmann wrote: On 3/14/21 6:38 PM, Pedro Tammela wrote: Linux headers might pull 'linux/stddef.h' which defines '__always_inline' as the following: #ifndef __always_inline #define __always_inline __inline__ #endif This becomes an issue if the program picks up the 'linux/stddef.h' definition as the macro now just hints inline to clang. How did the program end up including linux/stddef.h ? Would be good to also have some more details on how we got here for the commit desc. It's an UAPI header, so why not? Is there anything special about stddef.h that makes it unsuitable to be included? Hm, fair enough, looks like linux/types.h already pulls it in, so no. We defined our own stddef.h longer time ago, so looks like we never ran into this issue. This change now enforces the proper definition for BPF programs regardless of the include order. Signed-off-by: Pedro Tammela --- tools/lib/bpf/bpf_helpers.h | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h index ae6c975e0b87..5fa483c0b508 100644 --- a/tools/lib/bpf/bpf_helpers.h +++ b/tools/lib/bpf/bpf_helpers.h @@ -29,9 +29,12 @@ */ #define SEC(NAME) __attribute__((section(NAME), used)) -#ifndef __always_inline +/* + * Avoid 'linux/stddef.h' definition of '__always_inline'. + */ I think the comment should have more details on 'why' we undef it as in few months looking at it again, the next question to dig into would be what was wrong with linux/stddef.h. Providing a better rationale would be nice for readers here. So for whatever reason commit bot didn't send notification, but I've already landed this yesterday. To me, with #undef + #define it's pretty clear that we "force-define" __always_inline exactly how we want it, but we can certainly add clarifying comment in the follow up, if you think it's needed. Up to you, but given you applied it it's probably not worth the trouble; missed it earlier given I didn't see the patchbot message in the thread initially. :/ +#undef __always_inline #define __always_inline inline __attribute__((always_inline)) -#endif + #ifndef __noinline #define __noinline __attribute__((noinline)) #endif
Re: [PATCH] libbpf: avoid inline hint definition from 'linux/stddef.h'
On 3/14/21 6:38 PM, Pedro Tammela wrote: Linux headers might pull 'linux/stddef.h' which defines '__always_inline' as the following: #ifndef __always_inline #define __always_inline __inline__ #endif This becomes an issue if the program picks up the 'linux/stddef.h' definition as the macro now just hints inline to clang. How did the program end up including linux/stddef.h ? Would be good to also have some more details on how we got here for the commit desc. This change now enforces the proper definition for BPF programs regardless of the include order. Signed-off-by: Pedro Tammela --- tools/lib/bpf/bpf_helpers.h | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h index ae6c975e0b87..5fa483c0b508 100644 --- a/tools/lib/bpf/bpf_helpers.h +++ b/tools/lib/bpf/bpf_helpers.h @@ -29,9 +29,12 @@ */ #define SEC(NAME) __attribute__((section(NAME), used)) -#ifndef __always_inline +/* + * Avoid 'linux/stddef.h' definition of '__always_inline'. + */ I think the comment should have more details on 'why' we undef it as in few months looking at it again, the next question to dig into would be what was wrong with linux/stddef.h. Providing a better rationale would be nice for readers here. +#undef __always_inline #define __always_inline inline __attribute__((always_inline)) -#endif + #ifndef __noinline #define __noinline __attribute__((noinline)) #endif
Re: [PATCH v2] bpf: Fix memory leak in copy_process()
On 3/15/21 9:58 AM, qiang.zh...@windriver.com wrote: From: Zqiang nit: I presume it should be s/Zqiang/Qiang Zhang/ as real name for 'From' instead of abbreviation? The syzbot report a memleak follow: BUG: memory leak unreferenced object 0x888101b41d00 (size 120): comm "kworker/u4:0", pid 8, jiffies 4294944270 (age 12.780s) backtrace: [] alloc_pid+0x66/0x560 [] copy_process+0x1465/0x25e0 [] kernel_clone+0xf3/0x670 [] kernel_thread+0x61/0x80 [] call_usermodehelper_exec_work [] call_usermodehelper_exec_work+0xc4/0x120 [] process_one_work+0x2c9/0x600 [] worker_thread+0x59/0x5d0 [] kthread+0x178/0x1b0 [] ret_from_fork+0x1f/0x30 unreferenced object 0x888110ef5c00 (size 232): comm "kworker/u4:0", pid 8414, jiffies 4294944270 (age 12.780s) backtrace: [] kmem_cache_zalloc [] __alloc_file+0x1f/0xf0 [] alloc_empty_file+0x69/0x120 [] alloc_file+0x33/0x1b0 [] alloc_file_pseudo+0xb2/0x140 [] create_pipe_files+0x138/0x2e0 [] umd_setup+0x33/0x220 [] call_usermodehelper_exec_async+0xb4/0x1b0 [] ret_from_fork+0x1f/0x30 after the UMD process exits, the pipe_to_umh/pipe_from_umh and tgid need to be release. Fixes: d71fa5c9763c ("bpf: Add kernel module with user mode driver that populates bpffs.") Reported-by: syzbot+44908bb56d2bfe56b...@syzkaller.appspotmail.com Signed-off-by: Zqiang nit: Ditto --- v1->v2: Judge whether the pointer variable tgid is valid. kernel/bpf/preload/bpf_preload_kern.c | 24 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/kernel/bpf/preload/bpf_preload_kern.c b/kernel/bpf/preload/bpf_preload_kern.c index 79c5772465f1..5009875f01d3 100644 --- a/kernel/bpf/preload/bpf_preload_kern.c +++ b/kernel/bpf/preload/bpf_preload_kern.c @@ -4,6 +4,7 @@ #include #include #include +#include #include #include "bpf_preload.h" @@ -20,6 +21,14 @@ static struct bpf_preload_ops umd_ops = { .owner = THIS_MODULE, }; +static void bpf_preload_umh_cleanup(struct umd_info *info) +{ + fput(info->pipe_to_umh); + fput(info->pipe_from_umh); + put_pid(info->tgid); + info->tgid = NULL; +} The above is pretty much a reimplementation of ... static void umd_cleanup(struct subprocess_info *info) { struct umd_info *umd_info = info->data; /* cleanup if umh_setup() was successful but exec failed */ if (info->retval) { fput(umd_info->pipe_to_umh); fput(umd_info->pipe_from_umh); put_pid(umd_info->tgid); umd_info->tgid = NULL; } } ... so if there are ever changes to umd_cleanup() for additional resource cleanup, we'd be missing those easily in bpf_preload_umh_cleanup(). I'd suggest to refactor a common helper inside kernel/usermode_driver.c that is then exported as symbol which the driver here can use. static int preload(struct bpf_preload_info *obj) { int magic = BPF_PRELOAD_START; @@ -61,8 +70,10 @@ static int finish(void) if (n != sizeof(magic)) return -EPIPE; tgid = umd_ops.info.tgid; - wait_event(tgid->wait_pidfd, thread_group_exited(tgid)); - umd_ops.info.tgid = NULL; + if (tgid) { + wait_event(tgid->wait_pidfd, thread_group_exited(tgid)); + bpf_preload_umh_cleanup(_ops.info); + } return 0; } @@ -80,10 +91,15 @@ static int __init load_umd(void) static void __exit fini_umd(void) { + struct pid *tgid; bpf_preload_ops = NULL; /* kill UMD in case it's still there due to earlier error */ - kill_pid(umd_ops.info.tgid, SIGKILL, 1); - umd_ops.info.tgid = NULL; + tgid = umd_ops.info.tgid; + if (tgid) { + kill_pid(tgid, SIGKILL, 1); + wait_event(tgid->wait_pidfd, thread_group_exited(tgid)); + bpf_preload_umh_cleanup(_ops.info); + } umd_unload_blob(_ops.info); } late_initcall(load_umd);
Re: [PATCH v2] bpf: Fix memory leak in copy_process()
On 3/15/21 9:18 AM, qiang.zh...@windriver.com wrote: From: Zqiang Hello Zqiang, please resend this patch with b...@vger.kernel.org in Cc, so it actually reaches the rest of BPF community for review, thanks! The syzbot report a memleak follow: BUG: memory leak unreferenced object 0x888101b41d00 (size 120): comm "kworker/u4:0", pid 8, jiffies 4294944270 (age 12.780s) backtrace: [] alloc_pid+0x66/0x560 [] copy_process+0x1465/0x25e0 [] kernel_clone+0xf3/0x670 [] kernel_thread+0x61/0x80 [] call_usermodehelper_exec_work [] call_usermodehelper_exec_work+0xc4/0x120 [] process_one_work+0x2c9/0x600 [] worker_thread+0x59/0x5d0 [] kthread+0x178/0x1b0 [] ret_from_fork+0x1f/0x30 unreferenced object 0x888110ef5c00 (size 232): comm "kworker/u4:0", pid 8414, jiffies 4294944270 (age 12.780s) backtrace: [] kmem_cache_zalloc [] __alloc_file+0x1f/0xf0 [] alloc_empty_file+0x69/0x120 [] alloc_file+0x33/0x1b0 [] alloc_file_pseudo+0xb2/0x140 [] create_pipe_files+0x138/0x2e0 [] umd_setup+0x33/0x220 [] call_usermodehelper_exec_async+0xb4/0x1b0 [] ret_from_fork+0x1f/0x30 after the UMD process exits, the pipe_to_umh/pipe_from_umh and tgid need to be release. Fixes: d71fa5c9763c ("bpf: Add kernel module with user mode driver that populates bpffs.") Reported-by: syzbot+44908bb56d2bfe56b...@syzkaller.appspotmail.com Signed-off-by: Zqiang --- v1->v2: Judge whether the pointer variable tgid is valid.
Re: [PATCH] tools include: Add __sum16 and __wsum definitions.
On 3/7/21 11:30 PM, Ian Rogers wrote: This adds definitions available in the uapi version. Explanation: In the kernel include of types.h the uapi version is included. In tools the uapi/linux/types.h and linux/types.h are distinct. For BPF programs a definition of __wsum is needed by the generated bpf_helpers.h. The definition comes either from a generated vmlinux.h or from that may be transitively included from bpf.h. The perf build prefers linux/types.h over uapi/linux/types.h for *. To allow tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c to compile with the same include path used for perf then these definitions are necessary. There is likely a wider conversation about exactly how types.h should be specified and the include order used by the perf build - it is somewhat confusing that tools/include/uapi/linux/bpf.h is using the non-uapi types.h. *see tools/perf/Makefile.config: ... INC_FLAGS += -I$(srctree)/tools/include/ INC_FLAGS += -I$(srctree)/tools/arch/$(SRCARCH)/include/uapi INC_FLAGS += -I$(srctree)/tools/include/uapi ... The include directories are scanned from left-to-right: https://gcc.gnu.org/onlinedocs/gcc/Directory-Options.html As tools/include/linux/types.h appears before tools/include/uapi/linux/types.h then I say it is preferred. Signed-off-by: Ian Rogers Given more related to perf build infra, I presume Arnaldo would pick this one up? Thanks, Daniel
Re: [PATCH bpf-next] selftests_bpf: extend test_tc_tunnel test with vxlan
On 3/5/21 5:15 PM, Willem de Bruijn wrote: On Fri, Mar 5, 2021 at 11:10 AM Daniel Borkmann wrote: On 3/5/21 4:08 PM, Willem de Bruijn wrote: On Fri, Mar 5, 2021 at 7:34 AM Xuesen Huang wrote: From: Xuesen Huang Add BPF_F_ADJ_ROOM_ENCAP_L2_ETH flag to the existing tests which encapsulates the ethernet as the inner l2 header. Update a vxlan encapsulation test case. Signed-off-by: Xuesen Huang Signed-off-by: Li Wang Signed-off-by: Willem de Bruijn Please don't add my signed off by without asking. Agree, I can remove it if you prefer while applying and only keep the ack instead. That would be great. Thanks, Daniel! Done & applied, thanks everyone!
Re: [PATCH bpf-next] selftests_bpf: extend test_tc_tunnel test with vxlan
On 3/5/21 4:08 PM, Willem de Bruijn wrote: On Fri, Mar 5, 2021 at 7:34 AM Xuesen Huang wrote: From: Xuesen Huang Add BPF_F_ADJ_ROOM_ENCAP_L2_ETH flag to the existing tests which encapsulates the ethernet as the inner l2 header. Update a vxlan encapsulation test case. Signed-off-by: Xuesen Huang Signed-off-by: Li Wang Signed-off-by: Willem de Bruijn Please don't add my signed off by without asking. Agree, I can remove it if you prefer while applying and only keep the ack instead. That said, Acked-by: Willem de Bruijn
Re: [PATCH v8 bpf-next 0/5] xsk: build skb by page (aka generic zerocopy xmit)
On 2/18/21 9:49 PM, Alexander Lobakin wrote: This series introduces XSK generic zerocopy xmit by adding XSK umem pages as skb frags instead of copying data to linear space. The only requirement for this for drivers is to be able to xmit skbs with skb_headlen(skb) == 0, i.e. all data including hard headers starts from frag 0. To indicate whether a particular driver supports this, a new netdev priv flag, IFF_TX_SKB_NO_LINEAR, is added (and declared in virtio_net as it's already capable of doing it). So consider implementing this in your drivers to greatly speed-up generic XSK xmit. [...] Applied, thanks!
Re: [PATCH] bpf: fix a warning message in mark_ptr_not_null_reg()
On 2/16/21 10:10 PM, KP Singh wrote: On Tue, Feb 16, 2021 at 8:37 PM Dan Carpenter wrote: The WARN_ON() argument is a condition, and it generates a stack trace but it doesn't print the warning. Fixes: 4ddb74165ae5 ("bpf: Extract nullable reg type conversion into a helper function") Signed-off-by: Dan Carpenter --- kernel/bpf/verifier.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 056df6be3e30..bd4d1dfca73c 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1120,7 +1120,7 @@ static void mark_ptr_not_null_reg(struct bpf_reg_state *reg) reg->type = PTR_TO_RDWR_BUF; break; default: - WARN_ON("unknown nullable register type"); + WARN(1, "unknown nullable register type"); Should we use WARN_ONCE here? Also, I think the fix should be targeted for bpf-next as the patch that introduced this hasn't made it to bpf yet. [...] Usually we have something like `verbose(env, "kernel subsystem misconfigured verifier\n")`, but in this case we'd need to drag env all the way to here. :/ I agree with WARN_ONCE().
Re: [PATCH bpf] devmap: Use GFP_KERNEL for xdp bulk queue allocation
On 2/9/21 9:24 AM, NOMURA JUNICHI(野村 淳一) wrote: The devmap bulk queue is allocated with GFP_ATOMIC and the allocation may fail if there is no available space in existing percpu pool. Since commit 75ccae62cb8d42 ("xdp: Move devmap bulk queue into struct net_device") moved the bulk queue allocation to NETDEV_REGISTER callback, whose context is allowed to sleep, use GFP_KERNEL instead of GFP_ATOMIC to let percpu allocator extend the pool when needed and avoid possible failure of netdev registration. As the required alignment is natural, we can simply use alloc_percpu(). Signed-off-by: Jun'ichi Nomura Applied, thanks!
Re: [PATCH/v2] bpf: add bpf_skb_adjust_room flag BPF_F_ADJ_ROOM_ENCAP_L2_ETH
On 2/10/21 3:50 PM, Willem de Bruijn wrote: On Wed, Feb 10, 2021 at 1:59 AM huangxuesen wrote: From: huangxuesen bpf_skb_adjust_room sets the inner_protocol as skb->protocol for packets encapsulation. But that is not appropriate when pushing Ethernet header. Add an option to further specify encap L2 type and set the inner_protocol as ETH_P_TEB. Suggested-by: Willem de Bruijn Signed-off-by: huangxuesen Signed-off-by: chengzhiyong Signed-off-by: wangli Thanks, this is exactly what I meant. Acked-by: Willem de Bruijn One small point regarding Signed-off-by: It is customary to capitalize family and given names. +1, huangxuesen, would be great if you could resubmit with capitalized names in your SoB as well as From (both seem affected). Thanks, Daniel
Re: [PATCH bpf-next v6 2/5] bpf: Expose bpf_get_socket_cookie to tracing programs
On 1/30/21 12:45 PM, Florent Revest wrote: On Fri, Jan 29, 2021 at 1:49 PM Daniel Borkmann wrote: On 1/29/21 11:57 AM, Daniel Borkmann wrote: On 1/27/21 10:01 PM, Andrii Nakryiko wrote: On Tue, Jan 26, 2021 at 10:36 AM Florent Revest wrote: This needs a new helper that: - can work in a sleepable context (using sock_gen_cookie) - takes a struct sock pointer and checks that it's not NULL Signed-off-by: Florent Revest Acked-by: KP Singh --- include/linux/bpf.h| 1 + include/uapi/linux/bpf.h | 8 kernel/trace/bpf_trace.c | 2 ++ net/core/filter.c | 12 tools/include/uapi/linux/bpf.h | 8 5 files changed, 31 insertions(+) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 1aac2af12fed..26219465e1f7 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1874,6 +1874,7 @@ extern const struct bpf_func_proto bpf_per_cpu_ptr_proto; extern const struct bpf_func_proto bpf_this_cpu_ptr_proto; extern const struct bpf_func_proto bpf_ktime_get_coarse_ns_proto; extern const struct bpf_func_proto bpf_sock_from_file_proto; +extern const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto; const struct bpf_func_proto *bpf_tracing_func_proto( enum bpf_func_id func_id, const struct bpf_prog *prog); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 0b735c2729b2..5855c398d685 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1673,6 +1673,14 @@ union bpf_attr { * Return * A 8-byte long unique number. * + * u64 bpf_get_socket_cookie(void *sk) should the type be `struct sock *` then? Checking libbpf's generated bpf_helper_defs.h it generates: /* * bpf_get_socket_cookie * * If the **struct sk_buff** pointed by *skb* has a known socket, * retrieve the cookie (generated by the kernel) of this socket. * If no cookie has been set yet, generate a new cookie. Once * generated, the socket cookie remains stable for the life of the * socket. This helper can be useful for monitoring per socket * networking traffic statistics as it provides a global socket * identifier that can be assumed unique. * * Returns * A 8-byte long non-decreasing number on success, or 0 if the * socket field is missing inside *skb*. */ static __u64 (*bpf_get_socket_cookie)(void *ctx) = (void *) 46; So in terms of helper comment it's picking up the description from the `u64 bpf_get_socket_cookie(struct sk_buff *skb)` signature. With that in mind it would likely make sense to add the actual `struct sock *` type to the comment to make it more clear in here. One thought that still came to mind when looking over the series again, do we need to blacklist certain functions from bpf_get_socket_cookie() under tracing e.g. when attaching to, say fexit? For example, if sk_prot_free() would be temporary uninlined/exported for testing and bpf_get_socket_cookie() was invoked from a prog upon fexit where sock was already passed back to allocator, I presume there's risk of mem corruption, no? Mh, this is interesting. I can try to add a deny list in v7 but I'm not sure whether I'll be able to catch them all. I'm assuming that __sk_destruct, sk_destruct, __sk_free, sk_free would be other problematic functions but potentially there would be more. I was just looking at bpf_skb_output() from a7658e1a4164 ("bpf: Check types of arguments passed into helpers") which afaiu has similar issue, back at the time this was introduced there was no fentry/fexit but rather raw tp progs, so you could still safely dump skb this way including for kfree_skb() tp. Presumably if you bpf_skb_output() at 'fexit/kfree_skb' you might be able to similarly crash your kernel which I don't think is intentional (also given we go above and beyond in other areas to avoid crashing or destabilizing e.g. [0] to mention one). Maybe these should really only be for BPF_TRACE_RAW_TP (or BPF_PROG_TYPE_LSM) where it can be audited that it's safe to use like a7658e1a4164's original intention ... or have some sort of function annotation like __acquires/__releases but for tracing e.g. __frees(skb) where use would then be blocked (not sure iff feasible). [0] https://lore.kernel.org/bpf/20210126001219.845816-1-...@fb.com/
Re: [PATCH] bpf: Fix integer overflow in argument calculation for bpf_map_area_alloc
On 1/27/21 5:23 AM, Bui Quang Minh wrote: On Tue, Jan 26, 2021 at 09:36:57AM +, Lorenz Bauer wrote: On Tue, 26 Jan 2021 at 08:26, Bui Quang Minh wrote: In 32-bit architecture, the result of sizeof() is a 32-bit integer so the expression becomes the multiplication between 2 32-bit integer which can potentially leads to integer overflow. As a result, bpf_map_area_alloc() allocates less memory than needed. Fix this by casting 1 operand to u64. Some quick thoughts: * Should this have a Fixes tag? Ok, I will add Fixes tag in later version patch. * Seems like there are quite a few similar calls scattered around (cpumap, etc.). Did you audit these as well? [...] In cpumap, static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) { cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries * sizeof(struct bpf_cpu_map_entry *), cmap->map.numa_node); } I think this is safe because max_entries is not permitted to be larger than NR_CPUS. Yes. In stackmap, there is a place that I'm not very sure about static int prealloc_elems_and_freelist(struct bpf_stack_map *smap) { u32 elem_size = sizeof(struct stack_map_bucket) + smap->map.value_size; smap->elems = bpf_map_area_alloc(elem_size * smap->map.max_entries, smap->map.numa_node); } This is called after another bpf_map_area_alloc in stack_map_alloc(). In the first bpf_map_area_alloc() the argument is calculated in an u64 variable; so if in the second one, there is an integer overflow then the first one must be called with size > 4GB. I think the first one will probably fail (I am not sure about the actual limit of vmalloc()), so the second one might not be called. I would sanity check this as well. Looks like k*alloc()/v*alloc() call sites typically use array_size() which returns SIZE_MAX on overflow, 610b15c50e86 ("overflow.h: Add allocation size calculation helpers"). Thanks, Daniel
Re: [PATCH bpf-next] samples/bpf: Add include dir for MIPS Loongson64 to fix build errors
On 1/26/21 3:05 PM, Tiezhu Yang wrote: There exists many build errors when make M=samples/bpf on the Loongson platform, this issue is MIPS related, x86 compiles just fine. Here are some errors: [...] So we can do the similar things in samples/bpf/Makefile, just add platform specific and generic include dir for MIPS Loongson64 to fix the build errors. Your patch from [0] said ... There exists many build warnings when make M=samples/bpf on the Loongson platform, this issue is MIPS related, x86 compiles just fine. Here are some warnings: [...] With #ifndef __SANE_USERSPACE_TYPES__ in tools/include/linux/types.h, the above error has gone and this ifndef change does not hurt other compilations. ... which ave the impression that all the issues were fixed. What else is needed aside from this patch here? More samples/bpf fixes coming? If yes, please all submit them as a series instead of individual ones. [0] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=190d1c921ad0862da14807e1670f54020f48e889 Signed-off-by: Tiezhu Yang --- samples/bpf/Makefile | 4 1 file changed, 4 insertions(+) diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 362f314..45ceca4 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -185,6 +185,10 @@ endif ifeq ($(ARCH), mips) TPROGS_CFLAGS += -D__SANE_USERSPACE_TYPES__ +ifdef CONFIG_MACH_LOONGSON64 +BPF_EXTRA_CFLAGS += -I$(srctree)/arch/mips/include/asm/mach-loongson64 +BPF_EXTRA_CFLAGS += -I$(srctree)/arch/mips/include/asm/mach-generic +endif endif TPROGS_CFLAGS += -Wall -O2
Re: [PATCH bpf 1/2] bpf: support PTR_TO_MEM{,_OR_NULL} register spilling
On 1/12/21 8:46 PM, Andrii Nakryiko wrote: On Tue, Jan 12, 2021 at 1:14 AM Gilad Reti wrote: Add support for pointer to mem register spilling, to allow the verifier to track pointer to valid memory addresses. Such pointers are returned for example by a successful call of the bpf_ringbuf_reserve helper. This patch was suggested as a solution by Yonghong Song. The patch was partially contibuted by CyberArk Software, Inc. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Surprised no one mentioned this yet. Fixes tag should always be on a single unwrapped line, however long it is, please fix. Especially for first-time contributors there is usually some luxury that we would have fixed this up on the fly while applying. ;) But given a v2 is going to be sent anyway it's good to point it out for future reference, agree. Thanks, Daniel
Re: [PATCH 2/2] selftests/bpf: add verifier test for PTR_TO_MEM spill
On 1/12/21 4:35 PM, Gilad Reti wrote: On Tue, Jan 12, 2021 at 4:56 PM KP Singh wrote: On Tue, Jan 12, 2021 at 10:16 AM Gilad Reti wrote: Add test to check that the verifier is able to recognize spilling of PTR_TO_MEM registers. It would be nice to have some explanation of what the test does to recognize the spilling of the PTR_TO_MEM registers in the commit log as well. Would it be possible to augment an existing test_progs program like tools/testing/selftests/bpf/progs/test_ringbuf.c to test this functionality? How would you guarantee that LLVM generates the spill/fill, via inline asm? It may be possible, but from what I understood from Daniel's comment here https://lore.kernel.org/bpf/17629073-4fab-a922-ecc3-25b019960...@iogearbox.net/ the test should be a part of the verifier tests (which is reasonable to me since it is a verifier bugfix) Yeah, the test_verifier case as you have is definitely the most straight forward way to add coverage in this case.
Re: [PATCH] Signed-off-by: giladreti
Hello Gilad, On 1/11/21 4:31 PM, giladreti wrote: Added support for pointer to mem register spilling, to allow the verifier to track pointer to valid memory addresses. Such pointers are returned for example by a successful call of the bpf_ringbuf_reserve helper. This patch was suggested as a solution by Yonghong Song. The SoB should not be in subject line but as part of the commit message instead and with proper name, e.g. Signed-off-by: Gilad Reti For subject line, please use a short summary that fits the patch prefixed with the subsystem "bpf: [...]", see also [0] as an example. Thanks. It would be good if you could also add a BPF selftest for this [1]. [0] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=e22d7f05e445165e58feddb4e40cc9c0f94453bc [1] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/ https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/verifier/spill_fill.c --- kernel/bpf/verifier.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 17270b8404f1..36af69fac591 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -2217,6 +2217,8 @@ static bool is_spillable_regtype(enum bpf_reg_type type) case PTR_TO_RDWR_BUF: case PTR_TO_RDWR_BUF_OR_NULL: case PTR_TO_PERCPU_BTF_ID: + case PTR_TO_MEM: + case PTR_TO_MEM_OR_NULL: return true; default: return false;
Re: [PATCH net v2] net: fix use-after-free when UDP GRO with shared fraglist
On 1/7/21 3:44 PM, Willem de Bruijn wrote: On Thu, Jan 7, 2021 at 8:33 AM Daniel Borkmann wrote: On 1/7/21 2:05 PM, Willem de Bruijn wrote: On Thu, Jan 7, 2021 at 7:52 AM Daniel Borkmann wrote: On 1/7/21 12:40 PM, Dongseok Yi wrote: On 2021-01-07 20:05, Daniel Borkmann wrote: On 1/7/21 1:39 AM, Dongseok Yi wrote: [...] PF_PACKET handles untouched fraglist. To modify the payload only for udp_rcv_segment, skb_unclone is necessary. I don't parse this last sentence here, please elaborate in more detail on why it is necessary. For example, if tc layer would modify mark on the skb, then __copy_skb_header() in skb_segment_list() will propagate it. If tc layer would modify payload, the skb_ensure_writable() will take care of that internally and if needed pull in parts from fraglist into linear section to make it private. The purpose of the skb_clone() above iff shared is to make the struct itself private (to safely modify its struct members). What am I missing? If tc writes, it will call skb_make_writable and thus pskb_expand_head to create a private linear section for the head_skb. skb_segment_list overwrites part of the skb linear section of each fragment itself. Even after skb_clone, the frag_skbs share their linear section with their clone in pf_packet, so we need a pskb_expand_head here, too. Ok, got it, thanks for the explanation. Would be great to have it in the v3 commit log as well as that was more clear than the above. Too bad in that case (otoh the pf_packet situation could be considered corner case ...); ether way, fix makes sense then. Thanks for double checking the tricky logic. Pf_packet + BPF is indeed a peculiar corner case. But there are perhaps more, like raw sockets, and other BPF hooks that can trigger an skb_make_writable. Come to think of it, the no touching shared data rule is also violated without a BPF program? Then there is nothing in the frag_skbs themselves signifying that they are shared, but the head_skb is cloned, so its data may still not be modified. The skb_ensure_writable() is used in plenty of places from bpf, ovs, netfilter to core stack (e.g. vlan, mpls, icmp), but either way there should be nothing wrong with that as it's making sure to pull the data into its linear section before modification. Uncareful use of skb_store_bits() with offset into skb_frags for example could defeat that, but I don't see any in-tree occurrence that would be problematic at this point.. This modification happens to be safe in practice, as the pf_packet clones don't access the bytes before skb->data where this path inserts headers. But still.
Re: [PATCH net v3] net: fix use-after-free when UDP GRO with shared fraglist
On 1/8/21 3:28 AM, Dongseok Yi wrote: skbs in fraglist could be shared by a BPF filter loaded at TC. If TC writes, it will call skb_ensure_writable -> pskb_expand_head to create a private linear section for the head_skb. And then call skb_clone_fraglist -> skb_get on each skb in the fraglist. skb_segment_list overwrites part of the skb linear section of each fragment itself. Even after skb_clone, the frag_skbs share their linear section with their clone in PF_PACKET. Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have a link for the same frag_skbs chain. If a new skb (not frags) is queued to one of the sk_receive_queue, multiple ptypes can see and release this. It causes use-after-free. [ 4443.426215] [ cut here ] [ 4443.426222] refcount_t: underflow; use-after-free. [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190 refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426726] pstate: 6045 (nZCv daif +PAN -UAO) [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8 [ 4443.426808] Call trace: [ 4443.426813] refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426823] skb_release_data+0x144/0x264 [ 4443.426828] kfree_skb+0x58/0xc4 [ 4443.426832] skb_queue_purge+0x64/0x9c [ 4443.426844] packet_set_ring+0x5f0/0x820 [ 4443.426849] packet_setsockopt+0x5a4/0xcd0 [ 4443.426853] __sys_setsockopt+0x188/0x278 [ 4443.426858] __arm64_sys_setsockopt+0x28/0x38 [ 4443.426869] el0_svc_common+0xf0/0x1d0 [ 4443.426873] el0_svc_handler+0x74/0x98 [ 4443.426880] el0_svc+0x8/0xc Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.) Signed-off-by: Dongseok Yi Acked-by: Willem de Bruijn Acked-by: Daniel Borkmann
Re: [PATCH net v2] net: fix use-after-free when UDP GRO with shared fraglist
On 1/7/21 2:05 PM, Willem de Bruijn wrote: On Thu, Jan 7, 2021 at 7:52 AM Daniel Borkmann wrote: On 1/7/21 12:40 PM, Dongseok Yi wrote: On 2021-01-07 20:05, Daniel Borkmann wrote: On 1/7/21 1:39 AM, Dongseok Yi wrote: skbs in fraglist could be shared by a BPF filter loaded at TC. It triggers skb_ensure_writable -> pskb_expand_head -> skb_clone_fraglist -> skb_get on each skb in the fraglist. While tcpdump, sk_receive_queue of PF_PACKET has the original fraglist. But the same fraglist is queued to PF_INET (or PF_INET6) as the fraglist chain made by skb_segment_list. If the new skb (not fraglist) is queued to one of the sk_receive_queue, multiple ptypes can see this. The skb could be released by ptypes and it causes use-after-free. [ 4443.426215] [ cut here ] [ 4443.426222] refcount_t: underflow; use-after-free. [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190 refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426726] pstate: 6045 (nZCv daif +PAN -UAO) [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8 [ 4443.426808] Call trace: [ 4443.426813] refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426823] skb_release_data+0x144/0x264 [ 4443.426828] kfree_skb+0x58/0xc4 [ 4443.426832] skb_queue_purge+0x64/0x9c [ 4443.426844] packet_set_ring+0x5f0/0x820 [ 4443.426849] packet_setsockopt+0x5a4/0xcd0 [ 4443.426853] __sys_setsockopt+0x188/0x278 [ 4443.426858] __arm64_sys_setsockopt+0x28/0x38 [ 4443.426869] el0_svc_common+0xf0/0x1d0 [ 4443.426873] el0_svc_handler+0x74/0x98 [ 4443.426880] el0_svc+0x8/0xc Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.) Signed-off-by: Dongseok Yi Acked-by: Willem de Bruijn --- net/core/skbuff.c | 20 +++- 1 file changed, 19 insertions(+), 1 deletion(-) v2: Expand the commit message to clarify a BPF filter loaded diff --git a/net/core/skbuff.c b/net/core/skbuff.c index f62cae3..1dcbda8 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb, unsigned int delta_truesize = 0; unsigned int delta_len = 0; struct sk_buff *tail = NULL; - struct sk_buff *nskb; + struct sk_buff *nskb, *tmp; + int err; skb_push(skb, -skb_network_offset(skb) + offset); @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb, nskb = list_skb; list_skb = list_skb->next; + err = 0; + if (skb_shared(nskb)) { + tmp = skb_clone(nskb, GFP_ATOMIC); + if (tmp) { + kfree_skb(nskb); Should use consume_skb() to not trigger skb:kfree_skb tracepoint when looking for drops in the stack. I will use to consume_skb() on the next version. + nskb = tmp; + err = skb_unclone(nskb, GFP_ATOMIC); Could you elaborate why you also need to unclone? This looks odd here. tc layer (independent of BPF) from ingress & egress side generally assumes unshared skb, so above clone + dropping ref of nskb looks okay to make the main skb struct private for mangling attributes (e.g. mark) & should suffice. What is the exact purpose of the additional skb_unclone() in this context? Willem de Bruijn said: udp_rcv_segment later converts the udp-gro-list skb to a list of regular packets to pass these one-by-one to udp_queue_rcv_one_skb. Now all the frags are fully fledged packets, with headers pushed before the payload. Yes. PF_PACKET handles untouched fraglist. To modify the payload only for udp_rcv_segment, skb_unclone is necessary. I don't parse this last sentence here, please elaborate in more detail on why it is necessary. For example, if tc layer would modify mark on the skb, then __copy_skb_header() in skb_segment_list() will propagate it. If tc layer would modify payload, the skb_ensure_writable() will take care of that internally and if needed pull in parts from fraglist into linear section to make it private. The purpose of the skb_clone() above iff shared is to make the struct itself private (to safely modify its struct members). What am I missing? If tc writes, it will call skb_make_writable and thus pskb_expand_head to create a private linear section for the head_skb. skb_segment_list overwrites part of the skb linear section of each fragment itself. Even after skb_clone, the frag_skbs share their linear section with their clone in pf_packet, so we need a pskb_expand_head here, too. Ok, got it, thanks for the explanation. Would be great to have it in the v3 commit log as well as that was more clear than the above. Too bad in that case (otoh the pf_packet situation could be considered corner case ...); ether way, fix makes sense then.
Re: [PATCH net v2] net: fix use-after-free when UDP GRO with shared fraglist
On 1/7/21 12:40 PM, Dongseok Yi wrote: On 2021-01-07 20:05, Daniel Borkmann wrote: On 1/7/21 1:39 AM, Dongseok Yi wrote: skbs in fraglist could be shared by a BPF filter loaded at TC. It triggers skb_ensure_writable -> pskb_expand_head -> skb_clone_fraglist -> skb_get on each skb in the fraglist. While tcpdump, sk_receive_queue of PF_PACKET has the original fraglist. But the same fraglist is queued to PF_INET (or PF_INET6) as the fraglist chain made by skb_segment_list. If the new skb (not fraglist) is queued to one of the sk_receive_queue, multiple ptypes can see this. The skb could be released by ptypes and it causes use-after-free. [ 4443.426215] [ cut here ] [ 4443.426222] refcount_t: underflow; use-after-free. [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190 refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426726] pstate: 6045 (nZCv daif +PAN -UAO) [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8 [ 4443.426808] Call trace: [ 4443.426813] refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426823] skb_release_data+0x144/0x264 [ 4443.426828] kfree_skb+0x58/0xc4 [ 4443.426832] skb_queue_purge+0x64/0x9c [ 4443.426844] packet_set_ring+0x5f0/0x820 [ 4443.426849] packet_setsockopt+0x5a4/0xcd0 [ 4443.426853] __sys_setsockopt+0x188/0x278 [ 4443.426858] __arm64_sys_setsockopt+0x28/0x38 [ 4443.426869] el0_svc_common+0xf0/0x1d0 [ 4443.426873] el0_svc_handler+0x74/0x98 [ 4443.426880] el0_svc+0x8/0xc Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.) Signed-off-by: Dongseok Yi Acked-by: Willem de Bruijn --- net/core/skbuff.c | 20 +++- 1 file changed, 19 insertions(+), 1 deletion(-) v2: Expand the commit message to clarify a BPF filter loaded diff --git a/net/core/skbuff.c b/net/core/skbuff.c index f62cae3..1dcbda8 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb, unsigned int delta_truesize = 0; unsigned int delta_len = 0; struct sk_buff *tail = NULL; - struct sk_buff *nskb; + struct sk_buff *nskb, *tmp; + int err; skb_push(skb, -skb_network_offset(skb) + offset); @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb, nskb = list_skb; list_skb = list_skb->next; + err = 0; + if (skb_shared(nskb)) { + tmp = skb_clone(nskb, GFP_ATOMIC); + if (tmp) { + kfree_skb(nskb); Should use consume_skb() to not trigger skb:kfree_skb tracepoint when looking for drops in the stack. I will use to consume_skb() on the next version. + nskb = tmp; + err = skb_unclone(nskb, GFP_ATOMIC); Could you elaborate why you also need to unclone? This looks odd here. tc layer (independent of BPF) from ingress & egress side generally assumes unshared skb, so above clone + dropping ref of nskb looks okay to make the main skb struct private for mangling attributes (e.g. mark) & should suffice. What is the exact purpose of the additional skb_unclone() in this context? Willem de Bruijn said: udp_rcv_segment later converts the udp-gro-list skb to a list of regular packets to pass these one-by-one to udp_queue_rcv_one_skb. Now all the frags are fully fledged packets, with headers pushed before the payload. Yes. PF_PACKET handles untouched fraglist. To modify the payload only for udp_rcv_segment, skb_unclone is necessary. I don't parse this last sentence here, please elaborate in more detail on why it is necessary. For example, if tc layer would modify mark on the skb, then __copy_skb_header() in skb_segment_list() will propagate it. If tc layer would modify payload, the skb_ensure_writable() will take care of that internally and if needed pull in parts from fraglist into linear section to make it private. The purpose of the skb_clone() above iff shared is to make the struct itself private (to safely modify its struct members). What am I missing? + } else { + err = -ENOMEM; + } + } + if (!tail) skb->next = nskb; else tail->next = nskb; + if (unlikely(err)) { + nskb->next = list_skb; + goto err_linearize; + } + tail = nskb; delta_len += nskb->len;
Re: [PATCH net v2] net: fix use-after-free when UDP GRO with shared fraglist
On 1/7/21 1:39 AM, Dongseok Yi wrote: skbs in fraglist could be shared by a BPF filter loaded at TC. It triggers skb_ensure_writable -> pskb_expand_head -> skb_clone_fraglist -> skb_get on each skb in the fraglist. While tcpdump, sk_receive_queue of PF_PACKET has the original fraglist. But the same fraglist is queued to PF_INET (or PF_INET6) as the fraglist chain made by skb_segment_list. If the new skb (not fraglist) is queued to one of the sk_receive_queue, multiple ptypes can see this. The skb could be released by ptypes and it causes use-after-free. [ 4443.426215] [ cut here ] [ 4443.426222] refcount_t: underflow; use-after-free. [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190 refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426726] pstate: 6045 (nZCv daif +PAN -UAO) [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8 [ 4443.426808] Call trace: [ 4443.426813] refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426823] skb_release_data+0x144/0x264 [ 4443.426828] kfree_skb+0x58/0xc4 [ 4443.426832] skb_queue_purge+0x64/0x9c [ 4443.426844] packet_set_ring+0x5f0/0x820 [ 4443.426849] packet_setsockopt+0x5a4/0xcd0 [ 4443.426853] __sys_setsockopt+0x188/0x278 [ 4443.426858] __arm64_sys_setsockopt+0x28/0x38 [ 4443.426869] el0_svc_common+0xf0/0x1d0 [ 4443.426873] el0_svc_handler+0x74/0x98 [ 4443.426880] el0_svc+0x8/0xc Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.) Signed-off-by: Dongseok Yi Acked-by: Willem de Bruijn --- net/core/skbuff.c | 20 +++- 1 file changed, 19 insertions(+), 1 deletion(-) v2: Expand the commit message to clarify a BPF filter loaded diff --git a/net/core/skbuff.c b/net/core/skbuff.c index f62cae3..1dcbda8 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb, unsigned int delta_truesize = 0; unsigned int delta_len = 0; struct sk_buff *tail = NULL; - struct sk_buff *nskb; + struct sk_buff *nskb, *tmp; + int err; skb_push(skb, -skb_network_offset(skb) + offset); @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb, nskb = list_skb; list_skb = list_skb->next; + err = 0; + if (skb_shared(nskb)) { + tmp = skb_clone(nskb, GFP_ATOMIC); + if (tmp) { + kfree_skb(nskb); Should use consume_skb() to not trigger skb:kfree_skb tracepoint when looking for drops in the stack. + nskb = tmp; + err = skb_unclone(nskb, GFP_ATOMIC); Could you elaborate why you also need to unclone? This looks odd here. tc layer (independent of BPF) from ingress & egress side generally assumes unshared skb, so above clone + dropping ref of nskb looks okay to make the main skb struct private for mangling attributes (e.g. mark) & should suffice. What is the exact purpose of the additional skb_unclone() in this context? + } else { + err = -ENOMEM; + } + } + if (!tail) skb->next = nskb; else tail->next = nskb; + if (unlikely(err)) { + nskb->next = list_skb; + goto err_linearize; + } + tail = nskb; delta_len += nskb->len;
Re: [PATCH 03/15] perf: Add build id data in mmap2 event
Hey Arnaldo, On 12/15/20 4:52 PM, Arnaldo Carvalho de Melo wrote: Em Mon, Dec 14, 2020 at 11:54:45AM +0100, Jiri Olsa escreveu: Adding support to carry build id data in mmap2 event. The build id data replaces maj/min/ino/ino_generation fields, which are also used to identify map's binary, so it's ok to replace them with build id data: union { struct { u32 maj; u32 min; u64 ino; u64 ino_generation; }; struct { u8build_id_size; u8__reserved_1; u16 __reserved_2; u8build_id[20]; }; }; Alexei/Daniel, this one depends on BPFs build id routines to be exported for use by the perf kernel subsys, PeterZ already acked this, so can you guys consider getting the first three patches in this series via the bpf tree? The BPF bits were acked by Song. All the net-next and therefore also bpf-next bits for 5.11 were just merged by Linus into his tree. If you need the first 3 from [0] to land for this merge window, it's probably easiest if you take them in and send them via perf tree directly in case you didn't send out a pull-req yet.. (alternatively I'll ping David/Jakub if they plan to make a 2nd net-next pull-req end of this week). Thanks, Daniel [0] https://lore.kernel.org/lkml/20201214105457.543111-1-jo...@kernel.org/
Re: [PATCH bpf-next v2] libbpf: Expose libbpf ringbufer epoll_fd
On 12/14/20 12:38 PM, Brendan Jackman wrote: This provides a convenient perf ringbuf -> libbpf ringbuf migration path for users of external polling systems. It is analogous to perf_buffer__epoll_fd. Signed-off-by: Brendan Jackman --- Difference from v1: Added entry to libbpf.map. tools/lib/bpf/libbpf.h | 1 + tools/lib/bpf/libbpf.map | 1 + tools/lib/bpf/ringbuf.c | 6 ++ 3 files changed, 8 insertions(+) diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index 6909ee81113a..cde07f64771e 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -536,6 +536,7 @@ LIBBPF_API int ring_buffer__add(struct ring_buffer *rb, int map_fd, ring_buffer_sample_fn sample_cb, void *ctx); LIBBPF_API int ring_buffer__poll(struct ring_buffer *rb, int timeout_ms); LIBBPF_API int ring_buffer__consume(struct ring_buffer *rb); +LIBBPF_API int ring_buffer__epoll_fd(struct ring_buffer *rb); /* Perf buffer APIs */ struct perf_buffer; diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map index 7c4126542e2b..7be850271be6 100644 --- a/tools/lib/bpf/libbpf.map +++ b/tools/lib/bpf/libbpf.map @@ -348,4 +348,5 @@ LIBBPF_0.3.0 { btf__new_split; xsk_setup_xdp_prog; xsk_socket__update_xskmap; +ring_buffer__epoll_fd; Fyi, this had a whitespace issue, Andrii fixed it up while applying. } LIBBPF_0.2.0; diff --git a/tools/lib/bpf/ringbuf.c b/tools/lib/bpf/ringbuf.c index 5c6522c89af1..45a36648b403 100644 --- a/tools/lib/bpf/ringbuf.c +++ b/tools/lib/bpf/ringbuf.c @@ -282,3 +282,9 @@ int ring_buffer__poll(struct ring_buffer *rb, int timeout_ms) } return cnt < 0 ? -errno : res; } + +/* Get an fd that can be used to sleep until data is available in the ring(s) */ +int ring_buffer__epoll_fd(struct ring_buffer *rb) +{ + return rb->epoll_fd; +} base-commit: b4fe9fec51ef48011f11c2da4099f0b530449c92 -- 2.29.2.576.ga3fc446d84-goog
Re: [PATCH bpf-next v4 2/4] bpf: Expose bpf_get_socket_cookie to tracing programs
On 12/9/20 2:26 PM, Florent Revest wrote: This needs two new helpers, one that works in a sleepable context (using sock_gen_cookie which disables/enables preemption) and one that does not (for performance reasons). Both take a struct sock pointer and need to check it for NULLness. This helper could also be useful to other BPF program types such as LSM. Looks like this commit description is now stale and needs to be updated since we only really add one helper? Signed-off-by: Florent Revest --- include/linux/bpf.h| 1 + include/uapi/linux/bpf.h | 7 +++ kernel/trace/bpf_trace.c | 2 ++ net/core/filter.c | 12 tools/include/uapi/linux/bpf.h | 7 +++ 5 files changed, 29 insertions(+) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 07cb5d15e743..5a858e8c3f1a 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1860,6 +1860,7 @@ extern const struct bpf_func_proto bpf_per_cpu_ptr_proto; extern const struct bpf_func_proto bpf_this_cpu_ptr_proto; extern const struct bpf_func_proto bpf_ktime_get_coarse_ns_proto; extern const struct bpf_func_proto bpf_sock_from_file_proto; +extern const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto; const struct bpf_func_proto *bpf_tracing_func_proto( enum bpf_func_id func_id, const struct bpf_prog *prog); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index ba59309f4d18..9ac66cf25959 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1667,6 +1667,13 @@ union bpf_attr { *Return *A 8-byte long unique number. * + * u64 bpf_get_socket_cookie(void *sk) + * Description + * Equivalent to **bpf_get_socket_cookie**\ () helper that accepts + * *sk*, but gets socket from a BTF **struct sock**. Maybe add a small comment that this one also works for sleepable [tracing] progs? + * Return + * A 8-byte long unique number. ... or 0 if *sk* is NULL. * u32 bpf_get_socket_uid(struct sk_buff *skb) *Return *The owner UID of the socket associated to *skb*. If the socket diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 52ddd217d6a1..be5e96de306d 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1760,6 +1760,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return _sk_storage_delete_tracing_proto; case BPF_FUNC_sock_from_file: return _sock_from_file_proto; + case BPF_FUNC_get_socket_cookie: + return _get_socket_ptr_cookie_proto; #endif case BPF_FUNC_seq_printf: return prog->expected_attach_type == BPF_TRACE_ITER ? diff --git a/net/core/filter.c b/net/core/filter.c index 255aeee72402..13ad9a64f04f 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4631,6 +4631,18 @@ static const struct bpf_func_proto bpf_get_socket_cookie_sock_proto = { .arg1_type = ARG_PTR_TO_CTX, }; +BPF_CALL_1(bpf_get_socket_ptr_cookie, struct sock *, sk) +{ + return sk ? sock_gen_cookie(sk) : 0; +} + +const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto = { + .func = bpf_get_socket_ptr_cookie, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_BTF_ID_SOCK_COMMON, +}; + BPF_CALL_1(bpf_get_socket_cookie_sock_ops, struct bpf_sock_ops_kern *, ctx) { return __sock_gen_cookie(ctx->sk); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index ba59309f4d18..9ac66cf25959 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1667,6 +1667,13 @@ union bpf_attr { *Return *A 8-byte long unique number. * + * u64 bpf_get_socket_cookie(void *sk) + * Description + * Equivalent to **bpf_get_socket_cookie**\ () helper that accepts + * *sk*, but gets socket from a BTF **struct sock**. + * Return + * A 8-byte long unique number. + * * u32 bpf_get_socket_uid(struct sk_buff *skb) *Return *The owner UID of the socket associated to *skb*. If the socket
Re: [PATCH bpf-next v2 1/3] bpf: Expose bpf_get_socket_cookie to tracing programs
On 12/8/20 8:30 PM, Florent Revest wrote: On Fri, 2020-12-04 at 20:03 +0100, Daniel Borkmann wrote: On 12/4/20 7:56 PM, Daniel Borkmann wrote: On 12/3/20 10:33 PM, Florent Revest wrote: This creates a new helper proto because the existing bpf_get_socket_cookie_sock_proto has a ARG_PTR_TO_CTX argument and only works for BPF programs where the context is a sock. This helper could also be useful to other BPF program types such as LSM. Signed-off-by: Florent Revest --- include/uapi/linux/bpf.h | 7 +++ kernel/trace/bpf_trace.c | 4 net/core/filter.c | 7 +++ tools/include/uapi/linux/bpf.h | 7 +++ 4 files changed, 25 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c3458ec1f30a..3e0e33c43998 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1662,6 +1662,13 @@ union bpf_attr { * Return * A 8-byte long non-decreasing number. * + * u64 bpf_get_socket_cookie(void *sk) + * Description + * Equivalent to **bpf_get_socket_cookie**\ () helper that accepts + * *sk*, but gets socket from a BTF **struct sock**. + * Return + * A 8-byte long non-decreasing number. I would not mention this here since it's not fully correct and we should avoid users taking non-decreasing granted in their progs. The only assumption you can make is that it can be considered a unique number. See also [0] with reverse counter.. [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=92acdc58ab11af66fcaef485433fde61b5e32fac Ah this is a good point, thank you! I will send a v3 with an extra patch that s/non-decreasing/unique/ in the other descriptions. I had not given it any extra thought, I just stupidly copied/pasted existing descriptions. :) One more thought, in case you plan to use this from sleepable context, you would need to use sock_gen_cookie() variant in the BPF helper instead. Out of curiosity, why don't we just always call sock_gen_cookie? Is it to avoid the performance impact of increasing the preempt counter and introducing a memory barriers ? Yes, all the other contexts where the existing helpers are used already have preemption disabled, so the extra preempt_{disable,enable}() is unnecessary overhead given we want the cookie generation be efficient. Thanks, Daniel
Re: [PATCH bpf-next v9 00/34] bpf: switch to memcg-based memory accounting
On 12/3/20 4:26 AM, Roman Gushchin wrote: On Wed, Dec 02, 2020 at 06:54:46PM -0800, Alexei Starovoitov wrote: On Tue, Dec 1, 2020 at 1:59 PM Roman Gushchin wrote: 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had a function to "explain" this case for users. ... v9: - always charge the saved memory cgroup, by Daniel, Toke and Alexei - added bpf_map_kzalloc() - rebase and minor fixes This looks great. Applied. Thanks! Please follow up with a change to libbpf's pr_perm_msg(). That helpful warning should stay for old kernels, but it would be misleading for new kernels. libbpf probably needs a feature check to make this warning conditional. I think we've discussed it several months ago and at that time we didn't find a good way to check this feature. I'll think again, but if somebody has any ideas here, I'll appreciate a lot. Hm, bit tricky, agree .. given we only throw the warning in pr_perm_msg() for non-root and thus probing options are also limited, otherwise just probing for a helper that was added in this same cycle would have been good enough as a simple heuristic. I wonder if it would make sense to add some hint inside the bpf_{prog,map}_show_fdinfo() to indicate that accounting with memcg is enabled for the prog/map one way or another? Not just for the sake of pr_perm_msg(), but in general for apps to stop messing with rlimit at this point. Maybe also bpftool feature probe could be extended to indicate that as well (e.g. the json output can be fed into Go natively).
Re: [PATCH bpf-next v2 1/3] bpf: Expose bpf_get_socket_cookie to tracing programs
On 12/4/20 7:56 PM, Daniel Borkmann wrote: On 12/3/20 10:33 PM, Florent Revest wrote: This creates a new helper proto because the existing bpf_get_socket_cookie_sock_proto has a ARG_PTR_TO_CTX argument and only works for BPF programs where the context is a sock. This helper could also be useful to other BPF program types such as LSM. Signed-off-by: Florent Revest --- include/uapi/linux/bpf.h | 7 +++ kernel/trace/bpf_trace.c | 4 net/core/filter.c | 7 +++ tools/include/uapi/linux/bpf.h | 7 +++ 4 files changed, 25 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c3458ec1f30a..3e0e33c43998 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1662,6 +1662,13 @@ union bpf_attr { * Return * A 8-byte long non-decreasing number. * + * u64 bpf_get_socket_cookie(void *sk) + * Description + * Equivalent to **bpf_get_socket_cookie**\ () helper that accepts + * *sk*, but gets socket from a BTF **struct sock**. + * Return + * A 8-byte long non-decreasing number. I would not mention this here since it's not fully correct and we should avoid users taking non-decreasing granted in their progs. The only assumption you can make is that it can be considered a unique number. See also [0] with reverse counter.. [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=92acdc58ab11af66fcaef485433fde61b5e32fac One more thought, in case you plan to use this from sleepable context, you would need to use sock_gen_cookie() variant in the BPF helper instead.
Re: [PATCH bpf-next v2 1/3] bpf: Expose bpf_get_socket_cookie to tracing programs
On 12/3/20 10:33 PM, Florent Revest wrote: This creates a new helper proto because the existing bpf_get_socket_cookie_sock_proto has a ARG_PTR_TO_CTX argument and only works for BPF programs where the context is a sock. This helper could also be useful to other BPF program types such as LSM. Signed-off-by: Florent Revest --- include/uapi/linux/bpf.h | 7 +++ kernel/trace/bpf_trace.c | 4 net/core/filter.c | 7 +++ tools/include/uapi/linux/bpf.h | 7 +++ 4 files changed, 25 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c3458ec1f30a..3e0e33c43998 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1662,6 +1662,13 @@ union bpf_attr { *Return *A 8-byte long non-decreasing number. * + * u64 bpf_get_socket_cookie(void *sk) + * Description + * Equivalent to **bpf_get_socket_cookie**\ () helper that accepts + * *sk*, but gets socket from a BTF **struct sock**. + * Return + * A 8-byte long non-decreasing number. I would not mention this here since it's not fully correct and we should avoid users taking non-decreasing granted in their progs. The only assumption you can make is that it can be considered a unique number. See also [0] with reverse counter.. [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=92acdc58ab11af66fcaef485433fde61b5e32fac + * * u32 bpf_get_socket_uid(struct sk_buff *skb) *Return *The owner UID of the socket associated to *skb*. If the socket diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index d255bc9b2bfa..14ad96579813 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1725,6 +1725,8 @@ raw_tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) } } +extern const struct bpf_func_proto bpf_get_socket_cookie_sock_tracing_proto; + const struct bpf_func_proto * tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { @@ -1748,6 +1750,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return _sk_storage_get_tracing_proto; case BPF_FUNC_sk_storage_delete: return _sk_storage_delete_tracing_proto; + case BPF_FUNC_get_socket_cookie: + return _get_socket_cookie_sock_tracing_proto; #endif case BPF_FUNC_seq_printf: return prog->expected_attach_type == BPF_TRACE_ITER ? diff --git a/net/core/filter.c b/net/core/filter.c index 2ca5eecebacf..177c4e5e529d 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4631,6 +4631,13 @@ static const struct bpf_func_proto bpf_get_socket_cookie_sock_proto = { .arg1_type = ARG_PTR_TO_CTX, }; +const struct bpf_func_proto bpf_get_socket_cookie_sock_tracing_proto = { + .func = bpf_get_socket_cookie_sock, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_BTF_ID_SOCK_COMMON, +}; +
Re: linux-next: build failure after merge of the bpf-next tree
On 12/1/20 9:07 AM, Stephen Rothwell wrote: Hi all, After merging the bpf-next tree, today's linux-next build (x86_64 allnoconfig) failed like this: In file included from fs/select.c:32: include/net/busy_poll.h: In function 'sk_mark_napi_id_once': include/net/busy_poll.h:150:36: error: 'const struct sk_buff' has no member named 'napi_id' 150 | __sk_mark_napi_id_once_xdp(sk, skb->napi_id); |^~ Caused by commit b02e5a0ebb17 ("xsk: Propagate napi_id to XDP socket Rx path") Fixed it up in bpf-next, thanks for reporting!
Re: [PATCH] bpf: remove trailing semicolon in macro definition
On 11/27/20 8:27 PM, t...@redhat.com wrote: From: Tom Rix The macro use will already have a semicolon. Signed-off-by: Tom Rix --- include/trace/events/xdp.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h index cd24e8a59529..65ffedf8386f 100644 --- a/include/trace/events/xdp.h +++ b/include/trace/events/xdp.h @@ -146,13 +146,13 @@ DEFINE_EVENT(xdp_redirect_template, xdp_redirect_err, ); #define _trace_xdp_redirect(dev, xdp, to) \ -trace_xdp_redirect(dev, xdp, NULL, 0, NULL, to); +trace_xdp_redirect(dev, xdp, NULL, 0, NULL, to) #define _trace_xdp_redirect_err(dev, xdp, to, err) \ trace_xdp_redirect_err(dev, xdp, NULL, err, NULL, to); #define _trace_xdp_redirect_map(dev, xdp, to, map, index) \ -trace_xdp_redirect(dev, xdp, to, 0, map, index); +trace_xdp_redirect(dev, xdp, to, 0, map, index) #define _trace_xdp_redirect_map_err(dev, xdp, to, map, index, err) \ trace_xdp_redirect_err(dev, xdp, to, err, map, index); This looks random, why those but not other locations ? Thanks, Daniel
Re: [PATCH bpf v2 2/2] xsk: change the tx writeable condition
On 11/25/20 7:48 AM, Xuan Zhuo wrote: Modify the tx writeable condition from the queue is not full to the number of present tx queues is less than the half of the total number of queues. Because the tx queue not full is a very short time, this will cause a large number of EPOLLOUT events, and cause a large number of process wake up. Signed-off-by: Xuan Zhuo This one doesn't apply cleanly against bpf tree, please rebase. Small comment inline while looking over the patch: --- net/xdp/xsk.c | 16 +--- net/xdp/xsk_queue.h | 6 ++ 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 0df8651..22e35e9 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -211,6 +211,14 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len, return 0; } +static bool xsk_tx_writeable(struct xdp_sock *xs) +{ + if (xskq_cons_present_entries(xs->tx) > xs->tx->nentries / 2) + return false; + + return true; +} + static bool xsk_is_bound(struct xdp_sock *xs) { if (READ_ONCE(xs->state) == XSK_BOUND) { @@ -296,7 +304,8 @@ void xsk_tx_release(struct xsk_buff_pool *pool) rcu_read_lock(); list_for_each_entry_rcu(xs, >xsk_tx_list, tx_list) { __xskq_cons_release(xs->tx); - xs->sk.sk_write_space(>sk); + if (xsk_tx_writeable(xs)) + xs->sk.sk_write_space(>sk); } rcu_read_unlock(); } @@ -499,7 +508,8 @@ static int xsk_generic_xmit(struct sock *sk) out: if (sent_frame) - sk->sk_write_space(sk); + if (xsk_tx_writeable(xs)) + sk->sk_write_space(sk); mutex_unlock(>mutex); return err; @@ -556,7 +566,7 @@ static __poll_t xsk_poll(struct file *file, struct socket *sock, if (xs->rx && !xskq_prod_is_empty(xs->rx)) mask |= EPOLLIN | EPOLLRDNORM; - if (xs->tx && !xskq_cons_is_full(xs->tx)) + if (xs->tx && xsk_tx_writeable(xs)) mask |= EPOLLOUT | EPOLLWRNORM; return mask; diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h index b936c46..b655004 100644 --- a/net/xdp/xsk_queue.h +++ b/net/xdp/xsk_queue.h @@ -307,6 +307,12 @@ static inline bool xskq_cons_is_full(struct xsk_queue *q) q->nentries; } +static inline __u64 xskq_cons_present_entries(struct xsk_queue *q) Types prefixed with __ are mainly for user-space facing things like uapi headers, so in-kernel should be u64. Is there a reason this is not done as u32 (and thus same as producer and producer)? +{ + /* No barriers needed since data is not accessed */ + return READ_ONCE(q->ring->producer) - READ_ONCE(q->ring->consumer); +} + /* Functions for producers */ static inline u32 xskq_prod_nb_free(struct xsk_queue *q, u32 max)
Re: [PATCH bpf-next v8 06/34] bpf: prepare for memcg-based memory accounting for bpf maps
On 11/25/20 4:00 AM, Roman Gushchin wrote: In the absolute majority of cases if a process is making a kernel allocation, it's memory cgroup is getting charged. Bpf maps can be updated from an interrupt context and in such case there is no process which can be charged. It makes the memory accounting of bpf maps non-trivial. Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel memcg accounting from interrupt contexts") and b87d8cefe43c ("mm, memcg: rework remote charging API to support nesting") it's finally possible. To do it, a pointer to the memory cgroup of the process, which created the map, is saved, and this cgroup can be charged for all allocations made from an interrupt context. This commit introduces 2 helpers: bpf_map_kmalloc_node() and bpf_map_alloc_percpu(). They can be used in the bpf code for accounted memory allocations, both in the process and interrupt contexts. In the interrupt context they're using the saved memory cgroup, otherwise the current cgroup is getting charged. Signed-off-by: Roman Gushchin Thanks for updating the cover letter; replying in this series instead on one more item that came to mind: [...] diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index f3fe9f53f93c..4154c616788c 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -31,6 +31,8 @@ #include #include #include +#include +#include #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \ (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \ @@ -456,6 +458,77 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock) __release(_idr_lock); } +#ifdef CONFIG_MEMCG_KMEM +static void bpf_map_save_memcg(struct bpf_map *map) +{ + map->memcg = get_mem_cgroup_from_mm(current->mm); +} + +static void bpf_map_release_memcg(struct bpf_map *map) +{ + mem_cgroup_put(map->memcg); +} + +void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags, + int node) +{ + struct mem_cgroup *old_memcg; + bool in_interrupt; + void *ptr; + + /* +* If the memory allocation is performed from an interrupt context, +* the memory cgroup to charge can't be determined from the context +* of the current task. Instead, we charge the memory cgroup, which +* contained the process created the map. +*/ + in_interrupt = in_interrupt(); + if (in_interrupt) + old_memcg = set_active_memcg(map->memcg); + + ptr = kmalloc_node(size, flags, node); + + if (in_interrupt) + set_active_memcg(old_memcg); + + return ptr; +} + +void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size, + size_t align, gfp_t gfp) +{ + struct mem_cgroup *old_memcg; + bool in_interrupt; + void *ptr; + + /* +* If the memory allocation is performed from an interrupt context, +* the memory cgroup to charge can't be determined from the context +* of the current task. Instead, we charge the memory cgroup, which +* contained the process created the map. +*/ + in_interrupt = in_interrupt(); + if (in_interrupt) + old_memcg = set_active_memcg(map->memcg); + + ptr = __alloc_percpu_gfp(size, align, gfp); + + if (in_interrupt) + set_active_memcg(old_memcg); For this and above bpf_map_kmalloc_node() one, wouldn't it make more sense to perform the temporary memcg unconditionally? old_memcg = set_active_memcg(map->memcg); ptr = kmalloc_node(size, flags, node); set_active_memcg(old_memcg); I think the semantics are otherwise a bit weird and the charging unpredictable; this way it would /always/ be accounted against the prog in the memcg that originally created the map. E.g. maps could be shared between progs attached to, say, XDP/tc where in_interrupt() holds true with progs attached to skb-cgroup/egress where we're still in process context. So some part of the memory is charged against the original map's memcg and some other part against the current process' memcg which seems odd, no? Or, for example, if we start to run a tracing BPF prog which updates state in a BPF map ... that tracing prog now interferes with processes in other memcgs which may not be intentional & could lead to potential failures there as opposed when the tracing prog is not run. My concern is that the semantics are not quite clear and behavior unpredictable compared to always charging against map->memcg. Similarly, what if an orchestration prog creates dedicated memcg(s) for maps with individual limits ... the assumed behavior (imho) would be that whatever memory is accounted on the map it can be accurately retrieved from there & similarly limits enforced, no? It seems that would not be the case currently. Thoughts? +
Re: [PATCH bpf-next v3 1/3] ima: Implement ima_inode_hash
On 11/25/20 1:04 PM, KP Singh wrote: On Tue, Nov 24, 2020 at 6:35 PM Yonghong Song wrote: On 11/24/20 7:12 AM, KP Singh wrote: From: KP Singh This is in preparation to add a helper for BPF LSM programs to use IMA hashes when attached to LSM hooks. There are LSM hooks like inode_unlink which do not have a struct file * argument and cannot use the existing ima_file_hash API. An inode based API is, therefore, useful in LSM based detections like an executable trying to delete itself which rely on the inode_unlink LSM hook. Moreover, the ima_file_hash function does nothing with the struct file pointer apart from calling file_inode on it and converting it to an inode. Signed-off-by: KP Singh There is no change for this patch compared to previous version, so you can carry my Ack. Acked-by: Yonghong Song I am guessing: * We need an Ack from Mimi/James. Yes. * As regards to which tree, I guess bpf-next would be better since the BPF helper and the selftest depends on it Yep, bpf-next is my preference as otherwise we're running into unnecessary merge conflicts. Thanks, Daniel
Re: [PATCH bpf-next v7 00/34] bpf: switch to memcg-based memory accounting
On 11/19/20 6:37 PM, Roman Gushchin wrote: Currently bpf is using the memlock rlimit for the memory accounting. This approach has its downsides and over time has created a significant amount of problems: 1) The limit is per-user, but because most bpf operations are performed as root, the limit has a little value. 2) It's hard to come up with a specific maximum value. Especially because the counter is shared with non-bpf users (e.g. memlock() users). Any specific value is either too low and creates false failures or too high and useless. 3) Charging is not connected to the actual memory allocation. Bpf code should manually calculate the estimated cost and precharge the counter, and then take care of uncharging, including all fail paths. It adds to the code complexity and makes it easy to leak a charge. 4) There is no simple way of getting the current value of the counter. We've used drgn for it, but it's far from being convenient. 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had a function to "explain" this case for users. In order to overcome these problems let's switch to the memcg-based memory accounting of bpf objects. With the recent addition of the percpu memory accounting, now it's possible to provide a comprehensive accounting of the memory used by bpf programs and maps. This approach has the following advantages: 1) The limit is per-cgroup and hierarchical. It's way more flexible and allows a better control over memory usage by different workloads. Of course, it requires enabled cgroups and kernel memory accounting and properly configured cgroup tree, but it's a default configuration for a modern Linux system. 2) The actual memory consumption is taken into account. It happens automatically on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also performed automatically on releasing the memory. So the code on the bpf side becomes simpler and safer. 3) There is a simple way to get the current value and statistics. In general, if a process performs a bpf operation (e.g. creates or updates a map), it's memory cgroup is charged. However map updates performed from an interrupt context are charged to the memory cgroup which contained the process, which created the map. Providing a 1:1 replacement for the rlimit-based memory accounting is a non-goal of this patchset. Users and memory cgroups are completely orthogonal, so it's not possible even in theory. Memcg-based memory accounting requires a properly configured cgroup tree to be actually useful. However, it's the way how the memory is managed on a modern Linux system. The cover letter here only describes the advantages of this series, but leaves out discussion of the disadvantages. They definitely must be part of the series to provide a clear description of the semantic changes to readers. Last time we discussed them, they were i) no mem limits in general on unprivileged users when memory cgroups was not configured in the kernel, and ii) no mem limits by default if not configured in the cgroup specifically. Did we made any progress on these in the meantime? How do we want to address them? What is the concrete justification to not address them? Also I wonder what are the risk of regressions here, for example, if an existing orchestrator has configured memory cgroup limits that are tailored to the application's needs.. now, with kernel upgrade BPF will start to interfere, e.g. if a BPF program attached to cgroups (e.g. connect/sendmsg/recvmsg or general cgroup skb egress hook) starts charging to the process' memcg due to map updates? [0] https://lore.kernel.org/bpf/20200803190639.gd1020...@carbon.dhcp.thefacebook.com/ The patchset consists of the following parts: 1) 4 mm patches, which are already in the mm tree, but are required to avoid a regression (otherwise vmallocs cannot be mapped to userspace). 2) memcg-based accounting for various bpf objects: progs and maps 3) removal of the rlimit-based accounting 4) removal of rlimit adjustments in userspace samples First 4 patches are not supposed to be merged via the bpf tree. I'm including them to make sure bpf tests will pass. v7: - introduced bpf_map_kmalloc_node() and bpf_map_alloc_percpu(), by Alexei - switched allocations made from an interrupt context to new helpers, by Daniel - rebase and minor fixes
Re: [PATCH] bpf: Check the return value of dev_get_by_index_rcu()
On 11/20/20 4:19 PM, David Ahern wrote: On 11/20/20 8:13 AM, Daniel Borkmann wrote: [ +David ] On 11/19/20 8:04 AM, xiakaixu1...@gmail.com wrote: From: Kaixu Xia The return value of dev_get_by_index_rcu() can be NULL, so here it is need to check the return value and return error code if it is NULL. Signed-off-by: Kaixu Xia --- net/core/filter.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/core/filter.c b/net/core/filter.c index 2ca5eecebacf..1263fe07170a 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5573,6 +5573,8 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb, struct net_device *dev; dev = dev_get_by_index_rcu(net, params->ifindex); + if (unlikely(!dev)) + return -EINVAL; if (!is_skb_forwardable(dev, skb)) rc = BPF_FIB_LKUP_RET_FRAG_NEEDED; rcu lock is held right? It is impossible for dev to return NULL here. Yes, we're under RCU read side. Was wondering whether we could unlink it from the list but not yet free it, but also that seems not possible since we'd first need to close it which already has a synchronize_net(). So not an issue what Kaixu describes in the commit msg, agree.
Re: [PATCH] bpf: Check the return value of dev_get_by_index_rcu()
[ +David ] On 11/19/20 8:04 AM, xiakaixu1...@gmail.com wrote: From: Kaixu Xia The return value of dev_get_by_index_rcu() can be NULL, so here it is need to check the return value and return error code if it is NULL. Signed-off-by: Kaixu Xia --- net/core/filter.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/core/filter.c b/net/core/filter.c index 2ca5eecebacf..1263fe07170a 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5573,6 +5573,8 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb, struct net_device *dev; dev = dev_get_by_index_rcu(net, params->ifindex); + if (unlikely(!dev)) + return -EINVAL; if (!is_skb_forwardable(dev, skb)) rc = BPF_FIB_LKUP_RET_FRAG_NEEDED; The above logic is quite ugly anyway given we fetched the dev pointer already earlier in bpf_ipv{4,6}_fib_lookup() and now need to redo it again ... so yeah there could be a tiny race in here. We wanted do bring this logic closer to what XDP does anyway, something like below, for example. David, thoughts? Thx Subject: [PATCH] diff mtu check Signed-off-by: Daniel Borkmann --- net/core/filter.c | 22 +- 1 file changed, 5 insertions(+), 17 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 2ca5eecebacf..3bab0a97fa38 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5547,9 +5547,6 @@ static const struct bpf_func_proto bpf_xdp_fib_lookup_proto = { BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb, struct bpf_fib_lookup *, params, int, plen, u32, flags) { - struct net *net = dev_net(skb->dev); - int rc = -EAFNOSUPPORT; - if (plen < sizeof(*params)) return -EINVAL; @@ -5559,25 +5556,16 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb, switch (params->family) { #if IS_ENABLED(CONFIG_INET) case AF_INET: - rc = bpf_ipv4_fib_lookup(net, params, flags, false); - break; + return bpf_ipv4_fib_lookup(dev_net(skb->dev), params, flags, + !skb_is_gso(skb)); #endif #if IS_ENABLED(CONFIG_IPV6) case AF_INET6: - rc = bpf_ipv6_fib_lookup(net, params, flags, false); - break; + return bpf_ipv6_fib_lookup(dev_net(skb->dev), params, flags, + !skb_is_gso(skb)); #endif } - - if (!rc) { - struct net_device *dev; - - dev = dev_get_by_index_rcu(net, params->ifindex); - if (!is_skb_forwardable(dev, skb)) - rc = BPF_FIB_LKUP_RET_FRAG_NEEDED; - } - - return rc; + return -EAFNOSUPPORT; } static const struct bpf_func_proto bpf_skb_fib_lookup_proto = { -- 2.21.0
Re: [PATCH bpf-next v6 06/34] bpf: prepare for memcg-based memory accounting for bpf maps
On 11/18/20 2:28 AM, Roman Gushchin wrote: On Tue, Nov 17, 2020 at 05:11:00PM -0800, Alexei Starovoitov wrote: On Tue, Nov 17, 2020 at 5:07 PM Roman Gushchin wrote: On Tue, Nov 17, 2020 at 04:46:34PM -0800, Roman Gushchin wrote: On Wed, Nov 18, 2020 at 01:06:17AM +0100, Daniel Borkmann wrote: On 11/17/20 4:40 AM, Roman Gushchin wrote: In the absolute majority of cases if a process is making a kernel allocation, it's memory cgroup is getting charged. Bpf maps can be updated from an interrupt context and in such case there is no process which can be charged. It makes the memory accounting of bpf maps non-trivial. Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel memcg accounting from interrupt contexts") and b87d8cefe43c ("mm, memcg: rework remote charging API to support nesting") it's finally possible. To do it, a pointer to the memory cgroup of the process which created the map is saved, and this cgroup is getting charged for all allocations made from an interrupt context. Allocations made from a process context will be accounted in a usual way. Signed-off-by: Roman Gushchin Acked-by: Song Liu [...] +#ifdef CONFIG_MEMCG_KMEM +static __always_inline int __bpf_map_update_elem(struct bpf_map *map, void *key, + void *value, u64 flags) +{ + struct mem_cgroup *old_memcg; + bool in_interrupt; + int ret; + + /* + * If update from an interrupt context results in a memory allocation, + * the memory cgroup to charge can't be determined from the context + * of the current task. Instead, we charge the memory cgroup, which + * contained a process created the map. + */ + in_interrupt = in_interrupt(); + if (in_interrupt) + old_memcg = set_active_memcg(map->memcg); + + ret = map->ops->map_update_elem(map, key, value, flags); + + if (in_interrupt) + set_active_memcg(old_memcg); + + return ret; Hmm, this approach here won't work, see also commit 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps") which removes the indirect call, so the __bpf_map_update_elem() and therefore the set_active_memcg() is not invoked for the vast majority of cases. I see. Well, the first option is to move these calls into map-specific update functions, but the list is relatively long: nsim_map_update_elem() cgroup_storage_update_elem() htab_map_update_elem() htab_percpu_map_update_elem() dev_map_update_elem() dev_map_hash_update_elem() trie_update_elem() cpu_map_update_elem() bpf_pid_task_storage_update_elem() bpf_fd_inode_storage_update_elem() bpf_fd_sk_storage_update_elem() sock_map_update_elem() xsk_map_update_elem() Alternatively, we can set the active memcg for the whole duration of bpf execution. It's simpler, but will add some overhead. Maybe we can somehow mark programs calling into update helpers and skip all others? Actually, this is problematic if a program updates several maps, because in theory they can belong to different cgroups. So it seems that the first option is the way to go. Do you agree? May be instead of kmalloc_node() that is used by most of the map updates introduce bpf_map_kmalloc_node() that takes a map pointer as an argument? And do set_memcg inside? I suspect it's not only kmalloc_node(), but if there will be 2-3 allocation helpers, it sounds like a good idea to me! I'll try and get back with v7 soon. Could this be baked into kmalloc*() API itself given we also need to pass in __GFP_ACCOUNT everywhere, so we'd have a new API with additional argument where we pass the memcg pointer to tell it directly where to account it for instead of having to have the extra set_active_memcg() set/restore dance via BPF wrapper? It seems there would be not much specifics on BPF itself and if it's more generic it could also be used by other subsystems. Thanks, Daniel
Re: [PATCH bpf-next v6 06/34] bpf: prepare for memcg-based memory accounting for bpf maps
On 11/17/20 4:40 AM, Roman Gushchin wrote: In the absolute majority of cases if a process is making a kernel allocation, it's memory cgroup is getting charged. Bpf maps can be updated from an interrupt context and in such case there is no process which can be charged. It makes the memory accounting of bpf maps non-trivial. Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel memcg accounting from interrupt contexts") and b87d8cefe43c ("mm, memcg: rework remote charging API to support nesting") it's finally possible. To do it, a pointer to the memory cgroup of the process which created the map is saved, and this cgroup is getting charged for all allocations made from an interrupt context. Allocations made from a process context will be accounted in a usual way. Signed-off-by: Roman Gushchin Acked-by: Song Liu [...] +#ifdef CONFIG_MEMCG_KMEM +static __always_inline int __bpf_map_update_elem(struct bpf_map *map, void *key, +void *value, u64 flags) +{ + struct mem_cgroup *old_memcg; + bool in_interrupt; + int ret; + + /* +* If update from an interrupt context results in a memory allocation, +* the memory cgroup to charge can't be determined from the context +* of the current task. Instead, we charge the memory cgroup, which +* contained a process created the map. +*/ + in_interrupt = in_interrupt(); + if (in_interrupt) + old_memcg = set_active_memcg(map->memcg); + + ret = map->ops->map_update_elem(map, key, value, flags); + + if (in_interrupt) + set_active_memcg(old_memcg); + + return ret; Hmm, this approach here won't work, see also commit 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps") which removes the indirect call, so the __bpf_map_update_elem() and therefore the set_active_memcg() is not invoked for the vast majority of cases. +} +#else +static __always_inline int __bpf_map_update_elem(struct bpf_map *map, void *key, +void *value, u64 flags) +{ + return map->ops->map_update_elem(map, key, value, flags); +} +#endif + BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key, void *, value, u64, flags) { WARN_ON_ONCE(!rcu_read_lock_held()); - return map->ops->map_update_elem(map, key, value, flags); + + return __bpf_map_update_elem(map, key, value, flags); } const struct bpf_func_proto bpf_map_update_elem_proto = { diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index f3fe9f53f93c..2d77fc2496da 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -31,6 +31,7 @@ #include #include #include +#include #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \ (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \ @@ -456,6 +457,27 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock) __release(_idr_lock); } +#ifdef CONFIG_MEMCG_KMEM +static void bpf_map_save_memcg(struct bpf_map *map) +{ + map->memcg = get_mem_cgroup_from_mm(current->mm); +} + +static void bpf_map_release_memcg(struct bpf_map *map) +{ + mem_cgroup_put(map->memcg); +} + +#else +static void bpf_map_save_memcg(struct bpf_map *map) +{ +} + +static void bpf_map_release_memcg(struct bpf_map *map) +{ +} +#endif + /* called from workqueue */ static void bpf_map_free_deferred(struct work_struct *work) { @@ -464,6 +486,7 @@ static void bpf_map_free_deferred(struct work_struct *work) bpf_map_charge_move(, >memory); security_bpf_map_free(map); + bpf_map_release_memcg(map); /* implementation dependent freeing */ map->ops->map_free(map); bpf_map_charge_finish(); @@ -875,6 +898,8 @@ static int map_create(union bpf_attr *attr) if (err) goto free_map_sec; + bpf_map_save_memcg(map); + err = bpf_map_new_fd(map, f_flags); if (err < 0) { /* failed to allocate fd.
Re: [PATCH bpf-next v3 2/2] bpf: Add tests for bpf_lsm_set_bprm_opts
On 11/17/20 3:13 AM, KP Singh wrote: [...] + +static int run_set_secureexec(int map_fd, int secureexec) +{ + ^ same here + int child_pid, child_status, ret, null_fd; + + child_pid = fork(); + if (child_pid == 0) { + null_fd = open("/dev/null", O_WRONLY); + if (null_fd == -1) + exit(errno); + dup2(null_fd, STDOUT_FILENO); + dup2(null_fd, STDERR_FILENO); + close(null_fd); + + /* Ensure that all executions from hereon are +* secure by setting a local storage which is read by +* the bprm_creds_for_exec hook and sets bprm->secureexec. +*/ + ret = update_storage(map_fd, secureexec); + if (ret) + exit(ret); + + /* If the binary is executed with securexec=1, the dynamic +* loader ingores and unsets certain variables like LD_PRELOAD, +* TMPDIR etc. TMPDIR is used here to simplify the example, as +* LD_PRELOAD requires a real .so file. +* +* If the value of TMPDIR is set, the bash command returns 10 +* and if the value is unset, it returns 20. +*/ + execle("/bin/bash", "bash", "-c", + "[[ -z \"${TMPDIR}\" ]] || exit 10 && exit 20", NULL, + bash_envp); + exit(errno); + } else if (child_pid > 0) { + waitpid(child_pid, _status, 0); + ret = WEXITSTATUS(child_status); + + /* If a secureexec occured, the exit status should be 20. +*/ + if (secureexec && ret == 20) + return 0; + + /* If normal execution happened the exit code should be 10. +*/ + if (!secureexec && ret == 10) + return 0; + and here (rest looks good to me) + } + + return -EINVAL; +} +
Re: [PATCH bpf-next v3 1/2] bpf: Add bpf_lsm_set_bprm_opts helper
On 11/17/20 3:13 AM, KP Singh wrote: From: KP Singh The helper allows modification of certain bits on the linux_binprm struct starting with the secureexec bit which can be updated using the BPF_LSM_F_BPRM_SECUREEXEC flag. secureexec can be set by the LSM for privilege gaining executions to set the AT_SECURE auxv for glibc. When set, the dynamic linker disables the use of certain environment variables (like LD_PRELOAD). Signed-off-by: KP Singh --- include/uapi/linux/bpf.h | 18 ++ kernel/bpf/bpf_lsm.c | 27 +++ scripts/bpf_helpers_doc.py | 2 ++ tools/include/uapi/linux/bpf.h | 18 ++ 4 files changed, 65 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 162999b12790..bfa79054d106 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -3787,6 +3787,18 @@ union bpf_attr { **ARG_PTR_TO_BTF_ID* of type *task_struct*. *Return *Pointer to the current task. + * + * long bpf_lsm_set_bprm_opts(struct linux_binprm *bprm, u64 flags) + * small nit: should have no extra newline (same for the tools/ copy) + * Description + * Set or clear certain options on *bprm*: + * + * **BPF_LSM_F_BPRM_SECUREEXEC** Set the secureexec bit + * which sets the **AT_SECURE** auxv for glibc. The bit + * is cleared if the flag is not specified. + * Return + * **-EINVAL** if invalid *flags* are passed. + * */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3948,6 +3960,7 @@ union bpf_attr { FN(task_storage_get), \ FN(task_storage_delete),\ FN(get_current_task_btf), \ + FN(lsm_set_bprm_opts), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper @@ -4119,6 +4132,11 @@ enum bpf_lwt_encap_mode { BPF_LWT_ENCAP_IP, }; +/* Flags for LSM helpers */ +enum { + BPF_LSM_F_BPRM_SECUREEXEC = (1ULL << 0), +}; + #define __bpf_md_ptr(type, name) \ union { \ type name; \ diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c index 553107f4706a..cd85482228a0 100644 --- a/kernel/bpf/bpf_lsm.c +++ b/kernel/bpf/bpf_lsm.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include @@ -51,6 +52,30 @@ int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog, return 0; } +/* Mask for all the currently supported BPRM option flags */ +#define BPF_LSM_F_BRPM_OPTS_MASK BPF_LSM_F_BPRM_SECUREEXEC + +BPF_CALL_2(bpf_lsm_set_bprm_opts, struct linux_binprm *, bprm, u64, flags) +{ + ditto Would have fixed up these things on the fly while applying, but one small item I wanted to bring up here given uapi which will then freeze: it would be cleaner to call the helper just bpf_bprm_opts_set() or so given it's implied that we attach to lsm here and we don't use _lsm in the naming for the others either. Similarly, I'd drop the _LSM from the flag/mask. + if (flags & ~BPF_LSM_F_BRPM_OPTS_MASK) + return -EINVAL; + + bprm->secureexec = (flags & BPF_LSM_F_BPRM_SECUREEXEC); + return 0; +} + +BTF_ID_LIST_SINGLE(bpf_lsm_set_bprm_opts_btf_ids, struct, linux_binprm) + +const static struct bpf_func_proto bpf_lsm_set_bprm_opts_proto = { + .func = bpf_lsm_set_bprm_opts, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_BTF_ID, + .arg1_btf_id= _lsm_set_bprm_opts_btf_ids[0], + .arg2_type = ARG_ANYTHING, +}; + static const struct bpf_func_proto * bpf_lsm_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { @@ -71,6 +96,8 @@ bpf_lsm_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return _task_storage_get_proto; case BPF_FUNC_task_storage_delete: return _task_storage_delete_proto; + case BPF_FUNC_lsm_set_bprm_opts: + return _lsm_set_bprm_opts_proto; default: return tracing_prog_func_proto(func_id, prog); }
Re: [PATCH bpf-next 1/2] bpf: Add bpf_lsm_set_bprm_opts helper
On 11/16/20 3:01 PM, KP Singh wrote: From: KP Singh The helper allows modification of certain bits on the linux_binprm struct starting with the secureexec bit which can be updated using the BPF_LSM_F_BPRM_SECUREEXEC flag. secureexec can be set by the LSM for privilege gaining executions to set the AT_SECURE auxv for glibc. When set, the dynamic linker disables the use of certain environment variables (like LD_PRELOAD). Signed-off-by: KP Singh [...] /* integer value in 'imm' field of BPF_CALL instruction selects which helper @@ -4119,6 +4128,11 @@ enum bpf_lwt_encap_mode { BPF_LWT_ENCAP_IP, }; +/* Flags for LSM helpers */ +enum { + BPF_LSM_F_BPRM_SECUREEXEC = (1ULL << 0), +}; + #define __bpf_md_ptr(type, name) \ union { \ type name; \ diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c index 553107f4706a..4d04fc490a14 100644 --- a/kernel/bpf/bpf_lsm.c +++ b/kernel/bpf/bpf_lsm.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include @@ -51,6 +52,23 @@ int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog, return 0; } +BPF_CALL_2(bpf_lsm_set_bprm_opts, struct linux_binprm *, bprm, u64, flags) +{ This should also reject invalid flags. I'd rather change this helper from RET_VOID to RET_INTEGER and throw -EINVAL for everything other than BPF_LSM_F_BPRM_SECUREEXEC passed in here including zero so it can be extended in future. + bprm->secureexec = (flags & BPF_LSM_F_BPRM_SECUREEXEC); + return 0; +} + +BTF_ID_LIST_SINGLE(bpf_lsm_set_bprm_opts_btf_ids, struct, linux_binprm) + +const static struct bpf_func_proto bpf_lsm_set_bprm_opts_proto = { + .func = bpf_lsm_set_bprm_opts, + .gpl_only = false, + .ret_type = RET_VOID, + .arg1_type = ARG_PTR_TO_BTF_ID, + .arg1_btf_id= _lsm_set_bprm_opts_btf_ids[0], + .arg2_type = ARG_ANYTHING, +}; +
Re: [PATCH bpf-next 2/2] bpf: Expose bpf_d_path helper to sleepable LSM hooks
On 11/13/20 4:18 AM, Yonghong Song wrote: On 11/12/20 9:19 AM, KP Singh wrote: From: KP Singh Sleepable hooks are never called from an NMI/interrupt context, so it is safe to use the bpf_d_path helper in LSM programs attaching to these hooks. The helper is not restricted to sleepable programs and merely uses the list of sleeable hooks as the initial subset of LSM hooks where it can sleeable => sleepable probably not need to resend if no other major changes. The maintainer can just fix it up before merging. Did while rebasing & applying, thanks everyone! be used. Signed-off-by: KP Singh Acked-by: Yonghong Song
Re: [PATCH v4 bpf] tools: bpftool: Add missing close before bpftool net attach exit
On 11/13/20 12:51 PM, Wang Hai wrote: progfd is created by prog_parse_fd(), before 'bpftool net attach' exit, it should be closed. Fixes: 04949ccc273e ("tools: bpftool: add net attach command to attach XDP on interface") Signed-off-by: Wang Hai Applied & improved commit msg a bit, thanks!
Re: [PATCH bpf-next v2 1/2] bpf: Augment the set of sleepable LSM hooks
On 11/12/20 9:03 PM, KP Singh wrote: From: KP Singh Update the set of sleepable hooks with the ones that do not trigger a warning with might_fault() when exercised with the correct kernel config options enabled, i.e. DEBUG_ATOMIC_SLEEP=y LOCKDEP=y PROVE_LOCKING=y This means that a sleepable LSM eBPF program can be attached to these LSM hooks. A new helper method bpf_lsm_is_sleepable_hook is added and the set is maintained locally in bpf_lsm.c A comment is added about the list of LSM hooks that have been observed to be called from softirqs, atomic contexts, or the ones that can trigger pagefaults and thus should not be added to this list. [...] +/* The set of hooks which are called without pagefaults disabled and are allowed + * to "sleep" and thus can be used for sleeable BPF programs. + * + * There are some hooks which have been observed to be called from a + * non-sleepable context and should not be added to this set: + * + * bpf_lsm_bpf_prog_free_security + * bpf_lsm_capable + * bpf_lsm_cred_free + * bpf_lsm_d_instantiate + * bpf_lsm_file_alloc_security + * bpf_lsm_file_mprotect + * bpf_lsm_file_send_sigiotask + * bpf_lsm_inet_conn_request + * bpf_lsm_inet_csk_clone + * bpf_lsm_inode_alloc_security + * bpf_lsm_inode_follow_link + * bpf_lsm_inode_permission + * bpf_lsm_key_permission + * bpf_lsm_locked_down + * bpf_lsm_mmap_addr + * bpf_lsm_perf_event_read + * bpf_lsm_ptrace_access_check + * bpf_lsm_req_classify_flow + * bpf_lsm_sb_free_security + * bpf_lsm_sk_alloc_security + * bpf_lsm_sk_clone_security + * bpf_lsm_sk_free_security + * bpf_lsm_sk_getsecid + * bpf_lsm_socket_sock_rcv_skb + * bpf_lsm_sock_graft + * bpf_lsm_task_free + * bpf_lsm_task_getioprio + * bpf_lsm_task_getscheduler + * bpf_lsm_task_kill + * bpf_lsm_task_setioprio + * bpf_lsm_task_setnice + * bpf_lsm_task_setpgid + * bpf_lsm_task_setrlimit + * bpf_lsm_unix_may_send + * bpf_lsm_unix_stream_connect + * bpf_lsm_vm_enough_memory + */ I think this is very useful info. I was wondering whether it would make sense to annotate these more closely to the code so there's less chance this info becomes stale? Maybe something like below, not sure ... issue is if you would just place a cant_sleep() in there it might be wrong since this should just document that it can be invoked from non-sleepable context but it might not have to. diff --git a/security/security.c b/security/security.c index a28045dc9e7f..7899bf32cdaa 100644 --- a/security/security.c +++ b/security/security.c @@ -94,6 +94,11 @@ static __initdata bool debug; pr_info(__VA_ARGS__); \ } while (0) +/* + * Placeholder for now to document that hook implementation cannot sleep + * since it could potentially be called from non-sleepable context, too. + */ +#define hook_cant_sleep() do { } while (0) + static bool __init is_enabled(struct lsm_info *lsm) { if (!lsm->enabled) @@ -2522,6 +2527,7 @@ void security_bpf_map_free(struct bpf_map *map) } void security_bpf_prog_free(struct bpf_prog_aux *aux) { + hook_cant_sleep(); call_void_hook(bpf_prog_free_security, aux); } #endif /* CONFIG_BPF_SYSCALL */
Re: [PATCH v3 bpf] tools: bpftool: Add missing close before bpftool net attach exit
On 11/11/20 2:54 PM, Wang Hai wrote: progfd is created by prog_parse_fd(), before 'bpftool net attach' exit, it should be closed. Fixes: 04949ccc273e ("tools: bpftool: add net attach command to attach XDP on interface") Signed-off-by: Wang Hai --- v2->v3: add 'err = 0' before successful return v1->v2: use cleanup tag instead of repeated closes tools/bpf/bpftool/net.c | 18 +++--- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c index 910e7bac6e9e..f927392271cc 100644 --- a/tools/bpf/bpftool/net.c +++ b/tools/bpf/bpftool/net.c @@ -578,8 +578,8 @@ static int do_attach(int argc, char **argv) ifindex = net_parse_dev(, ); if (ifindex < 1) { - close(progfd); - return -EINVAL; + err = -EINVAL; + goto cleanup; } if (argc) { @@ -587,8 +587,8 @@ static int do_attach(int argc, char **argv) overwrite = true; } else { p_err("expected 'overwrite', got: '%s'?", *argv); - close(progfd); - return -EINVAL; + err = -EINVAL; + goto cleanup; } } @@ -597,16 +597,20 @@ static int do_attach(int argc, char **argv) err = do_attach_detach_xdp(progfd, attach_type, ifindex, overwrite); - if (err < 0) { + if (err) { p_err("interface %s attach failed: %s", attach_type_strings[attach_type], strerror(-err)); - return err; + goto cleanup; } if (json_output) jsonw_null(json_wtr); - return 0; + err = 0; Why is the 'err = 0' still needed here given we test for err != 0 earlier? Would just remove it, otherwise looks good. +cleanup: + close(progfd); + return err; } static int do_detach(int argc, char **argv)
Re: [PATCH v3] bpf: Fix unsigned 'datasec_id' compared with zero in check_pseudo_btf_id
On 11/11/20 6:03 AM, xiakaixu1...@gmail.com wrote: From: Kaixu Xia The unsigned variable datasec_id is assigned a return value from the call to check_pseudo_btf_id(), which may return negative error code. Fixes coccicheck warning: ./kernel/bpf/verifier.c:9616:5-15: WARNING: Unsigned expression compared with zero: datasec_id > 0 Reported-by: Tosk Robot Signed-off-by: Kaixu Xia Looks good, applied & also added Fixes tags, thanks!
Re: [selftest/bpf] b83590ee1a: BUG:KASAN:slab-out-of-bounds_in_l
Hi Daniel, On 11/9/20 3:54 PM, kernel test robot wrote: Greeting, FYI, we noticed the following commit (built with gcc-9): commit: b83590ee1add052518603bae607b0524632b7793 ("[PATCH bpf v3 2/2] selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL") url: https://github.com/0day-ci/linux/commits/Daniel-Xu/Fix-bpf_probe_read_user_str-overcopying/20201106-033210 base: https://git.kernel.org/cgit/linux/kernel/git/bpf/bpf.git master I've tossed them from the tree for now as it looks like these are adding regressions for regular strncpy_from_user() calls, please take a look. Thanks! in testcase: trinity version: trinity-x86_64-af355e9-1_2019-12-03 with following parameters: runtime: 300s test-description: Trinity is a linux system call fuzz tester. test-url: http://codemonkey.org.uk/projects/trinity/ on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 8G caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): ++++ || e65411d04b | b83590ee1a | ++++ | BUG:KASAN:slab-out-of-bounds_in_l | 0 | 4 | ++++ If you fix the issue, kindly add following tag Reported-by: kernel test robot [ 54.933739] BUG: KASAN: slab-out-of-bounds in link_path_walk+0x8f5/0xa80 [ 54.935295] Read of size 1 at addr 88815f726951 by task modprobe/114 [ 54.936720] [ 54.937199] CPU: 1 PID: 114 Comm: modprobe Not tainted 5.9.0-13439-gb83590ee1add #1 [ 54.938907] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 54.940683] Call Trace: [ 54.941008] dump_stack+0x84/0xad [ 54.941008] print_address_description+0x2f/0x220 [ 54.941008] ? pm_suspend.cold+0x70e/0x70e [ 54.941008] ? _raw_write_lock_irqsave+0x150/0x150 [ 54.941008] ? link_path_walk+0x8f5/0xa80 [ 54.941008] kasan_report.cold+0x37/0x7c [ 54.941008] ? link_path_walk+0x8f5/0xa80 [ 54.941008] __asan_report_load1_noabort+0x14/0x20 [ 54.941008] link_path_walk+0x8f5/0xa80 [ 54.941008] ? walk_component+0x670/0x670 [ 54.941008] ? deactivate_slab+0x3d9/0x690 [ 54.941008] link_path_walk+0x91/0xb0 [ 54.941008] path_lookupat+0x12f/0x430 [ 54.941008] filename_lookup+0x19a/0x2d0 [ 54.941008] ? may_linkat+0x180/0x180 [ 54.941008] ? __check_object_size+0x2bf/0x390 [ 54.941008] ? strncpy_from_user+0x24b/0x490 [ 54.941008] ? getname_flags+0x13a/0x4a0 [ 54.941008] user_path_at_empty+0x3f/0x50 [ 54.941008] do_faccessat+0xc1/0x5d0 [ 54.941008] ? stream_open+0x60/0x60 [ 54.941008] ? exit_to_user_mode_prepare+0xb9/0x190 [ 54.941008] __x64_sys_access+0x56/0x80 [ 54.941008] do_syscall_64+0x5d/0x70 [ 54.941008] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 54.941008] RIP: 0033:0x7f3cc3f345f7 [ 54.941008] Code: c8 ff c3 b8 01 00 00 00 0f 05 48 3d 01 f0 ff ff 73 01 c3 48 8d 0d 19 9b 20 00 f7 d8 89 01 48 83 c8 ff c3 b8 15 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8d 0d f9 9a 20 00 f7 d8 89 01 48 83 [ 54.941008] RSP: 002b:7ffde47f0b68 EFLAGS: 0246 ORIG_RAX: 0015 [ 54.941008] RAX: ffda RBX: RCX: 7f3cc3f345f7 [ 54.941008] RDX: 0006 RSI: 0004 RDI: 7f3cc3f39bd0 [ 54.941008] RBP: 7ffde47f1c70 R08: R09: [ 54.941008] R10: 0022 R11: 0246 R12: [ 54.941008] R13: 7ffde47f9330 R14: 000f R15: 7f3cc413e150 [ 54.941008] [ 54.941008] Allocated by task 114: [ 54.941008] kasan_save_stack+0x23/0x50 [ 54.941008] __kasan_kmalloc+0xe1/0xf0 [ 54.941008] kasan_slab_alloc+0xe/0x10 [ 54.941008] kmem_cache_alloc+0x166/0x360 [ 54.941008] getname_flags+0x4e/0x4a0 [ 54.941008] user_path_at_empty+0x2b/0x50 [ 54.941008] do_faccessat+0xc1/0x5d0 [ 54.941008] __x64_sys_access+0x56/0x80 [ 54.941008] do_syscall_64+0x5d/0x70 [ 54.941008] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 54.941008] [ 54.941008] The buggy address belongs to the object at 88815f725900 [ 54.941008] which belongs to the cache names_cache of size 4096 [ 54.941008] The buggy address is located 81 bytes to the right of [ 54.941008] 4096-byte region [88815f725900, 88815f726900) [ 54.941008] The buggy address belongs to the page: [ 54.941008] page:(ptrval) refcount:1 mapcount:0 mapping: index:0x0 pfn:0x15f720 [ 54.941008] head:(ptrval) order:3 compound_mapcount:0 compound_pincount:0 [ 54.941008] flags: 0x80010200(slab|head) [
Re: [PATCH bpf-next] lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator
On 11/4/20 9:18 PM, Daniel Xu wrote: On Wed Nov 4, 2020 at 8:24 AM PST, Daniel Borkmann wrote: On 11/4/20 3:29 AM, Daniel Xu wrote: do_strncpy_from_user() may copy some extra bytes after the NUL terminator into the destination buffer. This usually does not matter for normal string operations. However, when BPF programs key BPF maps with strings, this matters a lot. A BPF program may read strings from user memory by calling the bpf_probe_read_user_str() helper which eventually calls do_strncpy_from_user(). The program can then key a map with the resulting string. BPF map keys are fixed-width and string-agnostic, meaning that map keys are treated as a set of bytes. The issue is when do_strncpy_from_user() overcopies bytes after the NUL terminator, it can result in seemingly identical strings occupying multiple slots in a BPF map. This behavior is subtle and totally unexpected by the user. This commit uses the proper word-at-a-time APIs to avoid overcopying. Signed-off-by: Daniel Xu It looks like this is a regression from the recent refactoring of the mem probing util functions? I think it was like this from the beginning, at 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers"). The old bpf_probe_read_str() used the kernel's byte-by-byte copying routine. bpf_probe_read_user_str() started using strncpy_from_user() which has been doing the long-sized strides since ~2012 or earlier. I tried to build and test the kernel at that commit but it seems my compiler is too new to build that old code. Bunch of build failures. I assume the refactor you're referring to is 8d92db5c04d1 ("bpf: rework the compat kernel probe handling"). Ah I see, it was just reusing 3d7081822f7f ("uaccess: Add non-pagefault user-space read functions"). Potentially it might be safer choice to just rework the strncpy_from_user_nofault() to mimic strncpy_from_kernel_nofault() in that regard? Could we add a Fixes tag and then we'd also need to target the fix against bpf tree instead of bpf-next, no? Sure, will do in v2. Moreover, a BPF kselftest would help to make sure it doesn't regress in future again. Ditto. [..] Thanks, Daniel
Re: [PATCH bpf-next] lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator
On 11/4/20 3:29 AM, Daniel Xu wrote: do_strncpy_from_user() may copy some extra bytes after the NUL terminator into the destination buffer. This usually does not matter for normal string operations. However, when BPF programs key BPF maps with strings, this matters a lot. A BPF program may read strings from user memory by calling the bpf_probe_read_user_str() helper which eventually calls do_strncpy_from_user(). The program can then key a map with the resulting string. BPF map keys are fixed-width and string-agnostic, meaning that map keys are treated as a set of bytes. The issue is when do_strncpy_from_user() overcopies bytes after the NUL terminator, it can result in seemingly identical strings occupying multiple slots in a BPF map. This behavior is subtle and totally unexpected by the user. This commit uses the proper word-at-a-time APIs to avoid overcopying. Signed-off-by: Daniel Xu It looks like this is a regression from the recent refactoring of the mem probing util functions? Could we add a Fixes tag and then we'd also need to target the fix against bpf tree instead of bpf-next, no? Moreover, a BPF kselftest would help to make sure it doesn't regress in future again. Thanks, Daniel
Re: [PATCH] bpf: don't rely on GCC __attribute__((optimize)) to disable GCSE
On 10/27/20 9:57 PM, Ard Biesheuvel wrote: Commit 3193c0836f203 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()") introduced a __no_fgcse macro that expands to a function scope __attribute__((optimize("-fno-gcse"))), to disable a GCC specific optimization that was causing trouble on x86 builds, and was not expected to have any positive effect in the first place. However, as the GCC manual documents, __attribute__((optimize)) is not for production use, and results in all other optimization options to be forgotten for the function in question. This can cause all kinds of trouble, but in one particular reported case, Looks like there are couple more as well aside from __no_fgcse, are you also planning to fix them? arch/powerpc/kernel/setup.h:14:#define __nostackprotector __attribute__((__optimize__("no-stack-protector"))) tools/include/linux/compiler-gcc.h:37:#define __no_tail_call __attribute__((optimize("no-optimize-sibling-calls"))) it causes -fno-asynchronous-unwind-tables to be disregarded, resulting in .eh_frame info to be emitted for the function inadvertently. Would have been useful to add a pointer to the original discussion with Link tag. Link: https://lore.kernel.org/lkml/CAMuHMdUg0WJHEcq6to0-eODpXPOywLot6UD2=gfhpzoj_hc...@mail.gmail.com/ This reverts commit 3193c0836f203, and instead, it disables the -fgcse optimization for the entire source file, but only when building for X86. Cc: Nick Desaulniers Cc: Arvind Sankar Cc: Randy Dunlap Cc: Josh Poimboeuf Cc: Thomas Gleixner Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Peter Zijlstra (Intel) Cc: Geert Uytterhoeven Cc: Kees Cook Fixes: 3193c0836f203 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()") Signed-off-by: Ard Biesheuvel [...] diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index bdc8cd1b6767..02b58f44c479 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -1,6 +1,8 @@ # SPDX-License-Identifier: GPL-2.0 obj-y := core.o -CFLAGS_core.o += $(call cc-disable-warning, override-init) +# ___bpf_prog_run() needs GCSE disabled on x86; see 3193c0836f203 for details +cflags-core-$(CONFIG_X86) := -fno-gcse +CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-core-y) Also, this needs to be guarded behind !CONFIG_RETPOLINE and !CONFIG_BPF_JIT_ALWAYS_ON in particular the latter since only in this case interpreter is compiled in ... most distros have the CONFIG_BPF_JIT_ALWAYS_ON set these days for x86. Do you have an analysis for the commit log on what else this penalizes in core.c if it's now for the entire translation unit? obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 9268d77898b7..55454d2278b1 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -1369,7 +1369,7 @@ u64 __weak bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr) * * Decode and execute eBPF instructions. */ -static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack) +static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack) { #define BPF_INSN_2_LBL(x, y)[BPF_##x | BPF_##y] = &##_##y #define BPF_INSN_3_LBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = &##_##y##_##z
Re: linux-next: build failure after merge of the net-next tree
On 10/6/20 7:41 AM, Stephen Rothwell wrote: On Tue, 6 Oct 2020 07:13:01 +0200 Christoph Hellwig wrote: On Tue, Oct 06, 2020 at 02:58:47PM +1100, Stephen Rothwell wrote: Hi all, After merging the net-next tree, today's linux-next build (x86_64 allmodconfig) failed like this: It actually doesn't need that or the two other internal headers. Bjoern has a fixed, and it was supposed to be queued up according to patchwork. Yeah, it is in the bpf-next tree but not merged into the net-next tree yet. Yep, applied yesterday. Given a3cf4abf ("dma-mapping: merge into ") is in dma-mapping tree and not yet affecting bpf-next nor net-next, we were planning to ship bpf-next at the usual cadence this week, so it'll be in net-next end of week for sure. (If there is urgent reason to have it in net-next today, please let us know of course.) Thanks, Daniel
Re: mb2q experience and couple issues
On 10/1/20 11:13 AM, Thomas Gleixner wrote: On Wed, Sep 30 2020 at 11:12, Alexei Starovoitov wrote: For the last couple years we've been using mb2q tool to normalize patches and it worked wonderfully. Fun. I thought I'm the only user of it :) We're using it pretty much daily since you've put it on korg :) It's in a bunch of scripts^hacks we use for bpf trees: https://git.kernel.org/pub/scm/linux/kernel/git/dborkman/pw.git/ Recently we've hit few bugs: curl -s https://patchwork.kernel.org/patch/11807443/mbox/ > /tmp/mbox.i; ~/bin/mb2q --mboxout mbox.o /tmp/mbox.i Drop Message w/o Message-ID: No subject No patches found in mbox I've tried to debug it, but couldn't figure out what's going on. The subject and message-id fields are parsed correctly, but later something happens. Could you please take a look? The problem is the mbox storage format. The mbox created by curl has a mail body which has a line starting with 'From' in the mail body: From the VAR btf_id, the verifier can also read the address of the ksym's corresponding kernel var from kallsyms and use that to fill dst_reg. The mailbox parser trips over that From and takes it as start of the next message. http://qmail.org/qmail-manual-html/man5/mbox.html Usually mailbox storage escapes a From at the start of a newline with '>': >From the VAR btf_id, the verifier can also read the address of the ksym's corresponding kernel var from kallsyms and use that to fill dst_reg. Yes, it's ugly and I haven't figured out a proper way to deal with that. There are quite some mbox formats out there and they all are incompatible with each other and all of them have different horrors. Let me think about it. It seems these issues only appeared since maybe a month or so. Perhaps also something changed on ozlabs/patchwork side. Cheers, Daniel
Re: [PATCH] powerpc: net: bpf_jit_comp: Fix misuse of fallthrough
On 9/28/20 11:00 AM, zhe...@windriver.com wrote: From: He Zhe The user defined label following "fallthrough" is not considered by GCC and causes build failure. kernel-source/include/linux/compiler_attributes.h:208:41: error: attribute 'fallthrough' not preceding a case label or default label [-Werror] 208 define fallthrough _attribute((fallthrough_)) ^ Signed-off-by: He Zhe Applied, thanks! I've also added Fixes tag with df561f6688fe ("treewide: Use fallthrough pseudo-keyword") which added the bug.
Re: [PATCH v7 bpf-next 4/8] selftests/bpf: add bpf_snprintf_btf helper tests
On 9/28/20 1:31 PM, Alan Maguire wrote: Tests verifying snprintf()ing of various data structures, flags combinations using a tp_btf program. Tests are skipped if __builtin_btf_type_id is not available to retrieve BTF type ids. Signed-off-by: Alan Maguire [...] +void test_snprintf_btf(void) +{ + struct netif_receive_skb *skel; + struct netif_receive_skb__bss *bss; + int err, duration = 0; + + skel = netif_receive_skb__open(); + if (CHECK(!skel, "skel_open", "failed to open skeleton\n")) + return; + + err = netif_receive_skb__load(skel); + if (CHECK(err, "skel_load", "failed to load skeleton: %d\n", err)) + goto cleanup; + + bss = skel->bss; + + err = netif_receive_skb__attach(skel); + if (CHECK(err, "skel_attach", "skeleton attach failed: %d\n", err)) + goto cleanup; + + /* generate receive event */ + system("ping -c 1 127.0.0.1 > /dev/null"); This generates the following new warning when compiling BPF selftests: [...] EXT-OBJ [test_progs] cgroup_helpers.o EXT-OBJ [test_progs] trace_helpers.o EXT-OBJ [test_progs] network_helpers.o EXT-OBJ [test_progs] testing_helpers.o TEST-OBJ [test_progs] snprintf_btf.test.o /root/bpf-next/tools/testing/selftests/bpf/prog_tests/snprintf_btf.c: In function ‘test_snprintf_btf’: /root/bpf-next/tools/testing/selftests/bpf/prog_tests/snprintf_btf.c:30:2: warning: ignoring return value of ‘system’, declared with attribute warn_unused_result [-Wunused-result] system("ping -c 1 127.0.0.1 > /dev/null"); ^ [...] Please fix, thx!
Re: [PATCH] tools build feature: cleanup feature files on make clean
Hi Arnaldo, On 9/3/20 9:03 PM, Arnaldo Carvalho de Melo wrote: Em Thu, Aug 27, 2020 at 10:53:36AM +0200, Jesper Dangaard Brouer escreveu: The system for "Auto-detecting system features" located under tools/build/ are (currently) used by perf, libbpf and bpftool. It can contain stalled feature detection files, which are not cleaned up by libbpf and bpftool on make clean (side-note: perf tool is correct). Fix this by making the users invoke the make clean target. Some details about the changes. The libbpf Makefile already had a clean-config target (which seems to be copy-pasted from perf), but this target was not "connected" (a make dependency) to clean target. Choose not to rename target as someone might be using it. Did change the output from "CLEAN config" to "CLEAN feature-detect", to make it more clear what happens. Since this mostly touches BPF, should it go via the BPF tree? Already applied roughly a week ago: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=661b37cd437ef49cd28444f79b9b0c71ea76e8c8 Thanks, Daniel
Re: [PATCH] xsk: Free variable on error path
On 9/2/20 6:33 PM, Alex Dewar wrote: In xp_create_dma_map(), memory is allocated to dma_map->dma_pages, but then dma_map is erroneously compared to NULL, rather than the member. Fix this. Addresses-Coverity: ("Dead code") Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings") Signed-off-by: Alex Dewar Thanks, already applied a fix that was sent earlier [0]. [0] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=1d6fd78a213ee3874f46bdce083b7a41d208886d
Re: [PATCH][next] xsk: fix incorrect memory allocation failure check on dma_map->dma_pages
On 9/2/20 6:13 PM, Colin King wrote: From: Colin Ian King The failed memory allocation check for dma_map->dma_pages is incorrect, it is null checking dma_map and not dma_map->dma_pages. Fix this. Addresses-Coverity: ("Logicall dead code") Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings") Signed-off-by: Colin Ian King Thanks, already applied a fix that was sent earlier [0]. [0] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=1d6fd78a213ee3874f46bdce083b7a41d208886d
Re: [PATCH][next] xsk: Fix null check on error return path
On 9/2/20 5:07 PM, Gustavo A. R. Silva wrote: Currently, dma_map is being checked, when the right object identifier to be null-checked is dma_map->dma_pages, instead. Fix this by null-checking dma_map->dma_pages. Addresses-Coverity-ID: 1496811 ("Logically dead code") Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings") Signed-off-by: Gustavo A. R. Silva Applied, thanks!
Re: KASAN: use-after-free Write in xp_put_pool
On 9/2/20 8:57 AM, syzbot wrote: Hello, syzbot found the following issue on: Magnus/Bjorn, ptal, thanks! HEAD commit:dc1a9bf2 octeontx2-pf: Add UDP segmentation offload support git tree: net-next console output: https://syzkaller.appspot.com/x/log.txt?x=16ff67de90 kernel config: https://syzkaller.appspot.com/x/.config?x=b6856d16f78d8fa9 dashboard link: https://syzkaller.appspot.com/bug?extid=5334f62e4d22804e646a compiler: gcc (GCC) 10.1.0-syz 20200507 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12e9f27990 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=125f3e1e90 The issue was bisected to: commit a1132430c2c55af62d13e9fca752d46f14d548b3 Author: Magnus Karlsson Date: Fri Aug 28 08:26:26 2020 + xsk: Add shared umem support between devices bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=10a373de90 final oops: https://syzkaller.appspot.com/x/report.txt?x=12a373de90 console output: https://syzkaller.appspot.com/x/log.txt?x=14a373de90 IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+5334f62e4d22804e6...@syzkaller.appspotmail.com Fixes: a1132430c2c5 ("xsk: Add shared umem support between devices") == BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline] BUG: KASAN: use-after-free in atomic_fetch_sub_release include/asm-generic/atomic-instrumented.h:220 [inline] BUG: KASAN: use-after-free in refcount_sub_and_test include/linux/refcount.h:266 [inline] BUG: KASAN: use-after-free in refcount_dec_and_test include/linux/refcount.h:294 [inline] BUG: KASAN: use-after-free in xp_put_pool+0x2c/0x1e0 net/xdp/xsk_buff_pool.c:262 Write of size 4 at addr 8880a6a4d860 by task ksoftirqd/0/9 CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 5.9.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x18f/0x20d lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530 check_memory_region_inline mm/kasan/generic.c:186 [inline] check_memory_region+0x13d/0x180 mm/kasan/generic.c:192 instrument_atomic_write include/linux/instrumented.h:71 [inline] atomic_fetch_sub_release include/asm-generic/atomic-instrumented.h:220 [inline] refcount_sub_and_test include/linux/refcount.h:266 [inline] refcount_dec_and_test include/linux/refcount.h:294 [inline] xp_put_pool+0x2c/0x1e0 net/xdp/xsk_buff_pool.c:262 xsk_destruct+0x7d/0xa0 net/xdp/xsk.c:1138 __sk_destruct+0x4b/0x860 net/core/sock.c:1764 rcu_do_batch kernel/rcu/tree.c:2428 [inline] rcu_core+0x5c7/0x1190 kernel/rcu/tree.c:2656 __do_softirq+0x2de/0xa24 kernel/softirq.c:298 run_ksoftirqd kernel/softirq.c:652 [inline] run_ksoftirqd+0x89/0x100 kernel/softirq.c:644 smpboot_thread_fn+0x655/0x9e0 kernel/smpboot.c:165 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 Allocated by task 6854: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48 kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461 kmalloc_node include/linux/slab.h:577 [inline] kvmalloc_node+0x61/0xf0 mm/util.c:574 kvmalloc include/linux/mm.h:750 [inline] kvzalloc include/linux/mm.h:758 [inline] xp_create_and_assign_umem+0x58/0x8d0 net/xdp/xsk_buff_pool.c:54 xsk_bind+0x9a0/0xed0 net/xdp/xsk.c:709 __sys_bind+0x1e9/0x250 net/socket.c:1656 __do_sys_bind net/socket.c:1667 [inline] __se_sys_bind net/socket.c:1665 [inline] __x64_sys_bind+0x6f/0xb0 net/socket.c:1665 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 6854: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48 kasan_set_track+0x1c/0x30 mm/kasan/common.c:56 kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355 __kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422 __cache_free mm/slab.c:3418 [inline] kfree+0x103/0x2c0 mm/slab.c:3756 kvfree+0x42/0x50 mm/util.c:603 xp_destroy net/xdp/xsk_buff_pool.c:44 [inline] xp_destroy+0x45/0x60 net/xdp/xsk_buff_pool.c:38 xsk_bind+0xbdd/0xed0 net/xdp/xsk.c:719 __sys_bind+0x1e9/0x250 net/socket.c:1656 __do_sys_bind net/socket.c:1667 [inline] __se_sys_bind net/socket.c:1665 [inline] __x64_sys_bind+0x6f/0xb0 net/socket.c:1665 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The buggy address belongs to the object at 8880a6a4d800 which belongs to the cache kmalloc-1k of size 1024 The buggy address is located 96 bytes inside of 1024-byte region [8880a6a4d800, 8880a6a4dc00) The buggy address belongs to the page: page:dd5fc18f
Re: [PATCH] tools build feature: cleanup feature files on make clean
On 8/27/20 10:53 AM, Jesper Dangaard Brouer wrote: The system for "Auto-detecting system features" located under tools/build/ are (currently) used by perf, libbpf and bpftool. It can contain stalled feature detection files, which are not cleaned up by libbpf and bpftool on make clean (side-note: perf tool is correct). Fix this by making the users invoke the make clean target. Some details about the changes. The libbpf Makefile already had a clean-config target (which seems to be copy-pasted from perf), but this target was not "connected" (a make dependency) to clean target. Choose not to rename target as someone might be using it. Did change the output from "CLEAN config" to "CLEAN feature-detect", to make it more clear what happens. This is related to the complaint and troubleshooting in link: Link: https://lore.kernel.org/lkml/20200818122007.2d1cfe2d@carbon/ Signed-off-by: Jesper Dangaard Brouer Applied to bpf-next, thanks!
Re: [PATCH bpf-next] libbpf: simplify the return expression of build_map_pin_path()
On 8/19/20 4:53 AM, Xu Wang wrote: Simplify the return expression. Signed-off-by: Xu Wang Applied, thanks!
Re: [PATCH bpf-next v2] bpf: fix segmentation fault of test_progs
On 8/10/20 5:39 PM, Jianlin Lv wrote: test_progs reports the segmentation fault as below $ sudo ./test_progs -t mmap --verbose test_mmap:PASS:skel_open_and_load 0 nsec .. test_mmap:PASS:adv_mmap1 0 nsec test_mmap:PASS:adv_mmap2 0 nsec test_mmap:PASS:adv_mmap3 0 nsec test_mmap:PASS:adv_mmap4 0 nsec Segmentation fault This issue was triggered because mmap() and munmap() used inconsistent length parameters; mmap() creates a new mapping of 3*page_size, but the length parameter set in the subsequent re-map and munmap() functions is 4*page_size; this leads to the destruction of the process space. To fix this issue, first create 4 pages of anonymous mapping,then do all the mmap() with MAP_FIXED. Another issue is that when unmap the second page fails, the length parameter to delete tmp1 mappings should be 4*page_size. Signed-off-by: Jianlin Lv Applied, thanks!
Re: [PATCH] kernel: bpf: delete repeated words in comments
On 8/7/20 5:31 AM, Randy Dunlap wrote: Drop repeated words in kernel/bpf/. {has, the} Signed-off-by: Randy Dunlap Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: net...@vger.kernel.org Cc: b...@vger.kernel.org Applied, thanks!
Re: [PATCH bpf] bpf: doc: remove references to warning message when using bpf_trace_printk()
On 8/7/20 1:50 PM, Alan Maguire wrote: The BPF helper bpf_trace_printk() no longer uses trace_printk(); it is now triggers a dedicated trace event. Hence the described warning is no longer present, so remove the discussion of it as it may confuse people. Fixes: ac5a72ea5c89 ("bpf: Use dedicated bpf_trace_printk event instead of trace_printk()") Signed-off-by: Alan Maguire Applied, thanks!
Re: [PATCH bpf-next v3 00/29] bpf: switch to memcg-based memory accounting
On 8/3/20 7:05 PM, Roman Gushchin wrote: On Mon, Aug 03, 2020 at 06:39:01PM +0200, Daniel Borkmann wrote: On 8/3/20 5:34 PM, Roman Gushchin wrote: On Mon, Aug 03, 2020 at 02:05:29PM +0200, Daniel Borkmann wrote: On 7/30/20 11:22 PM, Roman Gushchin wrote: Currently bpf is using the memlock rlimit for the memory accounting. This approach has its downsides and over time has created a significant amount of problems: 1) The limit is per-user, but because most bpf operations are performed as root, the limit has a little value. 2) It's hard to come up with a specific maximum value. Especially because the counter is shared with non-bpf users (e.g. memlock() users). Any specific value is either too low and creates false failures or too high and useless. 3) Charging is not connected to the actual memory allocation. Bpf code should manually calculate the estimated cost and precharge the counter, and then take care of uncharging, including all fail paths. It adds to the code complexity and makes it easy to leak a charge. 4) There is no simple way of getting the current value of the counter. We've used drgn for it, but it's far from being convenient. 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had a function to "explain" this case for users. In order to overcome these problems let's switch to the memcg-based memory accounting of bpf objects. With the recent addition of the percpu memory accounting, now it's possible to provide a comprehensive accounting of memory used by bpf programs and maps. This approach has the following advantages: 1) The limit is per-cgroup and hierarchical. It's way more flexible and allows a better control over memory usage by different workloads. 2) The actual memory consumption is taken into account. It happens automatically on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also performed automatically on releasing the memory. So the code on the bpf side becomes simpler and safer. 3) There is a simple way to get the current value and statistics. The patchset consists of the following parts: 1) memcg-based accounting for various bpf objects: progs and maps 2) removal of the rlimit-based accounting 3) removal of rlimit adjustments in userspace samples The diff stat looks nice & agree that rlimit sucks, but I'm missing how this is set is supposed to work reliably, at least I currently fail to see it. Elaborating on this in more depth especially for the case of unprivileged users should be a /fundamental/ part of the commit message. Lets take an example: unprivileged user adds a max sized hashtable to one of its programs, and configures the map that it will perform runtime allocation. The load succeeds as it doesn't surpass the limits set for the current memcg. Kernel then processes packets from softirq. Given the runtime allocations, we end up mischarging to whoever ended up triggering __do_softirq(). If, for example, ksoftirq thread, then it's probably reasonable to assume that this might not be accounted e.g. limits are not imposed on the root cgroup. If so we would probably need to drag the context of /where/ this must be charged to __memcg_kmem_charge_page() to do it reliably. Otherwise how do you protect unprivileged users to OOM the machine? this is a valid concern, thank you for bringing it in. It can be resolved by associating a map with a memory cgroup on creation, so that we can charge this memory cgroup later, even from a soft-irq context. The question here is whether we want to do it for all maps, or just for dynamic hashtables (or any similar cases, if there are any)? I think the second option is better. With the first option we have to annotate all memory allocations in bpf maps code with memalloc_use_memcg()/memalloc_unuse_memcg(), so it's easy to mess it up in the future. What do you think? We would need to do it for all maps that are configured with non-prealloc, e.g. not only hash/LRU table but also others like LPM maps etc. I wonder whether program entry/ exit could do the memalloc_use_memcg() / memalloc_unuse_memcg() and then everything would be accounted against the prog's memcg from runtime side, but then there's the usual issue with 'unuse'-restore on tail calls, and it doesn't solve the syscall side. But seems like the memalloc_{use,unuse}_memcg()'s remote charging is lightweight anyway compared to some of the other map update work such as taking bucket lock etc. I'll explore it and address in the next version. Thank you for suggestions! Ok. I'm probably still missing one more thing, but could you elaborate what limits would be enforced if an unprivileged user creates a prog/map on the host (w/o further action such as moving to a specific cgroup)? From what I can tell via looking at systemd: $ cat /proc/self/cgroup 11:cpuset:/ 10:hugetlb:/ 9:devices:/user.slice 8:cpu,cpuacct
Re: [PATCH bpf-next v3 00/29] bpf: switch to memcg-based memory accounting
On 8/3/20 5:34 PM, Roman Gushchin wrote: On Mon, Aug 03, 2020 at 02:05:29PM +0200, Daniel Borkmann wrote: On 7/30/20 11:22 PM, Roman Gushchin wrote: Currently bpf is using the memlock rlimit for the memory accounting. This approach has its downsides and over time has created a significant amount of problems: 1) The limit is per-user, but because most bpf operations are performed as root, the limit has a little value. 2) It's hard to come up with a specific maximum value. Especially because the counter is shared with non-bpf users (e.g. memlock() users). Any specific value is either too low and creates false failures or too high and useless. 3) Charging is not connected to the actual memory allocation. Bpf code should manually calculate the estimated cost and precharge the counter, and then take care of uncharging, including all fail paths. It adds to the code complexity and makes it easy to leak a charge. 4) There is no simple way of getting the current value of the counter. We've used drgn for it, but it's far from being convenient. 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had a function to "explain" this case for users. In order to overcome these problems let's switch to the memcg-based memory accounting of bpf objects. With the recent addition of the percpu memory accounting, now it's possible to provide a comprehensive accounting of memory used by bpf programs and maps. This approach has the following advantages: 1) The limit is per-cgroup and hierarchical. It's way more flexible and allows a better control over memory usage by different workloads. 2) The actual memory consumption is taken into account. It happens automatically on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also performed automatically on releasing the memory. So the code on the bpf side becomes simpler and safer. 3) There is a simple way to get the current value and statistics. The patchset consists of the following parts: 1) memcg-based accounting for various bpf objects: progs and maps 2) removal of the rlimit-based accounting 3) removal of rlimit adjustments in userspace samples The diff stat looks nice & agree that rlimit sucks, but I'm missing how this is set is supposed to work reliably, at least I currently fail to see it. Elaborating on this in more depth especially for the case of unprivileged users should be a /fundamental/ part of the commit message. Lets take an example: unprivileged user adds a max sized hashtable to one of its programs, and configures the map that it will perform runtime allocation. The load succeeds as it doesn't surpass the limits set for the current memcg. Kernel then processes packets from softirq. Given the runtime allocations, we end up mischarging to whoever ended up triggering __do_softirq(). If, for example, ksoftirq thread, then it's probably reasonable to assume that this might not be accounted e.g. limits are not imposed on the root cgroup. If so we would probably need to drag the context of /where/ this must be charged to __memcg_kmem_charge_page() to do it reliably. Otherwise how do you protect unprivileged users to OOM the machine? this is a valid concern, thank you for bringing it in. It can be resolved by associating a map with a memory cgroup on creation, so that we can charge this memory cgroup later, even from a soft-irq context. The question here is whether we want to do it for all maps, or just for dynamic hashtables (or any similar cases, if there are any)? I think the second option is better. With the first option we have to annotate all memory allocations in bpf maps code with memalloc_use_memcg()/memalloc_unuse_memcg(), so it's easy to mess it up in the future. What do you think? We would need to do it for all maps that are configured with non-prealloc, e.g. not only hash/LRU table but also others like LPM maps etc. I wonder whether program entry/ exit could do the memalloc_use_memcg() / memalloc_unuse_memcg() and then everything would be accounted against the prog's memcg from runtime side, but then there's the usual issue with 'unuse'-restore on tail calls, and it doesn't solve the syscall side. But seems like the memalloc_{use,unuse}_memcg()'s remote charging is lightweight anyway compared to some of the other map update work such as taking bucket lock etc. Similarly, what happens to unprivileged users if kmemcg was not configured into the kernel or has been disabled? Well, I don't think we can address it. Memcg-based memory accounting requires enabled memory cgroups, a properly configured cgroup tree and also the kernel memory accounting turned on to function properly. Because we at Facebook are using cgroup for the memory accounting and control everywhere, I might be biased. If there are real !memcg systems which are actively using non-privileged bpf, we should keep the old system in place and make i
Re: [PATCH] tools/bpf/bpftool: Fix wrong return value in do_dump()
On 8/2/20 1:15 PM, Tianjia Zhang wrote: In case of btf_id does not exist, a negative error code -ENOENT should be returned. Fixes: c93cc69004df3 ("bpftool: add ability to dump BTF types") Cc: Andrii Nakryiko Signed-off-by: Tianjia Zhang Applied, thanks!
Re: [PATCH bpf-next v3 00/29] bpf: switch to memcg-based memory accounting
On 7/30/20 11:22 PM, Roman Gushchin wrote: Currently bpf is using the memlock rlimit for the memory accounting. This approach has its downsides and over time has created a significant amount of problems: 1) The limit is per-user, but because most bpf operations are performed as root, the limit has a little value. 2) It's hard to come up with a specific maximum value. Especially because the counter is shared with non-bpf users (e.g. memlock() users). Any specific value is either too low and creates false failures or too high and useless. 3) Charging is not connected to the actual memory allocation. Bpf code should manually calculate the estimated cost and precharge the counter, and then take care of uncharging, including all fail paths. It adds to the code complexity and makes it easy to leak a charge. 4) There is no simple way of getting the current value of the counter. We've used drgn for it, but it's far from being convenient. 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had a function to "explain" this case for users. In order to overcome these problems let's switch to the memcg-based memory accounting of bpf objects. With the recent addition of the percpu memory accounting, now it's possible to provide a comprehensive accounting of memory used by bpf programs and maps. This approach has the following advantages: 1) The limit is per-cgroup and hierarchical. It's way more flexible and allows a better control over memory usage by different workloads. 2) The actual memory consumption is taken into account. It happens automatically on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also performed automatically on releasing the memory. So the code on the bpf side becomes simpler and safer. 3) There is a simple way to get the current value and statistics. The patchset consists of the following parts: 1) memcg-based accounting for various bpf objects: progs and maps 2) removal of the rlimit-based accounting 3) removal of rlimit adjustments in userspace samples The diff stat looks nice & agree that rlimit sucks, but I'm missing how this is set is supposed to work reliably, at least I currently fail to see it. Elaborating on this in more depth especially for the case of unprivileged users should be a /fundamental/ part of the commit message. Lets take an example: unprivileged user adds a max sized hashtable to one of its programs, and configures the map that it will perform runtime allocation. The load succeeds as it doesn't surpass the limits set for the current memcg. Kernel then processes packets from softirq. Given the runtime allocations, we end up mischarging to whoever ended up triggering __do_softirq(). If, for example, ksoftirq thread, then it's probably reasonable to assume that this might not be accounted e.g. limits are not imposed on the root cgroup. If so we would probably need to drag the context of /where/ this must be charged to __memcg_kmem_charge_page() to do it reliably. Otherwise how do you protect unprivileged users to OOM the machine? Similarly, what happens to unprivileged users if kmemcg was not configured into the kernel or has been disabled? Thanks, Daniel
Re: [PATCH bpf-next v3] Documentation/bpf: Use valid and new links in index.rst
On 7/31/20 10:29 AM, Tiezhu Yang wrote: There exists an error "404 Not Found" when I click the html link of "Documentation/networking/filter.rst" in the BPF documentation [1], fix it. Additionally, use the new links about "BPF and XDP Reference Guide" and "bpf(2)" to avoid redirects. [1] https://www.kernel.org/doc/html/latest/bpf/ Fixes: d9b9170a2653 ("docs: bpf: Rename README.rst to index.rst") Fixes: cb3f0d56e153 ("docs: networking: convert filter.txt to ReST") Signed-off-by: Tiezhu Yang That looks better, applied, thanks!
Re: [PATCH bpf-next] bpf: fix compilation warning of selftests
On 7/31/20 8:16 AM, Jianlin Lv wrote: Clang compiler version: 12.0.0 The following warning appears during the selftests/bpf compilation: prog_tests/send_signal.c:51:3: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result] 51 | write(pipe_c2p[1], buf, 1); | ^~ prog_tests/send_signal.c:54:3: warning: ignoring return value of ‘read’, declared with attribute warn_unused_result [-Wunused-result] 54 | read(pipe_p2c[0], buf, 1); | ^ .. prog_tests/stacktrace_build_id_nmi.c:13:2: warning: ignoring return value of ‘fscanf’,declared with attribute warn_unused_result [-Wunused-resul] 13 | fscanf(f, "%llu", _freq); | ^~~ test_tcpnotify_user.c:133:2: warning:ignoring return value of ‘system’, declared with attribute warn_unused_result [-Wunused-result] 133 | system(test_script); | ^~~ test_tcpnotify_user.c:138:2: warning:ignoring return value of ‘system’, declared with attribute warn_unused_result [-Wunused-result] 138 | system(test_script); | ^~~ test_tcpnotify_user.c:143:2: warning:ignoring return value of ‘system’, declared with attribute warn_unused_result [-Wunused-result] 143 | system(test_script); | ^~~ Add code that fix compilation warning about ignoring return value and handles any errors; Check return value of library`s API make the code more secure. Signed-off-by: Jianlin Lv Looks good overall, there is one small bug that slipped in, see below: [...] diff --git a/tools/testing/selftests/bpf/test_tcpnotify_user.c b/tools/testing/selftests/bpf/test_tcpnotify_user.c index f9765ddf0761..869e28c92d73 100644 --- a/tools/testing/selftests/bpf/test_tcpnotify_user.c +++ b/tools/testing/selftests/bpf/test_tcpnotify_user.c @@ -130,17 +130,26 @@ int main(int argc, char **argv) sprintf(test_script, "iptables -A INPUT -p tcp --dport %d -j DROP", TESTPORT); - system(test_script); + if (system(test_script)) { + printf("FAILED: execute command: %s\n", test_script); + goto err; + } sprintf(test_script, "nc 127.0.0.1 %d < /etc/passwd > /dev/null 2>&1 ", TESTPORT); - system(test_script); + if (system(test_script)) { + printf("FAILED: execute command: %s\n", test_script); + goto err; + } Did you try to run this test case? With the patch here it will fail: # ./test_tcpnotify_user FAILED: execute command: nc 127.0.0.1 12877 < /etc/passwd > /dev/null 2>&1 This is because nc returns 1 as exit code and for the test it is actually expected to fail given the iptables rule we installed for TESTPORT right above and remove again below. Please adapt this and send a v2, thanks! sprintf(test_script, "iptables -D INPUT -p tcp --dport %d -j DROP", TESTPORT); - system(test_script); + if (system(test_script)) { + printf("FAILED: execute command: %s\n", test_script); + goto err; + } rv = bpf_map_lookup_elem(bpf_map__fd(global_map), , ); if (rv != 0) {
Re: [PATCH bpf-next 1/1] arm64: bpf: Add BPF exception tables
On 7/30/20 11:14 PM, Jean-Philippe Brucker wrote: On Thu, Jul 30, 2020 at 09:47:39PM +0200, Daniel Borkmann wrote: On 7/30/20 4:22 PM, Jean-Philippe Brucker wrote: On Thu, Jul 30, 2020 at 08:28:56AM -0400, Qian Cai wrote: On Tue, Jul 28, 2020 at 05:21:26PM +0200, Jean-Philippe Brucker wrote: When a tracing BPF program attempts to read memory without using the bpf_probe_read() helper, the verifier marks the load instruction with the BPF_PROBE_MEM flag. Since the arm64 JIT does not currently recognize this flag it falls back to the interpreter. Add support for BPF_PROBE_MEM, by appending an exception table to the BPF program. If the load instruction causes a data abort, the fixup infrastructure finds the exception table and fixes up the fault, by clearing the destination register and jumping over the faulting instruction. To keep the compact exception table entry format, inspect the pc in fixup_exception(). A more generic solution would add a "handler" field to the table entry, like on x86 and s390. Signed-off-by: Jean-Philippe Brucker This will fail to compile on arm64, https://gitlab.com/cailca/linux-mm/-/blob/master/arm64.config arch/arm64/mm/extable.o: In function `fixup_exception': arch/arm64/mm/extable.c:19: undefined reference to `arm64_bpf_fixup_exception' Thanks for the report, I attached a fix. Daniel, can I squash it and resend as v2 or is it too late? If you want I can squash your attached snippet into the original patch of yours. If you want to send a v2 that is fine as well of course. Let me know. Yes please squash it into the original patch, sorry for the mess Done, thanks!
Re: [PATCH bpf-next 1/1] arm64: bpf: Add BPF exception tables
On 7/30/20 4:22 PM, Jean-Philippe Brucker wrote: On Thu, Jul 30, 2020 at 08:28:56AM -0400, Qian Cai wrote: On Tue, Jul 28, 2020 at 05:21:26PM +0200, Jean-Philippe Brucker wrote: When a tracing BPF program attempts to read memory without using the bpf_probe_read() helper, the verifier marks the load instruction with the BPF_PROBE_MEM flag. Since the arm64 JIT does not currently recognize this flag it falls back to the interpreter. Add support for BPF_PROBE_MEM, by appending an exception table to the BPF program. If the load instruction causes a data abort, the fixup infrastructure finds the exception table and fixes up the fault, by clearing the destination register and jumping over the faulting instruction. To keep the compact exception table entry format, inspect the pc in fixup_exception(). A more generic solution would add a "handler" field to the table entry, like on x86 and s390. Signed-off-by: Jean-Philippe Brucker This will fail to compile on arm64, https://gitlab.com/cailca/linux-mm/-/blob/master/arm64.config arch/arm64/mm/extable.o: In function `fixup_exception': arch/arm64/mm/extable.c:19: undefined reference to `arm64_bpf_fixup_exception' Thanks for the report, I attached a fix. Daniel, can I squash it and resend as v2 or is it too late? If you want I can squash your attached snippet into the original patch of yours. If you want to send a v2 that is fine as well of course. Let me know. Thanks, Daniel
Re: [Linux-kernel-mentees] [PATCH net v2] xdp: Prevent kernel-infoleak in xsk_getsockopt()
On 7/28/20 7:36 AM, Peilin Ye wrote: xsk_getsockopt() is copying uninitialized stack memory to userspace when `extra_stats` is `false`. Fix it. Fixes: 8aa5a33578e9 ("xsk: Add new statistics") Suggested-by: Dan Carpenter Signed-off-by: Peilin Ye --- Doing `= {};` is sufficient since currently `struct xdp_statistics` is defined as follows: struct xdp_statistics { __u64 rx_dropped; __u64 rx_invalid_descs; __u64 tx_invalid_descs; __u64 rx_ring_full; __u64 rx_fill_ring_empty_descs; __u64 tx_ring_empty_descs; }; When being copied to the userspace, `stats` will not contain any uninitialized "holes" between struct fields. I've added above explanation to the commit log since it's useful reasoning for later on 'why' something has been done a certain way. Applied, thanks Peilin!
Re: [PATCH][next] bpf: fix swapped arguments in calls to check_buffer_access
On 7/27/20 11:39 PM, Yonghong Song wrote: On 7/27/20 10:54 AM, Colin King wrote: From: Colin Ian King There are a couple of arguments of the boolean flag zero_size_allowed and the char pointer buf_info when calling to function check_buffer_access that are swapped by mistake. Fix these by swapping them to correct the argument ordering. Addresses-Coverity: ("Array compared to 0") Fixes: afbf21dce668 ("bpf: Support readonly/readwrite buffers in verifier") Signed-off-by: Colin Ian King Thanks for the fix! Acked-by: Yonghong Song Sigh, thanks for the fix Colin, applied! Yonghong, could you follow-up with BPF selftest test cases that exercise these paths? Thx
Re: [PATCH bpf-next] bpf: Generate cookie for new non-initial net NS
On 7/20/20 4:09 PM, Jianlin Lv wrote: For non-initial network NS, the net cookie is generated when bpf_get_netns_cookie_sock is called for the first time, but it is more reasonable to complete the cookie generation work when creating a new network NS, just like init_net. net_gen_cookie() be moved into setup_net() that it can serve the initial and non-initial network namespace. Signed-off-by: Jianlin Lv What use-case are you trying to solve? Why should it be different than, say, socket cookie generation? I'm currently not seeing much of a point in moving this. When it's not used in the system, it would actually create more work. --- net/core/net_namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index dcd61aca343e..5937bd0df56d 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -336,6 +336,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns) idr_init(>netns_ids); spin_lock_init(>nsid_lock); mutex_init(>ipv4.ra_mutex); + net_gen_cookie(net); list_for_each_entry(ops, _list, list) { error = ops_init(ops, net); @@ -1101,7 +1102,6 @@ static int __init net_ns_init(void) panic("Could not allocate generic netns"); rcu_assign_pointer(init_net.gen, ng); - net_gen_cookie(_net); down_write(_ops_rwsem); if (setup_net(_net, _user_ns))
Re: [PATCH] Revert "test_bpf: flag tests that cannot be jited on s390"
On 7/16/20 4:39 PM, Seth Forshee wrote: This reverts commit 3203c9010060806ff88c9989aeab4dc8d9a474dc. The s390 bpf JIT previously had a restriction on the maximum program size, which required some tests in test_bpf to be flagged as expected failures. The program size limitation has been removed, and the tests now pass, so these tests should no longer be flagged. Fixes: d1242b10ff03 ("s390/bpf: Remove JITed image size limitations") Signed-off-by: Seth Forshee Applied, thanks!
Re: [Linux-kernel-mentees] [PATCH v3] bpf: Fix NULL pointer dereference in __btf_resolve_helper_id()
On 7/14/20 8:09 PM, Peilin Ye wrote: Prevent __btf_resolve_helper_id() from dereferencing `btf_vmlinux` as NULL. This patch fixes the following syzbot bug: https://syzkaller.appspot.com/bug?id=f823224ada908fa5c207902a5a62065e53ca0fcc Reported-by: syzbot+ee09bda7017345f1f...@syzkaller.appspotmail.com Signed-off-by: Peilin Ye Looks good, applied, thanks! As far as I can tell all the other occurrences are gated behind btf_parse_vmlinux() where we also init struct_opts, etc. So for bpf-next this would then end up looking like ... int btf_resolve_helper_id(struct bpf_verifier_log *log, const struct bpf_func_proto *fn, int arg) { int id; if (fn->arg_type[arg] != ARG_PTR_TO_BTF_ID) return -EINVAL; id = fn->btf_id[arg]; if (!id || !btf_vmlinux || id > btf_vmlinux->nr_types) return -EINVAL; return id; } --- Sorry, I got the link wrong. Thank you for pointing that out. Change in v3: - Fix incorrect syzbot dashboard link. Change in v2: - Split NULL and IS_ERR cases. kernel/bpf/btf.c | 5 + 1 file changed, 5 insertions(+) diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 30721f2c2d10..092116a311f4 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -4088,6 +4088,11 @@ static int __btf_resolve_helper_id(struct bpf_verifier_log *log, void *fn, const char *tname, *sym; u32 btf_id, i; + if (!btf_vmlinux) { + bpf_log(log, "btf_vmlinux doesn't exist\n"); + return -EINVAL; + } + if (IS_ERR(btf_vmlinux)) { bpf_log(log, "btf_vmlinux is malformed\n"); return -EINVAL;
Re: [PATCH] tools/bpftool: Fix error return code in do_skeleton()
On 7/15/20 5:13 AM, YueHaibing wrote: The error return code should be PTR_ERR(obj) other than PTR_ERR(NULL). Fixes: 5dc7a8b21144 ("bpftool, selftests/bpf: Embed object file inside skeleton") Signed-off-by: YueHaibing --- tools/bpf/bpftool/gen.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/bpf/bpftool/gen.c b/tools/bpf/bpftool/gen.c index 10de76b296ba..35f62273cdbd 100644 --- a/tools/bpf/bpftool/gen.c +++ b/tools/bpf/bpftool/gen.c @@ -305,8 +305,9 @@ static int do_skeleton(int argc, char **argv) opts.object_name = obj_name; obj = bpf_object__open_mem(obj_data, file_sz, ); if (IS_ERR(obj)) { + err = PTR_ERR(obj); + p_err("failed to open BPF object file: %ld", err); obj = NULL; - p_err("failed to open BPF object file: %ld", PTR_ERR(obj)); goto out; Instead of just the error number, could we dump something useful to the user here via libbpf_strerror() given you touch this line? Also, I think the convention in do_skeleton() is to just return {0,-1} given this is propagated as return code for bpftool. }
Re: [PATCH v2 bpf-next 1/2] bpf: use dedicated bpf_trace_printk event instead of trace_printk()
On 7/10/20 4:22 PM, Alan Maguire wrote: The bpf helper bpf_trace_printk() uses trace_printk() under the hood. This leads to an alarming warning message originating from trace buffer allocation which occurs the first time a program using bpf_trace_printk() is loaded. We can instead create a trace event for bpf_trace_printk() and enable it in-kernel when/if we encounter a program using the bpf_trace_printk() helper. With this approach, trace_printk() is not used directly and no warning message appears. This work was started by Steven (see Link) and finished by Alan; added Steven's Signed-off-by with his permission. Link: https://lore.kernel.org/r/20200628194334.6238b...@oasis.local.home Signed-off-by: Steven Rostedt (VMware) Signed-off-by: Alan Maguire --- kernel/trace/Makefile| 2 ++ kernel/trace/bpf_trace.c | 41 - kernel/trace/bpf_trace.h | 34 ++ 3 files changed, 72 insertions(+), 5 deletions(-) create mode 100644 kernel/trace/bpf_trace.h diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index 6575bb0..aeba5ee 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -31,6 +31,8 @@ ifdef CONFIG_GCOV_PROFILE_FTRACE GCOV_PROFILE := y endif +CFLAGS_bpf_trace.o := -I$(src) + CFLAGS_trace_benchmark.o := -I$(src) CFLAGS_trace_events_filter.o := -I$(src) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 1d874d8..1414bf5 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include @@ -19,6 +20,9 @@ #include "trace_probe.h" #include "trace.h" +#define CREATE_TRACE_POINTS +#include "bpf_trace.h" + #define bpf_event_rcu_dereference(p) \ rcu_dereference_protected(p, lockdep_is_held(_event_mutex)) @@ -374,6 +378,29 @@ static void bpf_trace_copy_string(char *buf, void *unsafe_ptr, char fmt_ptype, } } +static DEFINE_RAW_SPINLOCK(trace_printk_lock); + +#define BPF_TRACE_PRINTK_SIZE 1024 + + nit: double newline +static inline __printf(1, 0) int bpf_do_trace_printk(const char *fmt, ...) +{ + static char buf[BPF_TRACE_PRINTK_SIZE]; + unsigned long flags; + va_list ap; + int ret; + + raw_spin_lock_irqsave(_printk_lock, flags); + va_start(ap, fmt); + ret = vsnprintf(buf, BPF_TRACE_PRINTK_SIZE, fmt, ap); nit: s/BPF_TRACE_PRINTK_SIZE/sizeof(buf)/ + va_end(ap); + if (ret >= 0) + trace_bpf_trace_printk(buf); Is there a specific reason you added the 'ret >= 0' check on top of [0]? Given the vsnprintf() internals you either return 0 or number of characters generated, no? [0] https://lore.kernel.org/bpf/20200628194334.6238b...@oasis.local.home/ Rest lgtm, thanks! + raw_spin_unlock_irqrestore(_printk_lock, flags); + + return ret; +} + /* * Only limited trace_printk() conversion specifiers allowed: * %d %i %u %x %ld %li %lu %lx %lld %lli %llu %llx %p %pB %pks %pus %s @@ -483,8 +510,7 @@ static void bpf_trace_copy_string(char *buf, void *unsafe_ptr, char fmt_ptype, */ #define __BPF_TP_EMIT() __BPF_ARG3_TP() #define __BPF_TP(...) \ - __trace_printk(0 /* Fake ip */, \ - fmt, ##__VA_ARGS__) + bpf_do_trace_printk(fmt, ##__VA_ARGS__) #define __BPF_ARG1_TP(...) \ ((mod[0] == 2 || (mod[0] == 1 && __BITS_PER_LONG == 64))\ @@ -521,10 +547,15 @@ static void bpf_trace_copy_string(char *buf, void *unsafe_ptr, char fmt_ptype, const struct bpf_func_proto *bpf_get_trace_printk_proto(void) { /* -* this program might be calling bpf_trace_printk, -* so allocate per-cpu printk buffers +* This program might be calling bpf_trace_printk, +* so enable the associated bpf_trace/bpf_trace_printk event. +* Repeat this each time as it is possible a user has +* disabled bpf_trace_printk events. By loading a program +* calling bpf_trace_printk() however the user has expressed +* the intent to see such events. */ - trace_printk_init_buffers(); + if (trace_set_clr_event("bpf_trace", "bpf_trace_printk", 1)) + pr_warn_ratelimited("could not enable bpf_trace_printk events"); return _trace_printk_proto; } diff --git a/kernel/trace/bpf_trace.h b/kernel/trace/bpf_trace.h new file mode 100644 index 000..9acbc11 --- /dev/null +++ b/kernel/trace/bpf_trace.h @@ -0,0 +1,34 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM bpf_trace + +#if !defined(_TRACE_BPF_TRACE_H) || defined(TRACE_HEADER_MULTI_READ) + +#define _TRACE_BPF_TRACE_H + +#include + +TRACE_EVENT(bpf_trace_printk, + + TP_PROTO(const char *bpf_string), + + TP_ARGS(bpf_string),
Re: [PATCH] MAINTAINERS: XDP: restrict N: and K:
On 7/10/20 8:17 AM, Alexander A. Klimov wrote: Am 09.07.20 um 22:37 schrieb Daniel Borkmann: On 7/9/20 9:42 PM, Alexander A. Klimov wrote: Rationale: Documentation/arm/ixp4xx.rst contains "xdp" as part of "ixdp465" which has nothing to do with XDP. Signed-off-by: Alexander A. Klimov --- See also: https://lore.kernel.org/lkml/20200709132607.7fb42415@carbon/ MAINTAINERS | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 1d4aa7f942de..2bb7feb838af 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -18708,8 +18708,8 @@ F: include/trace/events/xdp.h F: kernel/bpf/cpumap.c F: kernel/bpf/devmap.c F: net/core/xdp.c -N: xdp -K: xdp +N: (?:\b|_)xdp(?:\b|_) +K: (?:\b|_)xdp(?:\b|_) Please also include \W to generally match on non-alphanumeric char given you explicitly want to avoid [a-z0-9] around the term xdp. Aren't \W, ^ and $ already covered by \b? Ah, true; it says '\b really means (?:(?<=\w)(?!\w)|(?
Re: [PATCH bpf] selftests: bpf: fix detach from sockmap tests
On 7/9/20 1:51 PM, Lorenz Bauer wrote: Fix sockmap tests which rely on old bpf_prog_dispatch behaviour. In the first case, the tests check that detaching without giving a program succeeds. Since these are not the desired semantics, invert the condition. In the second case, the clean up code doesn't supply the necessary program fds. Reported-by: Martin KaFai Lau Signed-off-by: Lorenz Bauer Fixes: bb0de3131f4c ("bpf: sockmap: Require attach_bpf_fd when detaching a program") Applied, thanks!