Re: [PATCH v2 net-next 1/2] bpf: allow extended BPF programs access skb fields
On 03/14/2015 02:46 AM, Daniel Borkmann wrote: ... Previously, it was much more consistent, which I like better. And only because of the simple BUILD_BUG_ON()? :/ Alternative is to move all of them into a central place, something like in twsk_build_assert() or __mld2_query_bugs[]. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 1/2] bpf: allow extended BPF programs access skb fields
On 03/14/2015 03:08 AM, Alexei Starovoitov wrote: On 3/13/15 7:06 PM, Daniel Borkmann wrote: On 03/14/2015 02:46 AM, Daniel Borkmann wrote: ... Previously, it was much more consistent, which I like better. And only because of the simple BUILD_BUG_ON()? :/ Alternative is to move all of them into a central place, something like in twsk_build_assert() or __mld2_query_bugs[]. nope. that defeats the purpose of bug_on. Well, it doesn't. ;) It throws a build error thus the user is forced to investigate that further. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 1/2] bpf: allow extended BPF programs access skb fields
On 03/14/2015 05:59 AM, Alexei Starovoitov wrote: On 3/13/15 7:27 PM, Alexei Starovoitov wrote: On 3/13/15 7:16 PM, Daniel Borkmann wrote: On 03/14/2015 03:08 AM, Alexei Starovoitov wrote: On 3/13/15 7:06 PM, Daniel Borkmann wrote: On 03/14/2015 02:46 AM, Daniel Borkmann wrote: ... Previously, it was much more consistent, which I like better. And only because of the simple BUILD_BUG_ON()? :/ Alternative is to move all of them into a central place, something like in twsk_build_assert() or __mld2_query_bugs[]. nope. that defeats the purpose of bug_on. Well, it doesn't. ;) It throws a build error thus the user is forced to investigate that further. according to this distorted logic all build_bug_on can be in one file across the whole tree, since 'user is forced to investigate' ?! That was not what I was suggesting, and I assume you know that ... also note that this case and twsk_build_assert are different. twsk_build_assert has no other choice then to have one function that covers logic in the whole file, whereas in this patch: + BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4); + *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg, + offsetof(struct sk_buff, mark)); the build_bug_on protect the line directly below. Separating them just doesn't make sense at all. I also like the above approach better, I only suggested that as a possible alternative since you were saying earlier in this thread: I thought about it, but didn't add it, since we already have them in the same file several lines above this spot. I think one build error per .c file should be enough to attract attention. Though I'll add a comment to convert_bpf_extensions() that build_bug_on errors should be addressed in two places. Cheers, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next 1/2] bpf: allow extended BPF programs access skb fields
On 03/13/2015 03:21 AM, Alexei Starovoitov wrote: introduce user accessible mirror of in-kernel 'struct sk_buff': struct __sk_buff { __u32 len; __u32 pkt_type; __u32 mark; __u32 ifindex; __u32 queue_mapping; }; bpf programs can do: struct __sk_buff *ptr; var = ptr-pkt_type; which will be compiled to bpf assembler as: dst_reg = *(u32 *)(src_reg + 4) // 4 == offsetof(struct __sk_buff, pkt_type) bpf verifier will check validity of access and will convert it to: dst_reg = *(u8 *)(src_reg + offsetof(struct sk_buff, __pkt_type_offset)) dst_reg = 7 since 'pkt_type' is a bitfield. When pkt_type field is moved around, goes into different structure, removed or its size changes, the function sk_filter_convert_ctx_access() would need to be updated. Just like the function convert_bpf_extensions() in case of classic bpf. For each member, I'd also add BUILD_BUG_ON()s similarly as we have in convert_bpf_extensions(). That way, people won't forget to adjust the code. General idea for this offset map looks good, imho. Well defined members that are already exported to uapi e.g. through classic socket filters or other socket api places could be used here. Signed-off-by: Alexei Starovoitov a...@plumgrid.com ... diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 3fa1af8a58d7..66a82d6cd75b 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -168,4 +168,12 @@ enum bpf_func_id { __BPF_FUNC_MAX_ID, }; +struct __sk_buff { + __u32 len; + __u32 pkt_type; + __u32 mark; + __u32 ifindex; + __u32 queue_mapping; +}; I'd add a comment saying that fields may _only_ be safely added at the end of the structure. Rearranging or removing members here, naturally would break user space. The remaining fields we export in classic BPF would be skb-hash, skb-protocol, skb-vlan_tci, are we adding them as well to match up functionality with classic BPF? For example, I can see hash being useful as a key to be used with eBPF maps, etc. ... diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index e6b522496250..c22ebd36fa4b 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c ... +/* convert load instructions that access fields of 'struct __sk_buff' + * into sequence of instructions that access fields of 'struct sk_buff' + */ +static int convert_ctx_accesses(struct verifier_env *env) +{ + struct bpf_insn *insn = env-prog-insnsi; + int insn_cnt = env-prog-len; + struct bpf_insn insn_buf[16]; + struct bpf_prog *new_prog; + u32 cnt; + int i; + + if (!env-prog-aux-ops-convert_ctx_access) + return 0; + + for (i = 0; i insn_cnt; i++, insn++) { + if (insn-code != (BPF_LDX | BPF_MEM | BPF_W)) + continue; + + if (insn-imm != PTR_TO_CTX) { + /* clear internal mark */ + insn-imm = 0; + continue; + } + + cnt = env-prog-aux-ops- + convert_ctx_access(insn-dst_reg, insn-src_reg, + insn-off, insn_buf); + if (cnt == 0 || cnt = ARRAY_SIZE(insn_buf)) { + verbose(bpf verifier is misconfigured\n); + return -EINVAL; + } + + if (cnt == 1) { + memcpy(insn, insn_buf, sizeof(*insn)); + continue; + } + + /* several new insns need to be inserted. Make room for them */ + insn_cnt += cnt - 1; + new_prog = bpf_prog_realloc(env-prog, + bpf_prog_size(insn_cnt), + GFP_USER); + if (!new_prog) + return -ENOMEM; Seems a bit expensive, do you think we could speculatively allocate a bit more space in bpf_prog_load() when we detect that we have access to ctx that we need to convert? + new_prog-len = insn_cnt; + + memmove(new_prog-insnsi + i + cnt, new_prog-insns + i + 1, + sizeof(*insn) * (insn_cnt - i - cnt)); + + /* copy substitute insns in place of load instruction */ + memcpy(new_prog-insnsi + i, insn_buf, sizeof(*insn) * cnt); + + /* adjust branches in the whole program */ + adjust_branches(new_prog, i, cnt - 1); + + /* keep walking new program and skip insns we just inserted */ + env-prog = new_prog; + insn = new_prog-insnsi + i + cnt - 1; + i += cnt - 1; + } + + return 0; +} + static void free_states(struct verifier_env *env) { struct verifier_state_list *sl, *sln; ... diff --git a/net/core/filter.c b/net/core/filter.c index 7a4eb7030dba..b5fcc7e2b608 100644 --- a/net/core/filter.c +++
Re: [PATCH net-next 0/2] bpf: allow extended BPF programs access skb fields
On 03/13/2015 03:21 AM, Alexei Starovoitov wrote: ... Daniel, patch 1 includes a bit of code that does prog_realloc and branch adjustment to make room for new instructions. I think you'd need the same for your 'constant blinding' work. If indeed that would be the case, we'll make it into a helper function. Yes, thanks will take care of that. Cheers, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next 1/2] bpf: allow extended BPF programs access skb fields
On 03/13/2015 05:22 PM, Alexei Starovoitov wrote: On 3/13/15 2:57 AM, Daniel Borkmann wrote: On 03/13/2015 03:21 AM, Alexei Starovoitov wrote: introduce user accessible mirror of in-kernel 'struct sk_buff': For each member, I'd also add BUILD_BUG_ON()s similarly as we have in convert_bpf_extensions(). That way, people won't forget to adjust the code. I thought about it, but didn't add it, since we already have them in the same file several lines above this spot. I think one build error per .c file should be enough to attract attention. Though I'll add a comment to convert_bpf_extensions() that build_bug_on errors should be addressed in two places. My concern would be that in case the code gets changed, this spot could easily be overlooked by someone possibly unfamiliar with the code, since no build error triggers there. So I guess it wouldn't hurt or cost us much to also adding the BUILD_BUG_ON()s there instead of a comment. ... The remaining fields we export in classic BPF would be skb-hash, skb-protocol, skb-vlan_tci, are we adding them as well to match up functionality with classic BPF? For example, I can see hash being useful as a key to be used with eBPF maps, etc. I want to do skb-protocol and skb-vlan_tci differently and hopefully more generically than classic did them, so they will come in the follow on patches. skb-hash also should be done differently. So yes. All of them are subjects of future patches/discussions. Ok, sounds good. Thanks, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] bpf: Suggestion on bpf syscall interface
On 03/28/2015 06:21 PM, Alexei Starovoitov wrote: On 3/28/15 4:36 AM, He Kuang wrote: Hi, Alexei In our end-end IO module project, we use bpf maps to record configurations. According to current bpf syscall interface, we should specify map_fd to lookup/update bpf maps, so we are restricted to do config in the same user program. you can pass map_fd and prog_fd from one process to another via normal scm_rights mechanism. +1, I've just tried that out in the context of a different work and works like a charm. My suggestion is to export this kind of operations to sysfs, so we can loadattach bpf progs and config it seperately. We implement this feature in our demo project. What's your opinion on this? Eventually we may use single sysfs file for lsmod-like listings, but I definitely don't want to create parallel interface to maps via sysfs. Yes, that would be a bad design decision. Btw, even more lightweight for kernel-side would be to just implement .show_fdinfo() for the anon indoes on the map/prog store and have some meta information exported from there. You can then grab that via /proc/pid/fdinfo/fd, I would consider such a thing a slow-path operation anyway, and you would also get the app info using it for free. It's way too expensive and not really suitable for binary key/values. +1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 tip 1/7] bpf: make internal bpf API independent of CONFIG_BPF_SYSCALL ifdefs
On 03/02/2015 12:51 PM, Masami Hiramatsu wrote: (2015/03/02 20:10), Daniel Borkmann wrote: On 03/02/2015 11:53 AM, Masami Hiramatsu wrote: ... Hmm, it seems that this still doesn't hide some APIs which is provided only when CONFIG_BPF_SYSCALL. For example bpf_register_map_type etc. I think all those APIs should be hidden in #ifdef or at least be commented so that the user doesn't refer that without the kconfig. (I don't think we need to provide dummy functions for those APIs, but it's better to clarify which API we can use with which kconfig) Well, currently all possible map types (hash table, array map) that would actually call into bpf_register_map_type() are only built when CONFIG_BPF_SYSCALL is enabled (see kernel/bpf/Makefile). I don't think new map additions should be added that are not under kernel/bpf/ and/or enabled outside the CONFIG_BPF_SYSCALL, as it should be considered part of the eBPF core code. The difference here (this patch) is simply that we don't want users to build ifdef spaghetti constructs in user code, so the API that is actually used by eBPF _users_ is being properly ifdef'ed in the headers. So, I don't think this is a big problem, but we could move these bits under the ifdef CONFIG_BPF_SYSCALL w/o providing a dummy in the else part. I can do that outside of the scope of this set. Or, maybe we'd better move them into new include/linux/bpf_prog.h which includes basic include/linux/bpf.h. Then, user can include the bpf_prog.h instead of bpf.h. Also, we can check CONFIG_BPF_SYSCAL=y at the top of bpf_prog.h. This makes things clearer :) I'm preferring the 1st variant, though. We have currently two native eBPF users, that is, socket filters and tc's cls_bpf (queued in net-next) and looking at the code/API usage, it's really not that hard, where it would justify to move this to an extra header file, really. I'm cooking a patch for net-next right now with the first variant (which is on top of this patch that resides in net-next as-is as well). Thanks, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 tip 1/7] bpf: make internal bpf API independent of CONFIG_BPF_SYSCALL ifdefs
On 03/02/2015 11:53 AM, Masami Hiramatsu wrote: ... Hmm, it seems that this still doesn't hide some APIs which is provided only when CONFIG_BPF_SYSCALL. For example bpf_register_map_type etc. I think all those APIs should be hidden in #ifdef or at least be commented so that the user doesn't refer that without the kconfig. (I don't think we need to provide dummy functions for those APIs, but it's better to clarify which API we can use with which kconfig) Well, currently all possible map types (hash table, array map) that would actually call into bpf_register_map_type() are only built when CONFIG_BPF_SYSCALL is enabled (see kernel/bpf/Makefile). I don't think new map additions should be added that are not under kernel/bpf/ and/or enabled outside the CONFIG_BPF_SYSCALL, as it should be considered part of the eBPF core code. The difference here (this patch) is simply that we don't want users to build ifdef spaghetti constructs in user code, so the API that is actually used by eBPF _users_ is being properly ifdef'ed in the headers. So, I don't think this is a big problem, but we could move these bits under the ifdef CONFIG_BPF_SYSCALL w/o providing a dummy in the else part. I can do that outside of the scope of this set. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH net-next 08/10] x86: unexport set_memory_ro and set_memory_rw
This effectively unexports set_memory_ro and set_memory_rw functions, and thus reverts a03352d2c1dc (x86: export set_memory_ro and set_memory_rw). They have been introduced for debugging purposes in e1000e, but no module user is in mainline kernel (anymore?) and we explicitly do not want modules to use these functions, as they i.e. protect eBPF (interpreted JIT'ed) images from malicious modifications or bugs. Outside of eBPF scope, I believe also other set_memory_* functions should be unexported on x86 for modules. Signed-off-by: Daniel Borkmann dan...@iogearbox.net Cc: Bruce Allan bruce.w.al...@intel.com Cc: Jesse Brandeburg jesse.brandeb...@intel.com Cc: Ingo Molnar mi...@kernel.org Cc: linux-kernel@vger.kernel.org Acked-by: Alexei Starovoitov a...@plumgrid.com --- arch/x86/mm/pageattr.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 536ea2f..81e8282 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -1654,13 +1654,11 @@ int set_memory_ro(unsigned long addr, int numpages) { return change_page_attr_clear(addr, numpages, __pgprot(_PAGE_RW), 0); } -EXPORT_SYMBOL_GPL(set_memory_ro); int set_memory_rw(unsigned long addr, int numpages) { return change_page_attr_set(addr, numpages, __pgprot(_PAGE_RW), 0); } -EXPORT_SYMBOL_GPL(set_memory_rw); int set_memory_np(unsigned long addr, int numpages) { -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH net-next 09/10] arm64: unexport set_memory_ro and set_memory_rw
This effectively unexports set_memory_ro and set_memory_rw functions from commit 11d91a770f1f (arm64: Add CONFIG_DEBUG_SET_MODULE_RONX support). No module user of those is in mainline kernel and we explicitly do not want modules to use these functions, as they i.e. protect eBPF (interpreted and JIT'ed) images from malicious modifications or bugs. Outside of eBPF scope, I believe also other set_memory_* functions should be unexported on arm64 for modules. Signed-off-by: Daniel Borkmann dan...@iogearbox.net Cc: Laura Abbott lau...@codeaurora.org Cc: Will Deacon will.dea...@arm.com Cc: linux-kernel@vger.kernel.org Acked-by: Alexei Starovoitov a...@plumgrid.com --- arch/arm64/mm/pageattr.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c index bb0ea94..8659357 100644 --- a/arch/arm64/mm/pageattr.c +++ b/arch/arm64/mm/pageattr.c @@ -70,7 +70,6 @@ int set_memory_ro(unsigned long addr, int numpages) __pgprot(PTE_RDONLY), __pgprot(PTE_WRITE)); } -EXPORT_SYMBOL_GPL(set_memory_ro); int set_memory_rw(unsigned long addr, int numpages) { @@ -78,7 +77,6 @@ int set_memory_rw(unsigned long addr, int numpages) __pgprot(PTE_WRITE), __pgprot(PTE_RDONLY)); } -EXPORT_SYMBOL_GPL(set_memory_rw); int set_memory_nx(unsigned long addr, int numpages) { -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next 09/10] arm64: unexport set_memory_ro and set_memory_rw
On 02/27/2015 08:54 PM, Will Deacon wrote: ... Looks good to me. Can this be applied independently, or does it need to remain part of your series? Ideally, it should be seen as part of this series, but I have no problem if this one goes via arm64 tree, instead. What Dave and you prefer. ;) Thanks, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux XIA - merge proposal
On 03/03/2015 06:29 PM, Michel Machado wrote: ... We're fine with clearly marking Linux XIA as being under staging as well as helping to define this review process for network stacks. With regard to staging, the code there is usually horrible and I'm not sure anyone really looks there, that would mitigate the review problem to the time when you try to get it out from there, so I'm not sure it brings anything. ;) +1 on what Eric said, would have also been nice if you had clearly described in your mail (w/o buzz words) what it is and what it does. Are you trying to introduce a new network stack as an alternative to the current one, e.g. something like FreeBSD's netgraph? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next 09/10] arm64: unexport set_memory_ro and set_memory_rw
Hi Will, On 02/27/2015 09:05 PM, Daniel Borkmann wrote: On 02/27/2015 08:54 PM, Will Deacon wrote: ... Looks good to me. Can this be applied independently, or does it need to remain part of your series? Ideally, it should be seen as part of this series, but I have no problem if this one goes via arm64 tree, instead. What Dave and you prefer. ;) I'll resend you this one directly as a stand-alone patch to arm64, with Acked-by's preserved. Thanks, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] arm64: mm: unexport set_memory_ro and set_memory_rw
This effectively unexports set_memory_ro and set_memory_rw functions from commit 11d91a770f1f (arm64: Add CONFIG_DEBUG_SET_MODULE_RONX support). No module user of those is in mainline kernel and we explicitly do not want modules to use these functions, as they i.e. RO-protect eBPF (interpreted and JIT'ed) images from malicious modifications/bugs. Outside of eBPF scope, I believe also other set_memory_* functions should be unexported on arm64 due to non-existant mainline module user. Laura mentioned that they have some uses for modules doing set_memory_*, but none that are in mainline and it's unclear if they would ever get there. Signed-off-by: Daniel Borkmann dan...@iogearbox.net Acked-by: Alexei Starovoitov a...@plumgrid.com Acked-by: Laura Abbott lau...@codeaurora.org --- arch/arm64/mm/pageattr.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c index bb0ea94..8659357 100644 --- a/arch/arm64/mm/pageattr.c +++ b/arch/arm64/mm/pageattr.c @@ -70,7 +70,6 @@ int set_memory_ro(unsigned long addr, int numpages) __pgprot(PTE_RDONLY), __pgprot(PTE_WRITE)); } -EXPORT_SYMBOL_GPL(set_memory_ro); int set_memory_rw(unsigned long addr, int numpages) { @@ -78,7 +77,6 @@ int set_memory_rw(unsigned long addr, int numpages) __pgprot(PTE_WRITE), __pgprot(PTE_RDONLY)); } -EXPORT_SYMBOL_GPL(set_memory_rw); int set_memory_nx(unsigned long addr, int numpages) { -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rhashtable: initialize all rhashtable walker members
On 02/21/2015 09:55 PM, Sasha Levin wrote: Commit rhashtable: Introduce rhashtable_walk_* forgot to initialize the members of struct rhashtable_walker after allocating it, which caused an undefined value for 'resize' which is used later on. Signed-off-by: Sasha Levin sasha.le...@oracle.com --- lib/rhashtable.c |3 +++ 1 file changed, 3 insertions(+) diff --git a/lib/rhashtable.c b/lib/rhashtable.c index 9cc4c4a..030484c 100644 --- a/lib/rhashtable.c +++ b/lib/rhashtable.c @@ -894,6 +894,9 @@ int rhashtable_walk_init(struct rhashtable *ht, struct rhashtable_iter *iter) if (!iter-walker) return -ENOMEM; + INIT_LIST_HEAD(iter-walker-list); + iter-walker-resize = false; + The change seems fine to me, although the INIT_LIST_HEAD() unnecessary due to the below list_add()? Anyway, setting resize to false is definitely correct. In practice this shouldn't cause much issue though as rhashtable_walk_start() would only reset iterator meta data and set resize to false, but lets fix it. mutex_lock(ht-mutex); list_add(iter-walker-list, ht-walkers); mutex_unlock(ht-mutex); Thanks, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Use-after-free oops in next-20150204 - probably nelink related
On 02/21/2015 04:36 PM, Shachar Raindel wrote: ... This is most likely rhashtable related. The fixes for the use-after-free issues have been merged Feb 6 so they are probably not included in the Feb 04 snapshot that you use. The relevant net-next commits are: commit 020219a69d40a205dad12b0ea1e6a46153793368 commit cf52d52f9ccb9966ac019d9f79824195583e3e6c commit 2af4b52988fd4f7ae525fcada29d4db8680033d6 commit a5ec68e3b8f2c95ea1a5d23dd543abbe0c8d0624 Most likely so - haven't seen this reproducing so far on next-20150219, which contains the relevant commits. Thanks for double checking! BTW, why is there no MAINTAINERS entry for the netlink subsystem? Well, if in doubt where to send a bug report, then : scripts/get_maintainer.pl -f net/netlink/ Make sure to Cc netdev in any case. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4] seccomp: rework seccomp_prepare_filter().
On 04/29/2015 03:37 PM, Nicolas Schichan wrote: - Try to use the classic BPF JIT via bpf_jit_compile(). - Use bpf_migrate_filter() from NET filter code instead of the double bpf_convert_filter() followed by bpf_prog_select_runtime() if classic bpf_jit_compile() did not succeed in producing native code. Signed-off-by: Nicolas Schichan nschic...@freebox.fr [ I had to look that one up manually, would be good if you keep people in Cc, also netdev for BPF in general. ] I see, you need that to make it available to the old bpf_jit_compile() for probing on classic JITs. Actually, I really would prefer, if instead of duplicating that code, you could export bpf_prepare_filter() and pass seccomp_check_filter() as an argument to bpf_prepare_filter(). Otherwise, in case bpf_prepare_filter() changes, people will easily forget to update seccomp related code, really. Thanks, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] compiler-intel: fix wrong compiler barrier() macro
On 04/29/2015 06:40 PM, Pranith Kumar wrote: On Wed, Apr 29, 2015 at 10:59 AM, mancha security manc...@zoho.com wrote: The problem is that ICC defines __GNUC__ so barrier() gets defined in compiler-gcc.h. Your commit removed an #undef from compiler-intel.h so compiler.h will never define barrier to __memory_barrier(). OK, I see your point. But, ICC has support for GCC inline assembly. So the change does not seem to be making any difference. We are using our own asm barrier rather than the inbuilt one provided by ICC. It does make a difference: gcc inline assembly is not supported by /ecc/, see that it's wrapped within the ifdef __ECC part. I believe, that should be for ia64 which we have under arch/, no? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Seccomp filter JIT support on ARM.
On 04/29/2015 03:37 PM, Nicolas Schichan wrote: ... The fourth and final patch fixes a bug in the emit_udiv() function when used to convert a BPF_ALU | BPF_DIV | BPF_K instruction in the ARM BPF JIT code. Shouldn't that fix go separately, so it can be included into 4.1 resp. -stable? Would be good if you also Cc Mircea, who wrote the code. Was that caught by lib/test_bpf.c suite (if not, would be good to add a test case for it) ? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] compiler-intel: fix wrong compiler barrier() macro
Cleanup commit 73679e508201 (compiler-intel.h: Remove duplicate definition) removed the double definition of __memory_barrier() intrinsics. However, in doing so, it also removed the preceding #undef barrier by accident, meaning, the actual barrier() macro from compiler-gcc.h with inline asm is still in place as __GNUC__ is provided. Subsequently, barrier() can never be defined as __memory_barrier() from compiler.h since it already has a definition in place and if we trust the comment in compiler-intel.h, ecc doesn't support gcc specific asm statements. I don't have an ecc at hand, so a revert of that cleanup would be the safest option, imho, as it has been like this since pre git times. Fixes: 73679e508201 (compiler-intel.h: Remove duplicate definition) Signed-off-by: Daniel Borkmann dan...@iogearbox.net Reviewed-by: Pranith Kumar bobby.pr...@gmail.com Cc: Pranith Kumar bobby.pr...@gmail.com Cc: H. Peter Anvin h...@linux.intel.com Cc: Ingo Molnar mi...@kernel.org Cc: mancha security manc...@zoho.com --- v1-v2: - fixed the wrong commit hash in 1st occurence of commit msg - rest as is, also kept Pranith's Reviewed-by tag include/linux/compiler-intel.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/linux/compiler-intel.h b/include/linux/compiler-intel.h index ba147a1..5529c52 100644 --- a/include/linux/compiler-intel.h +++ b/include/linux/compiler-intel.h @@ -13,9 +13,12 @@ /* Intel ECC compiler doesn't support gcc specific asm stmts. * It uses intrinsics to do the equivalent things. */ +#undef barrier #undef RELOC_HIDE #undef OPTIMIZER_HIDE_VAR +#define barrier() __memory_barrier() + #define RELOC_HIDE(ptr, off) \ ({ unsigned long __ptr; \ __ptr = (unsigned long) (ptr);\ -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Seccomp filter JIT support on ARM.
On 04/30/2015 02:35 PM, Nicolas Schichan wrote: On 04/29/2015 06:37 PM, Daniel Borkmann wrote: On 04/29/2015 03:37 PM, Nicolas Schichan wrote: ... The fourth and final patch fixes a bug in the emit_udiv() function when used to convert a BPF_ALU | BPF_DIV | BPF_K instruction in the ARM BPF JIT code. Shouldn't that fix go separately, so it can be included into 4.1 resp. -stable? Sure, shall I resend that separately from the v2 of the serie or is it fine in its current form ? Would be great if you could send that as an individual patch, since it's a stand-alone fix and independent from the (feature) patch set. Would be good if you also Cc Mircea, who wrote the code. Was that caught by lib/test_bpf.c suite (if not, would be good to add a test case for it) ? It was detected by an internal test suite we have here. I see that there are some BPF_ALU | BPF_DIV | BPF_K instructions so it might also be caught by lib/test_bpf.c as well. If there are some additional tests that are not yet covered by lib/test_bpf.c, I'd be happy if you could add them there. This can also be as a follow-up, but if we can increase coverage for others as well, the better. Thanks again, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4] seccomp: rework seccomp_prepare_filter().
On 04/30/2015 02:27 PM, Nicolas Schichan wrote: ... I'll take more care about the receiver list for the v2 of this serie. Ok, cool. I see, you need that to make it available to the old bpf_jit_compile() for probing on classic JITs. Actually, I really would prefer, if instead of duplicating that code, you could export bpf_prepare_filter() and pass seccomp_check_filter() as an argument to bpf_prepare_filter(). Just to be sure you want me to pass a pointer to seccomp_check_filter to bpf_prepare_filter so that it can run it between bpf_check_classic() and bpf_jit_compile ? For example, what comes to mind is something along these lines: struct bpf_prog * bpf_prepare_filter(struct bpf_prog *fp, int (*aux_trans_classic)(struct sock_filter *filter, unsigned int flen)) { int err; fp-bpf_func = NULL; fp-jited = false; err = bpf_check_classic(fp-insns, fp-len); if (err) { __bpf_prog_release(fp); return ERR_PTR(err); } /* There might be additional checks and transformations * needed on classic filters, f.e. in case of seccomp. */ if (aux_trans_classic) { err = aux_trans_classic(fp-insns, fp-len); if (err) { __bpf_prog_release(fp); return ERR_PTR(err); } } /* Probe if we can JIT compile the filter and if so, do * the compilation of the filter. */ bpf_jit_compile(fp); /* JIT compiler couldn't process this filter, so do the * internal BPF translation for the optimized interpreter. */ if (!fp-jited) fp = bpf_migrate_filter(fp); return fp; } From seccomp side, you invoke: ... bpf_prepare_filter(fp, seccomp_check_filter); I would actually like to see that customized code in seccomp_prepare_filter() further reduced instead of increased, would certainly make it easier to maintain and understand. Do you think something like the above is possible? Otherwise, in case bpf_prepare_filter() changes, people will easily forget to update seccomp related code, really. Fair point. Thanks, Thanks a lot, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG/PATCH] kernel RNG and its secrets
On 04/27/2015 10:41 PM, Stephan Mueller wrote: ... It seems you have the code already in mind, so please if you could write it :-) Ok, sure. I'll cook something by tomorrow morning. Cheers, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG/PATCH] kernel RNG and its secrets
On 04/27/2015 09:10 PM, Stephan Mueller wrote: ... I posted the issue on the clang mailing list on April 10 -- no word so far. I would interpret this as a sign that it is a no-issue for them. Hm. ;) Here's a bug report on the topic, gcc vs llvm: https://llvm.org/bugs/show_bug.cgi?id=15495 Lets add a new barrier macro to linux/compiler{,-gcc}.h, f.e. #define barrier_data(ptr) __asm__ __volatile__( : : r (ptr) : memory) or the version Mancha proposed. You could wrap that ... #define OPTIMIZER_HIDE(ptr) barrier_data(ptr) ... and use that one for memzero_explicit() instead: void memzero_explicit(void *s, size_t count) { memset(s, 0, count); OPTIMIZER_HIDE(s); } It certainly needs comments explaining in what situations to use which OPTIMIZER_HIDE* variants, etc. Do you want to send a patch? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.19 176/177] netfilter: x_tables: fix cgroup matching on non-full sks
Hi Greg, hi Pablo, On 05/03/2015 08:45 PM, Greg Kroah-Hartman wrote: On Sun, May 03, 2015 at 12:47:17AM +0300, Thomas Backlund wrote: Den 02.05.2015 22:03, Greg Kroah-Hartman skrev: 3.19-stable review patch. If anyone has any objections, please let me know. -- From: Daniel Borkmann dan...@iogearbox.net commit afb7718016fcb0370ac29a83b2839c78b76c2960 upstream. While originally only being intended for outgoing traffic, commit a00e76349f35 (netfilter: x_tables: allow to use cgroup match for LOCAL_IN nf hooks) enabled xt_cgroups for the NF_INET_LOCAL_IN hook as well, in order to allow for nfacct accounting. Besides being currently limited to early demuxes only, commit a00e76349f35 forgot to add a check if we deal with full sockets, i.e. in this case not with time wait sockets. TCP time wait sockets do not have the same memory layout as full sockets, a lower memory footprint and consequently also don't have a sk_classid member; probing for sk_classid member there could potentially lead to a crash. Fixes: a00e76349f35 (netfilter: x_tables: allow to use cgroup match for LOCAL_IN nf hooks) Cc: Alexey Perevalov a.pereva...@samsung.com Signed-off-by: Daniel Borkmann dan...@iogearbox.net Signed-off-by: Pablo Neira Ayuso pa...@netfilter.org Signed-off-by: Greg Kroah-Hartman gre...@linuxfoundation.org --- net/netfilter/xt_cgroup.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/net/netfilter/xt_cgroup.c +++ b/net/netfilter/xt_cgroup.c @@ -39,7 +39,7 @@ cgroup_mt(const struct sk_buff *skb, str { const struct xt_cgroup_info *info = par-matchinfo; - if (skb-sk == NULL) + if (skb-sk == NULL || !sk_fullsock(skb-sk)) return false; return (info-id == skb-sk-sk_classid) ^ info-invert; This one breaks the build with: net/netfilter/xt_cgroup.c: In function 'cgroup_mt': net/netfilter/xt_cgroup.c:42:2: error: implicit declaration of function 'sk_fullsock' [-Werror=implicit-function-declaration] In order to fix it, you also need to add: From 1d0ab253872cdd3d8e7913f59c266c7fd01771d0 Mon Sep 17 00:00:00 2001 From: Eric Dumazet eduma...@google.com Date: Sun, 15 Mar 2015 21:12:12 -0700 Subject: [PATCH] net: add sk_fullsock() helper which in turn needs this one: From 10feb428a5045d5eb18a5d755fbb8f0cc9645626 Mon Sep 17 00:00:00 2001 From: Eric Dumazet eduma...@google.com Date: Thu, 12 Mar 2015 16:44:04 -0700 Subject: [PATCH] inet: add TCP_NEW_SYN_RECV state I've just dropped the patch, thanks for letting me know, odd that my build tests missed it. If you're nevertheless interested in this fix, you could use this version, which should apply/build just fine: http://patchwork.ozlabs.org/patch/455546/ I believe Pablo usually sends netfilter patches in bundles to you. Thanks a lot, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] compiler-intel: fix wrong compiler barrier() macro
Cleanup commit 23ebdedc67e (compiler-intel.h: Remove duplicate definition) removed the double definition of __memory_barrier() intrinsics. However, in doing so, it also removed the preceding #undef barrier, meaning, the actual barrier() macro from compiler-gcc.h with inline asm is still in place when __GNUC__ is provided. Subsequently, barrier() can never be defined as __memory_barrier() from compiler.h since it already has a definition in place and if we trust the comment in compiler-intel.h, ecc doesn't support gcc specific asm statements. I don't have an ecc at hand, so a revert of that cleanup would be the safest option, imho, as it has been like this since pre-git times. Fixes: 73679e508201 (compiler-intel.h: Remove duplicate definition) Signed-off-by: Daniel Borkmann dan...@iogearbox.net Cc: Pranith Kumar bobby.pr...@gmail.com Cc: H. Peter Anvin h...@linux.intel.com Cc: Ingo Molnar mi...@kernel.org Cc: mancha security manc...@zoho.com --- include/linux/compiler-intel.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/linux/compiler-intel.h b/include/linux/compiler-intel.h index ba147a1..5529c52 100644 --- a/include/linux/compiler-intel.h +++ b/include/linux/compiler-intel.h @@ -13,9 +13,12 @@ /* Intel ECC compiler doesn't support gcc specific asm stmts. * It uses intrinsics to do the equivalent things. */ +#undef barrier #undef RELOC_HIDE #undef OPTIMIZER_HIDE_VAR +#define barrier() __memory_barrier() + #define RELOC_HIDE(ptr, off) \ ({ unsigned long __ptr; \ __ptr = (unsigned long) (ptr);\ -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] compiler-intel: fix wrong compiler barrier() macro
On 04/29/2015 04:51 PM, Pranith Kumar wrote: ... message says in 73679e508201(your commit message has the wrong hash). Sorry for that, the Fixes tag actually got it right. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ARM: net: delegate filter to kernel interpreter when imm_offset() return value can't fit into 12bits.
On 05/07/2015 05:14 PM, Nicolas Schichan wrote: The ARM JIT code emits ldr rX, [pc, #offset] to access the literal pool. #offset maximum value is 4095 and if the generated code is too large, the #offset value can overflow and not point to the expected slot in the literal pool. Additionally, when overflow occurs, bits of the overflow can end up changing the destination register of the ldr instruction. Fix that by detecting the overflow in imm_offset() and setting a flag that is checked for each BPF instructions converted in build_body(). As of now it can only be detected in the second pass. As a result the second build_body() call can now fail, so add the corresponding cleanup code in that case. Using multiple literal pools in the JITed code is going to require lots of intrusive changes to the JIT code (which would better be done as a feature instead of fix), just delegating to the kernel BPF interpreter in that case is a more straight forward, minimal fix and easy to backport. Signed-off-by: Nicolas Schichan nschic...@freebox.fr Fix looks good to me. Fixes: ddecdfcea0ae (ARM: 7259/3: net: JIT compiler for packet filters) Acked-by: Daniel Borkmann dan...@iogearbox.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next] test: bpf: extend load 64-bit immediate testcase
On 05/09/2015 10:14 AM, Xi Wang wrote: Extend the testcase to catch a signedness bug in the arm64 JIT: test_bpf: #58 load 64-bit immediate jited:1 ret -1 != 1 FAIL (1 times) This is useful to ensure other JITs won't have a similar bug. Link: https://lkml.org/lkml/2015/5/8/458 Cc: Alexei Starovoitov a...@plumgrid.com Cc: Will Deacon will.dea...@arm.com Signed-off-by: Xi Wang xi.w...@gmail.com Acked-by: Daniel Borkmann dan...@iogearbox.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On 04/16/2015 10:56 AM, George Dunlap wrote: On 04/15/2015 07:19 PM, Eric Dumazet wrote: On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote: Maybe you should stop wasting all of our time and just tell us what you're thinking. I think you make me wasting my time. I already gave all the hints in prior discussions. Right, and I suggested these two options: So mid term, it would be much more beneficial if you attempt fix the underlying driver issues that actually cause high tx completion delays, instead of reintroducing bufferbloat. So that we all can move forward and not backwards in time. Obviously one solution would be to allow the drivers themselves to set the tcp_limit_output_bytes, but that seems like a maintenance nightmare. Possible, but very hacky, as you penalize globally. Another simple solution would be to allow drivers to indicate whether they have a high transmit latency, and have the kernel use a higher value by default when that's the case. [1] Neither of which you commented on. Instead you pointed me to a comment What Eric described to you was that you introduce a new netdev member like netdev-needs_bufferbloat, set that indication from driver site, and cache that in the socket that binds to it, so you can adjust the test in tcp_xmit_size_goal(). It should merely be seen as a hint/indication for such devices. Hmm? that only partially described what the limitations were. (I.e., it described the two packets or 1ms, but not how they related, nor how they related to the max of 2 64k packets outstanding of the default tcp_limit_output_bytes setting.) -George [1] http://marc.info/?i=CAFLBxZYt7-v29ysm=f+5qmow64_qhesjzj98udba+1cs-pf...@mail.gmail.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: manual merge of the tip tree with the net-next tree
On 04/07/2015 09:11 AM, Stephen Rothwell wrote: ... Today's linux-next merge of the tip tree got a conflict in include/uapi/linux/bpf.h between commit 96be4325f443 (ebpf: add sched_cls_type and map it to sk_filter's verifier ops), 03e69b508b6f (ebpf: add prandom helper for packet sampling), c04167ce2ca0 (ebpf: add helper for obtaining current processor id), 94caee8c312d (ebpf: add sched_act_type and map it to sk_filter's verifier ops), 608cd71a9c7c (tc: bpf: generalize pedit action) and 91bc4822c3d6 (tc: bpf: add checksum helpers) from the net-next tree and commit 2541517c32be (tracing, perf: Implement BPF programs attached to kprobes), d9847d310ab4 (tracing: Allow BPF programs to call bpf_ktime_get_ns()) and 9c959c863f82 (tracing: Allow BPF programs to call bpf_trace_printk()) from the tip tree. I fixed it up (see below) and can carry the fix as necessary (no action is required). Looks good to me, thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: manual merge of the tip tree with the net-next tree
On 04/07/2015 09:00 AM, Stephen Rothwell wrote: ... Today's linux-next merge of the tip tree got a conflict in include/linux/bpf.h between commit 0fc174dea545 (ebpf: make internal bpf API independent of CONFIG_BPF_SYSCALL ifdefs) from the net-next tree and commit 4e537f7fbdce (bpf: Make internal bpf API independent of CONFIG_BPF_SYSCALL #ifdefs) from the tip tree. I fixed it up (they are the same patch and there are other changes in the net-next tree, so I just used that version) and can carry the fix as necessary (no action is required). Thanks, Stephen! That is correct. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the tip tree
On 04/07/2015 10:48 AM, Ingo Molnar wrote: * Stephen Rothwell s...@canb.auug.org.au wrote: Hi all, After merging the tip tree, today's linux-next build (powerpc ppc64_defconfig) failed like this: kernel/events/core.c: In function 'perf_event_set_bpf_prog': kernel/events/core.c:6732:15: error: 'struct bpf_prog_aux' has no member named 'prog_type' if (prog-aux-prog_type != BPF_PROG_TYPE_KPROBE) { ^ Caused by commit 2541517c32be (tracing, perf: Implement BPF programs attached to kprobes). Note, this must be some (rarely triggered) aspect of the ppc64 defconfig that neither x86 randconfigs nor most other arch defconfigs expose? Note, this is a merge conflict with the work that went via net-next tree, i.e. 24701ecea76b (ebpf: move read-only fields to bpf_prog and shrink bpf_prog_aux). I believe that is why it didn't trigger on tip tree. You should be able to resolve it in linux-next by changing the test to: if (prog-prog_type != BPF_PROG_TYPE_KPROBE) { Thanks, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the tip tree
[ Cc'ing Dave, fyi ] On 04/07/2015 11:05 AM, Stephen Rothwell wrote: On Tue, 07 Apr 2015 10:56:13 +0200 Daniel Borkmann dan...@iogearbox.net wrote: On 04/07/2015 10:48 AM, Ingo Molnar wrote: * Stephen Rothwell s...@canb.auug.org.au wrote: After merging the tip tree, today's linux-next build (powerpc ppc64_defconfig) failed like this: kernel/events/core.c: In function 'perf_event_set_bpf_prog': kernel/events/core.c:6732:15: error: 'struct bpf_prog_aux' has no member named 'prog_type' if (prog-aux-prog_type != BPF_PROG_TYPE_KPROBE) { ^ Caused by commit 2541517c32be (tracing, perf: Implement BPF programs attached to kprobes). Note, this must be some (rarely triggered) aspect of the ppc64 defconfig that neither x86 randconfigs nor most other arch defconfigs expose? Note, this is a merge conflict with the work that went via net-next tree, i.e. 24701ecea76b (ebpf: move read-only fields to bpf_prog and shrink bpf_prog_aux). I believe that is why it didn't trigger on tip tree. You should be able to resolve it in linux-next by changing the test to: if (prog-prog_type != BPF_PROG_TYPE_KPROBE) { Thanks Daniel, I will do that tomorrow. Someone will have to remember to tell Linus. Yes, indeed, depending which tree is merged first. Thanks, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the tip tree
On 04/07/2015 06:18 PM, Alexei Starovoitov wrote: On 4/7/15 4:13 AM, Daniel Borkmann wrote: [ Cc'ing Dave, fyi ] On 04/07/2015 11:05 AM, Stephen Rothwell wrote: On Tue, 07 Apr 2015 10:56:13 +0200 Daniel Borkmann dan...@iogearbox.net wrote: On 04/07/2015 10:48 AM, Ingo Molnar wrote: * Stephen Rothwell s...@canb.auug.org.au wrote: After merging the tip tree, today's linux-next build (powerpc ppc64_defconfig) failed like this: kernel/events/core.c: In function 'perf_event_set_bpf_prog': kernel/events/core.c:6732:15: error: 'struct bpf_prog_aux' has no member named 'prog_type' if (prog-aux-prog_type != BPF_PROG_TYPE_KPROBE) { ^ Caused by commit 2541517c32be (tracing, perf: Implement BPF programs attached to kprobes). Note, this must be some (rarely triggered) aspect of the ppc64 defconfig that neither x86 randconfigs nor most other arch defconfigs expose? Note, this is a merge conflict with the work that went via net-next tree, i.e. 24701ecea76b (ebpf: move read-only fields to bpf_prog and shrink bpf_prog_aux). I believe that is why it didn't trigger on tip tree. You should be able to resolve it in linux-next by changing the test to: if (prog-prog_type != BPF_PROG_TYPE_KPROBE) { Thanks Daniel, I will do that tomorrow. Someone will have to remember to tell Linus. Yes, indeed, depending which tree is merged first. Daniel analysis is correct, but the fix for kernel/events/core.c should be: - if (prog-aux-prog_type != BPF_PROG_TYPE_KPROBE) { + if (prog-type != BPF_PROG_TYPE_KPROBE) { instead of 'prog-prog_type' Yes, absolutely, thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH akpm] compiler-intel: fix wrong compiler barrier() macro
Cleanup commit 73679e508201 (compiler-intel.h: Remove duplicate definition) removed the double definition of __memory_barrier() intrinsics. However, in doing so, it also removed the preceding #undef barrier by accident, meaning, the actual barrier() macro from compiler-gcc.h with inline asm is still in place as __GNUC__ is provided. Subsequently, barrier() can never be defined as __memory_barrier() from compiler.h since it already has a definition in place and if we trust the comment in compiler-intel.h, ecc doesn't support gcc specific asm statements. I don't have an ecc at hand (unsure if that's still used in the field?) and only found this by accident during code review, a revert of that cleanup would be simplest option. Fixes: 73679e508201 (compiler-intel.h: Remove duplicate definition) Signed-off-by: Daniel Borkmann dan...@iogearbox.net Reviewed-by: Pranith Kumar bobby.pr...@gmail.com Cc: Pranith Kumar bobby.pr...@gmail.com Cc: H. Peter Anvin h...@zytor.com Cc: mancha security manc...@zoho.com --- Only resending fix for Andrew's tree, rebased. include/linux/compiler-intel.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/compiler-intel.h b/include/linux/compiler-intel.h index 0c9a2f2..d4c7113 100644 --- a/include/linux/compiler-intel.h +++ b/include/linux/compiler-intel.h @@ -13,10 +13,12 @@ /* Intel ECC compiler doesn't support gcc specific asm stmts. * It uses intrinsics to do the equivalent things. */ +#undef barrier #undef barrier_data #undef RELOC_HIDE #undef OPTIMIZER_HIDE_VAR +#define barrier() __memory_barrier() #define barrier_data(ptr) barrier() #define RELOC_HIDE(ptr, off) \ -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] bpf: BPF based latency tracing
- 131071 : 2|| 131072 - 262143 : 0|| 262144 - 524287 : 1|| 524288 - 1048575 : 174 || CPU 3 latency: count distribution 1 - 1: 0|| 2 - 3: 0|| 4 - 7: 0|| 8 - 15 : 0|| 16 - 31 : 0|| 32 - 63 : 0|| 64 - 127 : 0|| 128 - 255 : 0|| 256 - 511 : 0|| 512 - 1023 : 0|| 1024 - 2047 : 0|| 2048 - 4095 : 29626|*** | 4096 - 8191 : 2704 |** | 8192 - 16383: 1090 || 16384 - 32767: 160 || 32768 - 65535: 72 || 65536 - 131071 : 32 || 131072 - 262143 : 26 || 262144 - 524287 : 12 || 524288 - 1048575 : 298 || All this is based on the trace3 examples written by Alexei Starovoitov a...@plumgrid.com. Signed-off-by: Daniel Wagner daniel.wag...@bmw-carit.de I think it would be useful to perhaps have two options: 1) User specifies a specific CPU and gets one such an output above. 2) Summary view, i.e. to have the samples of each CPU for comparison next to each other in columns and maybe the histogram view a bit more compressed (perhaps summary of all CPUs). Anyway, it's sample code people can go with and modify individually. Acked-by: Daniel Borkmann dan...@iogearbox.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next] test_bpf: extend tests for 32-bit endianness conversion
On 06/26/2015 05:25 PM, Xi Wang wrote: Currently ALU_END_FROM_BE 32 and ALU_END_FROM_LE 32 do not test if the upper bits of the result are zeros (the arm64 JIT had such bugs). Extend the two tests to catch this. Cc: Alexei Starovoitov a...@plumgrid.com Signed-off-by: Xi Wang xi.w...@gmail.com Thanks for extending the test suite! Acked-by: Daniel Borkmann dan...@iogearbox.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] selftests: add seccomp suite
On 06/16/2015 07:54 PM, Kees Cook wrote: This imports the existing seccomp test suite into the kernel's selftests tree. It contains extensive testing of seccomp features and corner cases. There remain additional tests to move into the kernel tree, but they have not yet been ported to all the architectures seccomp supports: https://github.com/redpig/seccomp/tree/master/tests Signed-off-by: Kees Cook keesc...@chromium.org Awesome, thanks a bunch, Kees! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/3] bpf: share helpers between tracing and networking
On 06/16/2015 07:10 PM, Alexei Starovoitov wrote: ... Ideally we would allow a blend of tracing and networking programs, then the best solution would be one or two stable tracepoints in networking stack where skb is visible and receiving/transmitting task is also visible, then skb-len and task-pid together would give nice foundation for accurate stats. I think combining both seems interesting anyway, we need to find a way to make this gluing of both worlds easy to use, though. It's certainly interesting for stats/diagnostics, but one wouldn't be able to use the current/future skb eBPF helpers from {cls,act}_bpf in that context. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 3/3] bpf: let kprobe programs use bpf_get_smp_processor_id() helper
On 06/13/2015 04:39 AM, Alexei Starovoitov wrote: It's useful to do per-cpu histograms. Suggested-by: Daniel Wagner daniel.wag...@bmw-carit.de Signed-off-by: Alexei Starovoitov a...@plumgrid.com Acked-by: Daniel Borkmann dan...@iogearbox.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next 0/3] bpf: share helpers between tracing and networking
On 06/16/2015 05:28 AM, Alexei Starovoitov wrote: On 6/15/15 4:01 PM, David Miller wrote: Although I agree with the sentiment that this thing can cause surprising results and can be asking for trouble. If someone wants to filter traffic by UID they might make a simple ingress TC ebpf program using these new interfaces and expect it to work. But the UID their program will see will be the UID of whatever randomly happened to be executing when the packet was received and processed. yes, you're right. Such tc filters will be incorrect. Will send a partial revert disallowing them in tc. Sorry for late reply [on vacation]; if you really want to, you could go via skb-sk-sk_socket-file and then retrieve credentials from there for egress side (you can have a look at xt_owner). You'd need a different *_proto helper for tc in that case, which would then map to BPF_FUNC_get_current_uid_gid, etc. But that doesn't work for ingress however, even if you would have early demux, so you would need to let the eBPF helper function return an error code in that case. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Edited draft of bpf(2) man page for review/enhancement
On 05/27/2015 10:43 AM, Michael Kerrisk (man-pages) wrote: Hello Alexei, I took the draft 3 of the bpf(2) man page that you sent back in March and did some substantial editing to clarify the language and add a few technical details. Could you please check the revised version below, to ensure I did not inject any errors. I also added a number of FIXMEs for pieces of the page that need further work. Could you take a look at these and let me know your thoughts, please. That's great, thanks! Minor comments: ... .TH BPF 2 2015-03-10 Linux Linux Programmer's Manual .SH NAME bpf - perform a command on an extended BPF map or program .SH SYNOPSIS .nf .B #include linux/bpf.h .sp .BI int bpf(int cmd, union bpf_attr *attr, unsigned int size); .SH DESCRIPTION The .BR bpf () system call performs a range of operations related to extended Berkeley Packet Filters. Extended BPF (or eBPF) is similar to the original BPF (or classic BPF) used to filter network packets. For both BPF and eBPF programs, the kernel statically analyzes the programs before loading them, in order to ensure that they cannot harm the running system. .P eBPF extends classic BPF in multiple ways including the ability to call in-kernel helper functions (via the .B BPF_CALL opcode extension provided by eBPF) and access shared data structures such as BPF maps. I would perhaps emphasize that maps can be shared among in-kernel eBPF programs, but also between kernel and user space. The programs can be written in a restricted C that is compiled into .\ FIXME In the next line, what is a restricted C? Where does .\ one get further information about it? So far only from the kernel samples directory and for tc classifier and action, from the tc man page and/or examples/bpf/ in the tc git tree. eBPF bytecode and executed on the in-kernel virtual machine or just-in-time compiled into native code. .SS Extended BPF Design/Architecture .P .\ FIXME In the following line, what does different data types mean? .\ Are the values in a map not just blobs? Sort of, currently, these blobs can have different sizes of keys and values (you can even have structs as keys). For the map itself they are treated as blob internally. However, recently, bpf tail call got added where you can lookup another program from an array map and call into it. Here, that particular type of map can only have entries of type of eBPF program fd. I think, if needed, adding a paragraph to the tail call could be done as follow-up after we have an initial man page in the tree included. BPF maps are a generic data structure for storage of different data types. A user process can create multiple maps (with key/value-pairs being opaque bytes of data) and access them via file descriptors. BPF programs can access maps from inside the kernel in parallel. It's up to the user process and BPF program to decide what they store inside maps. .P BPF programs are similar to kernel modules. They are loaded by the user process and automatically unloaded when the process exits. Generally that's true. Btw, in 4.1 kernel, tc(8) also got support for eBPF classifier and actions, and here it's slightly different: in tc, we load the programs, maps etc, and push down the eBPF program fd in order to let the kernel hold reference on the program itself. Thus, there, the program fd that the application owns is gone when the application terminates, but the eBPF program itself still lives on inside the kernel. But perhaps it's already too much detail to mention here ... Each BPF program is a set of instructions that is safe to run until its completion. The in-kernel BPF verifier statically determines that the program terminates and is safe to execute. .\ FIXME In the following sentence, what does takes hold mean? Takes a reference. Meaning, that maps cannot disappear under us while the eBPF program that is using them in the kernel is still alive. During verification, the program takes hold of maps that it intends to use, so selected maps cannot be removed until the program is unloaded. BPF programs can be attached to different events. .\ FIXME: In the next sentence , packets are not events. What .\ do you really mean to say here? (the arrival of a network packet?) Socket filters is meant here: setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, ...); These events can be packets, tracing events, and other types that may be added in the future. There's already one more type worth mentioning: tc classifier and actions (tc/tc_bpf.{c,h}). A new event triggers execution of the BPF program, which may store information about the event in the maps. Beyond storing data, BPF programs may call into in-kernel helper functions. I would mention that these in-kernel helpers are a fixed set of functions provided by the kernel (linux/bpf.h: BPF_FUNC_*), so you cannot call into arbitrary ones. The same program can be attached to multiple events and different programs can access the same map: .\ FIXME Can maps
Re: [PATCH net-next 1/4] bpf: allow bpf programs to tail-call other bpf programs
On 05/20/2015 01:59 AM, Alexei Starovoitov wrote: introduce bpf_tail_call(ctx, jmp_table, index) helper function which can be used from BPF programs like: int bpf_prog(struct pt_regs *ctx) { ... bpf_tail_call(ctx, jmp_table, index); ... } that is roughly equivalent to: int bpf_prog(struct pt_regs *ctx) { ... if (jmp_table[index]) return (*jmp_table[index])(ctx); ... } The important detail that it's not a normal call, but a tail call. The kernel stack is precious, so this helper reuses the current stack frame and jumps into another BPF program without adding extra call frame. It's trivially done in interpreter and a bit trickier in JITs. In case of x64 JIT the bigger part of generated assembler prologue is common for all programs, so it is simply skipped while jumping. Other JITs can do similar prologue-skipping optimization or do stack unwind before jumping into the next program. ... Signed-off-by: Alexei Starovoitov a...@plumgrid.com LGTM, thanks! Acked-by: Daniel Borkmann dan...@iogearbox.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/6] test_bpf: allow tests to specify an skb fragment.
On 08/03/2015 04:02 PM, Nicolas Schichan wrote: This introduce a new test-aux flag (FLAG_SKB_FRAG) to tell the populate_skb() function to add a fragment to the test skb containing the data specified in test-frag_data). Signed-off-by: Nicolas Schichan nschic...@freebox.fr Acked-by: Alexei Starovoitov a...@plumgrid.com Acked-by: Daniel Borkmann dan...@iogearbox.net I'm good with this change here, just a comment below in general. enum { CLASSIC = BIT(6), /* Old BPF instructions only. */ @@ -81,6 +83,7 @@ struct bpf_test { __u32 result; } test[MAX_SUBTESTS]; int (*fill_helper)(struct bpf_test *self); + __u8 frag_data[MAX_DATA]; }; We now have 286 tests, which is awesome! Perhaps, we need to start thinking of a better test description method soonish as the test_bpf.ko module grew to ~1.6M, i.e. whenever we add to struct bpf_test, it adds memory overhead upon all test cases. /* Large test cases need separate allocation and fill handler. */ @@ -4525,6 +4528,10 @@ static struct sk_buff *populate_skb(char *buf, int size) static void *generate_test_data(struct bpf_test *test, int sub) { + struct sk_buff *skb; + struct page *page; + void *ptr; + if (test-aux FLAG_NO_DATA) return NULL; @@ -4532,7 +4539,36 @@ static void *generate_test_data(struct bpf_test *test, int sub) * subtests generate skbs of different sizes based on * the same data. */ - return populate_skb(test-data, test-test[sub].data_size); + skb = populate_skb(test-data, test-test[sub].data_size); + if (!skb) + return NULL; + + if (test-aux FLAG_SKB_FRAG) { Really minor nit: declaration of page, ptr could have been only in this block. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/6] test_bpf: add module parameters to filter the tests to run.
On 08/03/2015 04:02 PM, Nicolas Schichan wrote: When developping on the interpreter or a particular JIT, it can be insteresting to restrict the test list to a specific test or a s/insteresting/interesting/ particular range of tests. This patch adds the following module parameters to the test_bpf module: * test_name=string: only the specified named test will be run. * test_id=number: only the test with the specified id will be run (see the output of test_pbf without parameters to get the test id). s/test_pbf/test_bpf/ * test_range=number,number: only the tests with IDs in the specified id range are run (see the output of test_pbf without s/test_pbf/test_bpf/ parameters to get the test ids). Any invalid range, test id or test name will result in -EINVAL being returned and no tests being run. Signed-off-by: Nicolas Schichan nschic...@freebox.fr Seems very useful for the test suite, thanks. Acked-by: Daniel Borkmann dan...@iogearbox.net --- lib/test_bpf.c | 73 ++ 1 file changed, 73 insertions(+) diff --git a/lib/test_bpf.c b/lib/test_bpf.c index f5606fb..abd0507 100644 --- a/lib/test_bpf.c +++ b/lib/test_bpf.c @@ -4870,10 +4870,72 @@ static int run_one(const struct bpf_prog *fp, struct bpf_test *test) return err_cnt; } +static char test_name[64]; +module_param_string(test_name, test_name, sizeof(test_name), 0); + +static int test_id = -1; +module_param(test_id, int, 0); + +static int test_range[2] = { -1, -1 }; +module_param_array(test_range, int, NULL, 0); + +static __init int find_test_index(const char *test_name) +{ + int i; + + for (i = 0; i ARRAY_SIZE(tests); i++) { + if (!strcmp(tests[i].descr, test_name)) + return i; + } + return -1; +} + static __init int prepare_bpf_tests(void) { int i; + if (test_id = 0) { + /* +* if a test_id was specified, use test_range to +* conver only that test. s/conver/cover/ +*/ + if (test_id = ARRAY_SIZE(tests)) { + pr_err(test_bpf: invalid test_id specified.\n); + return -EINVAL; + } [...] @@ -4893,6 +4955,14 @@ static __init void destroy_bpf_tests(void) } } +static bool exclude_test(int test_id) +{ + if (test_range[0] = 0 + (test_id test_range[0] || test_id test_range[1])) + return true; + return false; Minor nit: could directly return it, f.e.: return test_range[0] = 0 (test_id test_range[0] || test_id test_range[1]); Btw, for the range test in prepare_bpf_tests(), you could also reject a negative lower bound index right there. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/6] test_bpf: avoid oopsing the kernel when generate_test_data() fails.
On 08/03/2015 04:02 PM, Nicolas Schichan wrote: Signed-off-by: Nicolas Schichan nschic...@freebox.fr Acked-by: Alexei Starovoitov a...@plumgrid.com Acked-by: Daniel Borkmann dan...@iogearbox.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event
On 07/28/2015 01:17 PM, Kaixu Xia wrote: Introduce a new bpf map type 'BPF_MAP_TYPE_PERF_EVENT_ARRAY'. This map only stores the pointer to struct perf_event. The user space event FDs from perf_event_open() syscall are converted to the pointer to struct perf_event and stored in map. Signed-off-by: Kaixu Xia xiaka...@huawei.com --- include/linux/bpf.h| 1 + include/linux/perf_event.h | 2 ++ include/uapi/linux/bpf.h | 1 + kernel/bpf/arraymap.c | 57 ++ kernel/bpf/verifier.c | 15 kernel/events/core.c | 17 ++ 6 files changed, 93 insertions(+) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 610b730..3c9c0eb 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -10,6 +10,7 @@ #include uapi/linux/bpf.h #include linux/workqueue.h #include linux/file.h +#include linux/perf_event.h struct bpf_map; diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 2027809..2ea4067 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -641,6 +641,7 @@ extern int perf_event_init_task(struct task_struct *child); extern void perf_event_exit_task(struct task_struct *child); extern void perf_event_free_task(struct task_struct *task); extern void perf_event_delayed_put(struct task_struct *task); +extern struct perf_event *perf_event_get(unsigned int fd); extern void perf_event_print_debug(void); extern void perf_pmu_disable(struct pmu *pmu); extern void perf_pmu_enable(struct pmu *pmu); @@ -979,6 +980,7 @@ static inline int perf_event_init_task(struct task_struct *child) { return 0; } static inline void perf_event_exit_task(struct task_struct *child){ } static inline void perf_event_free_task(struct task_struct *task) { } static inline void perf_event_delayed_put(struct task_struct *task) { } +static struct perf_event *perf_event_get(unsigned int fd) { return NULL; } static inline void perf_event_print_debug(void) { } static inline int perf_event_task_disable(void) { return -EINVAL; } static inline int perf_event_task_enable(void) { return -EINVAL; } diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 29ef6f9..69a1f6b 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -114,6 +114,7 @@ enum bpf_map_type { BPF_MAP_TYPE_HASH, BPF_MAP_TYPE_ARRAY, BPF_MAP_TYPE_PROG_ARRAY, + BPF_MAP_TYPE_PERF_EVENT_ARRAY, }; enum bpf_prog_type { diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 4784cdc..a9e0235 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -297,3 +297,60 @@ static int __init register_prog_array_map(void) return 0; } late_initcall(register_prog_array_map); + +static void *perf_event_fd_array_get_ptr(struct bpf_array *array, int fd) +{ + struct perf_event *event; + + event = perf_event_get(fd); + if (IS_ERR(event)) + return event; So in case of !CONFIG_PERF_EVENTS, you define perf_event_get() to return NULL. Seems like this will give a nice NULL ptr deref below. ;) + /* +* prevent some crazy events so we can make our life easier +*/ + if (event-attr.type != PERF_TYPE_RAW + event-attr.type != PERF_TYPE_HARDWARE) { + perf_event_release_kernel(event); + return ERR_PTR(-EINVAL); + } + return event; +} + +static void perf_event_fd_array_put_ptr(struct bpf_array *array, void *ptr) +{ + struct perf_event *event = (struct perf_event *)ptr; ( Same comment as in previous patch. ) + perf_event_release_kernel(event); +} + +static const struct fd_array_map_ops perf_event_fd_array_map_ops = { + .get_ptr= perf_event_fd_array_get_ptr, + .put_ptr= perf_event_fd_array_put_ptr, +}; ... +late_initcall(register_perf_event_array_map); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 039d866..c70f7e7 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -924,6 +924,21 @@ static int check_call(struct verifier_env *env, int func_id) */ return -EINVAL; + if (map map-map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY + func_id != BPF_FUNC_perf_event_read) + /* perf_event_array map type needs extra care: +* only allow to pass it into bpf_perf_event_read() for now. +* bpf_map_update/delete_elem() must only be done via syscall +*/ + return -EINVAL; + + if (func_id == BPF_FUNC_perf_event_read + map-map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY) + /* don't allow any other map type to be passed into +* bpf_perf_event_read() +*/ + return -EINVAL; Maybe a
Re: [PATCH v4 4/4] samples/bpf: example of get selected PMU counter value
On 07/28/2015 01:17 PM, Kaixu Xia wrote: This is a simple example and shows how to use the new ability to get the selected Hardware PMU counter value. Signed-off-by: Kaixu Xia xiaka...@huawei.com ... diff --git a/samples/bpf/tracex6_user.c b/samples/bpf/tracex6_user.c new file mode 100644 index 000..e607eac --- /dev/null +++ b/samples/bpf/tracex6_user.c @@ -0,0 +1,67 @@ +#include stdio.h +#include unistd.h +#include stdlib.h +#include stdbool.h +#include string.h +#include fcntl.h +#include poll.h +#include sys/ioctl.h +#include linux/perf_event.h +#include linux/bpf.h +#include libbpf.h +#include bpf_load.h + +static void test_bpf_perf_event(void) +{ + int nr_cpus = sysconf(_SC_NPROCESSORS_CONF); + int *pmu_fd = malloc(nr_cpus * sizeof(int)); + unsigned long value; + int i; + + struct perf_event_attr attr_insn_pmu = { + .freq = 0, + .sample_period = 0x7fffULL, Nit: #define ... + .inherit = 0, + .type = PERF_TYPE_HARDWARE, + .read_format = 0, + .sample_type = 0, + .config = 0,/* PMU: cycles */ + }; + + for (i = 0; i nr_cpus; i++) { + pmu_fd[i] = perf_event_open(attr_insn_pmu, -1/*pid*/, i/*cpu*/, -1/*group_fd*/, 0); + if (pmu_fd[i] 0) + printf(event syscall failed\n); + + bpf_update_elem(map_fd[0], i, (pmu_fd + i), BPF_ANY); Nit: you use pmu_fd[i], but here (pmu_fd + i)? Maybe better pmu_fd[i]? + ioctl(pmu_fd[i], PERF_EVENT_IOC_ENABLE, 0); + } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 1/4] bpf: Make the bpf_prog_array_map more generic
On 07/28/2015 01:17 PM, Kaixu Xia wrote: From: Wang Nan wangn...@huawei.com According to the comments from Daniel, rewrite part of the bpf_prog_array map code and make it more generic. So the new perf_event_array map type can reuse most of code with bpf_prog_array map and add fewer lines of special code. Tested the samples/bpf/tracex5 after this patch: $ sudo ./tracex5 ... dd-1051 [000] d...26.682903: : mmap dd-1051 [000] d...26.698348: : syscall=102 (one of get/set uid/pid/gid) dd-1051 [000] d...26.703892: : read(fd=0, buf=0078c010, size=512) dd-1051 [000] d...26.705847: : write(fd=1, buf=0078c010, size=512) dd-1051 [000] d...26.707914: : read(fd=0, buf=0078c010, size=512) dd-1051 [000] d...26.710988: : write(fd=1, buf=0078c010, size=512) dd-1051 [000] d...26.711865: : read(fd=0, buf=0078c010, size=512) dd-1051 [000] d...26.712704: : write(fd=1, buf=0078c010, size=512) ... Signed-off-by: Wang Nan wangn...@huawei.com Signed-off-by: Kaixu Xia xiaka...@huawei.com --- include/linux/bpf.h | 6 ++- kernel/bpf/arraymap.c | 104 +++--- kernel/bpf/syscall.c | 4 +- 3 files changed, 80 insertions(+), 34 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 4383476..610b730 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -130,6 +130,8 @@ struct bpf_prog_aux { }; }; +struct fd_array_map_ops; + struct bpf_array { struct bpf_map map; u32 elem_size; @@ -140,15 +142,17 @@ struct bpf_array { */ enum bpf_prog_type owner_prog_type; bool owner_jited; + const struct fd_array_map_ops* fd_ops; union { char value[0] __aligned(8); + void *ptrs[0] __aligned(8); struct bpf_prog *prog[0] __aligned(8); After your conversion, prog member from the union is not used here anymore (only as offsetof(...) in JITs). We should probably get rid of it then. }; }; #define MAX_TAIL_CALL_CNT 32 u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 r5); -void bpf_prog_array_map_clear(struct bpf_map *map); +void bpf_fd_array_map_clear(struct bpf_map *map); bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp); const struct bpf_func_proto *bpf_get_trace_printk_proto(void); diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index cb31229..4784cdc 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -150,15 +150,62 @@ static int __init register_array_map(void) } late_initcall(register_array_map); -static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr) +struct fd_array_map_ops { + void *(*get_ptr)(struct bpf_array *array, int fd); + void (*put_ptr)(struct bpf_array *array, void *ptr); +}; + +static void *prog_fd_array_get_ptr(struct bpf_array *array, int fd) +{ + struct bpf_prog *prog = bpf_prog_get(fd); + if (IS_ERR(prog)) + return prog; + + if (!bpf_prog_array_compatible(array, prog)) { + bpf_prog_put(prog); + return ERR_PTR(-EINVAL); + } + return prog; +} + +static void prog_fd_array_put_ptr(struct bpf_array *array __maybe_unused, + void *ptr) array member seems not to be used in both implementations. It should then probably not be part of the API? +{ + struct bpf_prog *prog = (struct bpf_prog *)ptr; No cast on void * needed. + + bpf_prog_put_rcu(prog); +} + +static const struct fd_array_map_ops prog_fd_array_map_ops = { + .get_ptr= prog_fd_array_get_ptr, + .put_ptr= prog_fd_array_put_ptr, +}; + +static struct bpf_map *fd_array_map_alloc(union bpf_attr *attr, + const struct fd_array_map_ops *ops) { - /* only bpf_prog file descriptors can be stored in prog_array map */ + struct bpf_map *map; + struct bpf_array *array; + + /* only file descriptors can be stored in this type of map */ if (attr-value_size != sizeof(u32)) return ERR_PTR(-EINVAL); - return array_map_alloc(attr); + + map = array_map_alloc(attr); + if (IS_ERR(map)) + return map; + + array = container_of(map, struct bpf_array, map); + array-fd_ops = ops; + return map; +} + +static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr) +{ + return fd_array_map_alloc(attr, prog_fd_array_map_ops); } -static void prog_array_map_free(struct bpf_map *map) +static void fd_array_map_free(struct bpf_map *map) { struct bpf_array *array = container_of(map, struct bpf_array, map); int i; @@ -167,21 +214,21 @@ static void prog_array_map_free(struct bpf_map *map) /* make sure it's empty */
Re: [PATCH v4 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter
On 07/28/2015 01:17 PM, Kaixu Xia wrote: According to the perf_event_map_fd and index, the function bpf_perf_event_read() can convert the corresponding map value to the pointer to struct perf_event and return the Hardware PMU counter value. Signed-off-by: Kaixu Xia xiaka...@huawei.com --- include/linux/bpf.h| 1 + include/linux/perf_event.h | 3 ++- include/uapi/linux/bpf.h | 1 + kernel/bpf/helpers.c | 36 kernel/events/core.c | 4 ++-- kernel/trace/bpf_trace.c | 2 ++ 6 files changed, 44 insertions(+), 3 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 3c9c0eb..516992c 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -190,6 +190,7 @@ extern const struct bpf_func_proto bpf_map_lookup_elem_proto; extern const struct bpf_func_proto bpf_map_update_elem_proto; extern const struct bpf_func_proto bpf_map_delete_elem_proto; +extern const struct bpf_func_proto bpf_perf_event_read_proto; extern const struct bpf_func_proto bpf_get_prandom_u32_proto; extern const struct bpf_func_proto bpf_get_smp_processor_id_proto; extern const struct bpf_func_proto bpf_tail_call_proto; diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 2ea4067..899abcb 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -662,7 +662,8 @@ extern void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu); extern u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running); - +extern void __perf_event_read(void *info); +extern u64 perf_event_count(struct perf_event *event); struct perf_sample_data { /* diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 69a1f6b..b9b13ce 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -250,6 +250,7 @@ enum bpf_func_id { * Return: 0 on success */ BPF_FUNC_get_current_comm, + BPF_FUNC_perf_event_read, /* u64 bpf_perf_event_read(map, index) */ __BPF_FUNC_MAX_ID, }; diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 1447ec0..c40c5ea 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -182,3 +182,39 @@ const struct bpf_func_proto bpf_get_current_comm_proto = { .arg1_type = ARG_PTR_TO_STACK, .arg2_type = ARG_CONST_STACK_SIZE, }; + +static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5) +{ + struct bpf_map *map = (struct bpf_map *) (unsigned long) r1; + struct bpf_array *array = container_of(map, struct bpf_array, map); + struct perf_event *event; + + if (index = array-map.max_entries) + return -E2BIG; Maybe unlikely(...), likewise below. + event = (struct perf_event *)array-ptrs[index]; + if (!event) + return -ENOENT; + + if (event-state != PERF_EVENT_STATE_ACTIVE) + return -EINVAL; + + if (event-oncpu != raw_smp_processor_id() + event-ctx-task != current) + return -EINVAL; + + if (event-attr.inherit) + return -EINVAL; + + __perf_event_read(event); + + return perf_event_count(event); I believe this helper should rather go somewhere such as bpf_trace.c (or under kernel/events/ ?), wouldn't we otherwise get a build error when perf events are compiled out? Anyway, I let perf folks comment on that (and the helper in general). +} + +const struct bpf_func_proto bpf_perf_event_read_proto = { + .func = bpf_perf_event_read, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_CONST_MAP_PTR, + .arg2_type = ARG_ANYTHING, +}; diff --git a/kernel/events/core.c b/kernel/events/core.c index 58f0d47..c926c6d 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3177,7 +3177,7 @@ void perf_event_exec(void) /* * Cross CPU call to read the hardware event */ -static void __perf_event_read(void *info) +void __perf_event_read(void *info) Does this need to be declared in a header file, no? { struct perf_event *event = info; struct perf_event_context *ctx = event-ctx; @@ -3204,7 +3204,7 @@ static void __perf_event_read(void *info) raw_spin_unlock(ctx-lock); } -static inline u64 perf_event_count(struct perf_event *event) +u64 perf_event_count(struct perf_event *event) Likewise? Should the inlining be preserved? { if (event-pmu-count) return event-pmu-count(event); diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 88a041a..9cf094f 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -183,6 +183,8 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func Btw, I'm wondering if the perf event map portions should actually better go
Re: [PATCH] packet: tpacket_snd(): fix signed/unsigned comparison
On 07/28/2015 12:57 PM, Alexander Drozdov wrote: tpacket_fill_skb() can return a negative value (-errno) which is stored in tp_len variable. In that case the following condition will be (but shouldn't be) true: tp_len dev-mtu + dev-hard_header_len as dev-mtu and dev-hard_header_len are both unsigned. That may lead to just returning an incorrect EMSGSIZE errno to the user. Signed-off-by: Alexander Drozdov al.droz...@gmail.com Looks good to me, thanks! Acked-by: Daniel Borkmann dan...@iogearbox.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] net/ipv4: inconsistent routing table
[ please cc netdev ] On 08/05/2015 10:56 AM, Zang MingJie wrote: Hi: I found a bug when remove an ip address which is referenced by a routing entry. step to reproduce: ip li add type dummy ip li set dummy0 up ip ad add 10.0.0.1/24 dev dummy0 ip ad add 10.0.0.2/24 dev dummy0 ip ro add default via 10.0.0.2/24 ip ad del 10.0.0.2/24 dev dummy0 after deleting the secondary ip address, the routing entry still pointing to 10.0.0.2 # ip ro default via 10.0.0.2 dev dummy0 10.0.0.0/24 dev dummy0 proto kernel scope link src 10.0.0.1 but actually, kernel considers the default route is directly connected. # ip ro get 1.1.1.1 1.1.1.1 dev dummy0 src 10.0.0.1 cache -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/6] test_bpf: test LD_ABS and LD_IND instructions on fragmented skbs.
On 08/03/2015 04:02 PM, Nicolas Schichan wrote: These new tests exercise various load sizes and offsets crossing the head/fragment boundary. Signed-off-by: Nicolas Schichan nschic...@freebox.fr Acked-by: Alexei Starovoitov a...@plumgrid.com Acked-by: Daniel Borkmann dan...@iogearbox.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/6] test_bpf: add more tests for LD_ABS and LD_IND.
On 08/03/2015 04:02 PM, Nicolas Schichan wrote: This exerces the LD_ABS and LD_IND instructions for various sizes and alignments. This also checks that X when used as an offset to a BPF_IND instruction first in a filter is correctly set to 0. Signed-off-by: Nicolas Schichan nschic...@freebox.fr Acked-by: Alexei Starovoitov a...@plumgrid.com Acked-by: Daniel Borkmann dan...@iogearbox.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] test_bpf: add tests checking that JIT/interpreter sets A and X to 0.
On 08/03/2015 04:02 PM, Nicolas Schichan wrote: It is mandatory for the JIT or interpreter to reset the A and X registers to 0 before running the filter. Check that it is the case on various ALU and JMP instructions. Signed-off-by: Nicolas Schichan nschic...@freebox.fr Acked-by: Alexei Starovoitov a...@plumgrid.com Acked-by: Daniel Borkmann dan...@iogearbox.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/6] test_bpf: add module parameters to filter the tests to run.
On 08/03/2015 06:23 PM, Nicolas Schichan wrote: ... Btw, for the range test in prepare_bpf_tests(), you could also reject a negative lower bound index right there. I thought it was better to have all the sanity checks grouped in prepare_bpf_tests() (with the checking of the test_name and test_id parameters nearby) ? Also a negative lower bound is meaning that no range has been set so all tests should be run. I just got a bit confused when loading test_range=-100,1 was not rejected, but they do indeed all run in this case. Thanks, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/6] test_bpf: allow tests to specify an skb fragment.
On 08/03/2015 06:38 PM, Nicolas Schichan wrote: On 08/03/2015 05:29 PM, Daniel Borkmann wrote: On 08/03/2015 04:02 PM, Nicolas Schichan wrote: We now have 286 tests, which is awesome! Perhaps, we need to start thinking of a better test description method soonish as the test_bpf.ko module grew to ~1.6M, i.e. whenever we add to struct bpf_test, it adds memory overhead upon all test cases. Indeed, test_bpf.ko is turning quite large (1.4M when compiled for ARM). It looks like gzip is able to do wonders on the module though as I end up with a 94.7K test_bpf.ko.gz file and if the modutils are compiled with --enable-zlib, it will be gunziped automatically before being loaded to the kernel. I think it just contains a lot of zero blocks, which then compress nicely. I think that marking tests[] array as __initdata will help with the runtime memory use if someone forgets to rmmod the test_bpf module after a completely successful run. Can be done, too, yep. Do you want to send a patch? ;) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter
On 07/28/2015 01:17 PM, Kaixu Xia wrote: Previous patch v3 url: https://lkml.org/lkml/2015/7/23/203 ... Kaixu Xia (3): bpf: Add new bpf map type to store the pointer to struct perf_event bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter samples/bpf: example of get selected PMU counter value Wang Nan (1): bpf: Make the bpf_prog_array_map more generic So kernel/bpf/ patches are usually going via netdev tree as main development happens there and to avoid cross tree conflicts, etc. Thus, please Cc also netdev in the next iterations. Maybe when these patches are in a perfect shape at some point, perf folks provide their Acked-by(s) to the series to give their consent, and then this should go via net-next? Or will this be organized differently? include/linux/bpf.h| 8 ++- include/linux/perf_event.h | 5 +- include/uapi/linux/bpf.h | 2 + kernel/bpf/arraymap.c | 161 - kernel/bpf/helpers.c | 36 ++ kernel/bpf/syscall.c | 4 +- kernel/bpf/verifier.c | 15 + kernel/events/core.c | 21 +- kernel/trace/bpf_trace.c | 2 + samples/bpf/Makefile | 4 ++ samples/bpf/bpf_helpers.h | 2 + samples/bpf/tracex6_kern.c | 26 samples/bpf/tracex6_user.c | 67 +++ 13 files changed, 316 insertions(+), 37 deletions(-) create mode 100644 samples/bpf/tracex6_kern.c create mode 100644 samples/bpf/tracex6_user.c -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 net-next] bpf: s390: Fix build error caused by the struct bpf_array member name changed
On 08/11/2015 08:53 AM, Kaixu Xia wrote: There is a build error that 'struct bpf_array' has no member named 'prog' on s390. In commit 2a36f0b, the member 'prog' of struct bpf_array is replaced by 'ptrs'. So this patch fixes it. Signed-off-by: Kaixu Xia xiaka...@huawei.com You were also asked to add a proper Fixes tag : Fixes: 2a36f0b92eb6 (bpf: Make the bpf_prog_array_map more generic) I guess this bug was reported by Fengguang as you have him in Cc ? If that should be the case, then please also add a Reported-by: tag for his bug report. Code looks good to me. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 26/31] net/sched: use kmemdup rather than duplicating its implementation
On 08/07/2015 09:59 AM, Andrzej Hajda wrote: The patch was generated using fixed coccinelle semantic patch scripts/coccinelle/api/memdup.cocci [1]. [1]: http://permalink.gmane.org/gmane.linux.kernel/2014320 Signed-off-by: Andrzej Hajda a.ha...@samsung.com Acked-by: Daniel Borkmann dan...@iogearbox.net Not sure where the rest of this series went, but if you want this patch to be routed via net-next tree (which I recommend, to avoid cross tree conflicts), then you would need to send these patches separately, rebased to that tree, and also mention [PATCH net-next XX/YY] in the subject. Thanks, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Edited draft of bpf(2) man page for review/enhancement
On 07/22/2015 04:49 PM, Michael Kerrisk (man-pages) wrote: Hi Daniel, Sorry for the long delay in following up No worries, eBPF is quite some material. ;) On 05/27/2015 11:26 AM, Daniel Borkmann wrote: On 05/27/2015 10:43 AM, Michael Kerrisk (man-pages) wrote: Hello Alexei, I took the draft 3 of the bpf(2) man page that you sent back in March and did some substantial editing to clarify the language and add a few technical details. Could you please check the revised version below, to ensure I did not inject any errors. I also added a number of FIXMEs for pieces of the page that need further work. Could you take a look at these and let me know your thoughts, please. That's great, thanks! Minor comments: ... .TH BPF 2 2015-03-10 Linux Linux Programmer's Manual .SH NAME bpf - perform a command on an extended BPF map or program .SH SYNOPSIS .nf .B #include linux/bpf.h .sp .BI int bpf(int cmd, union bpf_attr *attr, unsigned int size); .SH DESCRIPTION The .BR bpf () system call performs a range of operations related to extended Berkeley Packet Filters. Extended BPF (or eBPF) is similar to the original BPF (or classic BPF) used to filter network packets. For both BPF and eBPF programs, the kernel statically analyzes the programs before loading them, in order to ensure that they cannot harm the running system. .P eBPF extends classic BPF in multiple ways including the ability to call in-kernel helper functions (via the .B BPF_CALL opcode extension provided by eBPF) and access shared data structures such as BPF maps. I would perhaps emphasize that maps can be shared among in-kernel eBPF programs, but also between kernel and user space. This is covered later in the page, under the BPF maps subheading. Maybe you missed that? (Or did you think it doesn't suffice?) Okay, I presume you mean: Maps are a generic data structure for storage of different types and sharing data between the kernel and user-space programs. Maybe, to emphasize both options a bit (not sure if it's better in my words, though): Maps are a generic data structure for storage of different types and allow for sharing data among eBPF kernel programs, but also between kernel and user-space applications. The programs can be written in a restricted C that is compiled into .\ FIXME In the next line, what is a restricted C? Where does .\ one get further information about it? So far only from the kernel samples directory and for tc classifier and action, from the tc man page and/or examples/bpf/ in the tc git tree. So, given that we are several weeks down the track, and things may have changed, I'll re-ask the questions ;-) : * Is this restricted C documented anywhere? Not (yet) that I'm aware of. We were thinking that short-mid term to polish the stuff that resides in the kernel documentation, that is, Documentation/networking/filter.txt, to get it in a better shape, which I presume, would also include a documentation on the restricted C. So far, examples are provided in the tc-bpf man page (see link below). The set of available helper functions callable from eBPF resides under (enum bpf_func_id): https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/bpf.h * Is the procedure for compiling this restricted C documented anywhere? (Yes, it's LLVM, but are the suitable pipelines/options documented somewhere?) eBPF bytecode and executed on the in-kernel virtual machine or just-in-time compiled into native code. .SS Extended BPF Design/Architecture .P .\ FIXME In the following line, what does different data types mean? .\ Are the values in a map not just blobs? Sort of, currently, these blobs can have different sizes of keys and values (you can even have structs as keys). For the map itself they are treated as blob internally. However, recently, bpf tail call got added where you can lookup another program from an array map and call into it. Here, that particular type of map can only have entries of type of eBPF program fd. I think, if needed, adding a paragraph to the tail call could be done as follow-up after we have an initial man page in the tree included. Okay -- I've added a FIXME placeholder for this, so we can revisit. Okay. BPF maps are a generic data structure for storage of different data types. A user process can create multiple maps (with key/value-pairs being opaque bytes of data) and access them via file descriptors. BPF programs can access maps from inside the kernel in parallel. It's up to the user process and BPF program to decide what they store inside maps. .P BPF programs are similar to kernel modules. They are loaded by the user process and automatically unloaded when the process exits. Generally that's true. Btw, in 4.1 kernel, tc(8) also got support for eBPF classifier and actions, and here it's slightly different: in tc, we load the programs, maps etc, and push down the eBPF program fd in order to let the kernel hold
Re: [PATCH v2 0/5] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter
On 07/22/2015 10:09 AM, Kaixu Xia wrote: Previous patch v1 url: https://lkml.org/lkml/2015/7/17/287 [ Sorry to chime in late, just noticed this series now as I wasn't in Cc for the core BPF changes. More below ... ] This patchset allows user read PMU events in the following way: 1. Open the PMU using perf_event_open() (for each CPUs or for each processes he/she'd like to watch); 2. Create a BPF_MAP_TYPE_PERF_EVENT_ARRAY BPF map; 3. Insert FDs into the map with some key-value mapping scheme (i.e. cpuid - event on that CPU); 4. Load and attach eBPF programs as usual; 5. In eBPF program, get the perf_event_map_fd and key (i.e. cpuid get from bpf_get_smp_processor_id()) then use bpf_perf_event_read() to read from it. 6. Do anything he/her want. changes in V2: - put atomic_long_inc_not_zero() between fdget() and fdput(); - limit the event type to PERF_TYPE_RAW and PERF_TYPE_HARDWARE; - Only read the event counter on current CPU or on current process; - add new map type BPF_MAP_TYPE_PERF_EVENT_ARRAY to store the pointer to the struct perf_event; - according to the perf_event_map_fd and key, the function bpf_perf_event_read() can get the Hardware PMU counter value; Patch 5/5 is a simple example and shows how to use this new eBPF programs ability. The PMU counter data can be found in /sys/kernel/debug/tracing/trace(trace_pipe).(the cycles PMU value when 'kprobe/sys_write' sampling) $ cat /sys/kernel/debug/tracing/trace_pipe $ ./tracex6 ... cat-677 [002] d..1 210.299270: : bpf count: CPU-2 5316659 cat-677 [002] d..1 210.299316: : bpf count: CPU-2 5378639 cat-677 [002] d..1 210.299362: : bpf count: CPU-2 5440654 cat-677 [002] d..1 210.299408: : bpf count: CPU-2 5503211 cat-677 [002] d..1 210.299454: : bpf count: CPU-2 5565438 cat-677 [002] d..1 210.299500: : bpf count: CPU-2 5627433 cat-677 [002] d..1 210.299547: : bpf count: CPU-2 5690033 cat-677 [002] d..1 210.299593: : bpf count: CPU-2 5752184 cat-677 [002] d..1 210.299639: : bpf count: CPU-2 5814543 ...-548 [009] d..1 210.299667: : bpf count: CPU-9 605418074 ...-548 [009] d..1 210.299692: : bpf count: CPU-9 605452692 cat-677 [002] d..1 210.299700: : bpf count: CPU-2 5896319 ...-548 [009] d..1 210.299710: : bpf count: CPU-9 605477824 ...-548 [009] d..1 210.299728: : bpf count: CPU-9 605501726 ...-548 [009] d..1 210.299745: : bpf count: CPU-9 605525279 ...-548 [009] d..1 210.299762: : bpf count: CPU-9 605547817 ...-548 [009] d..1 210.299778: : bpf count: CPU-9 605570433 ...-548 [009] d..1 210.299795: : bpf count: CPU-9 605592743 ... The detail of patches is as follow: Patch 1/5 introduces a new bpf map type. This map only stores the pointer to struct perf_event; Patch 2/5 introduces a map_traverse_elem() function for further use; Patch 3/5 convets event file descriptors into perf_event structure when add new element to the map; So far all the map backends are of generic nature, knowing absolutely nothing about a particular consumer/subsystem of eBPF (tc, socket filters, etc). The tail call is a bit special, but nevertheless generic for each user and [very] useful, so it makes sense to inherit from the array map and move the code there. I don't really like that we start add new _special_-cased maps here into the eBPF core code, it seems quite hacky. :( From your rather terse commit description where you introduce the maps, I failed to see a detailed elaboration on this i.e. why it cannot be abstracted any different? Patch 4/5 implement function bpf_perf_event_read() that get the selected hardware PMU conuter; Patch 5/5 give a simple example. Kaixu Xia (5): bpf: Add new bpf map type to store the pointer to struct perf_event bpf: Add function map-ops-map_traverse_elem() to traverse map elems bpf: Save the pointer to struct perf_event to map bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter samples/bpf: example of get selected PMU counter value include/linux/bpf.h| 6 +++ include/linux/perf_event.h | 5 ++- include/uapi/linux/bpf.h | 3 ++ kernel/bpf/arraymap.c | 110 + kernel/bpf/helpers.c | 42 + kernel/bpf/syscall.c | 26 +++ kernel/events/core.c | 30 - kernel/trace/bpf_trace.c | 2 + samples/bpf/Makefile | 4 ++ samples/bpf/bpf_helpers.h | 2 + samples/bpf/tracex6_kern.c | 27 +++ samples/bpf/tracex6_user.c | 67 +++ 12 files changed, 321 insertions(+), 3 deletions(-) create mode 100644
Re: Draft 3 of bpf(2) man page for review
On 07/23/2015 01:23 PM, Michael Kerrisk (man-pages) wrote: ... Btw, a user obviously can close() the map fds if he wants to, but ultimatively they're freed when the program unloads. Okay. (Not sure if you meant that something should be added to the page.) I think not necessary. [...] The attributes key_size and value_size will be used by the attribute's? Nope. But I changed this to The key_size and value_size attributes will be, which may read clearer. Sorry, true, I was a bit confused. :) [...] The type __u64 is kernel internal, so if there's no strict reason to use it, we should just use what's provided by stdint.h. Agreed. Done. (By the way, what about all the __u32 and __u64 elements in the bpf_attr union?) I wouldn't change the bpf_attr from the uapi. Just the provided example code here, I presume people might copy from here when they build their own library and in userspace uint64_t seems to be more natural. [...] * map_update_elem() replaces elements in an non-atomic fashion; for atomic updates, a hash-table map should be used instead. This point here is most important, i.e. to not have false user expecations. Maybe it's also worth mentioning that when you have a value_size of sizeof(long), you can however use __sync_fetch_and_add() atomic builtin from the LLVM backend. I think I'll leave out that detail for the moment. Ok, I guess we could revisit/clarify that at a later point in time. I'd add a TODO comment to the source or the like, as this also is related to the 2nd below use case (aggregation/accounting), where an array is typically used. Among the uses for array maps are the following: * As global eBPF variables: an array of 1 element whose key is (index) 0 and where the value is a collection of 'global' variables which eBPF programs can use to keep state between events. * Aggregation of tracing events into a fixed set of buck‐ ets. [...] * license is a license string, which must be GPL compatible to call helper functions marked gpl_only. Not strictly. So here, the same rules apply as with kernel modules. I.e. what the kernel checks for are the following license strings: static inline int license_is_gpl_compatible(const char *license) { return (strcmp(license, GPL) == 0 || strcmp(license, GPL v2) == 0 || strcmp(license, GPL and additional rights) == 0 || strcmp(license, Dual BSD/GPL) == 0 || strcmp(license, Dual MIT/GPL) == 0 || strcmp(license, Dual MPL/GPL) == 0); } With any of them, the eBPF program is declared GPL compatible. Maybe of interest for those that want to use dual licensing of some sort. So, I'm a little unclear here. What text do you suggest for the page? Maybe we should mention in addition that the same licensing rules apply as in case with kernel modules, so also dual licenses could be used. * log_buf is a pointer to a caller-allocated buffer in which the in-kernel verifier can store the verification log. This log is a multi-line string that can be checked by the program author in order to understand how the verifier came to the conclusion that the BPF program is unsafe. The format of the output can change at any time as the verifier evolves. * log_size size of the buffer pointed to by log_bug. If the size of the buffer is not large enough to store all verifier messages, -1 is returned and errno is set to ENOSPC. * log_level verbosity level of the verifier. A value of zero means that the verifier will not provide a log. Note that the log buffer is optional as mentioned here log_level = 0. The above example code of bpf_prog_load() suggests that it always needs to be provided. I once ran indeed into an issue where the program itself was correct, but it got rejected by the kernel, because my log buffer size was too small, so in tc, we now have it larger as bpf_log_buf[65536] ... So, I'm not clear. Do you mean that some piece of text here in the page should be changed? If so, could elaborate? I'd maybe only mention in addition that in log_level=0 case, we also must not provide a log_buf and log_size, otherwise we get EINVAL. [...] I had to read this twice. ;) Maybe this needs to be reworded slightly. It just means that depending on the program type that the author selects, you might end up with a different subset of helper functions, and a different program input/context. For example tracing does not have the exact same helpers as socket filters (it might have some that can be used by both). Also, the eBPF program input (context) for socket filters is a network
Re: Draft 3 of bpf(2) man page for review
On 07/23/2015 03:36 PM, Michael Kerrisk (man-pages) wrote: ... Ok, I guess we could revisit/clarify that at a later point in time. I'd add a TODO comment to the source or the like, as this also is related to the 2nd below use case (aggregation/accounting), where an array is typically used. Okay. FIXME added. Great. [...] I'd maybe only mention in addition that in log_level=0 case, we also must not provide a log_buf and log_size, otherwise we get EINVAL. I changed the text to: * log_level verbosity level of the verifier. A value of zero means that the verifier will not provide a log; in this case, log_buf must be a NULL pointer, and log_size must be zero. Ok. [...] Thanks for clearing that up -- I added the following sentence JIT compiler for eBPF is currently available for the x86-64, arm64, and s390 architectures. Okay? Yep. Thanks, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Draft 3 of bpf(2) man page for review
Hi Michael, looks good already, a couple of comments inline, on top of Alexei's feedback: On 07/22/2015 10:10 PM, Michael Kerrisk (man-pages) wrote: ... NAME bpf - perform a command on an extended eBPF map or program 'extended eBPF' should perhaps just say 'eBPF' or 'extended BPF' (the 'e' itself stands for 'extended') SYNOPSIS #include linux/bpf.h int bpf(int cmd, union bpf_attr *attr, unsigned int size); DESCRIPTION The bpf() system call performs a range of operations related to extended Berkeley Packet Filters. Extended BPF (or eBPF) is sim‐ ilar to the original (classic) BPF (cBPF) used to filter net‐ work packets. For both cBPF and eBPF programs, the kernel stati‐ cally analyzes the programs before loading them, in order to ensure that they cannot harm the running system. eBPF extends cBPF in multiple ways, including the ability to call a fixed set of in-kernel helper functions (via the BPF_CALL opcode extension provided by eBPF) and access shared data struc‐ tures such as eBPF maps. Extended BPF Design/Architecture BPF maps are a generic data structure for storage of different Maybe s/BPF/eBPF/ as we introduced its definition above and used 'eBPF maps' just in the previous sentence. (I would from the onwards just use either eBPF or cBPF, makes it probably more clear). data types. A user process can create multiple maps (with key/value-pairs being opaque bytes of data) and access them via file descriptors. Differnt eBPF programs can access the same maps in parallel. It's up to the user process and eBPF program to decide what they store inside maps. eBPF programs are similar to kernel modules. They are loaded by the user process and automatically unloaded when the process exits. Each program is a set of instructions that is safe to run The 1st and 2nd sentence in that order/combination may sounds a bit weird. Maybe I would just drop the first sentence? I would argue that there might be a few similarities, but more differences overall. So I guess we'd either need to elaborate on the 1st sentence or just leave it out (could perhaps be a FIXME comment to later on introduce a new section that elaborates on both?). until its completion. An in-kernel verifier statically deter‐ mines that the eBPF program terminates and is safe to execute. During verification, the kernel increments reference counts for each of the maps that the eBPF program uses, so that the selected maps cannot be removed until the program is unloaded. s/selected/attached/ ? Btw, a user obviously can close() the map fds if he wants to, but ultimatively they're freed when the program unloads. eBPF programs can be attached to different events. These events can be the arrival of network packets, tracing events, classifi‐ cation event by qdisc (for eBPF programs attached to a tc(8) classifier), and other types that may be added in the future. A Maybe: classification events by network queuing disciplines new event triggers execution of the eBPF program, which may store information about the event in eBPF maps. Beyond storing data, eBPF programs may call a fixed set of in-kernel helper functions. I think this was mentioned before, but ok. The same eBPF program can be attached to multiple events and dif‐ ferent eBPF programs can access the same map: tracing tracing tracing packet packet event A event B event C on eth0on eth1 | | | | | | | | | | -- tracing -- tracing socket socket prog_1 prog_2 prog_3 prog_4 | | || |--- -| |---| map_3 map_1 map_2 Maybe prog_4 example could also be: s/socket/tc ingress classifier/ ;) Arguments The operation to be performed by the bpf() system call is deter‐ mined by the cmd argument. Each operation takes an accompanying argument, provided via attr, which is a pointer to a union of type bpf_attr (see below). The size argument is the size of the union pointed to by attr. The value provided in cmd is one of the following: BPF_MAP_CREATE Create a map with and return a file descriptor that refers to the map. 'Create a map with and' BPF_MAP_LOOKUP_ELEM Look up an element by key in a specified map and return its value. BPF_MAP_UPDATE_ELEM
Re: [PATCH v2] cgroup: net_cls: fix false-positive suspicious RCU usage
On 07/22/2015 11:23 AM, Konstantin Khlebnikov wrote: In dev_queue_xmit() net_cls protected with rcu-bh. ... Signed-off-by: Konstantin Khlebnikov khlebni...@yandex-team.ru --- net/core/netclassid_cgroup.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c index 1f2a126f4ffa..6441f47b1a8f 100644 --- a/net/core/netclassid_cgroup.c +++ b/net/core/netclassid_cgroup.c @@ -23,7 +23,8 @@ static inline struct cgroup_cls_state *css_cls_state(struct cgroup_subsys_state struct cgroup_cls_state *task_cls_state(struct task_struct *p) { - return css_cls_state(task_css(p, net_cls_cgrp_id)); + return css_cls_state(task_css_check(p, net_cls_cgrp_id, + rcu_read_lock_bh_held())); Did you also check that after your patch this doesn't trigger on ingress either? There, this code path could be invoked under rcu_read_lock(). So, perhaps you need to check for both. } EXPORT_SYMBOL_GPL(task_cls_state); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] cgroup: net_cls: fix false-positive suspicious RCU usage
On 07/22/2015 02:03 PM, Konstantin Khlebnikov wrote: On 22.07.2015 14:56, Daniel Borkmann wrote: On 07/22/2015 11:23 AM, Konstantin Khlebnikov wrote: In dev_queue_xmit() net_cls protected with rcu-bh. ... Signed-off-by: Konstantin Khlebnikov khlebni...@yandex-team.ru --- net/core/netclassid_cgroup.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c index 1f2a126f4ffa..6441f47b1a8f 100644 --- a/net/core/netclassid_cgroup.c +++ b/net/core/netclassid_cgroup.c @@ -23,7 +23,8 @@ static inline struct cgroup_cls_state *css_cls_state(struct cgroup_subsys_state struct cgroup_cls_state *task_cls_state(struct task_struct *p) { -return css_cls_state(task_css(p, net_cls_cgrp_id)); +return css_cls_state(task_css_check(p, net_cls_cgrp_id, +rcu_read_lock_bh_held())); Did you also check that after your patch this doesn't trigger on ingress either? There, this code path could be invoked under rcu_read_lock(). So, perhaps you need to check for both. I haven't seen warnings with this patch. rcu_read_lock_held() is checked inside rcu_dereference_check() inside task_css_check(). Thanks, agreed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Further work on bpf(2)
On 07/24/2015 11:11 AM, Michael Kerrisk (man-pages) wrote: Hello Alexei, Daniel Now that we have a version of the bpf(2) man page out the door, I'd like to see it refined, since there are a number of missing pieces. In particular: * The page lacks documentation of the BPF_MAP_TYPE_PROG_ARRAY map type. * The page needs more info on the BPF_PROG_TYPE_SOCKET_FILTER program type. * The page lacks documentation of the BPF_PROG_TYPE_SOCKET_FILTER program type. commit 2541517c32be2531e0da59dfd7efc1ce844644f5 Author: Alexei Starovoitov a...@plumgrid.com Date: Wed Mar 25 12:49:20 2015 -0700 tracing, perf: Implement BPF programs attached to kprobes * The page lacks documentation of the BPF_PROG_TYPE_SCHED_CLS program type. commit 96be4325f443dbbfeb37d2a157675ac0736531a1 Author: Daniel Borkmann dan...@iogearbox.net Date: Sun Mar 1 12:31:46 2015 +0100 ebpf: add sched_cls_type and map it to sk_filter's verifier ops commit e2e9b6541dd4b31848079da80fe2253daaafb549 Author: Daniel Borkmann dan...@iogearbox.net Date: Sun Mar 1 12:31:48 2015 +0100 cls_bpf: add initial eBPF support for programmable classifiers * The page lacks documentation of the BPF_PROG_TYPE_SCHED_ACT program type. commit 94caee8c312d96522bcdae88791aaa9ebcd5f22c Author: Daniel Borkmann dan...@iogearbox.net Date: Fri Mar 20 15:11:11 2015 +0100 ebpf: add sched_act_type and map it to sk_filter's verifier ops commit a8cb5f556b567974d75ea29c15181c445c541b1f Author: Daniel Borkmann dan...@iogearbox.net Date: Fri Mar 20 15:11:12 2015 +0100 act_bpf: add initial eBPF support for actions Between the two of you, could you propose some text for each of the above, ideally as man-page patches. There are also various other pieces of the page that need work, which I've marked with FIXMEs. Could you take a look at these please? Ok, sure. We will sort this out and get back to you with a couple of patches some time next week. Thanks, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rhashtable] WARNING: CPU: 0 PID: 1 at lib/debugobjects.c:301 __debug_object_init()
On 07/14/2015 07:21 AM, Fengguang Wu wrote: Sorry please ignore -- this no longer happen in linux-next, so should be fine. Seen this before, this fixed it back then: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/lib/test_rhashtable.c?id=b7f5e5c7f8cedf6b69c9702d448cdf78ffc45c7b -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] jhash: Deinline jhash, jhash2 and __jhash_nwords
On 07/16/2015 02:15 PM, Denys Vlasenko wrote: On 07/16/2015 12:41 PM, Thomas Graf wrote: On 07/16/15 at 12:02pm, Denys Vlasenko wrote: +/* jhash - hash an arbitrary key + * @k: sequence of bytes as key + * @length: the length of the key + * @initval: the previous hash, or an arbitray value + * + * The generic version, hashes an arbitrary sequence of bytes. + * No alignment or length assumptions are made about the input key. + * + * Returns the hash value of the key. The result depends on endianness. + */ +u32 jhash(const void *key, u32 length, u32 initval) Shouldn't these live in lib/jhash.c or something? Otherwise everyone needs to depend on CONFIG_RHASHTABLE There is no CONFIG_RHASHTABLE, rhashtable.c is compiled unconditionally. I will send an alternative patch, which creates jhash.c; apply whichever version you like most. Please also Cc netdev as networking subsys is one of the main users of jhash in various critical paths. Did you run e.g. some *_RR work load benchmarks as well to make sure there's no regression? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Edited draft of bpf(2) man page for review/enhancement
Hi Michael, is there any update on the bpf(2) man-page since last time, wrt having an initial version in your tree? Thanks again, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
dev_t devt, const struct device_type *type, + void *drvdata, const char *fmt, ...); +extern __printf(6, 7) struct device *device_create_with_groups(struct class *cls, struct device *parent, dev_t devt, void *drvdata, const struct attribute_group **groups, diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 564f1f0..55e5aad 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -63,50 +63,17 @@ struct bpf_insn { __s32 imm;/* signed immediate constant */ }; -/* BPF syscall commands */ +/* BPF syscall commands, see bpf(2) man-page for details. */ enum bpf_cmd { - /* create a map with given type and attributes -* fd = bpf(BPF_MAP_CREATE, union bpf_attr *, u32 size) -* returns fd or negative error -* map is deleted when fd is closed -*/ BPF_MAP_CREATE, - - /* lookup key in a given map -* err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size) -* Using attr->map_fd, attr->key, attr->value -* returns zero and stores found elem into value -* or negative error -*/ BPF_MAP_LOOKUP_ELEM, - - /* create or update key/value pair in a given map -* err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size) -* Using attr->map_fd, attr->key, attr->value, attr->flags -* returns zero or negative error -*/ BPF_MAP_UPDATE_ELEM, - - /* find and delete elem by key in a given map -* err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size) -* Using attr->map_fd, attr->key -* returns zero or negative error -*/ BPF_MAP_DELETE_ELEM, - - /* lookup key in a given map and return next key -* err = bpf(BPF_MAP_GET_NEXT_KEY, union bpf_attr *attr, u32 size) -* Using attr->map_fd, attr->key, attr->next_key -* returns zero and stores next key or negative error -*/ BPF_MAP_GET_NEXT_KEY, - - /* verify and load eBPF program -* prog_fd = bpf(BPF_PROG_LOAD, union bpf_attr *attr, u32 size) -* Using attr->prog_type, attr->insns, attr->license -* returns fd or negative error -*/ BPF_PROG_LOAD, + BPF_DEV_CREATE, + BPF_DEV_DESTROY, + BPF_DEV_CONNECT, }; enum bpf_map_type { @@ -160,6 +127,10 @@ union bpf_attr { __aligned_u64 log_buf;/* user supplied buffer */ __u32 kern_version; /* checked when prog_type=kprobe */ }; + + struct { /* anonymous struct used by BPF_DEV_* commands */ + __u32 fd; + }; } __attribute__((aligned(8))); /* integer value in 'imm' field of BPF_CALL instruction selects which helper diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index e6983be..f871ca6 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -1,2 +1,4 @@ obj-y := core.o -obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o hashtab.o arraymap.o helpers.o + +obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o device.o helpers.o +obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 8086471..334b1bd 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -92,6 +92,7 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) fp->pages = size / PAGE_SIZE; fp->aux = aux; + fp->aux->prog = fp; return fp; } @@ -116,6 +117,7 @@ struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size, memcpy(fp, fp_old, fp_old->pages * PAGE_SIZE); fp->pages = size / PAGE_SIZE; + fp->aux->prog = fp; /* We keep fp->aux from fp_old around in the new * reallocated structure. @@ -726,7 +728,6 @@ void bpf_prog_free(struct bpf_prog *fp) struct bpf_prog_aux *aux = fp->aux; INIT_WORK(>work, bpf_prog_free_deferred); - aux->prog = fp; schedule_work(>work); } EXPORT_SYMBOL_GPL(bpf_prog_free); diff --git a/kernel/bpf/device.c b/kernel/bpf/device.c new file mode 100644 index 000..711c9a4 --- /dev/null +++ b/kernel/bpf/device.c @@ -0,0 +1,441 @@ +/* + * Special file backend for persistent eBPF maps and programs, used by + * bpf() system call. + * + * (C) 2015 Daniel Borkmann <dan...@iogearbox.net> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define BPF_MAX_DEVS (1UL << MINORBITS) +#define BPF_MODE_DEF (S_IRUSR
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
On 10/22/2015 09:35 PM, Eric W. Biederman wrote: Daniel Borkmann <dan...@iogearbox.net> writes: On 10/20/2015 08:56 PM, Eric W. Biederman wrote: ... Just FYI: Using a device for this kind of interface is pretty much a non-starter as that quickly gets you into situations where things do not work in containers. If someone gets a version of device namespaces past GregKH it might be up for discussion to use character devices. Okay, you are referring to this discussion here: http://thread.gmane.org/gmane.linux.kernel.containers/26760 That is a piece of it. It is an old old discussion (which generally has been handled poorly). For the forseeable future device namespaces have a firm NACK by GregKH. Which means that dynamic character device based interfaces do not work in containers. Which means if you are not talking about physical hardware, character devices are a poor fit. Yes, it breaks down with real namespace support. Reworking the set with an improved version of the fs code is already in progress. Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8] seccomp, ptrace: add support for dumping seccomp filters
On 10/21/2015 10:12 PM, Kees Cook wrote: On Wed, Oct 21, 2015 at 12:15 PM, Tycho Andersenwrote: Hi Oleg, On Wed, Oct 21, 2015 at 08:51:46PM +0200, Oleg Nesterov wrote: On 10/20, Tycho Andersen wrote: Hi Kees, Oleg, On Tue, Oct 20, 2015 at 10:20:24PM +0200, Oleg Nesterov wrote: No, you can't do copy_to_user() from atomic context. You need to pin this filter, drop the lock/irq, then copy_to_user(). Attached is a patch which addresses this. Looks good to me, feel free to add my reviewed-by. a couple of questions, I am just curious... +long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, + void __user *data) +{ + struct seccomp_filter *filter; + struct sock_fprog_kern *fprog; + long ret; + unsigned long count = 0; + + if (!capable(CAP_SYS_ADMIN) || + current->seccomp.mode != SECCOMP_MODE_DISABLED) { + return -EACCES; + } + + spin_lock_irq(>sighand->siglock); + if (task->seccomp.mode != SECCOMP_MODE_FILTER) { + ret = -EINVAL; + goto out; + } + + filter = task->seccomp.filter; + while (filter) { + filter = filter->prev; + count++; + } + + if (filter_off >= count) { + ret = -ENOENT; + goto out; + } + count -= filter_off; + + filter = task->seccomp.filter; + while (filter && count > 1) { + filter = filter->prev; + count--; + } + + if (WARN_ON(count != 1)) { + /* The filter tree shouldn't shrink while we're using it. */ + ret = -ENOENT; Yes. but this looks a bit confusing. If we want this WARN_ON() check because we are paranoid, then we should do WARN_ON(count != 1 || filter); I guess you mean !filter here? We want filter to be non-null, because we use it later. And "while we're using it" look misleading, we rely on ->siglock. Plus if we could be shrinked the additional check can't help anyway, we can used the free filter. So I don't really understand this check and "filter != NULL" in the previous "while (filter && count > 1)". Nevermind... Just paranoia. You're right that we could get rid of WARN_ON and the null check. I can send an updated patch to drop these bits if necessary. Kees? I like being really paranoid when dealing with the filters. Let's keep the WARN_ON (with the "|| !filter" added) but maybe wrap it in "unlikely"? Btw, the conditions inside the WARN_ON() macro would already resolve to unlikely(). Best, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8] seccomp, ptrace: add support for dumping seccomp filters
Hi Tycho, On 10/27/2015 01:04 AM, Tycho Andersen wrote: On Mon, Oct 26, 2015 at 04:07:01PM +0900, Kees Cook wrote: On Mon, Oct 26, 2015 at 3:46 PM, Kees Cookwrote: Cool, thanks. I'll get this into my tree after kernel summit. Thanks for suffering through all this Tycho! Actually, since this depends on changes in net, could this get pulled in from that direction? Acked-by: Kees Cook Can we get the attached patch into net-next? You need to make a fresh, formal submission of your patch to netdev, not as an attachment (otherwise patchwork cannot properly pick it up). Also, indicate the right tree in the subject as: [PATCH net-next] ... Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH net-next v2 1/5] bpf: abstract anon_inode_getfd invocations
Since we're going to use anon_inode_getfd() invocations in more than just the current places, make a helper function for both, so that we only need to pass a map/prog pointer to the helper itself in order to get a fd. The new helpers are called bpf_map_new_fd() and bpf_prog_new_fd(). Signed-off-by: Daniel Borkmann <dan...@iogearbox.net> Acked-by: Alexei Starovoitov <a...@kernel.org> --- kernel/bpf/syscall.c | 17 ++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 687dd6c..2b89ef0 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -111,6 +111,12 @@ static const struct file_operations bpf_map_fops = { .release = bpf_map_release, }; +static int bpf_map_new_fd(struct bpf_map *map) +{ + return anon_inode_getfd("bpf-map", _map_fops, map, + O_RDWR | O_CLOEXEC); +} + /* helper macro to check that unused fields 'union bpf_attr' are zero */ #define CHECK_ATTR(CMD) \ memchr_inv((void *) >CMD##_LAST_FIELD + \ @@ -141,8 +147,7 @@ static int map_create(union bpf_attr *attr) if (err) goto free_map; - err = anon_inode_getfd("bpf-map", _map_fops, map, O_RDWR | O_CLOEXEC); - + err = bpf_map_new_fd(map); if (err < 0) /* failed to allocate fd */ goto free_map; @@ -538,6 +543,12 @@ static const struct file_operations bpf_prog_fops = { .release = bpf_prog_release, }; +static int bpf_prog_new_fd(struct bpf_prog *prog) +{ + return anon_inode_getfd("bpf-prog", _prog_fops, prog, + O_RDWR | O_CLOEXEC); +} + static struct bpf_prog *get_prog(struct fd f) { struct bpf_prog *prog; @@ -647,7 +658,7 @@ static int bpf_prog_load(union bpf_attr *attr) if (err < 0) goto free_used_maps; - err = anon_inode_getfd("bpf-prog", _prog_fops, prog, O_RDWR | O_CLOEXEC); + err = bpf_prog_new_fd(prog); if (err < 0) /* failed to allocate fd */ goto free_used_maps; -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH net-next v2 3/5] bpf: consolidate bpf_prog_put{,_rcu} dismantle paths
We currently have duplicated cleanup code in bpf_prog_put() and bpf_prog_put_rcu() cleanup paths. Back then we decided that it was not worth it to make it a common helper called by both, but with the recent addition of resource charging, we could have avoided the fix in commit ac00737f4e81 ("bpf: Need to call bpf_prog_uncharge_memlock from bpf_prog_put") if we would have had only a single, common path. We can simplify it further by assigning aux->prog only once during allocation time. Signed-off-by: Daniel Borkmann <dan...@iogearbox.net> Acked-by: Alexei Starovoitov <a...@kernel.org> --- kernel/bpf/core.c| 3 ++- kernel/bpf/syscall.c | 15 +-- 2 files changed, 7 insertions(+), 11 deletions(-) diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 8086471..334b1bd 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -92,6 +92,7 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) fp->pages = size / PAGE_SIZE; fp->aux = aux; + fp->aux->prog = fp; return fp; } @@ -116,6 +117,7 @@ struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size, memcpy(fp, fp_old, fp_old->pages * PAGE_SIZE); fp->pages = size / PAGE_SIZE; + fp->aux->prog = fp; /* We keep fp->aux from fp_old around in the new * reallocated structure. @@ -726,7 +728,6 @@ void bpf_prog_free(struct bpf_prog *fp) struct bpf_prog_aux *aux = fp->aux; INIT_WORK(>work, bpf_prog_free_deferred); - aux->prog = fp; schedule_work(>work); } EXPORT_SYMBOL_GPL(bpf_prog_free); diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 3fff82c..d7783cb 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -513,7 +513,7 @@ static void bpf_prog_uncharge_memlock(struct bpf_prog *prog) free_uid(user); } -static void __prog_put_rcu(struct rcu_head *rcu) +static void __prog_put_common(struct rcu_head *rcu) { struct bpf_prog_aux *aux = container_of(rcu, struct bpf_prog_aux, rcu); @@ -525,19 +525,14 @@ static void __prog_put_rcu(struct rcu_head *rcu) /* version of bpf_prog_put() that is called after a grace period */ void bpf_prog_put_rcu(struct bpf_prog *prog) { - if (atomic_dec_and_test(>aux->refcnt)) { - prog->aux->prog = prog; - call_rcu(>aux->rcu, __prog_put_rcu); - } + if (atomic_dec_and_test(>aux->refcnt)) + call_rcu(>aux->rcu, __prog_put_common); } void bpf_prog_put(struct bpf_prog *prog) { - if (atomic_dec_and_test(>aux->refcnt)) { - free_used_maps(prog->aux); - bpf_prog_uncharge_memlock(prog); - bpf_prog_free(prog); - } + if (atomic_dec_and_test(>aux->refcnt)) + __prog_put_common(>aux->rcu); } EXPORT_SYMBOL_GPL(bpf_prog_put); -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH net-next v2 0/5] BPF updates
This set adds support for persistent maps/progs. Please see individual patches for further details. A man-page update to bpf(2) will be sent later on, also a iproute2 patch for support in tc. Thanks! v1 -> v2: - Reworked most of patch 4 and 5 - Rebased to latest net-next Daniel Borkmann (5): bpf: abstract anon_inode_getfd invocations bpf: align and clean bpf_{map,prog}_get helpers bpf: consolidate bpf_prog_put{,_rcu} dismantle paths bpf: add support for persistent maps/progs bpf: add sample usages for persistent maps/progs include/linux/bpf.h| 9 +- include/uapi/linux/bpf.h | 45 +- include/uapi/linux/magic.h | 1 + kernel/bpf/Makefile| 4 +- kernel/bpf/core.c | 3 +- kernel/bpf/inode.c | 387 + kernel/bpf/syscall.c | 95 +++ kernel/bpf/verifier.c | 3 +- samples/bpf/Makefile | 3 + samples/bpf/fds_example.c | 183 + samples/bpf/libbpf.c | 19 +++ samples/bpf/libbpf.h | 3 + 12 files changed, 683 insertions(+), 72 deletions(-) create mode 100644 kernel/bpf/inode.c create mode 100644 samples/bpf/fds_example.c -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH net-next v2 4/5] bpf: add support for persistent maps/progs
This work adds support for "persistent" eBPF maps/programs. The term "persistent" is to be understood that maps/programs have a facility that lets them survive process termination. This is desired by various eBPF subsystem users. Just to name one example: tc classifier/action. Whenever tc parses the ELF object, extracts and loads maps/progs into the kernel, these file descriptors will be out of reach after the tc instance exits. So a subsequent tc invocation won't be able to access/relocate on this resource, and therefore maps cannot easily be shared, f.e. between the ingress and egress networking data path. The current workaround is that Unix domain sockets (UDS) need to be instrumented in order to pass the created eBPF map/program file descriptors to a third party management daemon through UDS' socket passing facility. This makes it a bit complicated to deploy shared eBPF maps or programs (programs f.e. for tail calls) among various processes. We've been brainstorming on how we could tackle this issue and various approches have been tried out so far, which can be read up further in the below reference. The architecture we eventually ended up with is a minimal file system that can hold map/prog objects. The file system is a per mount namespace singleton, and the default mount point is /sys/fs/bpf/. Any subsequent mounts within a given namespace will point to the same instance. The file system allows for creating a user-defined directory structure. The objects for maps/progs are created/fetched through bpf(2) with two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor along with a pathname is being passed to bpf(2) that in turn creates (we call it eBPF object pinning) the file system nodes. Only the pathname is being passed to bpf(2) for getting a new BPF file descriptor to an existing node. The user can use that to access maps and progs later on, through bpf(2). Removal of file system nodes is being managed through normal VFS functions such as unlink(2), etc. The file system code is kept to a very minimum and can be further extended later on. The next step I'm working on is to add dump eBPF map/prog commands to bpf(2), so that a specification from a given file descriptor can be retrieved. This can be used by things like CRIU but also applications can inspect the meta data after calling BPF_OBJ_GET. Big thanks also to Alexei and Hannes who significantly contributed in the design discussion that eventually let us end up with this architecture here. Reference: https://lkml.org/lkml/2015/10/15/925 Signed-off-by: Daniel Borkmann <dan...@iogearbox.net> Signed-off-by: Alexei Starovoitov <a...@kernel.org> Signed-off-by: Hannes Frederic Sowa <han...@stressinduktion.org> --- include/linux/bpf.h| 7 + include/uapi/linux/bpf.h | 45 +- include/uapi/linux/magic.h | 1 + kernel/bpf/Makefile| 4 +- kernel/bpf/inode.c | 387 + kernel/bpf/syscall.c | 30 +++- 6 files changed, 433 insertions(+), 41 deletions(-) create mode 100644 kernel/bpf/inode.c diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 0b5fb6a..de464e6 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -167,11 +167,18 @@ struct bpf_prog *bpf_prog_get(u32 ufd); void bpf_prog_put(struct bpf_prog *prog); void bpf_prog_put_rcu(struct bpf_prog *prog); +struct bpf_map *bpf_map_get(u32 ufd); struct bpf_map *__bpf_map_get(struct fd f); void bpf_map_put(struct bpf_map *map); extern int sysctl_unprivileged_bpf_disabled; +int bpf_map_new_fd(struct bpf_map *map); +int bpf_prog_new_fd(struct bpf_prog *prog); + +int bpf_obj_pin_user(u32 ufd, const char __user *pathname); +int bpf_obj_get_user(const char __user *pathname); + /* verify correctness of eBPF program */ int bpf_check(struct bpf_prog **fp, union bpf_attr *attr); #else diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 2e03242..9ea2d22 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -63,50 +63,16 @@ struct bpf_insn { __s32 imm;/* signed immediate constant */ }; -/* BPF syscall commands */ +/* BPF syscall commands, see bpf(2) man-page for details. */ enum bpf_cmd { - /* create a map with given type and attributes -* fd = bpf(BPF_MAP_CREATE, union bpf_attr *, u32 size) -* returns fd or negative error -* map is deleted when fd is closed -*/ BPF_MAP_CREATE, - - /* lookup key in a given map -* err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size) -* Using attr->map_fd, attr->key, attr->value -* returns zero and stores found elem into value -* or negative error -*/ BPF_MAP_LOOKUP_ELEM, - - /* create or update key/value pair in a given map -* err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size) -* Using at
[PATCH net-next v2 5/5] bpf: add sample usages for persistent maps/progs
This patch adds a couple of stand-alone examples on how BPF_OBJ_PIN and BPF_OBJ_GET commands can be used. Example with maps: # ./fds_example -F /sys/fs/bpf/m -P -m -k 1 -v 42 bpf: map fd:3 (Success) bpf: pin ret:(0,Success) bpf: fd:3 u->(1:42) ret:(0,Success) # ./fds_example -F /sys/fs/bpf/m -G -m -k 1 bpf: get fd:3 (Success) bpf: fd:3 l->(1):42 ret:(0,Success) # ./fds_example -F /sys/fs/bpf/m -G -m -k 1 -v 24 bpf: get fd:3 (Success) bpf: fd:3 u->(1:24) ret:(0,Success) # ./fds_example -F /sys/fs/bpf/m -G -m -k 1 bpf: get fd:3 (Success) bpf: fd:3 l->(1):24 ret:(0,Success) # ./fds_example -F /sys/fs/bpf/m2 -P -m bpf: map fd:3 (Success) bpf: pin ret:(0,Success) # ./fds_example -F /sys/fs/bpf/m2 -G -m -k 1 bpf: get fd:3 (Success) bpf: fd:3 l->(1):0 ret:(0,Success) # ./fds_example -F /sys/fs/bpf/m2 -G -m bpf: get fd:3 (Success) Example with progs: # ./fds_example -F /sys/fs/bpf/p -P -p bpf: prog fd:3 (Success) bpf: pin ret:(0,Success) bpf sock:4 <- fd:3 attached ret:(0,Success) # ./fds_example -F /sys/fs/bpf/p -G -p bpf: get fd:3 (Success) bpf: sock:4 <- fd:3 attached ret:(0,Success) # ./fds_example -F /sys/fs/bpf/p2 -P -p -o ./sockex1_kern.o bpf: prog fd:5 (Success) bpf: pin ret:(0,Success) bpf: sock:3 <- fd:5 attached ret:(0,Success) # ./fds_example -F /sys/fs/bpf/p2 -G -p bpf: get fd:3 (Success) bpf: sock:4 <- fd:3 attached ret:(0,Success) Signed-off-by: Daniel Borkmann <dan...@iogearbox.net> Acked-by: Alexei Starovoitov <a...@kernel.org> --- samples/bpf/Makefile | 3 + samples/bpf/fds_example.c | 183 ++ samples/bpf/libbpf.c | 19 + samples/bpf/libbpf.h | 3 + 4 files changed, 208 insertions(+) create mode 100644 samples/bpf/fds_example.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index b305145..79b4596 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -4,6 +4,7 @@ obj- := dummy.o # List of programs to build hostprogs-y := test_verifier test_maps hostprogs-y += sock_example +hostprogs-y += fds_example hostprogs-y += sockex1 hostprogs-y += sockex2 hostprogs-y += sockex3 @@ -19,6 +20,7 @@ hostprogs-y += lathist test_verifier-objs := test_verifier.o libbpf.o test_maps-objs := test_maps.o libbpf.o sock_example-objs := sock_example.o libbpf.o +fds_example-objs := bpf_load.o libbpf.o fds_example.o sockex1-objs := bpf_load.o libbpf.o sockex1_user.o sockex2-objs := bpf_load.o libbpf.o sockex2_user.o sockex3-objs := bpf_load.o libbpf.o sockex3_user.o @@ -49,6 +51,7 @@ always += lathist_kern.o HOSTCFLAGS += -I$(objtree)/usr/include HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable +HOSTLOADLIBES_fds_example += -lelf HOSTLOADLIBES_sockex1 += -lelf HOSTLOADLIBES_sockex2 += -lelf HOSTLOADLIBES_sockex3 += -lelf diff --git a/samples/bpf/fds_example.c b/samples/bpf/fds_example.c new file mode 100644 index 000..e2fd16c --- /dev/null +++ b/samples/bpf/fds_example.c @@ -0,0 +1,183 @@ +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include "bpf_load.h" +#include "libbpf.h" + +#define BPF_F_PIN (1 << 0) +#define BPF_F_GET (1 << 1) +#define BPF_F_PIN_GET (BPF_F_PIN | BPF_F_GET) + +#define BPF_F_KEY (1 << 2) +#define BPF_F_VAL (1 << 3) +#define BPF_F_KEY_VAL (BPF_F_KEY | BPF_F_VAL) + +#define BPF_M_UNSPEC 0 +#define BPF_M_MAP 1 +#define BPF_M_PROG 2 + +static void usage(void) +{ + printf("Usage: fds_example [...]\n"); + printf(" -FFile to pin/get object\n"); + printf(" -P |- pin object\n"); + printf(" -G `- get object\n"); + printf(" -m eBPF map mode\n"); + printf(" -k |- map key\n"); + printf(" -v `- map value\n"); + printf(" -p eBPF prog mode\n"); + printf(" -o `- object file\n"); + printf(" -h Display this help.\n"); +} + +static int bpf_map_create(void) +{ + return bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(uint32_t), + sizeof(uint32_t), 1024); +} + +static int bpf_prog_create(const char *object) +{ + static const struct bpf_insn insns[] = { + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }; + + if (object) { + assert(!load_bpf_file((char *)object)); + return prog_fd[0]; + } else { + return bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, +insns, sizeof(insns), "GPL", 0); + } +} + +static int bpf_do_map(const char *file, uint32_t flags, uint32_t key, +
[PATCH net-next v2 2/5] bpf: align and clean bpf_{map,prog}_get helpers
Add a bpf_map_get() function that we're going to use later on and align/clean the remaining helpers a bit so that we have them a bit more consistent: - __bpf_map_get() and __bpf_prog_get() that both work on the fd struct, check whether the descriptor is eBPF and return the pointer to the map/prog stored in the private data. Also, we can return f.file->private_data directly, the function signature is enough of a documentation already. - bpf_map_get() and bpf_prog_get() that both work on u32 user fd, call their respective __bpf_map_get()/__bpf_prog_get() variants, and take a reference. Signed-off-by: Daniel Borkmann <dan...@iogearbox.net> Acked-by: Alexei Starovoitov <a...@kernel.org> --- include/linux/bpf.h | 2 +- kernel/bpf/syscall.c | 41 +++-- kernel/bpf/verifier.c | 3 +-- 3 files changed, 25 insertions(+), 21 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 75718fa..0b5fb6a 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -167,7 +167,7 @@ struct bpf_prog *bpf_prog_get(u32 ufd); void bpf_prog_put(struct bpf_prog *prog); void bpf_prog_put_rcu(struct bpf_prog *prog); -struct bpf_map *bpf_map_get(struct fd f); +struct bpf_map *__bpf_map_get(struct fd f); void bpf_map_put(struct bpf_map *map); extern int sysctl_unprivileged_bpf_disabled; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 2b89ef0..3fff82c 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -162,19 +162,29 @@ free_map: /* if error is returned, fd is released. * On success caller should complete fd access with matching fdput() */ -struct bpf_map *bpf_map_get(struct fd f) +struct bpf_map *__bpf_map_get(struct fd f) { - struct bpf_map *map; - if (!f.file) return ERR_PTR(-EBADF); - if (f.file->f_op != _map_fops) { fdput(f); return ERR_PTR(-EINVAL); } - map = f.file->private_data; + return f.file->private_data; +} + +static struct bpf_map *bpf_map_get(u32 ufd) +{ + struct fd f = fdget(ufd); + struct bpf_map *map; + + map = __bpf_map_get(f); + if (IS_ERR(map)) + return map; + + atomic_inc(>refcnt); + fdput(f); return map; } @@ -202,7 +212,7 @@ static int map_lookup_elem(union bpf_attr *attr) return -EINVAL; f = fdget(ufd); - map = bpf_map_get(f); + map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); @@ -261,7 +271,7 @@ static int map_update_elem(union bpf_attr *attr) return -EINVAL; f = fdget(ufd); - map = bpf_map_get(f); + map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); @@ -314,7 +324,7 @@ static int map_delete_elem(union bpf_attr *attr) return -EINVAL; f = fdget(ufd); - map = bpf_map_get(f); + map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); @@ -355,7 +365,7 @@ static int map_get_next_key(union bpf_attr *attr) return -EINVAL; f = fdget(ufd); - map = bpf_map_get(f); + map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); @@ -549,21 +559,16 @@ static int bpf_prog_new_fd(struct bpf_prog *prog) O_RDWR | O_CLOEXEC); } -static struct bpf_prog *get_prog(struct fd f) +static struct bpf_prog *__bpf_prog_get(struct fd f) { - struct bpf_prog *prog; - if (!f.file) return ERR_PTR(-EBADF); - if (f.file->f_op != _prog_fops) { fdput(f); return ERR_PTR(-EINVAL); } - prog = f.file->private_data; - - return prog; + return f.file->private_data; } /* called by sockets/tracing/seccomp before attaching program to an event @@ -574,13 +579,13 @@ struct bpf_prog *bpf_prog_get(u32 ufd) struct fd f = fdget(ufd); struct bpf_prog *prog; - prog = get_prog(f); - + prog = __bpf_prog_get(f); if (IS_ERR(prog)) return prog; atomic_inc(>aux->refcnt); fdput(f); + return prog; } EXPORT_SYMBOL_GPL(bpf_prog_get); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index b56cf51..fdc88c5 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1989,8 +1989,7 @@ static int replace_map_fd_with_map_ptr(struct verifier_env *env) } f = fdget(insn->imm); - - map = bpf_map_get(f); + map = __bpf_map_get(f); if (IS_ERR(map)) { verbose("fd %d is not pointing to valid bpf_map\n", insn->imm); -- 1.9.3 -- To unsubscri
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
On 10/22/2015 12:44 AM, Alexei Starovoitov wrote: ... all users) When you have to hack drivers/base/core.c to get there it should have been a warning sign that something is wrong with this cdev approach. Hmm, you know, this had nothing to do with it, merely to save ~20 LoC that I can do just as well inside BPF framework. No changes in driver API needed. I've read the discussion passively and my take away is that, frankly, I think the differences are somewhat minor. Both architectures can scale to what we need. Both will do the job. I'm slightly worried about exposing uAPI as a FS, I think that didn't work too well for sysfs. It's pretty much a define the format once and never touch it again kind of deal. It's even worse in cdev style since it piggy backs on sysfs. I don't mind with what approach we're going in the end, but this kind of discussion is really tiring, and not going anywhere. Lets just make a beer call, so we can hash out a way forward that works for everyone. On that note: cheers! ;) Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH, REPORT] bpf_trace: build error without PERF_EVENTS
On 11/10/2015 06:14 PM, Alexei Starovoitov wrote: On Tue, Nov 10, 2015 at 09:25:01AM -0500, Steven Rostedt wrote: On Tue, 10 Nov 2015 14:31:38 +0100 Daniel Borkmann <dan...@iogearbox.net> wrote: On 11/10/2015 01:55 PM, Arnd Bergmann wrote: In my ARM randconfig tests, I'm getting a build error for newly added code in bpf_perf_event_read and bpf_perf_event_output whenever CONFIG_PERF_EVENTS is disabled: kernel/trace/bpf_trace.c: In function 'bpf_perf_event_read': kernel/trace/bpf_trace.c:203:11: error: 'struct perf_event' has no member named 'oncpu' if (event->oncpu != smp_processor_id() || ^ kernel/trace/bpf_trace.c:204:11: error: 'struct perf_event' has no member named 'pmu' event->pmu->count) This can happen when UPROBE_EVENT is enabled but KPROBE_EVENT is disabled. I'm not sure if that is a configuration we care about, otherwise we could prevent this case from occuring by adding Kconfig dependencies. I think that seems better than spreading #if IS_ENABLEDs into the code. Probably enough to add a 'depends on PERF_EVENTS' to config BPF_EVENTS, so it's also explicitly documented. So just do the following then? -- Steve diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 8d6363f42169..f5aecff2d243 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -434,7 +434,7 @@ config UPROBE_EVENT config BPF_EVENTS depends on BPF_SYSCALL - depends on KPROBE_EVENT || UPROBE_EVENT + depends on KPROBE_EVENT && UPROBE_EVENT yeah that's definitely cleaner and avoids ifdef creep in the future. Agreed, that's better. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] arm64: bpf: add BPF XADD instruction
On 11/11/2015 11:24 AM, Will Deacon wrote: On Wed, Nov 11, 2015 at 09:49:48AM +0100, Arnd Bergmann wrote: On Tuesday 10 November 2015 18:52:45 Z Lim wrote: On Tue, Nov 10, 2015 at 4:42 PM, Alexei Starovoitovwrote: On Tue, Nov 10, 2015 at 04:26:02PM -0800, Shi, Yang wrote: On 11/10/2015 4:08 PM, Eric Dumazet wrote: On Tue, 2015-11-10 at 14:41 -0800, Yang Shi wrote: aarch64 doesn't have native support for XADD instruction, implement it by the below instruction sequence: aarch64 supports atomic add in ARMv8.1. For ARMv8(.0), please consider using LDXR/STXR sequence. Is it worth optimizing for the 8.1 case? It would add a bit of complexity to make the code depend on the CPU feature, but it's certainly doable. What's the atomicity required for? Put another way, what are we racing with (I thought bpf was single-threaded)? Do we need to worry about memory barriers? Apologies if these are stupid questions, but all I could find was samples/bpf/sock_example.c and it didn't help much :( The equivalent code more readable in restricted C syntax (that can be compiled by llvm) can be found in samples/bpf/sockex1_kern.c. So the built-in __sync_fetch_and_add() will be translated into a BPF_XADD insn variant. What you can race against is that an eBPF map can be _shared_ by multiple eBPF programs that are attached somewhere in the system, and they could all update a particular entry/counter from the map at the same time. Best, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] arm64: bpf: add BPF XADD instruction
On 11/11/2015 01:58 PM, Peter Zijlstra wrote: On Wed, Nov 11, 2015 at 12:38:31PM +, Will Deacon wrote: Hmm, gcc doesn't have an eBPF compiler backend, so this won't work on gcc at all. The eBPF backend in LLVM recognizes the __sync_fetch_and_add() keyword and maps that to a BPF_XADD version (BPF_W or BPF_DW). In the interpreter (__bpf_prog_run()), as Eric mentioned, this maps to atomic_add() and atomic64_add(), respectively. So the struct bpf_insn prog[] you saw from sock_example.c can be regarded as one possible equivalent program section output from the compiler. Ok, so if I understand you correctly, then __sync_fetch_and_add() has different semantics depending on the backend target. That seems counter to the LLVM atomics Documentation: http://llvm.org/docs/Atomics.html which specifically calls out the __sync_* primitives as being sequentially-consistent and requiring barriers on ARM (which isn't the case for atomic[64]_add in the kernel). If we re-use the __sync_* naming scheme in the source language, I don't think we can overlay our own semantics in the backend. The __sync_fetch_and_add primitive is also expected to return the old value, which doesn't appear to be the case for BPF_XADD. Yikes. That's double fail. Please don't do this. If you use the __sync stuff (and I agree with Will, you should not) it really _SHOULD_ be sequentially consistent, which means full barriers all over the place. And if you name something XADD (exchange and add, or fetch-add) then it had better return the previous value. atomic*_add() does neither. unsigned int ui; unsigned long long ull; void foo(void) { (void) __sync_fetch_and_add(, 1); (void) __sync_fetch_and_add(, 1); } So clang front-end translates this snippet into intermediate representation of ... clang test.c -S -emit-llvm -o - [...] define void @foo() #0 { %1 = atomicrmw add i32* @ui, i32 1 seq_cst %2 = atomicrmw add i64* @ull, i64 1 seq_cst ret void } [...] ... which, if I see this correctly, then maps atomicrmw add {i32,i64} in the BPF target into BPF_XADD as mentioned: // Atomics class XADDSizeOp, string OpcodeStr, PatFrag OpNode> : InstBPF<(outs GPR:$dst), (ins MEMri:$addr, GPR:$val), !strconcat(OpcodeStr, "\t$dst, $addr, $val"), [(set GPR:$dst, (OpNode ADDRri:$addr, GPR:$val))]> { bits<3> mode; bits<2> size; bits<4> src; bits<20> addr; let Inst{63-61} = mode; let Inst{60-59} = size; let Inst{51-48} = addr{19-16}; // base reg let Inst{55-52} = src; let Inst{47-32} = addr{15-0}; // offset let mode = 6; // BPF_XADD let size = SizeOp; let BPFClass = 3; // BPF_STX } let Constraints = "$dst = $val" in { def XADD32 : XADD<0, "xadd32", atomic_load_add_32>; def XADD64 : XADD<3, "xadd64", atomic_load_add_64>; // undefined def XADD16 : XADD<1, "xadd16", atomic_load_add_16>; // undefined def XADD8 : XADD<2, "xadd8", atomic_load_add_8>; } I played a bit around with eBPF code to assign the __sync_fetch_and_add() return value to a var and dump it to trace pipe, or use it as return code. llvm compiles it (with the result assignment) and it looks like: [...] 206: (b7) r3 = 3 207: (db) lock *(u64 *)(r0 +0) += r3 208: (bf) r1 = r10 209: (07) r1 += -16 210: (b7) r2 = 10 211: (85) call 6 // r3 dumped here [...] [...] 206: (b7) r5 = 3 207: (db) lock *(u64 *)(r0 +0) += r5 208: (bf) r1 = r10 209: (07) r1 += -16 210: (b7) r2 = 10 211: (b7) r3 = 43 212: (b7) r4 = 42 213: (85) call 6 // r5 dumped here [...] [...] 11: (b7) r0 = 3 12: (db) lock *(u64 *)(r1 +0) += r0 13: (95) exit // r0 returned here [...] What it seems is that we 'get back' the value (== 3 here in r3, r5, r0) that we're adding, at least that's what seems to be generated wrt register assignments. Hmm, the semantic differences of bpf target should be documented somewhere for people writing eBPF programs to be aware of. Best, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] arm64: bpf: add BPF XADD instruction
On 11/11/2015 12:58 PM, Will Deacon wrote: On Wed, Nov 11, 2015 at 11:42:11AM +0100, Daniel Borkmann wrote: On 11/11/2015 11:24 AM, Will Deacon wrote: On Wed, Nov 11, 2015 at 09:49:48AM +0100, Arnd Bergmann wrote: On Tuesday 10 November 2015 18:52:45 Z Lim wrote: On Tue, Nov 10, 2015 at 4:42 PM, Alexei Starovoitov <alexei.starovoi...@gmail.com> wrote: On Tue, Nov 10, 2015 at 04:26:02PM -0800, Shi, Yang wrote: On 11/10/2015 4:08 PM, Eric Dumazet wrote: On Tue, 2015-11-10 at 14:41 -0800, Yang Shi wrote: aarch64 doesn't have native support for XADD instruction, implement it by the below instruction sequence: aarch64 supports atomic add in ARMv8.1. For ARMv8(.0), please consider using LDXR/STXR sequence. Is it worth optimizing for the 8.1 case? It would add a bit of complexity to make the code depend on the CPU feature, but it's certainly doable. What's the atomicity required for? Put another way, what are we racing with (I thought bpf was single-threaded)? Do we need to worry about memory barriers? Apologies if these are stupid questions, but all I could find was samples/bpf/sock_example.c and it didn't help much :( The equivalent code more readable in restricted C syntax (that can be compiled by llvm) can be found in samples/bpf/sockex1_kern.c. So the built-in __sync_fetch_and_add() will be translated into a BPF_XADD insn variant. Yikes, so the memory-model for BPF is based around the deprecated GCC __sync builtins, that inherit their semantics from ia64? Any reason not to use the C11-compatible __atomic builtins[1] as a base? Hmm, gcc doesn't have an eBPF compiler backend, so this won't work on gcc at all. The eBPF backend in LLVM recognizes the __sync_fetch_and_add() keyword and maps that to a BPF_XADD version (BPF_W or BPF_DW). In the interpreter (__bpf_prog_run()), as Eric mentioned, this maps to atomic_add() and atomic64_add(), respectively. So the struct bpf_insn prog[] you saw from sock_example.c can be regarded as one possible equivalent program section output from the compiler. What you can race against is that an eBPF map can be _shared_ by multiple eBPF programs that are attached somewhere in the system, and they could all update a particular entry/counter from the map at the same time. Ok, so it does sound like eBPF needs to define/choose a memory-model and I worry that riding on the back of __sync isn't necessarily the right thing to do, particularly as its fallen out of favour with the compiler folks. On weakly-ordered architectures, it's also going to result in heavy-weight barriers for all atomic operations. Will [1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tools/net: Use include/uapi with __EXPORTED_HEADERS__
On 11/11/2015 11:24 PM, Kamal Mostafa wrote: Use the local uapi headers to keep in sync with "recently" added #define's (e.g. SKF_AD_VLAN_TPID). Refactored CFLAGS, and bpf_asm doesn't need -I. Fixes: 3f356385e8a449e1d7cfc6b6f8d634ac4f5581a0 Signed-off-by: Kamal Mostafa <ka...@canonical.com> Acked-by: Daniel Borkmann <dan...@iogearbox.net> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] arm64: bpf: add BPF XADD instruction
On 11/11/2015 07:31 PM, Peter Zijlstra wrote: On Wed, Nov 11, 2015 at 10:11:33AM -0800, Alexei Starovoitov wrote: On Wed, Nov 11, 2015 at 06:57:41PM +0100, Peter Zijlstra wrote: On Wed, Nov 11, 2015 at 12:35:48PM -0500, David Miller wrote: From: Alexei StarovoitovDate: Wed, 11 Nov 2015 09:27:00 -0800 BPF_XADD == atomic_add() in kernel. period. we are not going to deprecate it or introduce something else. Agreed, it makes no sense to try and tie C99 or whatever atomic semantics to something that is already clearly defined to have exactly kernel atomic_add() semantics. Dave, this really doesn't make any sense to me. __sync primitives have well defined semantics and (e)BPF is violating this. bpf_xadd was never meant to be __sync_fetch_and_add equivalent. From the day one it meant to be atomic_add() as kernel does it. I did piggy back on __sync in the llvm backend because it was the quick and dirty way to move forward. In retrospect I should have introduced a clean intrinstic for that instead, but it's not too late to do it now. user space we can change at any time unlike kernel. I would argue that breaking userspace (language in this case) is equally bad. Programs that used to work will now no longer work. Well, on that note, it's not like you just change the target to bpf in your Makefile and can compile (& load into the kernel) anything you want with it. You do have to write small, restricted programs from scratch for a specific use-case with the limited set of helper functions and intrinsics that are available from the kernel. So I don't think that "Programs that used to work will now no longer work." holds if you regard it as such. Furthermore, the fetch_and_add (or XADD) name has well defined semantics, which (e)BPF also violates. bpf_xadd also didn't meant to be 'fetch'. It was void return from the beginning. Then why the 'X'? The XADD name, does and always has meant: eXchange-ADD, this means it must have a return value. You using the XADD name for something that is not in fact XADD is just wrong. Atomicy is hard enough as it is, backends giving random interpretations to them isn't helping anybody. no randomness. You mean every other backend translating __sync_fetch_and_add() differently than you isn't random on your part? bpf_xadd == atomic_add() in kernel. imo that is the simplest and cleanest intepretantion one can have, no? Wrong though, if you'd named it BPF_ADD, sure, XADD, not so much. That is 'randomly' co-opting something that has well defined meaning and semantics with something else. It also baffles me that Alexei is seemingly unwilling to change/rev the (e)BPF instructions, which would be invisible to the regular user, he does want to change the language itself, which will impact all 'scripts'. well, we cannot change it in kernel because it's ABI. You can always rev it. Introduce a new set, and wait for users of the old set to die, then remove it. We do that all the time with Linux ABI. I'm not against adding new insns. We definitely can, but let's figure out why? Is anything broken? No. Yes, __sync_fetch_and_add() is broken when pulled through the eBPF backend. So what new insns make sense? Depends a bit on how fancy you want to go. If you want to support weakly ordered architectures at full speed you'll need more (and more complexity) than if you decide to not go that way. The simplest option would be a fully ordered compare-and-swap operation. That is enough to implement everything else (at a cost). The other extreme is a weak ll/sc with an optimizer pass recognising various forms to translate into 'better' native instructions. Add new one that does 'fetch_and_add' ? What is the real use case it will be used for? Look at all the atomic_{add,dec}_return*() users in the kernel. A typical example would be a reader-writer lock implementations. See include/asm-generic/rwsem.h for examples. Adding new intrinsic to llvm is not a big deal. I'll add it as soon as I have time to work on it or if somebody beats me to it I would be glad to test it and apply it. This isn't a speed coding contest. You want to think about this properly. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] arm64: bpf: add BPF XADD instruction
On 11/11/2015 08:23 PM, Peter Zijlstra wrote: On Wed, Nov 11, 2015 at 07:50:15PM +0100, Daniel Borkmann wrote: Well, on that note, it's not like you just change the target to bpf in your Makefile and can compile (& load into the kernel) anything you want with it. You do have to write small, restricted programs from scratch for a specific use-case with the limited set of helper functions and intrinsics that are available from the kernel. So I don't think that "Programs that used to work will now no longer work." holds if you regard it as such. So I don't get this argument. If everything is so targeted, then why are the BPF instructions an ABI. If OTOH you're expected to be able to transfer these small proglets, then too I would expect to transfer the source of these proglets. You cannot argue both ways. Ohh, I think we were talking past each other. ;) So, yeah, you'd likely need to add new intrinstics that then map to the existing BPF_XADD instructions, and perhaps spill a warning when __sync_fetch_and_add() is being used to advise the developer to switch to the new intrinstics instead. From kernel ABI PoV nothing would change. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] net-scm: Delete an unnecessary check before the function call "kfree"
On 11/17/2015 05:43 PM, SF Markus Elfring wrote: From: Markus ElfringDate: Tue, 17 Nov 2015 17:37:22 +0100 The kfree() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring --- net/core/scm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/scm.c b/net/core/scm.c index 3b6899b..4f64173 100644 --- a/net/core/scm.c +++ b/net/core/scm.c @@ -193,7 +193,7 @@ int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *p) } } - if (p->fp && !p->fp->count) + if (likely(!p->fp->count)) { kfree(p->fp); p->fp = NULL; Really, I don't like your blind, silly removals everywhere throughout the kernel tree for these tests. Eric already mentioned that in some situations where it's critical, it actually slows down the code, i.e. you'll have an extra function call to get there and inside kfree() / kfree_skb() / etc, the test is actually marked as unlikely(). Anyway, I think this one in particular could throw a NULL pointer deref. You even say in your commit message "kfree() function tests whether its argument [p->fp] is NULL" and yet if that is the case then, you already deref'ed on the p->fp->count test ??? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: net-scm: Delete an unnecessary check before the function call "kfree"
On 11/17/2015 07:05 PM, SF Markus Elfring wrote: Eric already mentioned that in some situations where it's critical, it actually slows down the code, i.e. you'll have an extra function call to get there and inside kfree() / kfree_skb() / etc, the test is actually marked as unlikely(). How do you think about to add similar annotations to any more source code places? You mean this likely() annotation of yours? It doesn't make any sense to me to place it there, and since you've spend the second thinking about it when adding it to the diff, you should have already realized that your code was buggy ... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH, REPORT] bpf_trace: build error without PERF_EVENTS
On 11/10/2015 01:55 PM, Arnd Bergmann wrote: In my ARM randconfig tests, I'm getting a build error for newly added code in bpf_perf_event_read and bpf_perf_event_output whenever CONFIG_PERF_EVENTS is disabled: kernel/trace/bpf_trace.c: In function 'bpf_perf_event_read': kernel/trace/bpf_trace.c:203:11: error: 'struct perf_event' has no member named 'oncpu' if (event->oncpu != smp_processor_id() || ^ kernel/trace/bpf_trace.c:204:11: error: 'struct perf_event' has no member named 'pmu' event->pmu->count) This can happen when UPROBE_EVENT is enabled but KPROBE_EVENT is disabled. I'm not sure if that is a configuration we care about, otherwise we could prevent this case from occuring by adding Kconfig dependencies. I think that seems better than spreading #if IS_ENABLEDs into the code. Probably enough to add a 'depends on PERF_EVENTS' to config BPF_EVENTS, so it's also explicitly documented. Simply hiding the broken code inside #ifdef CONFIG_PERF_EVENTS as this patch does seems to reliably fix the error as well, I have built thousands of randconfig kernels since I started seeing this and added the workaround. Signed-off-by: Arnd BergmannFixes: 62544ce8e01c ("bpf: fix bpf_perf_event_read() helper") Fixes: a43eec304259 ("bpf: introduce bpf_perf_event_output() helper") Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bpf: fix trivial comment typo
On 11/02/2015 10:48 PM, Matthew Fernandez wrote: On 03/11/15 08:31, David Miller wrote: From: Matthew FernandezDate: Mon, 2 Nov 2015 11:59:03 +1100 bpf: fix trivial comment typo Signed-off-by: Matthew Fernandez This doesn't apply to any tree. I'm sorry, I think I must be missing something. This seems to apply cleanly to the current tip of mainline (e86328c489d7) to me. Was this not in the expected format? It wasn't my intention to waste your time, so I apologise for any newbie errors. You might want to check Documentation/networking/netdev-FAQ.txt ;), and rebase your spelling fix f.e. to the latest net-next tree. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] tracefs: fix refcount imbalance in start_creating
In tracefs' start_creating(), we pin the file system to safely access its root. When we failed to create a file, we unpin the file system via failed_creating() to release the mount count and eventually the reference of the singleton vfsmount. However, when we run into an error during lookup_one_len() when still in start_creating(), we only release the parent's mutex but not so the reference on the mount. F.e., in securityfs_create_file(), after doing simple_pin_fs() when lookup_one_len() fails there, we infact do simple_release_fs(). This seems necessary here as well. Same issue seen in debugfs due to 190afd81e4a5 ("debugfs: split the beginning and the end of __create_file() off"), which seemed to got carried over into tracefs, too. Noticed during code review. Fixes: 4282d60689d4 ("tracefs: Add new tracefs file system") Signed-off-by: Daniel Borkmann <dan...@iogearbox.net> Acked-by: Steven Rostedt <rost...@goodmis.org> --- v1 -> v2: - Split into two patches (tracefs and debugfs, will send debugfs one separately) - Kept Steven's Acked-by fs/tracefs/inode.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index cbc8d5d..c66f242 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -340,8 +340,12 @@ static struct dentry *start_creating(const char *name, struct dentry *parent) dput(dentry); dentry = ERR_PTR(-EEXIST); } - if (IS_ERR(dentry)) + + if (IS_ERR(dentry)) { mutex_unlock(>d_inode->i_mutex); + simple_release_fs(_mount, _mount_count); + } + return dentry; } -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tracefs, debugfs: fix refcount imbalance in start_creating
On 11/02/2015 08:07 PM, Steven Rostedt wrote: On Fri, 9 Oct 2015 20:30:10 +0200 Daniel Borkmann <dan...@iogearbox.net> wrote: [...] Fixes: 4282d60689d4 ("tracefs: Add new tracefs file system") Fixes: 190afd81e4a5 ("debugfs: split the beginning and the end of __create_file() off") Signed-off-by: Daniel Borkmann <dan...@iogearbox.net> Cc: Steven Rostedt <rost...@goodmis.org> [...] Fyi, I'll respin and split this one into two, so they can be routed through their individual trees. (Keeping Steven's Acked-by on the tracefs one.) Thanks & sorry for the noise, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] debugfs: fix refcount imbalance in start_creating
In debugfs' start_creating(), we pin the file system to safely access its root. When we failed to create a file, we unpin the file system via failed_creating() to release the mount count and eventually the reference of the vfsmount. However, when we run into an error during lookup_one_len() when still in start_creating(), we only release the parent's mutex but not so the reference on the mount. Looks like it was done in the past, but after splitting portions of __create_file() into start_creating() and end_creating() via 190afd81e4a5 ("debugfs: split the beginning and the end of __create_file() off"), this seemed missed. Noticed during code review. Fixes: 190afd81e4a5 ("debugfs: split the beginning and the end of __create_file() off") Signed-off-by: Daniel Borkmann <dan...@iogearbox.net> --- v1 -> v2: - Split original patch into two patches (tracefs and debugfs) fs/debugfs/inode.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c index c711be8..9c8d233 100644 --- a/fs/debugfs/inode.c +++ b/fs/debugfs/inode.c @@ -271,8 +271,12 @@ static struct dentry *start_creating(const char *name, struct dentry *parent) dput(dentry); dentry = ERR_PTR(-EEXIST); } - if (IS_ERR(dentry)) + + if (IS_ERR(dentry)) { mutex_unlock(_inode(parent)->i_mutex); + simple_release_fs(_mount, _mount_count); + } + return dentry; } -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bpf: add mod default A and X test cases
On 11/04/2015 08:36 PM, Yang Shi wrote: When running "mod X" operation, if X is 0 the filter has to be halt. Add new test cases to cover A = A mod X if X is 0, and A = A mod 1. CC: Xi Wang <xi.w...@gmail.com> CC: Zi Shen Lim <zlim@gmail.com> Signed-off-by: Yang Shi <yang@linaro.org> LGTM! Acked-by: Daniel Borkmann <dan...@iogearbox.net> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] arm64: bpf: fix div-by-zero case
On 11/04/2015 07:41 PM, Shi, Yang wrote: ... Agreed, and we may need add one more test cases in test_bpf module to cover MOD? Let me know if you have a test case ready :) Does the below change look like a valid test? + "MOD default X", + .u.insns = { + /* +* A = 0x42 +* A = A mod X ; this halt the filter execution if X is 0 +* ret 0x42 +*/ + BPF_STMT(BPF_LD | BPF_IMM, 0x42), + BPF_STMT(BPF_ALU | BPF_MOD | BPF_X, 0), + BPF_STMT(BPF_RET | BPF_K, 0x42), + }, + CLASSIC | FLAG_NO_DATA, + {}, + { {0x1, 0x0 } }, + }, + { + "MOD default A", + .u.insns = { + /* +* A = A mod 1 +* ret A +*/ + BPF_STMT(BPF_ALU | BPF_MOD | BPF_K, 0x1), + BPF_STMT(BPF_RET | BPF_A, 0x0), + }, + CLASSIC | FLAG_NO_DATA, + {}, + { {0x1, 0x0 } }, + }, My test result with it: test_bpf: #284 MOD default X jited:1 ret 66 != 0 FAIL (1 times) test_bpf: #285 MOD default A jited:1 102 PASS If it looks right, I will post a patch to add the test cases. Looks good to me. Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/