Re: [PATCH v2 net-next 1/2] bpf: allow extended BPF programs access skb fields

2015-03-13 Thread Daniel Borkmann

On 03/14/2015 02:46 AM, Daniel Borkmann wrote:
...

Previously, it was much more consistent, which I like better. And only
because of the simple BUILD_BUG_ON()? :/


Alternative is to move all of them into a central place, something like
in twsk_build_assert() or __mld2_query_bugs[].
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 1/2] bpf: allow extended BPF programs access skb fields

2015-03-13 Thread Daniel Borkmann

On 03/14/2015 03:08 AM, Alexei Starovoitov wrote:

On 3/13/15 7:06 PM, Daniel Borkmann wrote:

On 03/14/2015 02:46 AM, Daniel Borkmann wrote:
...

Previously, it was much more consistent, which I like better. And only
because of the simple BUILD_BUG_ON()? :/


Alternative is to move all of them into a central place, something like
in twsk_build_assert() or __mld2_query_bugs[].


nope. that defeats the purpose of bug_on.


Well, it doesn't. ;) It throws a build error thus the user is forced to
investigate that further.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 1/2] bpf: allow extended BPF programs access skb fields

2015-03-14 Thread Daniel Borkmann

On 03/14/2015 05:59 AM, Alexei Starovoitov wrote:

On 3/13/15 7:27 PM, Alexei Starovoitov wrote:

On 3/13/15 7:16 PM, Daniel Borkmann wrote:

On 03/14/2015 03:08 AM, Alexei Starovoitov wrote:

On 3/13/15 7:06 PM, Daniel Borkmann wrote:

On 03/14/2015 02:46 AM, Daniel Borkmann wrote:
...

Previously, it was much more consistent, which I like better. And only
because of the simple BUILD_BUG_ON()? :/


Alternative is to move all of them into a central place, something like
in twsk_build_assert() or __mld2_query_bugs[].


nope. that defeats the purpose of bug_on.


Well, it doesn't. ;) It throws a build error thus the user is forced to
investigate that further.


according to this distorted logic all build_bug_on can be in one file
across the whole tree, since 'user is forced to investigate' ?!


That was not what I was suggesting, and I assume you know that ...


also note that this case and twsk_build_assert are different.
twsk_build_assert has no other choice then to have one function
that covers logic in the whole file, whereas in this patch:
+   BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
+   *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
+ offsetof(struct sk_buff, mark));

the build_bug_on protect the line directly below.
Separating them just doesn't make sense at all.


I also like the above approach better, I only suggested that as a
possible alternative since you were saying earlier in this thread:

  I thought about it, but didn't add it, since we already have them
  in the same file several lines above this spot. I think one build
  error per .c file should be enough to attract attention. Though
  I'll add a comment to convert_bpf_extensions() that build_bug_on
  errors should be addressed in two places.

Cheers,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 1/2] bpf: allow extended BPF programs access skb fields

2015-03-13 Thread Daniel Borkmann

On 03/13/2015 03:21 AM, Alexei Starovoitov wrote:

introduce user accessible mirror of in-kernel 'struct sk_buff':
struct __sk_buff {
 __u32 len;
 __u32 pkt_type;
 __u32 mark;
 __u32 ifindex;
 __u32 queue_mapping;
};

bpf programs can do:
struct __sk_buff *ptr;
var = ptr-pkt_type;

which will be compiled to bpf assembler as:
dst_reg = *(u32 *)(src_reg + 4) // 4 == offsetof(struct __sk_buff, pkt_type)

bpf verifier will check validity of access and will convert it to:
dst_reg = *(u8 *)(src_reg + offsetof(struct sk_buff, __pkt_type_offset))
dst_reg = 7

since 'pkt_type' is a bitfield.

When pkt_type field is moved around, goes into different structure, removed or
its size changes, the function sk_filter_convert_ctx_access() would need to be
updated. Just like the function convert_bpf_extensions() in case of classic bpf.


For each member, I'd also add BUILD_BUG_ON()s similarly as we have in
convert_bpf_extensions(). That way, people won't forget to adjust the
code.

General idea for this offset map looks good, imho. Well defined members
that are already exported to uapi e.g. through classic socket filters or
other socket api places could be used here.


Signed-off-by: Alexei Starovoitov a...@plumgrid.com

...

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3fa1af8a58d7..66a82d6cd75b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -168,4 +168,12 @@ enum bpf_func_id {
__BPF_FUNC_MAX_ID,
  };

+struct __sk_buff {
+   __u32 len;
+   __u32 pkt_type;
+   __u32 mark;
+   __u32 ifindex;
+   __u32 queue_mapping;
+};


I'd add a comment saying that fields may _only_ be safely added at
the end of the structure. Rearranging or removing members here,
naturally would break user space.

The remaining fields we export in classic BPF would be skb-hash,
skb-protocol, skb-vlan_tci, are we adding them as well to match
up functionality with classic BPF? For example, I can see hash being
useful as a key to be used with eBPF maps, etc.

...

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e6b522496250..c22ebd36fa4b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c

...

+/* convert load instructions that access fields of 'struct __sk_buff'
+ * into sequence of instructions that access fields of 'struct sk_buff'
+ */
+static int convert_ctx_accesses(struct verifier_env *env)
+{
+   struct bpf_insn *insn = env-prog-insnsi;
+   int insn_cnt = env-prog-len;
+   struct bpf_insn insn_buf[16];
+   struct bpf_prog *new_prog;
+   u32 cnt;
+   int i;
+
+   if (!env-prog-aux-ops-convert_ctx_access)
+   return 0;
+
+   for (i = 0; i  insn_cnt; i++, insn++) {
+   if (insn-code != (BPF_LDX | BPF_MEM | BPF_W))
+   continue;
+
+   if (insn-imm != PTR_TO_CTX) {
+   /* clear internal mark */
+   insn-imm = 0;
+   continue;
+   }
+
+   cnt = env-prog-aux-ops-
+   convert_ctx_access(insn-dst_reg, insn-src_reg,
+  insn-off, insn_buf);
+   if (cnt == 0 || cnt = ARRAY_SIZE(insn_buf)) {
+   verbose(bpf verifier is misconfigured\n);
+   return -EINVAL;
+   }
+
+   if (cnt == 1) {
+   memcpy(insn, insn_buf, sizeof(*insn));
+   continue;
+   }
+
+   /* several new insns need to be inserted. Make room for them */
+   insn_cnt += cnt - 1;
+   new_prog = bpf_prog_realloc(env-prog,
+   bpf_prog_size(insn_cnt),
+   GFP_USER);
+   if (!new_prog)
+   return -ENOMEM;


Seems a bit expensive, do you think we could speculatively allocate a
bit more space in bpf_prog_load() when we detect that we have access
to ctx that we need to convert?


+   new_prog-len = insn_cnt;
+
+   memmove(new_prog-insnsi + i + cnt, new_prog-insns + i + 1,
+   sizeof(*insn) * (insn_cnt - i - cnt));
+
+   /* copy substitute insns in place of load instruction */
+   memcpy(new_prog-insnsi + i, insn_buf, sizeof(*insn) * cnt);
+
+   /* adjust branches in the whole program */
+   adjust_branches(new_prog, i, cnt - 1);
+
+   /* keep walking new program and skip insns we just inserted */
+   env-prog = new_prog;
+   insn = new_prog-insnsi + i + cnt - 1;
+   i += cnt - 1;
+   }
+
+   return 0;
+}
+
  static void free_states(struct verifier_env *env)
  {
struct verifier_state_list *sl, *sln;

...

diff --git a/net/core/filter.c b/net/core/filter.c
index 7a4eb7030dba..b5fcc7e2b608 100644
--- a/net/core/filter.c
+++ 

Re: [PATCH net-next 0/2] bpf: allow extended BPF programs access skb fields

2015-03-13 Thread Daniel Borkmann

On 03/13/2015 03:21 AM, Alexei Starovoitov wrote:
...

Daniel,
patch 1 includes a bit of code that does prog_realloc and branch adjustment
to make room for new instructions. I think you'd need the same for your
'constant blinding' work. If indeed that would be the case, we'll make it
into a helper function.


Yes, thanks will take care of that.

Cheers,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 1/2] bpf: allow extended BPF programs access skb fields

2015-03-13 Thread Daniel Borkmann

On 03/13/2015 05:22 PM, Alexei Starovoitov wrote:

On 3/13/15 2:57 AM, Daniel Borkmann wrote:

On 03/13/2015 03:21 AM, Alexei Starovoitov wrote:

introduce user accessible mirror of in-kernel 'struct sk_buff':


For each member, I'd also add BUILD_BUG_ON()s similarly as we have in
convert_bpf_extensions(). That way, people won't forget to adjust the
code.


I thought about it, but didn't add it, since we already have them
in the same file several lines above this spot.
I think one build error per .c file should be enough to attract
attention.
Though I'll add a comment to convert_bpf_extensions() that build_bug_on
errors should be addressed in two places.


My concern would be that in case the code gets changed, this spot
could easily be overlooked by someone possibly unfamiliar with the
code, since no build error triggers there.

So I guess it wouldn't hurt or cost us much to also adding the
BUILD_BUG_ON()s there instead of a comment.

...

The remaining fields we export in classic BPF would be skb-hash,
skb-protocol, skb-vlan_tci, are we adding them as well to match
up functionality with classic BPF? For example, I can see hash being
useful as a key to be used with eBPF maps, etc.


I want to do skb-protocol and skb-vlan_tci differently and hopefully
more generically than classic did them, so they will come in the
follow on patches. skb-hash also should be done differently.
So yes. All of them are subjects of future patches/discussions.


Ok, sounds good.

Thanks,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] bpf: Suggestion on bpf syscall interface

2015-03-28 Thread Daniel Borkmann

On 03/28/2015 06:21 PM, Alexei Starovoitov wrote:

On 3/28/15 4:36 AM, He Kuang wrote:

Hi, Alexei

In our end-end IO module project, we use bpf maps to record
configurations. According to current bpf syscall interface, we
should specify map_fd to lookup/update bpf maps, so we are
restricted to do config in the same user program.


you can pass map_fd and prog_fd from one process to another via normal
scm_rights mechanism.


+1, I've just tried that out in the context of a different work and
works like a charm.


My suggestion is to export this kind of operations to sysfs, so
we can loadattach bpf progs and config it seperately. We
implement this feature in our demo project. What's your opinion
on this?


Eventually we may use single sysfs file for lsmod-like listings, but I
definitely don't want to create parallel interface to maps via sysfs.


Yes, that would be a bad design decision. Btw, even more lightweight
for kernel-side would be to just implement .show_fdinfo() for the anon
indoes on the map/prog store and have some meta information exported
from there. You can then grab that via /proc/pid/fdinfo/fd, I
would consider such a thing a slow-path operation anyway, and you would
also get the app info using it for free.


It's way too expensive and not really suitable for binary key/values.


+1
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 tip 1/7] bpf: make internal bpf API independent of CONFIG_BPF_SYSCALL ifdefs

2015-03-02 Thread Daniel Borkmann
On 03/02/2015 12:51 PM, Masami Hiramatsu wrote:
 (2015/03/02 20:10), Daniel Borkmann wrote:
 On 03/02/2015 11:53 AM, Masami Hiramatsu wrote:
 ...
 Hmm, it seems that this still doesn't hide some APIs which is provided
 only when CONFIG_BPF_SYSCALL. For example bpf_register_map_type etc.
 I think all those APIs should be hidden in #ifdef or at least be commented
 so that the user doesn't refer that without the kconfig.
 (I don't think we need to provide dummy functions for those APIs,
but it's better to clarify which API we can use with which kconfig)

 Well, currently all possible map types (hash table, array map) that
 would actually call into bpf_register_map_type() are only built when
 CONFIG_BPF_SYSCALL is enabled (see kernel/bpf/Makefile). I don't think
 new map additions should be added that are not under kernel/bpf/ and/or
 enabled outside the CONFIG_BPF_SYSCALL, as it should be considered
 part of the eBPF core code.

 The difference here (this patch) is simply that we don't want users to
 build ifdef spaghetti constructs in user code, so the API that is
 actually used by eBPF _users_ is being properly ifdef'ed in the headers.

 So, I don't think this is a big problem, but we could move these bits
 under the ifdef CONFIG_BPF_SYSCALL w/o providing a dummy in the else part.
 I can do that outside of the scope of this set.
 
 Or, maybe we'd better move them into new include/linux/bpf_prog.h which
 includes basic include/linux/bpf.h. Then, user can include the bpf_prog.h
 instead of bpf.h. Also, we can check CONFIG_BPF_SYSCAL=y at the top of
 bpf_prog.h. This makes things clearer :)

I'm preferring the 1st variant, though. We have currently two native eBPF
users, that is, socket filters and tc's cls_bpf (queued in net-next) and
looking at the code/API usage, it's really not that hard, where it would
justify to move this to an extra header file, really.

I'm cooking a patch for net-next right now with the first variant (which is
on top of this patch that resides in net-next as-is as well).

Thanks,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 tip 1/7] bpf: make internal bpf API independent of CONFIG_BPF_SYSCALL ifdefs

2015-03-02 Thread Daniel Borkmann
On 03/02/2015 11:53 AM, Masami Hiramatsu wrote:
...
 Hmm, it seems that this still doesn't hide some APIs which is provided
 only when CONFIG_BPF_SYSCALL. For example bpf_register_map_type etc.
 I think all those APIs should be hidden in #ifdef or at least be commented
 so that the user doesn't refer that without the kconfig.
 (I don't think we need to provide dummy functions for those APIs,
   but it's better to clarify which API we can use with which kconfig)

Well, currently all possible map types (hash table, array map) that
would actually call into bpf_register_map_type() are only built when
CONFIG_BPF_SYSCALL is enabled (see kernel/bpf/Makefile). I don't think
new map additions should be added that are not under kernel/bpf/ and/or
enabled outside the CONFIG_BPF_SYSCALL, as it should be considered
part of the eBPF core code.

The difference here (this patch) is simply that we don't want users to
build ifdef spaghetti constructs in user code, so the API that is
actually used by eBPF _users_ is being properly ifdef'ed in the headers.

So, I don't think this is a big problem, but we could move these bits
under the ifdef CONFIG_BPF_SYSCALL w/o providing a dummy in the else part.
I can do that outside of the scope of this set.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net-next 08/10] x86: unexport set_memory_ro and set_memory_rw

2015-02-27 Thread Daniel Borkmann
This effectively unexports set_memory_ro and set_memory_rw functions, and
thus reverts a03352d2c1dc (x86: export set_memory_ro and set_memory_rw).

They have been introduced for debugging purposes in e1000e, but no module
user is in mainline kernel (anymore?) and we explicitly do not want modules
to use these functions, as they i.e. protect eBPF (interpreted  JIT'ed)
images from malicious modifications or bugs.

Outside of eBPF scope, I believe also other set_memory_* functions should
be unexported on x86 for modules.

Signed-off-by: Daniel Borkmann dan...@iogearbox.net
Cc: Bruce Allan bruce.w.al...@intel.com
Cc: Jesse Brandeburg jesse.brandeb...@intel.com
Cc: Ingo Molnar mi...@kernel.org
Cc: linux-kernel@vger.kernel.org
Acked-by: Alexei Starovoitov a...@plumgrid.com
---
 arch/x86/mm/pageattr.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 536ea2f..81e8282 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1654,13 +1654,11 @@ int set_memory_ro(unsigned long addr, int numpages)
 {
return change_page_attr_clear(addr, numpages, __pgprot(_PAGE_RW), 0);
 }
-EXPORT_SYMBOL_GPL(set_memory_ro);
 
 int set_memory_rw(unsigned long addr, int numpages)
 {
return change_page_attr_set(addr, numpages, __pgprot(_PAGE_RW), 0);
 }
-EXPORT_SYMBOL_GPL(set_memory_rw);
 
 int set_memory_np(unsigned long addr, int numpages)
 {
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net-next 09/10] arm64: unexport set_memory_ro and set_memory_rw

2015-02-27 Thread Daniel Borkmann
This effectively unexports set_memory_ro and set_memory_rw functions from
commit 11d91a770f1f (arm64: Add CONFIG_DEBUG_SET_MODULE_RONX support).

No module user of those is in mainline kernel and we explicitly do not want
modules to use these functions, as they i.e. protect eBPF (interpreted and
JIT'ed) images from malicious modifications or bugs.

Outside of eBPF scope, I believe also other set_memory_* functions should
be unexported on arm64 for modules.

Signed-off-by: Daniel Borkmann dan...@iogearbox.net
Cc: Laura Abbott lau...@codeaurora.org
Cc: Will Deacon will.dea...@arm.com
Cc: linux-kernel@vger.kernel.org
Acked-by: Alexei Starovoitov a...@plumgrid.com
---
 arch/arm64/mm/pageattr.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index bb0ea94..8659357 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -70,7 +70,6 @@ int set_memory_ro(unsigned long addr, int numpages)
__pgprot(PTE_RDONLY),
__pgprot(PTE_WRITE));
 }
-EXPORT_SYMBOL_GPL(set_memory_ro);
 
 int set_memory_rw(unsigned long addr, int numpages)
 {
@@ -78,7 +77,6 @@ int set_memory_rw(unsigned long addr, int numpages)
__pgprot(PTE_WRITE),
__pgprot(PTE_RDONLY));
 }
-EXPORT_SYMBOL_GPL(set_memory_rw);
 
 int set_memory_nx(unsigned long addr, int numpages)
 {
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 09/10] arm64: unexport set_memory_ro and set_memory_rw

2015-02-27 Thread Daniel Borkmann

On 02/27/2015 08:54 PM, Will Deacon wrote:
...

Looks good to me. Can this be applied independently, or does it need to
remain part of your series?


Ideally, it should be seen as part of this series, but I have no problem
if this one goes via arm64 tree, instead. What Dave and you prefer. ;)

Thanks,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux XIA - merge proposal

2015-03-03 Thread Daniel Borkmann

On 03/03/2015 06:29 PM, Michel Machado wrote:
...

We're fine with clearly marking Linux XIA as being under staging as well as 
helping to define this review process for network stacks.


With regard to staging, the code there is usually horrible and I'm
not sure anyone really looks there, that would mitigate the review
problem to the time when you try to get it out from there, so I'm
not sure it brings anything. ;)

+1 on what Eric said, would have also been nice if you had clearly
described in your mail (w/o buzz words) what it is and what it does.

Are you trying to introduce a new network stack as an alternative
to the current one, e.g. something like FreeBSD's netgraph?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 09/10] arm64: unexport set_memory_ro and set_memory_rw

2015-03-01 Thread Daniel Borkmann

Hi Will,

On 02/27/2015 09:05 PM, Daniel Borkmann wrote:

On 02/27/2015 08:54 PM, Will Deacon wrote:
...

Looks good to me. Can this be applied independently, or does it need to
remain part of your series?


Ideally, it should be seen as part of this series, but I have no problem
if this one goes via arm64 tree, instead. What Dave and you prefer. ;)


I'll resend you this one directly as a stand-alone patch to arm64,
with Acked-by's preserved.

Thanks,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] arm64: mm: unexport set_memory_ro and set_memory_rw

2015-03-01 Thread Daniel Borkmann
This effectively unexports set_memory_ro and set_memory_rw functions from
commit 11d91a770f1f (arm64: Add CONFIG_DEBUG_SET_MODULE_RONX support).

No module user of those is in mainline kernel and we explicitly do not want
modules to use these functions, as they i.e. RO-protect eBPF (interpreted and
JIT'ed) images from malicious modifications/bugs.

Outside of eBPF scope, I believe also other set_memory_* functions should
be unexported on arm64 due to non-existant mainline module user. Laura
mentioned that they have some uses for modules doing set_memory_*, but
none that are in mainline and it's unclear if they would ever get there.

Signed-off-by: Daniel Borkmann dan...@iogearbox.net
Acked-by: Alexei Starovoitov a...@plumgrid.com
Acked-by: Laura Abbott lau...@codeaurora.org
---
 arch/arm64/mm/pageattr.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index bb0ea94..8659357 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -70,7 +70,6 @@ int set_memory_ro(unsigned long addr, int numpages)
__pgprot(PTE_RDONLY),
__pgprot(PTE_WRITE));
 }
-EXPORT_SYMBOL_GPL(set_memory_ro);
 
 int set_memory_rw(unsigned long addr, int numpages)
 {
@@ -78,7 +77,6 @@ int set_memory_rw(unsigned long addr, int numpages)
__pgprot(PTE_WRITE),
__pgprot(PTE_RDONLY));
 }
-EXPORT_SYMBOL_GPL(set_memory_rw);
 
 int set_memory_nx(unsigned long addr, int numpages)
 {
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rhashtable: initialize all rhashtable walker members

2015-02-22 Thread Daniel Borkmann

On 02/21/2015 09:55 PM, Sasha Levin wrote:

Commit rhashtable: Introduce rhashtable_walk_* forgot to initialize the
members of struct rhashtable_walker after allocating it, which caused
an undefined value for 'resize' which is used later on.

Signed-off-by: Sasha Levin sasha.le...@oracle.com
---
  lib/rhashtable.c |3 +++
  1 file changed, 3 insertions(+)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 9cc4c4a..030484c 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -894,6 +894,9 @@ int rhashtable_walk_init(struct rhashtable *ht, struct 
rhashtable_iter *iter)
if (!iter-walker)
return -ENOMEM;

+   INIT_LIST_HEAD(iter-walker-list);
+   iter-walker-resize = false;
+


The change seems fine to me, although the INIT_LIST_HEAD() unnecessary
due to the below list_add()?

Anyway, setting resize to false is definitely correct. In practice this
shouldn't cause much issue though as rhashtable_walk_start() would only
reset iterator meta data and set resize to false, but lets fix it.


mutex_lock(ht-mutex);
list_add(iter-walker-list, ht-walkers);
mutex_unlock(ht-mutex);


Thanks,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Use-after-free oops in next-20150204 - probably nelink related

2015-02-21 Thread Daniel Borkmann

On 02/21/2015 04:36 PM, Shachar Raindel wrote:
...

This is most likely rhashtable related. The fixes for the
use-after-free issues have been merged Feb 6 so they are probably
not included in the Feb 04 snapshot that you use.

The relevant net-next commits are:
commit 020219a69d40a205dad12b0ea1e6a46153793368
commit cf52d52f9ccb9966ac019d9f79824195583e3e6c
commit 2af4b52988fd4f7ae525fcada29d4db8680033d6
commit a5ec68e3b8f2c95ea1a5d23dd543abbe0c8d0624


Most likely so - haven't seen this reproducing so far on next-20150219,
which contains the relevant commits.


Thanks for double checking!


BTW, why is there no MAINTAINERS entry for the netlink subsystem?


Well, if in doubt where to send a bug report, then :

  scripts/get_maintainer.pl -f net/netlink/

Make sure to Cc netdev in any case.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] seccomp: rework seccomp_prepare_filter().

2015-04-29 Thread Daniel Borkmann

On 04/29/2015 03:37 PM, Nicolas Schichan wrote:

- Try to use the classic BPF JIT via bpf_jit_compile().

- Use bpf_migrate_filter() from NET filter code instead of the double
   bpf_convert_filter() followed by bpf_prog_select_runtime() if
   classic bpf_jit_compile() did not succeed in producing native code.

Signed-off-by: Nicolas Schichan nschic...@freebox.fr


[ I had to look that one up manually, would be good if you keep
  people in Cc, also netdev for BPF in general. ]

I see, you need that to make it available to the old bpf_jit_compile()
for probing on classic JITs. Actually, I really would prefer, if instead
of duplicating that code, you could export bpf_prepare_filter() and
pass seccomp_check_filter() as an argument to bpf_prepare_filter().

Otherwise, in case bpf_prepare_filter() changes, people will easily
forget to update seccomp related code, really.

Thanks,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] compiler-intel: fix wrong compiler barrier() macro

2015-04-29 Thread Daniel Borkmann

On 04/29/2015 06:40 PM, Pranith Kumar wrote:

On Wed, Apr 29, 2015 at 10:59 AM, mancha security manc...@zoho.com wrote:


The problem is that ICC defines __GNUC__ so barrier() gets defined
in compiler-gcc.h. Your commit removed an #undef from compiler-intel.h
so compiler.h will never define barrier to __memory_barrier().


OK, I see your point. But, ICC has support for GCC inline assembly. So
the change does not seem to be making any difference. We are using our
own asm barrier rather than the inbuilt one provided by ICC.


It does make a difference: gcc inline assembly is not supported by
/ecc/, see that it's wrapped within the ifdef __ECC part. I believe,
that should be for ia64 which we have under arch/, no?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] Seccomp filter JIT support on ARM.

2015-04-29 Thread Daniel Borkmann

On 04/29/2015 03:37 PM, Nicolas Schichan wrote:
...

The fourth and final patch fixes a bug in the emit_udiv() function
when used to convert a BPF_ALU | BPF_DIV | BPF_K instruction in the
ARM BPF JIT code.


Shouldn't that fix go separately, so it can be included into 4.1
resp. -stable?

Would be good if you also Cc Mircea, who wrote the code. Was that
caught by lib/test_bpf.c suite (if not, would be good to add a test
case for it) ?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] compiler-intel: fix wrong compiler barrier() macro

2015-04-30 Thread Daniel Borkmann
Cleanup commit 73679e508201 (compiler-intel.h: Remove duplicate
definition) removed the double definition of __memory_barrier()
intrinsics.

However, in doing so, it also removed the preceding #undef barrier
by accident, meaning, the actual barrier() macro from compiler-gcc.h
with inline asm is still in place as __GNUC__ is provided.

Subsequently, barrier() can never be defined as __memory_barrier()
from compiler.h since it already has a definition in place and if
we trust the comment in compiler-intel.h, ecc doesn't support gcc
specific asm statements.

I don't have an ecc at hand, so a revert of that cleanup would be
the safest option, imho, as it has been like this since pre git
times.

Fixes: 73679e508201 (compiler-intel.h: Remove duplicate definition)
Signed-off-by: Daniel Borkmann dan...@iogearbox.net
Reviewed-by: Pranith Kumar bobby.pr...@gmail.com
Cc: Pranith Kumar bobby.pr...@gmail.com
Cc: H. Peter Anvin h...@linux.intel.com
Cc: Ingo Molnar mi...@kernel.org
Cc: mancha security manc...@zoho.com
---
 v1-v2:
  - fixed the wrong commit hash in 1st occurence of commit msg
  - rest as is, also kept Pranith's Reviewed-by tag

 include/linux/compiler-intel.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/compiler-intel.h b/include/linux/compiler-intel.h
index ba147a1..5529c52 100644
--- a/include/linux/compiler-intel.h
+++ b/include/linux/compiler-intel.h
@@ -13,9 +13,12 @@
 /* Intel ECC compiler doesn't support gcc specific asm stmts.
  * It uses intrinsics to do the equivalent things.
  */
+#undef barrier
 #undef RELOC_HIDE
 #undef OPTIMIZER_HIDE_VAR
 
+#define barrier() __memory_barrier()
+
 #define RELOC_HIDE(ptr, off)   \
   ({ unsigned long __ptr;  \
  __ptr = (unsigned long) (ptr);\
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] Seccomp filter JIT support on ARM.

2015-04-30 Thread Daniel Borkmann

On 04/30/2015 02:35 PM, Nicolas Schichan wrote:

On 04/29/2015 06:37 PM, Daniel Borkmann wrote:

On 04/29/2015 03:37 PM, Nicolas Schichan wrote:
...

The fourth and final patch fixes a bug in the emit_udiv() function
when used to convert a BPF_ALU | BPF_DIV | BPF_K instruction in the
ARM BPF JIT code.


Shouldn't that fix go separately, so it can be included into 4.1
resp. -stable?


Sure, shall I resend that separately from the v2 of the serie or is it fine in
its current form ?


Would be great if you could send that as an individual patch, since
it's a stand-alone fix and independent from the (feature) patch set.


Would be good if you also Cc Mircea, who wrote the code. Was that
caught by lib/test_bpf.c suite (if not, would be good to add a test
case for it) ?


It was detected by an internal test suite we have here. I see that there are
some BPF_ALU | BPF_DIV | BPF_K instructions so it might also be caught by
lib/test_bpf.c as well.


If there are some additional tests that are not yet covered by lib/test_bpf.c,
I'd be happy if you could add them there. This can also be as a follow-up,
but if we can increase coverage for others as well, the better.

Thanks again,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] seccomp: rework seccomp_prepare_filter().

2015-04-30 Thread Daniel Borkmann

On 04/30/2015 02:27 PM, Nicolas Schichan wrote:
...

I'll take more care about the receiver list for the v2 of this serie.


Ok, cool.


I see, you need that to make it available to the old bpf_jit_compile()
for probing on classic JITs. Actually, I really would prefer, if instead
of duplicating that code, you could export bpf_prepare_filter() and
pass seccomp_check_filter() as an argument to bpf_prepare_filter().


Just to be sure you want me to pass a pointer to seccomp_check_filter to
bpf_prepare_filter so that it can run it between bpf_check_classic() and
bpf_jit_compile ?


For example, what comes to mind is something along these lines:

struct bpf_prog *
bpf_prepare_filter(struct bpf_prog *fp,
   int (*aux_trans_classic)(struct sock_filter *filter,
unsigned int flen))
{
int err;

fp-bpf_func = NULL;
fp-jited = false;

err = bpf_check_classic(fp-insns, fp-len);
if (err) {
__bpf_prog_release(fp);
return ERR_PTR(err);
}

/* There might be additional checks and transformations
 * needed on classic filters, f.e. in case of seccomp.
 */
if (aux_trans_classic) {
err = aux_trans_classic(fp-insns,
fp-len);
if (err) {
__bpf_prog_release(fp);
return ERR_PTR(err);
}
}

/* Probe if we can JIT compile the filter and if so, do
 * the compilation of the filter.
 */
bpf_jit_compile(fp);

/* JIT compiler couldn't process this filter, so do the
 * internal BPF translation for the optimized interpreter.
 */
if (!fp-jited)
fp = bpf_migrate_filter(fp);

return fp;
}

From seccomp side, you invoke:

... bpf_prepare_filter(fp, seccomp_check_filter);

I would actually like to see that customized code in seccomp_prepare_filter()
further reduced instead of increased, would certainly make it easier to
maintain and understand.

Do you think something like the above is possible?


Otherwise, in case bpf_prepare_filter() changes, people will easily
forget to update seccomp related code, really.


Fair point.

Thanks,


Thanks a lot,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG/PATCH] kernel RNG and its secrets

2015-04-27 Thread Daniel Borkmann

On 04/27/2015 10:41 PM, Stephan Mueller wrote:
...

It seems you have the code already in mind, so please if you could write it
:-)


Ok, sure. I'll cook something by tomorrow morning.

Cheers,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG/PATCH] kernel RNG and its secrets

2015-04-27 Thread Daniel Borkmann

On 04/27/2015 09:10 PM, Stephan Mueller wrote:
...

I posted the issue on the clang mailing list on April 10 -- no word so far. I
would interpret this as a sign that it is a no-issue for them.


Hm. ;)

Here's a bug report on the topic, gcc vs llvm:

  https://llvm.org/bugs/show_bug.cgi?id=15495

Lets add a new barrier macro to linux/compiler{,-gcc}.h, f.e.

  #define barrier_data(ptr) __asm__ __volatile__( : : r (ptr) : memory)

or the version Mancha proposed. You could wrap that ...

  #define OPTIMIZER_HIDE(ptr)   barrier_data(ptr)

... and use that one for memzero_explicit() instead:

  void memzero_explicit(void *s, size_t count)
  {
 memset(s, 0, count);
 OPTIMIZER_HIDE(s);
  }

It certainly needs comments explaining in what situations to use
which OPTIMIZER_HIDE* variants, etc.

Do you want to send a patch?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3.19 176/177] netfilter: x_tables: fix cgroup matching on non-full sks

2015-05-03 Thread Daniel Borkmann

Hi Greg, hi Pablo,

On 05/03/2015 08:45 PM, Greg Kroah-Hartman wrote:

On Sun, May 03, 2015 at 12:47:17AM +0300, Thomas Backlund wrote:

Den 02.05.2015 22:03, Greg Kroah-Hartman skrev:

3.19-stable review patch.  If anyone has any objections, please let me know.

--

From: Daniel Borkmann dan...@iogearbox.net

commit afb7718016fcb0370ac29a83b2839c78b76c2960 upstream.

While originally only being intended for outgoing traffic, commit
a00e76349f35 (netfilter: x_tables: allow to use cgroup match for
LOCAL_IN nf hooks) enabled xt_cgroups for the NF_INET_LOCAL_IN hook
as well, in order to allow for nfacct accounting.

Besides being currently limited to early demuxes only, commit
a00e76349f35 forgot to add a check if we deal with full sockets,
i.e. in this case not with time wait sockets. TCP time wait sockets
do not have the same memory layout as full sockets, a lower memory
footprint and consequently also don't have a sk_classid member;
probing for sk_classid member there could potentially lead to a
crash.

Fixes: a00e76349f35 (netfilter: x_tables: allow to use cgroup match for LOCAL_IN nf 
hooks)
Cc: Alexey Perevalov a.pereva...@samsung.com
Signed-off-by: Daniel Borkmann dan...@iogearbox.net
Signed-off-by: Pablo Neira Ayuso pa...@netfilter.org
Signed-off-by: Greg Kroah-Hartman gre...@linuxfoundation.org

---
  net/netfilter/xt_cgroup.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/netfilter/xt_cgroup.c
+++ b/net/netfilter/xt_cgroup.c
@@ -39,7 +39,7 @@ cgroup_mt(const struct sk_buff *skb, str
  {
const struct xt_cgroup_info *info = par-matchinfo;

-   if (skb-sk == NULL)
+   if (skb-sk == NULL || !sk_fullsock(skb-sk))
return false;

return (info-id == skb-sk-sk_classid) ^ info-invert;


This one breaks the build with:

net/netfilter/xt_cgroup.c: In function 'cgroup_mt':
net/netfilter/xt_cgroup.c:42:2: error: implicit declaration of function
'sk_fullsock' [-Werror=implicit-function-declaration]


In order to fix it, you also need to add:

 From 1d0ab253872cdd3d8e7913f59c266c7fd01771d0 Mon Sep 17 00:00:00 2001
From: Eric Dumazet eduma...@google.com
Date: Sun, 15 Mar 2015 21:12:12 -0700
Subject: [PATCH] net: add sk_fullsock() helper

which in turn needs this one:

 From 10feb428a5045d5eb18a5d755fbb8f0cc9645626 Mon Sep 17 00:00:00 2001
From: Eric Dumazet eduma...@google.com
Date: Thu, 12 Mar 2015 16:44:04 -0700
Subject: [PATCH] inet: add TCP_NEW_SYN_RECV state


I've just dropped the patch, thanks for letting me know, odd that my
build tests missed it.


If you're nevertheless interested in this fix, you could use this version,
which should apply/build just fine:

  http://patchwork.ozlabs.org/patch/455546/

I believe Pablo usually sends netfilter patches in bundles to you.

Thanks a lot,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] compiler-intel: fix wrong compiler barrier() macro

2015-04-29 Thread Daniel Borkmann
Cleanup commit 23ebdedc67e (compiler-intel.h: Remove duplicate
definition) removed the double definition of __memory_barrier()
intrinsics.

However, in doing so, it also removed the preceding #undef barrier,
meaning, the actual barrier() macro from compiler-gcc.h with inline
asm is still in place when __GNUC__ is provided.

Subsequently, barrier() can never be defined as __memory_barrier()
from compiler.h since it already has a definition in place and if
we trust the comment in compiler-intel.h, ecc doesn't support gcc
specific asm statements. I don't have an ecc at hand, so a revert
of that cleanup would be the safest option, imho, as it has been
like this since pre-git times.

Fixes: 73679e508201 (compiler-intel.h: Remove duplicate definition)
Signed-off-by: Daniel Borkmann dan...@iogearbox.net
Cc: Pranith Kumar bobby.pr...@gmail.com
Cc: H. Peter Anvin h...@linux.intel.com
Cc: Ingo Molnar mi...@kernel.org
Cc: mancha security manc...@zoho.com
---
 include/linux/compiler-intel.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/compiler-intel.h b/include/linux/compiler-intel.h
index ba147a1..5529c52 100644
--- a/include/linux/compiler-intel.h
+++ b/include/linux/compiler-intel.h
@@ -13,9 +13,12 @@
 /* Intel ECC compiler doesn't support gcc specific asm stmts.
  * It uses intrinsics to do the equivalent things.
  */
+#undef barrier
 #undef RELOC_HIDE
 #undef OPTIMIZER_HIDE_VAR
 
+#define barrier() __memory_barrier()
+
 #define RELOC_HIDE(ptr, off)   \
   ({ unsigned long __ptr;  \
  __ptr = (unsigned long) (ptr);\
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] compiler-intel: fix wrong compiler barrier() macro

2015-04-29 Thread Daniel Borkmann

On 04/29/2015 04:51 PM, Pranith Kumar wrote:
...

message says in 73679e508201(your commit message has the wrong hash).


Sorry for that, the Fixes tag actually got it right.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ARM: net: delegate filter to kernel interpreter when imm_offset() return value can't fit into 12bits.

2015-05-07 Thread Daniel Borkmann

On 05/07/2015 05:14 PM, Nicolas Schichan wrote:

The ARM JIT code emits ldr rX, [pc, #offset] to access the literal
pool. #offset maximum value is 4095 and if the generated code is too
large, the #offset value can overflow and not point to the expected
slot in the literal pool. Additionally, when overflow occurs, bits of
the overflow can end up changing the destination register of the ldr
instruction.

Fix that by detecting the overflow in imm_offset() and setting a flag
that is checked for each BPF instructions converted in
build_body(). As of now it can only be detected in the second pass. As
a result the second build_body() call can now fail, so add the
corresponding cleanup code in that case.

Using multiple literal pools in the JITed code is going to require
lots of intrusive changes to the JIT code (which would better be done
as a feature instead of fix), just delegating to the kernel BPF
interpreter in that case is a more straight forward, minimal fix and
easy to backport.

Signed-off-by: Nicolas Schichan nschic...@freebox.fr


Fix looks good to me.

Fixes: ddecdfcea0ae (ARM: 7259/3: net: JIT compiler for packet filters)
Acked-by: Daniel Borkmann dan...@iogearbox.net
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next] test: bpf: extend load 64-bit immediate testcase

2015-05-09 Thread Daniel Borkmann

On 05/09/2015 10:14 AM, Xi Wang wrote:

Extend the testcase to catch a signedness bug in the arm64 JIT:

test_bpf: #58 load 64-bit immediate jited:1 ret -1 != 1 FAIL (1 times)

This is useful to ensure other JITs won't have a similar bug.

Link: https://lkml.org/lkml/2015/5/8/458
Cc: Alexei Starovoitov a...@plumgrid.com
Cc: Will Deacon will.dea...@arm.com
Signed-off-by: Xi Wang xi.w...@gmail.com


Acked-by: Daniel Borkmann dan...@iogearbox.net
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-04-16 Thread Daniel Borkmann

On 04/16/2015 10:56 AM, George Dunlap wrote:

On 04/15/2015 07:19 PM, Eric Dumazet wrote:

On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:


Maybe you should stop wasting all of our time and just tell us what
you're thinking.


I think you make me wasting my time.

I already gave all the hints in prior discussions.


Right, and I suggested these two options:


So mid term, it would be much more beneficial if you attempt fix the
underlying driver issues that actually cause high tx completion delays,
instead of reintroducing bufferbloat. So that we all can move forward
and not backwards in time.


Obviously one solution would be to allow the drivers themselves to set
the tcp_limit_output_bytes, but that seems like a maintenance
nightmare.


Possible, but very hacky, as you penalize globally.


Another simple solution would be to allow drivers to indicate whether
they have a high transmit latency, and have the kernel use a higher
value by default when that's the case. [1]

Neither of which you commented on.  Instead you pointed me to a comment


What Eric described to you was that you introduce a new netdev member
like netdev-needs_bufferbloat, set that indication from driver site,
and cache that in the socket that binds to it, so you can adjust the
test in tcp_xmit_size_goal(). It should merely be seen as a hint/indication
for such devices. Hmm?


that only partially described what the limitations were. (I.e., it
described the two packets or 1ms, but not how they related, nor how
they related to the max of 2 64k packets outstanding of the default
tcp_limit_output_bytes setting.)

  -George

[1]
http://marc.info/?i=CAFLBxZYt7-v29ysm=f+5qmow64_qhesjzj98udba+1cs-pf...@mail.gmail.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: manual merge of the tip tree with the net-next tree

2015-04-07 Thread Daniel Borkmann

On 04/07/2015 09:11 AM, Stephen Rothwell wrote:
...

Today's linux-next merge of the tip tree got a conflict in
include/uapi/linux/bpf.h between commit 96be4325f443 (ebpf: add
sched_cls_type and map it to sk_filter's verifier ops), 03e69b508b6f
(ebpf: add prandom helper for packet sampling), c04167ce2ca0 (ebpf:
add helper for obtaining current processor id), 94caee8c312d (ebpf:
add sched_act_type and map it to sk_filter's verifier ops),
608cd71a9c7c (tc: bpf: generalize pedit action) and 91bc4822c3d6
(tc: bpf: add checksum helpers) from the net-next tree and commit
2541517c32be (tracing, perf: Implement BPF programs attached to
kprobes), d9847d310ab4 (tracing: Allow BPF programs to call
bpf_ktime_get_ns()) and 9c959c863f82 (tracing: Allow BPF programs to
call bpf_trace_printk()) from the tip tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).


Looks good to me, thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: manual merge of the tip tree with the net-next tree

2015-04-07 Thread Daniel Borkmann

On 04/07/2015 09:00 AM, Stephen Rothwell wrote:
...

Today's linux-next merge of the tip tree got a conflict in
include/linux/bpf.h between commit 0fc174dea545 (ebpf: make internal
bpf API independent of CONFIG_BPF_SYSCALL ifdefs) from the net-next
tree and commit 4e537f7fbdce (bpf: Make internal bpf API independent
of CONFIG_BPF_SYSCALL #ifdefs) from the tip tree.

I fixed it up (they are the same patch and there are other changes in
the net-next tree, so I just used that version) and can carry the fix
as necessary (no action is required).


Thanks, Stephen! That is correct.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: build failure after merge of the tip tree

2015-04-07 Thread Daniel Borkmann

On 04/07/2015 10:48 AM, Ingo Molnar wrote:


* Stephen Rothwell s...@canb.auug.org.au wrote:


Hi all,

After merging the tip tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

kernel/events/core.c: In function 'perf_event_set_bpf_prog':
kernel/events/core.c:6732:15: error: 'struct bpf_prog_aux' has no member named 
'prog_type'
   if (prog-aux-prog_type != BPF_PROG_TYPE_KPROBE) {
^

Caused by commit 2541517c32be (tracing, perf: Implement BPF programs
attached to kprobes).


Note, this must be some (rarely triggered) aspect of the ppc64
defconfig that neither x86 randconfigs nor most other arch defconfigs
expose?


Note, this is a merge conflict with the work that went via net-next tree,
i.e. 24701ecea76b (ebpf: move read-only fields to bpf_prog and shrink
bpf_prog_aux). I believe that is why it didn't trigger on tip tree.

You should be able to resolve it in linux-next by changing the test to:

  if (prog-prog_type != BPF_PROG_TYPE_KPROBE) {

Thanks,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: build failure after merge of the tip tree

2015-04-07 Thread Daniel Borkmann

[ Cc'ing Dave, fyi ]

On 04/07/2015 11:05 AM, Stephen Rothwell wrote:

On Tue, 07 Apr 2015 10:56:13 +0200 Daniel Borkmann dan...@iogearbox.net wrote:

On 04/07/2015 10:48 AM, Ingo Molnar wrote:

* Stephen Rothwell s...@canb.auug.org.au wrote:


After merging the tip tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

kernel/events/core.c: In function 'perf_event_set_bpf_prog':
kernel/events/core.c:6732:15: error: 'struct bpf_prog_aux' has no member named 
'prog_type'
if (prog-aux-prog_type != BPF_PROG_TYPE_KPROBE) {
 ^

Caused by commit 2541517c32be (tracing, perf: Implement BPF programs
attached to kprobes).


Note, this must be some (rarely triggered) aspect of the ppc64
defconfig that neither x86 randconfigs nor most other arch defconfigs
expose?


Note, this is a merge conflict with the work that went via net-next tree,
i.e. 24701ecea76b (ebpf: move read-only fields to bpf_prog and shrink
bpf_prog_aux). I believe that is why it didn't trigger on tip tree.

You should be able to resolve it in linux-next by changing the test to:

if (prog-prog_type != BPF_PROG_TYPE_KPROBE) {


Thanks Daniel, I will do that tomorrow.  Someone will have to remember
to tell Linus.


Yes, indeed, depending which tree is merged first.

Thanks,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: build failure after merge of the tip tree

2015-04-07 Thread Daniel Borkmann

On 04/07/2015 06:18 PM, Alexei Starovoitov wrote:

On 4/7/15 4:13 AM, Daniel Borkmann wrote:

[ Cc'ing Dave, fyi ]

On 04/07/2015 11:05 AM, Stephen Rothwell wrote:

On Tue, 07 Apr 2015 10:56:13 +0200 Daniel Borkmann
dan...@iogearbox.net wrote:

On 04/07/2015 10:48 AM, Ingo Molnar wrote:

* Stephen Rothwell s...@canb.auug.org.au wrote:


After merging the tip tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

kernel/events/core.c: In function 'perf_event_set_bpf_prog':
kernel/events/core.c:6732:15: error: 'struct bpf_prog_aux' has no
member named 'prog_type'
if (prog-aux-prog_type != BPF_PROG_TYPE_KPROBE) {
 ^

Caused by commit 2541517c32be (tracing, perf: Implement BPF programs
attached to kprobes).


Note, this must be some (rarely triggered) aspect of the ppc64
defconfig that neither x86 randconfigs nor most other arch defconfigs
expose?


Note, this is a merge conflict with the work that went via net-next
tree,
i.e. 24701ecea76b (ebpf: move read-only fields to bpf_prog and shrink
bpf_prog_aux). I believe that is why it didn't trigger on tip tree.

You should be able to resolve it in linux-next by changing the test to:

if (prog-prog_type != BPF_PROG_TYPE_KPROBE) {


Thanks Daniel, I will do that tomorrow.  Someone will have to remember
to tell Linus.


Yes, indeed, depending which tree is merged first.


Daniel analysis is correct, but the fix for kernel/events/core.c
should be:
- if (prog-aux-prog_type != BPF_PROG_TYPE_KPROBE) {
+ if (prog-type != BPF_PROG_TYPE_KPROBE) {
instead of 'prog-prog_type'


Yes, absolutely, thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH akpm] compiler-intel: fix wrong compiler barrier() macro

2015-05-19 Thread Daniel Borkmann
Cleanup commit 73679e508201 (compiler-intel.h: Remove duplicate
definition) removed the double definition of __memory_barrier()
intrinsics.

However, in doing so, it also removed the preceding #undef barrier
by accident, meaning, the actual barrier() macro from compiler-gcc.h
with inline asm is still in place as __GNUC__ is provided.

Subsequently, barrier() can never be defined as __memory_barrier()
from compiler.h since it already has a definition in place and if
we trust the comment in compiler-intel.h, ecc doesn't support gcc
specific asm statements.

I don't have an ecc at hand (unsure if that's still used in the
field?) and only found this by accident during code review, a revert
of that cleanup would be simplest option.

Fixes: 73679e508201 (compiler-intel.h: Remove duplicate definition)
Signed-off-by: Daniel Borkmann dan...@iogearbox.net
Reviewed-by: Pranith Kumar bobby.pr...@gmail.com
Cc: Pranith Kumar bobby.pr...@gmail.com
Cc: H. Peter Anvin h...@zytor.com
Cc: mancha security manc...@zoho.com
---
 Only resending fix for Andrew's tree, rebased.

 include/linux/compiler-intel.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/compiler-intel.h b/include/linux/compiler-intel.h
index 0c9a2f2..d4c7113 100644
--- a/include/linux/compiler-intel.h
+++ b/include/linux/compiler-intel.h
@@ -13,10 +13,12 @@
 /* Intel ECC compiler doesn't support gcc specific asm stmts.
  * It uses intrinsics to do the equivalent things.
  */
+#undef barrier
 #undef barrier_data
 #undef RELOC_HIDE
 #undef OPTIMIZER_HIDE_VAR
 
+#define barrier() __memory_barrier()
 #define barrier_data(ptr) barrier()
 
 #define RELOC_HIDE(ptr, off)   \
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] bpf: BPF based latency tracing

2015-06-20 Thread Daniel Borkmann
 - 131071   : 2||
   131072 - 262143   : 0||
   262144 - 524287   : 1||
   524288 - 1048575  : 174  ||
CPU 3
   latency: count distribution
1 - 1: 0||
2 - 3: 0||
4 - 7: 0||
8 - 15   : 0||
   16 - 31   : 0||
   32 - 63   : 0||
   64 - 127  : 0||
  128 - 255  : 0||
  256 - 511  : 0||
  512 - 1023 : 0||
 1024 - 2047 : 0||
 2048 - 4095 : 29626|*** |
 4096 - 8191 : 2704 |**  |
 8192 - 16383: 1090 ||
16384 - 32767: 160  ||
32768 - 65535: 72   ||
65536 - 131071   : 32   ||
   131072 - 262143   : 26   ||
   262144 - 524287   : 12   ||
   524288 - 1048575  : 298  ||

All this is based on the trace3 examples written by
Alexei Starovoitov a...@plumgrid.com.

Signed-off-by: Daniel Wagner daniel.wag...@bmw-carit.de


I think it would be useful to perhaps have two options:

1) User specifies a specific CPU and gets one such an output above.
2) Summary view, i.e. to have the samples of each CPU for comparison
   next to each other in columns and maybe the histogram view a bit
   more compressed (perhaps summary of all CPUs).

Anyway, it's sample code people can go with and modify individually.

Acked-by: Daniel Borkmann dan...@iogearbox.net
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next] test_bpf: extend tests for 32-bit endianness conversion

2015-06-27 Thread Daniel Borkmann

On 06/26/2015 05:25 PM, Xi Wang wrote:

Currently ALU_END_FROM_BE 32 and ALU_END_FROM_LE 32 do not test if
the upper bits of the result are zeros (the arm64 JIT had such bugs).
Extend the two tests to catch this.

Cc: Alexei Starovoitov a...@plumgrid.com
Signed-off-by: Xi Wang xi.w...@gmail.com


Thanks for extending the test suite!

Acked-by: Daniel Borkmann dan...@iogearbox.net
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] selftests: add seccomp suite

2015-06-16 Thread Daniel Borkmann

On 06/16/2015 07:54 PM, Kees Cook wrote:

This imports the existing seccomp test suite into the kernel's selftests
tree. It contains extensive testing of seccomp features and corner cases.
There remain additional tests to move into the kernel tree, but they have
not yet been ported to all the architectures seccomp supports:
https://github.com/redpig/seccomp/tree/master/tests

Signed-off-by: Kees Cook keesc...@chromium.org


Awesome, thanks a bunch, Kees!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/3] bpf: share helpers between tracing and networking

2015-06-18 Thread Daniel Borkmann

On 06/16/2015 07:10 PM, Alexei Starovoitov wrote:
...

Ideally we would allow a blend of tracing and networking programs,
then the best solution would be one or two stable tracepoints in
networking stack where skb is visible and receiving/transmitting task
is also visible, then skb-len and task-pid together would give nice
foundation for accurate stats.


I think combining both seems interesting anyway, we need to find
a way to make this gluing of both worlds easy to use, though. It's
certainly interesting for stats/diagnostics, but one wouldn't be
able to use the current/future skb eBPF helpers from {cls,act}_bpf
in that context.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 3/3] bpf: let kprobe programs use bpf_get_smp_processor_id() helper

2015-06-13 Thread Daniel Borkmann

On 06/13/2015 04:39 AM, Alexei Starovoitov wrote:

It's useful to do per-cpu histograms.

Suggested-by: Daniel Wagner daniel.wag...@bmw-carit.de
Signed-off-by: Alexei Starovoitov a...@plumgrid.com


Acked-by: Daniel Borkmann dan...@iogearbox.net
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 0/3] bpf: share helpers between tracing and networking

2015-06-16 Thread Daniel Borkmann

On 06/16/2015 05:28 AM, Alexei Starovoitov wrote:

On 6/15/15 4:01 PM, David Miller wrote:


Although I agree with the sentiment that this thing can cause
surprising results and can be asking for trouble.

If someone wants to filter traffic by UID they might make
a simple ingress TC ebpf program using these new interfaces
and expect it to work.

But the UID their program will see will be the UID of whatever
randomly happened to be executing when the packet was received
and processed.


yes, you're right. Such tc filters will be incorrect.
Will send a partial revert disallowing them in tc.


Sorry for late reply [on vacation]; if you really want to, you
could go via skb-sk-sk_socket-file and then retrieve credentials
from there for egress side (you can have a look at xt_owner). You'd
need a different *_proto helper for tc in that case, which would
then map to BPF_FUNC_get_current_uid_gid, etc. But that doesn't work
for ingress however, even if you would have early demux, so you
would need to let the eBPF helper function return an error code in
that case.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Edited draft of bpf(2) man page for review/enhancement

2015-05-27 Thread Daniel Borkmann

On 05/27/2015 10:43 AM, Michael Kerrisk (man-pages) wrote:

Hello Alexei,

I took the draft 3 of the bpf(2) man page that you sent back in March
and did some substantial editing to clarify the language and add a
few technical details. Could you please check the revised  version
below, to ensure I did not inject any errors.

I also added a number of FIXMEs for pieces of the page that need
further work. Could you take a look at these and let me know your
thoughts, please.


That's great, thanks! Minor comments:

...

.TH BPF 2 2015-03-10 Linux Linux Programmer's Manual
.SH NAME
bpf - perform a command on an extended BPF map or program
.SH SYNOPSIS
.nf
.B #include linux/bpf.h
.sp
.BI int bpf(int cmd, union bpf_attr *attr, unsigned int size);

.SH DESCRIPTION
The
.BR bpf ()
system call performs a range of operations related to extended
Berkeley Packet Filters.
Extended BPF (or eBPF) is similar to
the original BPF (or classic BPF) used to filter network packets.
For both BPF and eBPF programs,
the kernel statically analyzes the programs before loading them,
in order to ensure that they cannot harm the running system.
.P
eBPF extends classic BPF in multiple ways including the ability to call
in-kernel helper functions (via the
.B BPF_CALL
opcode extension provided by eBPF)
and access shared data structures such as BPF maps.


I would perhaps emphasize that maps can be shared among in-kernel
eBPF programs, but also between kernel and user space.


The programs can be written in a restricted C that is compiled into
.\ FIXME In the next line, what is a restricted C? Where does
.\   one get further information about it?


So far only from the kernel samples directory and for tc classifier
and action, from the tc man page and/or examples/bpf/ in the tc git
tree.


eBPF bytecode and executed on the in-kernel virtual machine or
just-in-time compiled into native code.
.SS Extended BPF Design/Architecture
.P
.\ FIXME In the following line, what does different data types mean?
.\   Are the values in a map not just blobs?


Sort of, currently, these blobs can have different sizes of keys
and values (you can even have structs as keys). For the map itself
they are treated as blob internally. However, recently, bpf tail call
got added where you can lookup another program from an array map and
call into it. Here, that particular type of map can only have entries
of type of eBPF program fd. I think, if needed, adding a paragraph to
the tail call could be done as follow-up after we have an initial man
page in the tree included.


BPF maps are a generic data structure for storage of different data types.
A user process can create multiple maps (with key/value-pairs being
opaque bytes of data) and access them via file descriptors.
BPF programs can access maps from inside the kernel in parallel.
It's up to the user process and BPF program to decide what they store
inside maps.
.P
BPF programs are similar to kernel modules.
They are loaded by the user
process and automatically unloaded when the process exits.


Generally that's true. Btw, in 4.1 kernel, tc(8) also got support for
eBPF classifier and actions, and here it's slightly different: in tc,
we load the programs, maps etc, and push down the eBPF program fd in
order to let the kernel hold reference on the program itself.

Thus, there, the program fd that the application owns is gone when the
application terminates, but the eBPF program itself still lives on
inside the kernel. But perhaps it's already too much detail to mention
here ...


Each BPF program is a set of instructions that is safe to run until
its completion.
The in-kernel BPF verifier statically determines that the program
terminates and is safe to execute.
.\ FIXME In the following sentence, what does takes hold mean?


Takes a reference. Meaning, that maps cannot disappear under us while
the eBPF program that is using them in the kernel is still alive.


During verification, the program takes hold of maps that it intends to use,
so selected maps cannot be removed until the program is unloaded.

BPF programs can be attached to different events.
.\ FIXME: In the next sentence , packets are not events. What
.\ do you really mean to say here? (the arrival of a network packet?)


Socket filters is meant here: setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, ...);


These events can be packets, tracing
events, and other types that may be added in the future.


There's already one more type worth mentioning: tc classifier and actions
(tc/tc_bpf.{c,h}).


A new event triggers execution of the BPF program, which
may store information about the event in the maps.
Beyond storing data, BPF programs may call into in-kernel helper functions.


I would mention that these in-kernel helpers are a fixed set of functions
provided by the kernel (linux/bpf.h: BPF_FUNC_*), so you cannot call into
arbitrary ones.


The same program can be attached to multiple events and different programs can
access the same map:
.\ FIXME Can maps 

Re: [PATCH net-next 1/4] bpf: allow bpf programs to tail-call other bpf programs

2015-05-21 Thread Daniel Borkmann

On 05/20/2015 01:59 AM, Alexei Starovoitov wrote:

introduce bpf_tail_call(ctx, jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
   ...
   bpf_tail_call(ctx, jmp_table, index);
   ...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
   ...
   if (jmp_table[index])
 return (*jmp_table[index])(ctx);
   ...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.


...

Signed-off-by: Alexei Starovoitov a...@plumgrid.com


LGTM, thanks!

Acked-by: Daniel Borkmann dan...@iogearbox.net
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/6] test_bpf: allow tests to specify an skb fragment.

2015-08-03 Thread Daniel Borkmann

On 08/03/2015 04:02 PM, Nicolas Schichan wrote:

This introduce a new test-aux flag (FLAG_SKB_FRAG) to tell the
populate_skb() function to add a fragment to the test skb containing
the data specified in test-frag_data).

Signed-off-by: Nicolas Schichan nschic...@freebox.fr
Acked-by: Alexei Starovoitov a...@plumgrid.com


Acked-by: Daniel Borkmann dan...@iogearbox.net

I'm good with this change here, just a comment below in general.


  enum {
CLASSIC  = BIT(6),  /* Old BPF instructions only. */
@@ -81,6 +83,7 @@ struct bpf_test {
__u32 result;
} test[MAX_SUBTESTS];
int (*fill_helper)(struct bpf_test *self);
+   __u8 frag_data[MAX_DATA];
  };


We now have 286 tests, which is awesome!

Perhaps, we need to start thinking of a better test description method
soonish as the test_bpf.ko module grew to ~1.6M, i.e. whenever we add
to struct bpf_test, it adds memory overhead upon all test cases.


  /* Large test cases need separate allocation and fill handler. */
@@ -4525,6 +4528,10 @@ static struct sk_buff *populate_skb(char *buf, int size)

  static void *generate_test_data(struct bpf_test *test, int sub)
  {
+   struct sk_buff *skb;
+   struct page *page;
+   void *ptr;
+
if (test-aux  FLAG_NO_DATA)
return NULL;

@@ -4532,7 +4539,36 @@ static void *generate_test_data(struct bpf_test *test, 
int sub)
 * subtests generate skbs of different sizes based on
 * the same data.
 */
-   return populate_skb(test-data, test-test[sub].data_size);
+   skb = populate_skb(test-data, test-test[sub].data_size);
+   if (!skb)
+   return NULL;
+
+   if (test-aux  FLAG_SKB_FRAG) {


Really minor nit: declaration of page, ptr could have been only in this block.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/6] test_bpf: add module parameters to filter the tests to run.

2015-08-03 Thread Daniel Borkmann

On 08/03/2015 04:02 PM, Nicolas Schichan wrote:

When developping on the interpreter or a particular JIT, it can be
insteresting to restrict the test list to a specific test or a


s/insteresting/interesting/


particular range of tests.

This patch adds the following module parameters to the test_bpf module:

* test_name=string: only the specified named test will be run.

* test_id=number: only the test with the specified id will be run
   (see the output of test_pbf without parameters to get the test id).


s/test_pbf/test_bpf/


* test_range=number,number: only the tests with IDs in the
   specified id range are run (see the output of test_pbf without


s/test_pbf/test_bpf/


   parameters to get the test ids).

Any invalid range, test id or test name will result in -EINVAL being
returned and no tests being run.

Signed-off-by: Nicolas Schichan nschic...@freebox.fr


Seems very useful for the test suite, thanks.

Acked-by: Daniel Borkmann dan...@iogearbox.net


---
  lib/test_bpf.c | 73 ++
  1 file changed, 73 insertions(+)

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index f5606fb..abd0507 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -4870,10 +4870,72 @@ static int run_one(const struct bpf_prog *fp, struct 
bpf_test *test)
return err_cnt;
  }

+static char test_name[64];
+module_param_string(test_name, test_name, sizeof(test_name), 0);
+
+static int test_id = -1;
+module_param(test_id, int, 0);
+
+static int test_range[2] = { -1, -1 };
+module_param_array(test_range, int, NULL, 0);
+
+static __init int find_test_index(const char *test_name)
+{
+   int i;
+
+   for (i = 0; i  ARRAY_SIZE(tests); i++) {
+   if (!strcmp(tests[i].descr, test_name))
+   return i;
+   }
+   return -1;
+}
+
  static __init int prepare_bpf_tests(void)
  {
int i;

+   if (test_id = 0) {
+   /*
+* if a test_id was specified, use test_range to
+* conver only that test.


s/conver/cover/


+*/
+   if (test_id = ARRAY_SIZE(tests)) {
+   pr_err(test_bpf: invalid test_id specified.\n);
+   return -EINVAL;
+   }

[...]

@@ -4893,6 +4955,14 @@ static __init void destroy_bpf_tests(void)
}
  }

+static bool exclude_test(int test_id)
+{
+   if (test_range[0] = 0 
+   (test_id  test_range[0] || test_id  test_range[1]))
+   return true;
+   return false;


Minor nit: could directly return it, f.e.:

return test_range[0] = 0  (test_id  test_range[0] ||
  test_id  test_range[1]);

Btw, for the range test in prepare_bpf_tests(), you could also reject
a negative lower bound index right there.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/6] test_bpf: avoid oopsing the kernel when generate_test_data() fails.

2015-08-03 Thread Daniel Borkmann

On 08/03/2015 04:02 PM, Nicolas Schichan wrote:

Signed-off-by: Nicolas Schichan nschic...@freebox.fr
Acked-by: Alexei Starovoitov a...@plumgrid.com


Acked-by: Daniel Borkmann dan...@iogearbox.net
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event

2015-07-29 Thread Daniel Borkmann

On 07/28/2015 01:17 PM, Kaixu Xia wrote:

Introduce a new bpf map type 'BPF_MAP_TYPE_PERF_EVENT_ARRAY'.
This map only stores the pointer to struct perf_event. The
user space event FDs from perf_event_open() syscall are converted
to the pointer to struct perf_event and stored in map.

Signed-off-by: Kaixu Xia xiaka...@huawei.com
---
  include/linux/bpf.h|  1 +
  include/linux/perf_event.h |  2 ++
  include/uapi/linux/bpf.h   |  1 +
  kernel/bpf/arraymap.c  | 57 ++
  kernel/bpf/verifier.c  | 15 
  kernel/events/core.c   | 17 ++
  6 files changed, 93 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 610b730..3c9c0eb 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -10,6 +10,7 @@
  #include uapi/linux/bpf.h
  #include linux/workqueue.h
  #include linux/file.h
+#include linux/perf_event.h

  struct bpf_map;

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2027809..2ea4067 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -641,6 +641,7 @@ extern int perf_event_init_task(struct task_struct *child);
  extern void perf_event_exit_task(struct task_struct *child);
  extern void perf_event_free_task(struct task_struct *task);
  extern void perf_event_delayed_put(struct task_struct *task);
+extern struct perf_event *perf_event_get(unsigned int fd);
  extern void perf_event_print_debug(void);
  extern void perf_pmu_disable(struct pmu *pmu);
  extern void perf_pmu_enable(struct pmu *pmu);
@@ -979,6 +980,7 @@ static inline int perf_event_init_task(struct task_struct 
*child)   { return 0; }
  static inline void perf_event_exit_task(struct task_struct *child){ }
  static inline void perf_event_free_task(struct task_struct *task) { }
  static inline void perf_event_delayed_put(struct task_struct *task)   { }
+static struct perf_event *perf_event_get(unsigned int fd)  { 
return NULL; }
  static inline void perf_event_print_debug(void)   
{ }
  static inline int perf_event_task_disable(void)   
{ return -EINVAL; }
  static inline int perf_event_task_enable(void)
{ return -EINVAL; }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 29ef6f9..69a1f6b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -114,6 +114,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_HASH,
BPF_MAP_TYPE_ARRAY,
BPF_MAP_TYPE_PROG_ARRAY,
+   BPF_MAP_TYPE_PERF_EVENT_ARRAY,
  };

  enum bpf_prog_type {
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 4784cdc..a9e0235 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -297,3 +297,60 @@ static int __init register_prog_array_map(void)
return 0;
  }
  late_initcall(register_prog_array_map);
+
+static void *perf_event_fd_array_get_ptr(struct bpf_array *array, int fd)
+{
+   struct perf_event *event;
+
+   event = perf_event_get(fd);
+   if (IS_ERR(event))
+   return event;


So in case of !CONFIG_PERF_EVENTS, you define perf_event_get() to return NULL.

Seems like this will give a nice NULL ptr deref below. ;)


+   /*
+* prevent some crazy events so we can make our life easier
+*/
+   if (event-attr.type != PERF_TYPE_RAW 
+   event-attr.type != PERF_TYPE_HARDWARE) {
+   perf_event_release_kernel(event);
+   return ERR_PTR(-EINVAL);
+   }
+   return event;
+}
+
+static void perf_event_fd_array_put_ptr(struct bpf_array *array, void *ptr)
+{
+   struct perf_event *event = (struct perf_event *)ptr;


( Same comment as in previous patch. )


+   perf_event_release_kernel(event);
+}
+
+static const struct fd_array_map_ops perf_event_fd_array_map_ops = {
+   .get_ptr= perf_event_fd_array_get_ptr,
+   .put_ptr= perf_event_fd_array_put_ptr,
+};

...

+late_initcall(register_perf_event_array_map);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 039d866..c70f7e7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -924,6 +924,21 @@ static int check_call(struct verifier_env *env, int 
func_id)
 */
return -EINVAL;

+   if (map  map-map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY 
+   func_id != BPF_FUNC_perf_event_read)
+   /* perf_event_array map type needs extra care:
+* only allow to pass it into bpf_perf_event_read() for now.
+* bpf_map_update/delete_elem() must only be done via syscall
+*/
+   return -EINVAL;
+
+   if (func_id == BPF_FUNC_perf_event_read 
+   map-map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY)
+   /* don't allow any other map type to be passed into
+* bpf_perf_event_read()
+*/
+   return -EINVAL;


Maybe a 

Re: [PATCH v4 4/4] samples/bpf: example of get selected PMU counter value

2015-07-29 Thread Daniel Borkmann

On 07/28/2015 01:17 PM, Kaixu Xia wrote:

This is a simple example and shows how to use the new ability
to get the selected Hardware PMU counter value.

Signed-off-by: Kaixu Xia xiaka...@huawei.com

...

diff --git a/samples/bpf/tracex6_user.c b/samples/bpf/tracex6_user.c
new file mode 100644
index 000..e607eac
--- /dev/null
+++ b/samples/bpf/tracex6_user.c
@@ -0,0 +1,67 @@
+#include stdio.h
+#include unistd.h
+#include stdlib.h
+#include stdbool.h
+#include string.h
+#include fcntl.h
+#include poll.h
+#include sys/ioctl.h
+#include linux/perf_event.h
+#include linux/bpf.h
+#include libbpf.h
+#include bpf_load.h
+
+static void test_bpf_perf_event(void)
+{
+   int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
+   int *pmu_fd = malloc(nr_cpus * sizeof(int));
+   unsigned long value;
+   int i;
+
+   struct perf_event_attr attr_insn_pmu = {
+   .freq = 0,
+   .sample_period = 0x7fffULL,


Nit: #define ...


+   .inherit = 0,
+   .type = PERF_TYPE_HARDWARE,
+   .read_format = 0,
+   .sample_type = 0,
+   .config = 0,/* PMU: cycles */
+   };
+
+   for (i = 0; i  nr_cpus; i++) {
+   pmu_fd[i] = perf_event_open(attr_insn_pmu, -1/*pid*/, 
i/*cpu*/, -1/*group_fd*/, 0);
+   if (pmu_fd[i]  0)
+   printf(event syscall failed\n);
+
+   bpf_update_elem(map_fd[0], i, (pmu_fd + i), BPF_ANY);


Nit: you use pmu_fd[i], but here (pmu_fd + i)? Maybe better pmu_fd[i]?


+   ioctl(pmu_fd[i], PERF_EVENT_IOC_ENABLE, 0);
+   }

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 1/4] bpf: Make the bpf_prog_array_map more generic

2015-07-29 Thread Daniel Borkmann

On 07/28/2015 01:17 PM, Kaixu Xia wrote:

From: Wang Nan wangn...@huawei.com

According to the comments from Daniel, rewrite part of
the bpf_prog_array map code and make it more generic.
So the new perf_event_array map type can reuse most of
code with bpf_prog_array map and add fewer lines of
special code.

Tested the samples/bpf/tracex5 after this patch:
$ sudo ./tracex5
...
dd-1051  [000] d...26.682903: : mmap
dd-1051  [000] d...26.698348: : syscall=102 (one of get/set 
uid/pid/gid)
dd-1051  [000] d...26.703892: : read(fd=0, buf=0078c010, 
size=512)
dd-1051  [000] d...26.705847: : write(fd=1, buf=0078c010, 
size=512)
dd-1051  [000] d...26.707914: : read(fd=0, buf=0078c010, 
size=512)
dd-1051  [000] d...26.710988: : write(fd=1, buf=0078c010, 
size=512)
dd-1051  [000] d...26.711865: : read(fd=0, buf=0078c010, 
size=512)
dd-1051  [000] d...26.712704: : write(fd=1, buf=0078c010, 
size=512)
...

Signed-off-by: Wang Nan wangn...@huawei.com
Signed-off-by: Kaixu Xia xiaka...@huawei.com
---
  include/linux/bpf.h   |   6 ++-
  kernel/bpf/arraymap.c | 104 +++---
  kernel/bpf/syscall.c  |   4 +-
  3 files changed, 80 insertions(+), 34 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4383476..610b730 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -130,6 +130,8 @@ struct bpf_prog_aux {
};
  };

+struct fd_array_map_ops;
+
  struct bpf_array {
struct bpf_map map;
u32 elem_size;
@@ -140,15 +142,17 @@ struct bpf_array {
 */
enum bpf_prog_type owner_prog_type;
bool owner_jited;
+   const struct fd_array_map_ops* fd_ops;
union {
char value[0] __aligned(8);
+   void *ptrs[0] __aligned(8);
struct bpf_prog *prog[0] __aligned(8);


After your conversion, prog member from the union is not used here anymore
(only as offsetof(...) in JITs). We should probably get rid of it then.


};
  };
  #define MAX_TAIL_CALL_CNT 32

  u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 r5);
-void bpf_prog_array_map_clear(struct bpf_map *map);
+void bpf_fd_array_map_clear(struct bpf_map *map);
  bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog 
*fp);
  const struct bpf_func_proto *bpf_get_trace_printk_proto(void);

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index cb31229..4784cdc 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -150,15 +150,62 @@ static int __init register_array_map(void)
  }
  late_initcall(register_array_map);

-static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)
+struct fd_array_map_ops {
+   void *(*get_ptr)(struct bpf_array *array, int fd);
+   void (*put_ptr)(struct bpf_array *array, void *ptr);
+};
+
+static void *prog_fd_array_get_ptr(struct bpf_array *array, int fd)
+{
+   struct bpf_prog *prog = bpf_prog_get(fd);
+   if (IS_ERR(prog))
+   return prog;
+
+   if (!bpf_prog_array_compatible(array, prog)) {
+   bpf_prog_put(prog);
+   return ERR_PTR(-EINVAL);
+   }
+   return prog;
+}
+
+static void prog_fd_array_put_ptr(struct bpf_array *array __maybe_unused,
+ void *ptr)


array member seems not to be used in both implementations. It should then
probably not be part of the API?


+{
+   struct bpf_prog *prog = (struct bpf_prog *)ptr;


No cast on void * needed.


+
+   bpf_prog_put_rcu(prog);
+}
+
+static const struct fd_array_map_ops prog_fd_array_map_ops = {
+   .get_ptr= prog_fd_array_get_ptr,
+   .put_ptr= prog_fd_array_put_ptr,
+};
+
+static struct bpf_map *fd_array_map_alloc(union bpf_attr *attr,
+ const struct fd_array_map_ops *ops)
  {
-   /* only bpf_prog file descriptors can be stored in prog_array map */
+   struct bpf_map *map;
+   struct bpf_array *array;
+
+   /* only file descriptors can be stored in this type of map */
if (attr-value_size != sizeof(u32))
return ERR_PTR(-EINVAL);
-   return array_map_alloc(attr);
+
+   map = array_map_alloc(attr);
+   if (IS_ERR(map))
+   return map;
+
+   array = container_of(map, struct bpf_array, map);
+   array-fd_ops = ops;
+   return map;
+}
+
+static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)
+{
+   return fd_array_map_alloc(attr, prog_fd_array_map_ops);
  }

-static void prog_array_map_free(struct bpf_map *map)
+static void fd_array_map_free(struct bpf_map *map)
  {
struct bpf_array *array = container_of(map, struct bpf_array, map);
int i;
@@ -167,21 +214,21 @@ static void prog_array_map_free(struct bpf_map *map)

/* make sure it's empty */

Re: [PATCH v4 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

2015-07-29 Thread Daniel Borkmann

On 07/28/2015 01:17 PM, Kaixu Xia wrote:

According to the perf_event_map_fd and index, the function
bpf_perf_event_read() can convert the corresponding map
value to the pointer to struct perf_event and return the
Hardware PMU counter value.

Signed-off-by: Kaixu Xia xiaka...@huawei.com
---
  include/linux/bpf.h|  1 +
  include/linux/perf_event.h |  3 ++-
  include/uapi/linux/bpf.h   |  1 +
  kernel/bpf/helpers.c   | 36 
  kernel/events/core.c   |  4 ++--
  kernel/trace/bpf_trace.c   |  2 ++
  6 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 3c9c0eb..516992c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -190,6 +190,7 @@ extern const struct bpf_func_proto 
bpf_map_lookup_elem_proto;
  extern const struct bpf_func_proto bpf_map_update_elem_proto;
  extern const struct bpf_func_proto bpf_map_delete_elem_proto;

+extern const struct bpf_func_proto bpf_perf_event_read_proto;
  extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
  extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
  extern const struct bpf_func_proto bpf_tail_call_proto;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2ea4067..899abcb 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -662,7 +662,8 @@ extern void perf_pmu_migrate_context(struct pmu *pmu,
int src_cpu, int dst_cpu);
  extern u64 perf_event_read_value(struct perf_event *event,
 u64 *enabled, u64 *running);
-
+extern void __perf_event_read(void *info);
+extern u64 perf_event_count(struct perf_event *event);

  struct perf_sample_data {
/*
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 69a1f6b..b9b13ce 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -250,6 +250,7 @@ enum bpf_func_id {
 * Return: 0 on success
 */
BPF_FUNC_get_current_comm,
+   BPF_FUNC_perf_event_read,   /* u64 bpf_perf_event_read(map, index) 
*/
__BPF_FUNC_MAX_ID,
  };

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 1447ec0..c40c5ea 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -182,3 +182,39 @@ const struct bpf_func_proto bpf_get_current_comm_proto = {
.arg1_type  = ARG_PTR_TO_STACK,
.arg2_type  = ARG_CONST_STACK_SIZE,
  };
+
+static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5)
+{
+   struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+   struct bpf_array *array = container_of(map, struct bpf_array, map);
+   struct perf_event *event;
+
+   if (index = array-map.max_entries)
+   return -E2BIG;


Maybe unlikely(...), likewise below.


+   event = (struct perf_event *)array-ptrs[index];
+   if (!event)
+   return -ENOENT;
+
+   if (event-state != PERF_EVENT_STATE_ACTIVE)
+   return -EINVAL;
+
+   if (event-oncpu != raw_smp_processor_id() 
+   event-ctx-task != current)
+   return -EINVAL;
+
+   if (event-attr.inherit)
+   return -EINVAL;
+
+   __perf_event_read(event);
+
+   return perf_event_count(event);


I believe this helper should rather go somewhere such as bpf_trace.c
(or under kernel/events/ ?), wouldn't we otherwise get a build error
when perf events are compiled out?

Anyway, I let perf folks comment on that (and the helper in general).


+}
+
+const struct bpf_func_proto bpf_perf_event_read_proto = {
+   .func   = bpf_perf_event_read,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_CONST_MAP_PTR,
+   .arg2_type  = ARG_ANYTHING,
+};
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 58f0d47..c926c6d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3177,7 +3177,7 @@ void perf_event_exec(void)
  /*
   * Cross CPU call to read the hardware event
   */
-static void __perf_event_read(void *info)
+void __perf_event_read(void *info)


Does this need to be declared in a header file, no?


  {
struct perf_event *event = info;
struct perf_event_context *ctx = event-ctx;
@@ -3204,7 +3204,7 @@ static void __perf_event_read(void *info)
raw_spin_unlock(ctx-lock);
  }

-static inline u64 perf_event_count(struct perf_event *event)
+u64 perf_event_count(struct perf_event *event)


Likewise? Should the inlining be preserved?


  {
if (event-pmu-count)
return event-pmu-count(event);
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 88a041a..9cf094f 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -183,6 +183,8 @@ static const struct bpf_func_proto 
*kprobe_prog_func_proto(enum bpf_func_id func


Btw, I'm wondering if the perf event map portions should actually better go

Re: [PATCH] packet: tpacket_snd(): fix signed/unsigned comparison

2015-07-28 Thread Daniel Borkmann

On 07/28/2015 12:57 PM, Alexander Drozdov wrote:

tpacket_fill_skb() can return a negative value (-errno) which
is stored in tp_len variable. In that case the following
condition will be (but shouldn't be) true:

tp_len  dev-mtu + dev-hard_header_len

as dev-mtu and dev-hard_header_len are both unsigned.

That may lead to just returning an incorrect EMSGSIZE errno
to the user.

Signed-off-by: Alexander Drozdov al.droz...@gmail.com


Looks good to me, thanks!

Acked-by: Daniel Borkmann dan...@iogearbox.net
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] net/ipv4: inconsistent routing table

2015-08-05 Thread Daniel Borkmann

[ please cc netdev ]

On 08/05/2015 10:56 AM, Zang MingJie wrote:

Hi:

I found a bug when remove an ip address which is referenced by a routing entry.

step to reproduce:

ip li add type dummy
ip li set dummy0 up
ip ad add 10.0.0.1/24 dev dummy0
ip ad add 10.0.0.2/24 dev dummy0
ip ro add default via 10.0.0.2/24
ip ad del 10.0.0.2/24 dev dummy0

after deleting the secondary ip address, the routing entry still
pointing to 10.0.0.2

# ip ro
default via 10.0.0.2 dev dummy0
10.0.0.0/24 dev dummy0  proto kernel  scope link  src 10.0.0.1

but actually, kernel considers the default route is directly connected.

# ip ro get 1.1.1.1
1.1.1.1 dev dummy0  src 10.0.0.1
 cache
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/6] test_bpf: test LD_ABS and LD_IND instructions on fragmented skbs.

2015-08-03 Thread Daniel Borkmann

On 08/03/2015 04:02 PM, Nicolas Schichan wrote:

These new tests exercise various load sizes and offsets crossing the
head/fragment boundary.

Signed-off-by: Nicolas Schichan nschic...@freebox.fr
Acked-by: Alexei Starovoitov a...@plumgrid.com


Acked-by: Daniel Borkmann dan...@iogearbox.net
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/6] test_bpf: add more tests for LD_ABS and LD_IND.

2015-08-03 Thread Daniel Borkmann

On 08/03/2015 04:02 PM, Nicolas Schichan wrote:

This exerces the LD_ABS and LD_IND instructions for various sizes and
alignments. This also checks that X when used as an offset to a
BPF_IND instruction first in a filter is correctly set to 0.

Signed-off-by: Nicolas Schichan nschic...@freebox.fr
Acked-by: Alexei Starovoitov a...@plumgrid.com


Acked-by: Daniel Borkmann dan...@iogearbox.net
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] test_bpf: add tests checking that JIT/interpreter sets A and X to 0.

2015-08-03 Thread Daniel Borkmann

On 08/03/2015 04:02 PM, Nicolas Schichan wrote:

It is mandatory for the JIT or interpreter to reset the A and X
registers to 0 before running the filter. Check that it is the case on
various ALU and JMP instructions.

Signed-off-by: Nicolas Schichan nschic...@freebox.fr
Acked-by: Alexei Starovoitov a...@plumgrid.com


Acked-by: Daniel Borkmann dan...@iogearbox.net
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/6] test_bpf: add module parameters to filter the tests to run.

2015-08-03 Thread Daniel Borkmann

On 08/03/2015 06:23 PM, Nicolas Schichan wrote:
...

Btw, for the range test in prepare_bpf_tests(), you could also reject
a negative lower bound index right there.


I thought it was better to have all the sanity checks grouped in
prepare_bpf_tests() (with the checking of the test_name and test_id parameters
nearby) ? Also a negative lower bound is meaning that no range has been set so
all tests should be run.


I just got a bit confused when loading test_range=-100,1 was not rejected, but
they do indeed all run in this case.

Thanks,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/6] test_bpf: allow tests to specify an skb fragment.

2015-08-03 Thread Daniel Borkmann

On 08/03/2015 06:38 PM, Nicolas Schichan wrote:

On 08/03/2015 05:29 PM, Daniel Borkmann wrote:

On 08/03/2015 04:02 PM, Nicolas Schichan wrote:
We now have 286 tests, which is awesome!

Perhaps, we need to start thinking of a better test description method
soonish as the test_bpf.ko module grew to ~1.6M, i.e. whenever we add
to struct bpf_test, it adds memory overhead upon all test cases.


Indeed, test_bpf.ko is turning quite large (1.4M when compiled for ARM).

It looks like gzip is able to do wonders on the module though as I end up with
a 94.7K test_bpf.ko.gz file and if the modutils are compiled with
--enable-zlib, it will be gunziped automatically before being loaded to the
kernel.


I think it just contains a lot of zero blocks, which then compress nicely.


I think that marking tests[] array as __initdata will help with the runtime
memory use if someone forgets to rmmod the test_bpf module after a completely
successful run.


Can be done, too, yep. Do you want to send a patch? ;)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter

2015-07-29 Thread Daniel Borkmann

On 07/28/2015 01:17 PM, Kaixu Xia wrote:

Previous patch v3 url:
https://lkml.org/lkml/2015/7/23/203

...

Kaixu Xia (3):
   bpf: Add new bpf map type to store the pointer to struct perf_event
   bpf: Implement function bpf_perf_event_read() that get the selected
 hardware PMU conuter
   samples/bpf: example of get selected PMU counter value

Wang Nan (1):
   bpf: Make the bpf_prog_array_map more generic


So kernel/bpf/ patches are usually going via netdev tree as main development
happens there and to avoid cross tree conflicts, etc. Thus, please Cc also
netdev in the next iterations.

Maybe when these patches are in a perfect shape at some point, perf folks
provide their Acked-by(s) to the series to give their consent, and then this
should go via net-next? Or will this be organized differently?


  include/linux/bpf.h|   8 ++-
  include/linux/perf_event.h |   5 +-
  include/uapi/linux/bpf.h   |   2 +
  kernel/bpf/arraymap.c  | 161 -
  kernel/bpf/helpers.c   |  36 ++
  kernel/bpf/syscall.c   |   4 +-
  kernel/bpf/verifier.c  |  15 +
  kernel/events/core.c   |  21 +-
  kernel/trace/bpf_trace.c   |   2 +
  samples/bpf/Makefile   |   4 ++
  samples/bpf/bpf_helpers.h  |   2 +
  samples/bpf/tracex6_kern.c |  26 
  samples/bpf/tracex6_user.c |  67 +++
  13 files changed, 316 insertions(+), 37 deletions(-)
  create mode 100644 samples/bpf/tracex6_kern.c
  create mode 100644 samples/bpf/tracex6_user.c



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next] bpf: s390: Fix build error caused by the struct bpf_array member name changed

2015-08-11 Thread Daniel Borkmann

On 08/11/2015 08:53 AM, Kaixu Xia wrote:

There is a build error that 'struct bpf_array' has no member
named 'prog' on s390. In commit 2a36f0b, the member 'prog' of
struct bpf_array is replaced by 'ptrs'. So this patch fixes it.

Signed-off-by: Kaixu Xia xiaka...@huawei.com


You were also asked to add a proper Fixes tag :

Fixes: 2a36f0b92eb6 (bpf: Make the bpf_prog_array_map more generic)

I guess this bug was reported by Fengguang as you have him in Cc ?

If that should be the case, then please also add a Reported-by: tag
for his bug report.

Code looks good to me.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/31] net/sched: use kmemdup rather than duplicating its implementation

2015-08-07 Thread Daniel Borkmann

On 08/07/2015 09:59 AM, Andrzej Hajda wrote:

The patch was generated using fixed coccinelle semantic patch
scripts/coccinelle/api/memdup.cocci [1].

[1]: http://permalink.gmane.org/gmane.linux.kernel/2014320

Signed-off-by: Andrzej Hajda a.ha...@samsung.com


Acked-by: Daniel Borkmann dan...@iogearbox.net

Not sure where the rest of this series went, but if you want this patch
to be routed via net-next tree (which I recommend, to avoid cross tree
conflicts), then you would need to send these patches separately, rebased
to that tree, and also mention [PATCH net-next XX/YY] in the subject.
  
Thanks,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Edited draft of bpf(2) man page for review/enhancement

2015-07-22 Thread Daniel Borkmann

On 07/22/2015 04:49 PM, Michael Kerrisk (man-pages) wrote:

Hi Daniel,

Sorry for the long delay in following up


No worries, eBPF is quite some material. ;)


On 05/27/2015 11:26 AM, Daniel Borkmann wrote:

On 05/27/2015 10:43 AM, Michael Kerrisk (man-pages) wrote:

Hello Alexei,

I took the draft 3 of the bpf(2) man page that you sent back in March
and did some substantial editing to clarify the language and add a
few technical details. Could you please check the revised  version
below, to ensure I did not inject any errors.

I also added a number of FIXMEs for pieces of the page that need
further work. Could you take a look at these and let me know your
thoughts, please.


That's great, thanks! Minor comments:

...

.TH BPF 2 2015-03-10 Linux Linux Programmer's Manual
.SH NAME
bpf - perform a command on an extended BPF map or program
.SH SYNOPSIS
.nf
.B #include linux/bpf.h
.sp
.BI int bpf(int cmd, union bpf_attr *attr, unsigned int size);

.SH DESCRIPTION
The
.BR bpf ()
system call performs a range of operations related to extended
Berkeley Packet Filters.
Extended BPF (or eBPF) is similar to
the original BPF (or classic BPF) used to filter network packets.
For both BPF and eBPF programs,
the kernel statically analyzes the programs before loading them,
in order to ensure that they cannot harm the running system.
.P
eBPF extends classic BPF in multiple ways including the ability to call
in-kernel helper functions (via the
.B BPF_CALL
opcode extension provided by eBPF)
and access shared data structures such as BPF maps.


I would perhaps emphasize that maps can be shared among in-kernel
eBPF programs, but also between kernel and user space.


This is covered later in the page, under the BPF maps subheading.
Maybe you missed that? (Or did you think it doesn't suffice?)


Okay, I presume you mean:

  Maps are a generic data structure for storage of different types
  and sharing data between the kernel and user-space programs.

Maybe, to emphasize both options a bit (not sure if it's better in
my words, though):

  Maps are a generic data structure for storage of different types
  and allow for sharing data among eBPF kernel programs, but also
  between kernel and user-space applications.


The programs can be written in a restricted C that is compiled into
.\ FIXME In the next line, what is a restricted C? Where does
.\   one get further information about it?


So far only from the kernel samples directory and for tc classifier
and action, from the tc man page and/or examples/bpf/ in the tc git
tree.


So, given that we are several weeks down the track, and things may have
changed, I'll re-ask the questions ;-) :

* Is this restricted C documented anywhere?


Not (yet) that I'm aware of. We were thinking that short-mid term
to polish the stuff that resides in the kernel documentation, that
is, Documentation/networking/filter.txt, to get it in a better
shape, which I presume, would also include a documentation on the
restricted C. So far, examples are provided in the tc-bpf man page
(see link below).

The set of available helper functions callable from eBPF resides
under (enum bpf_func_id):

  
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/bpf.h


* Is the procedure for compiling this restricted C documented anywhere?
   (Yes, it's LLVM, but are the suitable pipelines/options documented
   somewhere?)


eBPF bytecode and executed on the in-kernel virtual machine or
just-in-time compiled into native code.
.SS Extended BPF Design/Architecture
.P
.\ FIXME In the following line, what does different data types mean?
.\   Are the values in a map not just blobs?


Sort of, currently, these blobs can have different sizes of keys
and values (you can even have structs as keys). For the map itself
they are treated as blob internally. However, recently, bpf tail call
got added where you can lookup another program from an array map and
call into it. Here, that particular type of map can only have entries
of type of eBPF program fd. I think, if needed, adding a paragraph to
the tail call could be done as follow-up after we have an initial man
page in the tree included.


Okay -- I've added a FIXME placeholder for this, so we can revisit.


Okay.


BPF maps are a generic data structure for storage of different data types.
A user process can create multiple maps (with key/value-pairs being
opaque bytes of data) and access them via file descriptors.
BPF programs can access maps from inside the kernel in parallel.
It's up to the user process and BPF program to decide what they store
inside maps.
.P
BPF programs are similar to kernel modules.
They are loaded by the user
process and automatically unloaded when the process exits.


Generally that's true. Btw, in 4.1 kernel, tc(8) also got support for
eBPF classifier and actions, and here it's slightly different: in tc,
we load the programs, maps etc, and push down the eBPF program fd in
order to let the kernel hold

Re: [PATCH v2 0/5] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter

2015-07-23 Thread Daniel Borkmann

On 07/22/2015 10:09 AM, Kaixu Xia wrote:

Previous patch v1 url:
https://lkml.org/lkml/2015/7/17/287


[ Sorry to chime in late, just noticed this series now as I wasn't in Cc for
  the core BPF changes. More below ... ]


This patchset allows user read PMU events in the following way:
  1. Open the PMU using perf_event_open() (for each CPUs or for
 each processes he/she'd like to watch);
  2. Create a BPF_MAP_TYPE_PERF_EVENT_ARRAY BPF map;
  3. Insert FDs into the map with some key-value mapping scheme
 (i.e. cpuid - event on that CPU);
  4. Load and attach eBPF programs as usual;
  5. In eBPF program, get the perf_event_map_fd and key (i.e.
 cpuid get from bpf_get_smp_processor_id()) then use
 bpf_perf_event_read() to read from it.
  6. Do anything he/her want.

changes in V2:
  - put atomic_long_inc_not_zero() between fdget() and fdput();
  - limit the event type to PERF_TYPE_RAW and PERF_TYPE_HARDWARE;
  - Only read the event counter on current CPU or on current
process;
  - add new map type BPF_MAP_TYPE_PERF_EVENT_ARRAY to store the
pointer to the struct perf_event;
  - according to the perf_event_map_fd and key, the function
bpf_perf_event_read() can get the Hardware PMU counter value;

Patch 5/5 is a simple example and shows how to use this new eBPF
programs ability. The PMU counter data can be found in
/sys/kernel/debug/tracing/trace(trace_pipe).(the cycles PMU
value when 'kprobe/sys_write' sampling)

   $ cat /sys/kernel/debug/tracing/trace_pipe
   $ ./tracex6
...
  cat-677   [002] d..1   210.299270: : bpf count: CPU-2  5316659
  cat-677   [002] d..1   210.299316: : bpf count: CPU-2  5378639
  cat-677   [002] d..1   210.299362: : bpf count: CPU-2  5440654
  cat-677   [002] d..1   210.299408: : bpf count: CPU-2  5503211
  cat-677   [002] d..1   210.299454: : bpf count: CPU-2  5565438
  cat-677   [002] d..1   210.299500: : bpf count: CPU-2  5627433
  cat-677   [002] d..1   210.299547: : bpf count: CPU-2  5690033
  cat-677   [002] d..1   210.299593: : bpf count: CPU-2  5752184
  cat-677   [002] d..1   210.299639: : bpf count: CPU-2  5814543
...-548   [009] d..1   210.299667: : bpf count: CPU-9  605418074
...-548   [009] d..1   210.299692: : bpf count: CPU-9  605452692
  cat-677   [002] d..1   210.299700: : bpf count: CPU-2  5896319
...-548   [009] d..1   210.299710: : bpf count: CPU-9  605477824
...-548   [009] d..1   210.299728: : bpf count: CPU-9  605501726
...-548   [009] d..1   210.299745: : bpf count: CPU-9  605525279
...-548   [009] d..1   210.299762: : bpf count: CPU-9  605547817
...-548   [009] d..1   210.299778: : bpf count: CPU-9  605570433
...-548   [009] d..1   210.299795: : bpf count: CPU-9  605592743
...

The detail of patches is as follow:

Patch 1/5 introduces a new bpf map type. This map only stores the
pointer to struct perf_event;

Patch 2/5 introduces a map_traverse_elem() function for further use;

Patch 3/5 convets event file descriptors into perf_event structure when
add new element to the map;


So far all the map backends are of generic nature, knowing absolutely nothing
about a particular consumer/subsystem of eBPF (tc, socket filters, etc). The
tail call is a bit special, but nevertheless generic for each user and [very]
useful, so it makes sense to inherit from the array map and move the code there.

I don't really like that we start add new _special_-cased maps here into the
eBPF core code, it seems quite hacky. :( From your rather terse commit 
description
where you introduce the maps, I failed to see a detailed elaboration on this 
i.e.
why it cannot be abstracted any different?


Patch 4/5 implement function bpf_perf_event_read() that get the selected
hardware PMU conuter;

Patch 5/5 give a simple example.

Kaixu Xia (5):
   bpf: Add new bpf map type to store the pointer to struct perf_event
   bpf: Add function map-ops-map_traverse_elem() to traverse map elems
   bpf: Save the pointer to struct perf_event to map
   bpf: Implement function bpf_perf_event_read() that get the selected
 hardware PMU conuter
   samples/bpf: example of get selected PMU counter value

  include/linux/bpf.h|   6 +++
  include/linux/perf_event.h |   5 ++-
  include/uapi/linux/bpf.h   |   3 ++
  kernel/bpf/arraymap.c  | 110 +
  kernel/bpf/helpers.c   |  42 +
  kernel/bpf/syscall.c   |  26 +++
  kernel/events/core.c   |  30 -
  kernel/trace/bpf_trace.c   |   2 +
  samples/bpf/Makefile   |   4 ++
  samples/bpf/bpf_helpers.h  |   2 +
  samples/bpf/tracex6_kern.c |  27 +++
  samples/bpf/tracex6_user.c |  67 +++
  12 files changed, 321 insertions(+), 3 deletions(-)
  create mode 100644 

Re: Draft 3 of bpf(2) man page for review

2015-07-23 Thread Daniel Borkmann

On 07/23/2015 01:23 PM, Michael Kerrisk (man-pages) wrote:
...

Btw, a user obviously can close() the map fds if he
wants to, but ultimatively they're freed when the program unloads.


Okay. (Not sure if you meant that something should be added to the page.)


I think not necessary.

[...]

The attributes key_size and value_size will be used by the


attribute's?


Nope. But I changed this to The key_size and value_size attributes will be,
which may read clearer.


Sorry, true, I was a bit confused. :)

[...]

The type __u64 is kernel internal, so if there's no strict reason to use it,
we should just use what's provided by stdint.h.


Agreed. Done. (By the way, what about all the __u32 and __u64 elements in the
bpf_attr union?)


I wouldn't change the bpf_attr from the uapi.

Just the provided example code here, I presume people might copy from here when
they build their own library and in userspace uint64_t seems to be more natural.

[...]

*  map_update_elem()  replaces  elements  in an non-atomic
   fashion; for atomic updates, a hash-table map should be
   used instead.


This point here is most important, i.e. to not have false user expecations.
Maybe it's also worth mentioning that when you have a value_size of 
sizeof(long),
you can however use __sync_fetch_and_add() atomic builtin from the LLVM backend.


I think I'll leave out that detail for the moment.


Ok, I guess we could revisit/clarify that at a later point in time. I'd add
a TODO comment to the source or the like, as this also is related to the 2nd
below use case (aggregation/accounting), where an array is typically used.


Among the uses for array maps are the following:

*  As global eBPF variables: an array of 1 element whose
   key is (index) 0 and where the value is a collection of
   'global'  variables which eBPF programs can use to keep
   state between events.

*  Aggregation of tracing events into a fixed set of buck‐
   ets.


[...]

 *  license is a license string, which must be GPL  compatible  to
call helper functions marked gpl_only.


Not strictly. So here, the same rules apply as with kernel modules. I.e. what
the kernel checks for are the following license strings:

static inline int license_is_gpl_compatible(const char *license)
{
return (strcmp(license, GPL) == 0
|| strcmp(license, GPL v2) == 0
|| strcmp(license, GPL and additional rights) == 0
|| strcmp(license, Dual BSD/GPL) == 0
|| strcmp(license, Dual MIT/GPL) == 0
|| strcmp(license, Dual MPL/GPL) == 0);
}

With any of them, the eBPF program is declared GPL compatible. Maybe of interest
for those that want to use dual licensing of some sort.


So, I'm a little unclear here. What text do you suggest for the page?


Maybe we should mention in addition that the same licensing rules apply as
in case with kernel modules, so also dual licenses could be used.


 *  log_buf is a pointer to a caller-allocated buffer in which the
in-kernel verifier can store the verification log.   This  log
is  a  multi-line  string  that  can be checked by the program
author in order to understand how the  verifier  came  to  the
conclusion  that the BPF program is unsafe.  The format of the
output can change at any time as the verifier evolves.

 *  log_size size of the buffer pointed to  by  log_bug.   If  the
size  of  the buffer is not large enough to store all verifier
messages, -1 is returned and errno is set to ENOSPC.

 *  log_level verbosity level of the verifier.  A  value  of  zero
means that the verifier will not provide a log.


Note that the log buffer is optional as mentioned here log_level = 0. The
above example code of bpf_prog_load() suggests that it always needs to be
provided.

I once ran indeed into an issue where the program itself was correct, but
it got rejected by the kernel, because my log buffer size was too small, so
in tc, we now have it larger as bpf_log_buf[65536] ...


So, I'm not clear. Do you mean that some piece of text here in the page
should be changed? If so, could elaborate?


I'd maybe only mention in addition that in log_level=0 case, we also must not
provide a log_buf and log_size, otherwise we get EINVAL.

[...]

I had to read this twice. ;) Maybe this needs to be reworded slightly.

It just means that depending on the program type that the author selects,
you might end up with a different subset of helper functions, and a
different program input/context. For example tracing does not have the
exact same helpers as socket filters (it might have some that can be used
by both). Also, the eBPF program input (context) for socket filters is a
network 

Re: Draft 3 of bpf(2) man page for review

2015-07-23 Thread Daniel Borkmann

On 07/23/2015 03:36 PM, Michael Kerrisk (man-pages) wrote:
...

Ok, I guess we could revisit/clarify that at a later point in time. I'd add
a TODO comment to the source or the like, as this also is related to the 2nd
below use case (aggregation/accounting), where an array is typically used.


Okay. FIXME added.


Great.

[...]

I'd maybe only mention in addition that in log_level=0 case, we also must not
provide a log_buf and log_size, otherwise we get EINVAL.


I changed the text to:

*  log_level  verbosity  level  of the verifier.  A value of zero
   means that the verifier will not provide a log; in this  case,
   log_buf must be a NULL pointer, and log_size must be zero.


Ok.

[...]

Thanks for clearing that up -- I added the following sentence

 JIT compiler for eBPF is currently available for the x86-64, arm64,
 and s390 architectures.

Okay?


Yep.

Thanks,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Draft 3 of bpf(2) man page for review

2015-07-23 Thread Daniel Borkmann

Hi Michael,

looks good already, a couple of comments inline, on top of Alexei's feedback:

On 07/22/2015 10:10 PM, Michael Kerrisk (man-pages) wrote:
...

NAME
bpf - perform a command on an extended eBPF map or program


'extended eBPF' should perhaps just say 'eBPF' or 'extended BPF' (the
'e' itself stands for 'extended')


SYNOPSIS
#include linux/bpf.h

int bpf(int cmd, union bpf_attr *attr, unsigned int size);

DESCRIPTION
The  bpf()  system call performs a range of operations related to
extended Berkeley Packet Filters.  Extended BPF (or eBPF) is sim‐
ilar  to  the original (classic) BPF (cBPF) used to filter net‐
work packets.  For both cBPF and eBPF programs, the kernel stati‐
cally  analyzes  the  programs  before  loading them, in order to
ensure that they cannot harm the running system.

eBPF extends cBPF in multiple ways, including the ability to call
a  fixed  set  of  in-kernel  helper  functions (via the BPF_CALL
opcode extension provided by eBPF) and access shared data  struc‐
tures such as eBPF maps.

Extended BPF Design/Architecture
BPF  maps  are  a generic data structure for storage of different


Maybe s/BPF/eBPF/ as we introduced its definition above and used 'eBPF maps'
just in the previous sentence. (I would from the onwards just use either eBPF
or cBPF, makes it probably more clear).


data types.  A  user  process  can  create  multiple  maps  (with
key/value-pairs  being  opaque bytes of data) and access them via
file descriptors.  Differnt eBPF programs  can  access  the  same
maps  in  parallel.  It's up to the user process and eBPF program
to decide what they store inside maps.

eBPF programs are similar to kernel modules.  They are loaded  by
the  user  process  and  automatically  unloaded when the process
exits.  Each program is a set of instructions that is safe to run


The 1st and 2nd sentence in that order/combination may sounds a bit weird.
Maybe I would just drop the first sentence? I would argue that there might
be a few similarities, but more differences overall. So I guess we'd either
need to elaborate on the 1st sentence or just leave it out (could perhaps
be a FIXME comment to later on introduce a new section that elaborates on
both?).


until  its  completion.   An in-kernel verifier statically deter‐
mines that the eBPF program terminates and is  safe  to  execute.
During  verification,  the kernel increments reference counts for
each of the maps that the eBPF program uses, so that the selected
maps cannot be removed until the program is unloaded.


s/selected/attached/ ? Btw, a user obviously can close() the map fds if he
wants to, but ultimatively they're freed when the program unloads.


eBPF  programs can be attached to different events.  These events
can be the arrival of network packets, tracing events,  classifi‐
cation  event  by  qdisc  (for  eBPF programs attached to a tc(8)
classifier), and other types that may be added in the future.   A


Maybe: classification events by network queuing disciplines


new event triggers execution of the eBPF program, which may store
information about the event in eBPF maps.  Beyond  storing  data,
eBPF programs may call a fixed set of in-kernel helper functions.


I think this was mentioned before, but ok.


The same eBPF program can be attached to multiple events and dif‐
ferent eBPF programs can access the same map:

tracing tracing tracing packet packet
event A event B event C on eth0on eth1
 | |  |   |  |
 | |  |   |  |
 -- tracing --  tracing   socket socket
  prog_1   prog_2   prog_3 prog_4
  |  |   ||
   |---  -|  |---|   map_3
 map_1   map_2


Maybe prog_4 example could also be: s/socket/tc ingress classifier/ ;)


Arguments
The  operation to be performed by the bpf() system call is deter‐
mined by the cmd argument.  Each operation takes an  accompanying
argument,  provided  via  attr,  which is a pointer to a union of
type bpf_attr (see below).  The size argument is the size of  the
union pointed to by attr.

The value provided in cmd is one of the following:

BPF_MAP_CREATE
   Create a map with and return a file descriptor that refers
   to the map.


'Create a map with and'


BPF_MAP_LOOKUP_ELEM
   Look up an element by key in a specified  map  and  return
   its value.

BPF_MAP_UPDATE_ELEM
   

Re: [PATCH v2] cgroup: net_cls: fix false-positive suspicious RCU usage

2015-07-22 Thread Daniel Borkmann

On 07/22/2015 11:23 AM, Konstantin Khlebnikov wrote:

In dev_queue_xmit() net_cls protected with rcu-bh.

...

Signed-off-by: Konstantin Khlebnikov khlebni...@yandex-team.ru
---
  net/core/netclassid_cgroup.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 1f2a126f4ffa..6441f47b1a8f 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -23,7 +23,8 @@ static inline struct cgroup_cls_state *css_cls_state(struct 
cgroup_subsys_state

  struct cgroup_cls_state *task_cls_state(struct task_struct *p)
  {
-   return css_cls_state(task_css(p, net_cls_cgrp_id));
+   return css_cls_state(task_css_check(p, net_cls_cgrp_id,
+   rcu_read_lock_bh_held()));


Did you also check that after your patch this doesn't trigger on ingress
either? There, this code path could be invoked under rcu_read_lock(). So,
perhaps you need to check for both.


  }
  EXPORT_SYMBOL_GPL(task_cls_state);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] cgroup: net_cls: fix false-positive suspicious RCU usage

2015-07-22 Thread Daniel Borkmann

On 07/22/2015 02:03 PM, Konstantin Khlebnikov wrote:

On 22.07.2015 14:56, Daniel Borkmann wrote:

On 07/22/2015 11:23 AM, Konstantin Khlebnikov wrote:

In dev_queue_xmit() net_cls protected with rcu-bh.

...

Signed-off-by: Konstantin Khlebnikov khlebni...@yandex-team.ru
---
  net/core/netclassid_cgroup.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 1f2a126f4ffa..6441f47b1a8f 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -23,7 +23,8 @@ static inline struct cgroup_cls_state
*css_cls_state(struct cgroup_subsys_state

  struct cgroup_cls_state *task_cls_state(struct task_struct *p)
  {
-return css_cls_state(task_css(p, net_cls_cgrp_id));
+return css_cls_state(task_css_check(p, net_cls_cgrp_id,
+rcu_read_lock_bh_held()));


Did you also check that after your patch this doesn't trigger on ingress
either? There, this code path could be invoked under rcu_read_lock(). So,
perhaps you need to check for both.


I haven't seen warnings with this patch. rcu_read_lock_held() is
checked inside rcu_dereference_check() inside task_css_check().


Thanks, agreed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Further work on bpf(2)

2015-07-24 Thread Daniel Borkmann

On 07/24/2015 11:11 AM, Michael Kerrisk (man-pages) wrote:

Hello Alexei, Daniel

Now that we have a version of the bpf(2) man page out the door, I'd
like to see it refined, since there are a number of missing pieces.
In particular:

* The page lacks documentation of the BPF_MAP_TYPE_PROG_ARRAY
   map type.

* The page needs more info on the BPF_PROG_TYPE_SOCKET_FILTER
   program type.

* The page lacks documentation of the BPF_PROG_TYPE_SOCKET_FILTER
   program type.

 commit 2541517c32be2531e0da59dfd7efc1ce844644f5
 Author: Alexei Starovoitov a...@plumgrid.com
 Date:   Wed Mar 25 12:49:20 2015 -0700

 tracing, perf: Implement BPF programs attached to kprobes

* The page lacks documentation of the BPF_PROG_TYPE_SCHED_CLS
   program type.

 commit 96be4325f443dbbfeb37d2a157675ac0736531a1
 Author: Daniel Borkmann dan...@iogearbox.net
 Date:   Sun Mar 1 12:31:46 2015 +0100

 ebpf: add sched_cls_type and map it to sk_filter's verifier ops

 commit e2e9b6541dd4b31848079da80fe2253daaafb549
 Author: Daniel Borkmann dan...@iogearbox.net
 Date:   Sun Mar 1 12:31:48 2015 +0100

 cls_bpf: add initial eBPF support for programmable classifiers

* The page lacks documentation of the BPF_PROG_TYPE_SCHED_ACT
   program type.

 commit 94caee8c312d96522bcdae88791aaa9ebcd5f22c
 Author: Daniel Borkmann dan...@iogearbox.net
 Date:   Fri Mar 20 15:11:11 2015 +0100

 ebpf: add sched_act_type and map it to sk_filter's verifier ops

 commit a8cb5f556b567974d75ea29c15181c445c541b1f
 Author: Daniel Borkmann dan...@iogearbox.net
 Date:   Fri Mar 20 15:11:12 2015 +0100

 act_bpf: add initial eBPF support for actions

Between the two of you, could you propose some text for each of the
above, ideally as man-page patches.

There are also various other pieces of the page that need work,
which I've marked with FIXMEs. Could you take a look at these please?


Ok, sure. We will sort this out and get back to you with a couple
of patches some time next week.

Thanks,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rhashtable] WARNING: CPU: 0 PID: 1 at lib/debugobjects.c:301 __debug_object_init()

2015-07-14 Thread Daniel Borkmann

On 07/14/2015 07:21 AM, Fengguang Wu wrote:

Sorry please ignore -- this no longer happen in linux-next, so should be fine.


Seen this before, this fixed it back then:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/lib/test_rhashtable.c?id=b7f5e5c7f8cedf6b69c9702d448cdf78ffc45c7b
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] jhash: Deinline jhash, jhash2 and __jhash_nwords

2015-07-16 Thread Daniel Borkmann

On 07/16/2015 02:15 PM, Denys Vlasenko wrote:

On 07/16/2015 12:41 PM, Thomas Graf wrote:

On 07/16/15 at 12:02pm, Denys Vlasenko wrote:

+/* jhash - hash an arbitrary key
+ * @k: sequence of bytes as key
+ * @length: the length of the key
+ * @initval: the previous hash, or an arbitray value
+ *
+ * The generic version, hashes an arbitrary sequence of bytes.
+ * No alignment or length assumptions are made about the input key.
+ *
+ * Returns the hash value of the key. The result depends on endianness.
+ */
+u32 jhash(const void *key, u32 length, u32 initval)


Shouldn't these live in lib/jhash.c or something? Otherwise
everyone needs to depend on CONFIG_RHASHTABLE


There is no CONFIG_RHASHTABLE, rhashtable.c is compiled unconditionally.

I will send an alternative patch, which creates jhash.c;
apply whichever version you like most.


Please also Cc netdev as networking subsys is one of the main users
of jhash in various critical paths. Did you run e.g. some *_RR work
load benchmarks as well to make sure there's no regression?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Edited draft of bpf(2) man page for review/enhancement

2015-07-21 Thread Daniel Borkmann

Hi Michael,

is there any update on the bpf(2) man-page since last time, wrt
having an initial version in your tree?

Thanks again,
Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-21 Thread Daniel Borkmann
 dev_t devt, const struct device_type *type,
+ void *drvdata, const char *fmt, ...);
+extern __printf(6, 7)
 struct device *device_create_with_groups(struct class *cls,
 struct device *parent, dev_t devt, void *drvdata,
 const struct attribute_group **groups,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 564f1f0..55e5aad 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -63,50 +63,17 @@ struct bpf_insn {
__s32   imm;/* signed immediate constant */
 };

-/* BPF syscall commands */
+/* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
-   /* create a map with given type and attributes
-* fd = bpf(BPF_MAP_CREATE, union bpf_attr *, u32 size)
-* returns fd or negative error
-* map is deleted when fd is closed
-*/
BPF_MAP_CREATE,
-
-   /* lookup key in a given map
-* err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
-* Using attr->map_fd, attr->key, attr->value
-* returns zero and stores found elem into value
-* or negative error
-*/
BPF_MAP_LOOKUP_ELEM,
-
-   /* create or update key/value pair in a given map
-* err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
-* Using attr->map_fd, attr->key, attr->value, attr->flags
-* returns zero or negative error
-*/
BPF_MAP_UPDATE_ELEM,
-
-   /* find and delete elem by key in a given map
-* err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
-* Using attr->map_fd, attr->key
-* returns zero or negative error
-*/
BPF_MAP_DELETE_ELEM,
-
-   /* lookup key in a given map and return next key
-* err = bpf(BPF_MAP_GET_NEXT_KEY, union bpf_attr *attr, u32 size)
-* Using attr->map_fd, attr->key, attr->next_key
-* returns zero and stores next key or negative error
-*/
BPF_MAP_GET_NEXT_KEY,
-
-   /* verify and load eBPF program
-* prog_fd = bpf(BPF_PROG_LOAD, union bpf_attr *attr, u32 size)
-* Using attr->prog_type, attr->insns, attr->license
-* returns fd or negative error
-*/
BPF_PROG_LOAD,
+   BPF_DEV_CREATE,
+   BPF_DEV_DESTROY,
+   BPF_DEV_CONNECT,
 };

 enum bpf_map_type {
@@ -160,6 +127,10 @@ union bpf_attr {
__aligned_u64   log_buf;/* user supplied buffer */
__u32   kern_version;   /* checked when 
prog_type=kprobe */
};
+
+   struct { /* anonymous struct used by BPF_DEV_* commands */
+   __u32   fd;
+   };
 } __attribute__((aligned(8)));

 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index e6983be..f871ca6 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,2 +1,4 @@
 obj-y := core.o
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o hashtab.o arraymap.o 
helpers.o
+
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o device.o helpers.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 8086471..334b1bd 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -92,6 +92,7 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t 
gfp_extra_flags)

fp->pages = size / PAGE_SIZE;
fp->aux = aux;
+   fp->aux->prog = fp;

return fp;
 }
@@ -116,6 +117,7 @@ struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, 
unsigned int size,

memcpy(fp, fp_old, fp_old->pages * PAGE_SIZE);
fp->pages = size / PAGE_SIZE;
+   fp->aux->prog = fp;

/* We keep fp->aux from fp_old around in the new
 * reallocated structure.
@@ -726,7 +728,6 @@ void bpf_prog_free(struct bpf_prog *fp)
struct bpf_prog_aux *aux = fp->aux;

INIT_WORK(>work, bpf_prog_free_deferred);
-   aux->prog = fp;
schedule_work(>work);
 }
 EXPORT_SYMBOL_GPL(bpf_prog_free);
diff --git a/kernel/bpf/device.c b/kernel/bpf/device.c
new file mode 100644
index 000..711c9a4
--- /dev/null
+++ b/kernel/bpf/device.c
@@ -0,0 +1,441 @@
+/*
+ * Special file backend for persistent eBPF maps and programs, used by
+ * bpf() system call.
+ *
+ * (C) 2015 Daniel Borkmann <dan...@iogearbox.net>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define BPF_MAX_DEVS   (1UL << MINORBITS)
+#define BPF_MODE_DEF   (S_IRUSR

Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-23 Thread Daniel Borkmann

On 10/22/2015 09:35 PM, Eric W. Biederman wrote:

Daniel Borkmann <dan...@iogearbox.net> writes:

On 10/20/2015 08:56 PM, Eric W. Biederman wrote:
...

Just FYI:  Using a device for this kind of interface is pretty
much a non-starter as that quickly gets you into situations where
things do not work in containers.  If someone gets a version of device
namespaces past GregKH it might be up for discussion to use character
devices.


Okay, you are referring to this discussion here:

   http://thread.gmane.org/gmane.linux.kernel.containers/26760


That is a piece of it.  It is an old old discussion (which generally has
been handled poorly).  For the forseeable future device namespaces have
a firm NACK by GregKH.  Which means that dynamic character device based
interfaces do not work in containers.  Which means if you are not
talking about physical hardware, character devices are a poor fit.


Yes, it breaks down with real namespace support. Reworking the set with
an improved version of the fs code is already in progress.

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8] seccomp, ptrace: add support for dumping seccomp filters

2015-10-21 Thread Daniel Borkmann

On 10/21/2015 10:12 PM, Kees Cook wrote:

On Wed, Oct 21, 2015 at 12:15 PM, Tycho Andersen
 wrote:

Hi Oleg,

On Wed, Oct 21, 2015 at 08:51:46PM +0200, Oleg Nesterov wrote:

On 10/20, Tycho Andersen wrote:


Hi Kees, Oleg,

On Tue, Oct 20, 2015 at 10:20:24PM +0200, Oleg Nesterov wrote:


No, you can't do copy_to_user() from atomic context. You need to pin this
filter, drop the lock/irq, then copy_to_user().


Attached is a patch which addresses this.


Looks good to me, feel free to add my reviewed-by.


a couple of questions, I am just curious...


+long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
+   void __user *data)
+{
+   struct seccomp_filter *filter;
+   struct sock_fprog_kern *fprog;
+   long ret;
+   unsigned long count = 0;
+
+   if (!capable(CAP_SYS_ADMIN) ||
+   current->seccomp.mode != SECCOMP_MODE_DISABLED) {
+   return -EACCES;
+   }
+
+   spin_lock_irq(>sighand->siglock);
+   if (task->seccomp.mode != SECCOMP_MODE_FILTER) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   filter = task->seccomp.filter;
+   while (filter) {
+   filter = filter->prev;
+   count++;
+   }
+
+   if (filter_off >= count) {
+   ret = -ENOENT;
+   goto out;
+   }
+   count -= filter_off;
+
+   filter = task->seccomp.filter;
+   while (filter && count > 1) {
+   filter = filter->prev;
+   count--;
+   }
+
+   if (WARN_ON(count != 1)) {
+   /* The filter tree shouldn't shrink while we're using it. */
+   ret = -ENOENT;


Yes. but this looks a bit confusing. If we want this WARN_ON() check
because we are paranoid, then we should do

   WARN_ON(count != 1 || filter);


I guess you mean !filter here? We want filter to be non-null, because
we use it later.


And "while we're using it" look misleading, we rely on ->siglock.

Plus if we could be shrinked the additional check can't help anyway,
we can used the free filter. So I don't really understand this check
and "filter != NULL" in the previous "while (filter && count > 1)".
Nevermind...


Just paranoia. You're right that we could get rid of WARN_ON and the
null check. I can send an updated patch to drop these bits if
necessary. Kees?


I like being really paranoid when dealing with the filters. Let's keep
the WARN_ON (with the "|| !filter" added) but maybe wrap it in
"unlikely"?


Btw, the conditions inside the WARN_ON() macro would already resolve
to unlikely().

Best,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8] seccomp, ptrace: add support for dumping seccomp filters

2015-10-26 Thread Daniel Borkmann

Hi Tycho,

On 10/27/2015 01:04 AM, Tycho Andersen wrote:

On Mon, Oct 26, 2015 at 04:07:01PM +0900, Kees Cook wrote:

On Mon, Oct 26, 2015 at 3:46 PM, Kees Cook  wrote:

Cool, thanks. I'll get this into my tree after kernel summit. Thanks
for suffering through all this Tycho!


Actually, since this depends on changes in net, could this get pulled
in from that direction?

Acked-by: Kees Cook 


Can we get the attached patch into net-next?


You need to make a fresh, formal submission of your patch to netdev,
not as an attachment (otherwise patchwork cannot properly pick it up).

Also, indicate the right tree in the subject as: [PATCH net-next] ...

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net-next v2 1/5] bpf: abstract anon_inode_getfd invocations

2015-10-29 Thread Daniel Borkmann
Since we're going to use anon_inode_getfd() invocations in more than just
the current places, make a helper function for both, so that we only need
to pass a map/prog pointer to the helper itself in order to get a fd. The
new helpers are called bpf_map_new_fd() and bpf_prog_new_fd().

Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 kernel/bpf/syscall.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 687dd6c..2b89ef0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -111,6 +111,12 @@ static const struct file_operations bpf_map_fops = {
.release = bpf_map_release,
 };
 
+static int bpf_map_new_fd(struct bpf_map *map)
+{
+   return anon_inode_getfd("bpf-map", _map_fops, map,
+   O_RDWR | O_CLOEXEC);
+}
+
 /* helper macro to check that unused fields 'union bpf_attr' are zero */
 #define CHECK_ATTR(CMD) \
memchr_inv((void *) >CMD##_LAST_FIELD + \
@@ -141,8 +147,7 @@ static int map_create(union bpf_attr *attr)
if (err)
goto free_map;
 
-   err = anon_inode_getfd("bpf-map", _map_fops, map, O_RDWR | 
O_CLOEXEC);
-
+   err = bpf_map_new_fd(map);
if (err < 0)
/* failed to allocate fd */
goto free_map;
@@ -538,6 +543,12 @@ static const struct file_operations bpf_prog_fops = {
 .release = bpf_prog_release,
 };
 
+static int bpf_prog_new_fd(struct bpf_prog *prog)
+{
+   return anon_inode_getfd("bpf-prog", _prog_fops, prog,
+   O_RDWR | O_CLOEXEC);
+}
+
 static struct bpf_prog *get_prog(struct fd f)
 {
struct bpf_prog *prog;
@@ -647,7 +658,7 @@ static int bpf_prog_load(union bpf_attr *attr)
if (err < 0)
goto free_used_maps;
 
-   err = anon_inode_getfd("bpf-prog", _prog_fops, prog, O_RDWR | 
O_CLOEXEC);
+   err = bpf_prog_new_fd(prog);
if (err < 0)
/* failed to allocate fd */
goto free_used_maps;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net-next v2 3/5] bpf: consolidate bpf_prog_put{,_rcu} dismantle paths

2015-10-29 Thread Daniel Borkmann
We currently have duplicated cleanup code in bpf_prog_put() and
bpf_prog_put_rcu() cleanup paths. Back then we decided that it was
not worth it to make it a common helper called by both, but with
the recent addition of resource charging, we could have avoided
the fix in commit ac00737f4e81 ("bpf: Need to call bpf_prog_uncharge_memlock
from bpf_prog_put") if we would have had only a single, common path.
We can simplify it further by assigning aux->prog only once during
allocation time.

Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 kernel/bpf/core.c|  3 ++-
 kernel/bpf/syscall.c | 15 +--
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 8086471..334b1bd 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -92,6 +92,7 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t 
gfp_extra_flags)
 
fp->pages = size / PAGE_SIZE;
fp->aux = aux;
+   fp->aux->prog = fp;
 
return fp;
 }
@@ -116,6 +117,7 @@ struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, 
unsigned int size,
 
memcpy(fp, fp_old, fp_old->pages * PAGE_SIZE);
fp->pages = size / PAGE_SIZE;
+   fp->aux->prog = fp;
 
/* We keep fp->aux from fp_old around in the new
 * reallocated structure.
@@ -726,7 +728,6 @@ void bpf_prog_free(struct bpf_prog *fp)
struct bpf_prog_aux *aux = fp->aux;
 
INIT_WORK(>work, bpf_prog_free_deferred);
-   aux->prog = fp;
schedule_work(>work);
 }
 EXPORT_SYMBOL_GPL(bpf_prog_free);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 3fff82c..d7783cb 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -513,7 +513,7 @@ static void bpf_prog_uncharge_memlock(struct bpf_prog *prog)
free_uid(user);
 }
 
-static void __prog_put_rcu(struct rcu_head *rcu)
+static void __prog_put_common(struct rcu_head *rcu)
 {
struct bpf_prog_aux *aux = container_of(rcu, struct bpf_prog_aux, rcu);
 
@@ -525,19 +525,14 @@ static void __prog_put_rcu(struct rcu_head *rcu)
 /* version of bpf_prog_put() that is called after a grace period */
 void bpf_prog_put_rcu(struct bpf_prog *prog)
 {
-   if (atomic_dec_and_test(>aux->refcnt)) {
-   prog->aux->prog = prog;
-   call_rcu(>aux->rcu, __prog_put_rcu);
-   }
+   if (atomic_dec_and_test(>aux->refcnt))
+   call_rcu(>aux->rcu, __prog_put_common);
 }
 
 void bpf_prog_put(struct bpf_prog *prog)
 {
-   if (atomic_dec_and_test(>aux->refcnt)) {
-   free_used_maps(prog->aux);
-   bpf_prog_uncharge_memlock(prog);
-   bpf_prog_free(prog);
-   }
+   if (atomic_dec_and_test(>aux->refcnt))
+   __prog_put_common(>aux->rcu);
 }
 EXPORT_SYMBOL_GPL(bpf_prog_put);
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net-next v2 0/5] BPF updates

2015-10-29 Thread Daniel Borkmann
This set adds support for persistent maps/progs. Please see
individual patches for further details. A man-page update
to bpf(2) will be sent later on, also a iproute2 patch for
support in tc.

Thanks!

v1 -> v2:
  - Reworked most of patch 4 and 5
  - Rebased to latest net-next

Daniel Borkmann (5):
  bpf: abstract anon_inode_getfd invocations
  bpf: align and clean bpf_{map,prog}_get helpers
  bpf: consolidate bpf_prog_put{,_rcu} dismantle paths
  bpf: add support for persistent maps/progs
  bpf: add sample usages for persistent maps/progs

 include/linux/bpf.h|   9 +-
 include/uapi/linux/bpf.h   |  45 +-
 include/uapi/linux/magic.h |   1 +
 kernel/bpf/Makefile|   4 +-
 kernel/bpf/core.c  |   3 +-
 kernel/bpf/inode.c | 387 +
 kernel/bpf/syscall.c   |  95 +++
 kernel/bpf/verifier.c  |   3 +-
 samples/bpf/Makefile   |   3 +
 samples/bpf/fds_example.c  | 183 +
 samples/bpf/libbpf.c   |  19 +++
 samples/bpf/libbpf.h   |   3 +
 12 files changed, 683 insertions(+), 72 deletions(-)
 create mode 100644 kernel/bpf/inode.c
 create mode 100644 samples/bpf/fds_example.c

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net-next v2 4/5] bpf: add support for persistent maps/progs

2015-10-29 Thread Daniel Borkmann
This work adds support for "persistent" eBPF maps/programs. The term
"persistent" is to be understood that maps/programs have a facility
that lets them survive process termination. This is desired by various
eBPF subsystem users.

Just to name one example: tc classifier/action. Whenever tc parses
the ELF object, extracts and loads maps/progs into the kernel, these
file descriptors will be out of reach after the tc instance exits.
So a subsequent tc invocation won't be able to access/relocate on this
resource, and therefore maps cannot easily be shared, f.e. between the
ingress and egress networking data path.

The current workaround is that Unix domain sockets (UDS) need to be
instrumented in order to pass the created eBPF map/program file
descriptors to a third party management daemon through UDS' socket
passing facility. This makes it a bit complicated to deploy shared
eBPF maps or programs (programs f.e. for tail calls) among various
processes.

We've been brainstorming on how we could tackle this issue and various
approches have been tried out so far, which can be read up further in
the below reference.

The architecture we eventually ended up with is a minimal file system
that can hold map/prog objects. The file system is a per mount namespace
singleton, and the default mount point is /sys/fs/bpf/. Any subsequent
mounts within a given namespace will point to the same instance. The
file system allows for creating a user-defined directory structure.
The objects for maps/progs are created/fetched through bpf(2) with
two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor
along with a pathname is being passed to bpf(2) that in turn creates
(we call it eBPF object pinning) the file system nodes. Only the pathname
is being passed to bpf(2) for getting a new BPF file descriptor to an
existing node. The user can use that to access maps and progs later on,
through bpf(2). Removal of file system nodes is being managed through
normal VFS functions such as unlink(2), etc. The file system code is
kept to a very minimum and can be further extended later on.

The next step I'm working on is to add dump eBPF map/prog commands
to bpf(2), so that a specification from a given file descriptor can
be retrieved. This can be used by things like CRIU but also applications
can inspect the meta data after calling BPF_OBJ_GET.

Big thanks also to Alexei and Hannes who significantly contributed
in the design discussion that eventually let us end up with this
architecture here.

Reference: https://lkml.org/lkml/2015/10/15/925
Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
Signed-off-by: Alexei Starovoitov <a...@kernel.org>
Signed-off-by: Hannes Frederic Sowa <han...@stressinduktion.org>
---
 include/linux/bpf.h|   7 +
 include/uapi/linux/bpf.h   |  45 +-
 include/uapi/linux/magic.h |   1 +
 kernel/bpf/Makefile|   4 +-
 kernel/bpf/inode.c | 387 +
 kernel/bpf/syscall.c   |  30 +++-
 6 files changed, 433 insertions(+), 41 deletions(-)
 create mode 100644 kernel/bpf/inode.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0b5fb6a..de464e6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -167,11 +167,18 @@ struct bpf_prog *bpf_prog_get(u32 ufd);
 void bpf_prog_put(struct bpf_prog *prog);
 void bpf_prog_put_rcu(struct bpf_prog *prog);
 
+struct bpf_map *bpf_map_get(u32 ufd);
 struct bpf_map *__bpf_map_get(struct fd f);
 void bpf_map_put(struct bpf_map *map);
 
 extern int sysctl_unprivileged_bpf_disabled;
 
+int bpf_map_new_fd(struct bpf_map *map);
+int bpf_prog_new_fd(struct bpf_prog *prog);
+
+int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
+int bpf_obj_get_user(const char __user *pathname);
+
 /* verify correctness of eBPF program */
 int bpf_check(struct bpf_prog **fp, union bpf_attr *attr);
 #else
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2e03242..9ea2d22 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -63,50 +63,16 @@ struct bpf_insn {
__s32   imm;/* signed immediate constant */
 };
 
-/* BPF syscall commands */
+/* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
-   /* create a map with given type and attributes
-* fd = bpf(BPF_MAP_CREATE, union bpf_attr *, u32 size)
-* returns fd or negative error
-* map is deleted when fd is closed
-*/
BPF_MAP_CREATE,
-
-   /* lookup key in a given map
-* err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
-* Using attr->map_fd, attr->key, attr->value
-* returns zero and stores found elem into value
-* or negative error
-*/
BPF_MAP_LOOKUP_ELEM,
-
-   /* create or update key/value pair in a given map
-* err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
-* Using at

[PATCH net-next v2 5/5] bpf: add sample usages for persistent maps/progs

2015-10-29 Thread Daniel Borkmann
This patch adds a couple of stand-alone examples on how BPF_OBJ_PIN
and BPF_OBJ_GET commands can be used.

Example with maps:

  # ./fds_example -F /sys/fs/bpf/m -P -m -k 1 -v 42
  bpf: map fd:3 (Success)
  bpf: pin ret:(0,Success)
  bpf: fd:3 u->(1:42) ret:(0,Success)
  # ./fds_example -F /sys/fs/bpf/m -G -m -k 1
  bpf: get fd:3 (Success)
  bpf: fd:3 l->(1):42 ret:(0,Success)
  # ./fds_example -F /sys/fs/bpf/m -G -m -k 1 -v 24
  bpf: get fd:3 (Success)
  bpf: fd:3 u->(1:24) ret:(0,Success)
  # ./fds_example -F /sys/fs/bpf/m -G -m -k 1
  bpf: get fd:3 (Success)
  bpf: fd:3 l->(1):24 ret:(0,Success)

  # ./fds_example -F /sys/fs/bpf/m2 -P -m
  bpf: map fd:3 (Success)
  bpf: pin ret:(0,Success)
  # ./fds_example -F /sys/fs/bpf/m2 -G -m -k 1
  bpf: get fd:3 (Success)
  bpf: fd:3 l->(1):0 ret:(0,Success)
  # ./fds_example -F /sys/fs/bpf/m2 -G -m
  bpf: get fd:3 (Success)

Example with progs:

  # ./fds_example -F /sys/fs/bpf/p -P -p
  bpf: prog fd:3 (Success)
  bpf: pin ret:(0,Success)
  bpf sock:4 <- fd:3 attached ret:(0,Success)
  # ./fds_example -F /sys/fs/bpf/p -G -p
  bpf: get fd:3 (Success)
  bpf: sock:4 <- fd:3 attached ret:(0,Success)

  # ./fds_example -F /sys/fs/bpf/p2 -P -p -o ./sockex1_kern.o
  bpf: prog fd:5 (Success)
  bpf: pin ret:(0,Success)
  bpf: sock:3 <- fd:5 attached ret:(0,Success)
  # ./fds_example -F /sys/fs/bpf/p2 -G -p
  bpf: get fd:3 (Success)
  bpf: sock:4 <- fd:3 attached ret:(0,Success)

Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 samples/bpf/Makefile  |   3 +
 samples/bpf/fds_example.c | 183 ++
 samples/bpf/libbpf.c  |  19 +
 samples/bpf/libbpf.h  |   3 +
 4 files changed, 208 insertions(+)
 create mode 100644 samples/bpf/fds_example.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index b305145..79b4596 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -4,6 +4,7 @@ obj- := dummy.o
 # List of programs to build
 hostprogs-y := test_verifier test_maps
 hostprogs-y += sock_example
+hostprogs-y += fds_example
 hostprogs-y += sockex1
 hostprogs-y += sockex2
 hostprogs-y += sockex3
@@ -19,6 +20,7 @@ hostprogs-y += lathist
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
+fds_example-objs := bpf_load.o libbpf.o fds_example.o
 sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
 sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
 sockex3-objs := bpf_load.o libbpf.o sockex3_user.o
@@ -49,6 +51,7 @@ always += lathist_kern.o
 HOSTCFLAGS += -I$(objtree)/usr/include
 
 HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
+HOSTLOADLIBES_fds_example += -lelf
 HOSTLOADLIBES_sockex1 += -lelf
 HOSTLOADLIBES_sockex2 += -lelf
 HOSTLOADLIBES_sockex3 += -lelf
diff --git a/samples/bpf/fds_example.c b/samples/bpf/fds_example.c
new file mode 100644
index 000..e2fd16c
--- /dev/null
+++ b/samples/bpf/fds_example.c
@@ -0,0 +1,183 @@
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include "bpf_load.h"
+#include "libbpf.h"
+
+#define BPF_F_PIN  (1 << 0)
+#define BPF_F_GET  (1 << 1)
+#define BPF_F_PIN_GET  (BPF_F_PIN | BPF_F_GET)
+
+#define BPF_F_KEY  (1 << 2)
+#define BPF_F_VAL  (1 << 3)
+#define BPF_F_KEY_VAL  (BPF_F_KEY | BPF_F_VAL)
+
+#define BPF_M_UNSPEC   0
+#define BPF_M_MAP  1
+#define BPF_M_PROG 2
+
+static void usage(void)
+{
+   printf("Usage: fds_example [...]\n");
+   printf("   -FFile to pin/get object\n");
+   printf("   -P  |- pin object\n");
+   printf("   -G  `- get object\n");
+   printf("   -m  eBPF map mode\n");
+   printf("   -k |- map key\n");
+   printf("   -v   `- map value\n");
+   printf("   -p  eBPF prog mode\n");
+   printf("   -o  `- object file\n");
+   printf("   -h  Display this help.\n");
+}
+
+static int bpf_map_create(void)
+{
+   return bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(uint32_t),
+ sizeof(uint32_t), 1024);
+}
+
+static int bpf_prog_create(const char *object)
+{
+   static const struct bpf_insn insns[] = {
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   };
+
+   if (object) {
+   assert(!load_bpf_file((char *)object));
+   return prog_fd[0];
+   } else {
+   return bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER,
+insns, sizeof(insns), "GPL", 0);
+   }
+}
+
+static int bpf_do_map(const char *file, uint32_t flags, uint32_t key,
+

[PATCH net-next v2 2/5] bpf: align and clean bpf_{map,prog}_get helpers

2015-10-29 Thread Daniel Borkmann
Add a bpf_map_get() function that we're going to use later on and
align/clean the remaining helpers a bit so that we have them a bit
more consistent:

  - __bpf_map_get() and __bpf_prog_get() that both work on the fd
struct, check whether the descriptor is eBPF and return the
pointer to the map/prog stored in the private data.

Also, we can return f.file->private_data directly, the function
signature is enough of a documentation already.

  - bpf_map_get() and bpf_prog_get() that both work on u32 user fd,
call their respective __bpf_map_get()/__bpf_prog_get() variants,
and take a reference.

Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/linux/bpf.h   |  2 +-
 kernel/bpf/syscall.c  | 41 +++--
 kernel/bpf/verifier.c |  3 +--
 3 files changed, 25 insertions(+), 21 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 75718fa..0b5fb6a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -167,7 +167,7 @@ struct bpf_prog *bpf_prog_get(u32 ufd);
 void bpf_prog_put(struct bpf_prog *prog);
 void bpf_prog_put_rcu(struct bpf_prog *prog);
 
-struct bpf_map *bpf_map_get(struct fd f);
+struct bpf_map *__bpf_map_get(struct fd f);
 void bpf_map_put(struct bpf_map *map);
 
 extern int sysctl_unprivileged_bpf_disabled;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 2b89ef0..3fff82c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -162,19 +162,29 @@ free_map:
 /* if error is returned, fd is released.
  * On success caller should complete fd access with matching fdput()
  */
-struct bpf_map *bpf_map_get(struct fd f)
+struct bpf_map *__bpf_map_get(struct fd f)
 {
-   struct bpf_map *map;
-
if (!f.file)
return ERR_PTR(-EBADF);
-
if (f.file->f_op != _map_fops) {
fdput(f);
return ERR_PTR(-EINVAL);
}
 
-   map = f.file->private_data;
+   return f.file->private_data;
+}
+
+static struct bpf_map *bpf_map_get(u32 ufd)
+{
+   struct fd f = fdget(ufd);
+   struct bpf_map *map;
+
+   map = __bpf_map_get(f);
+   if (IS_ERR(map))
+   return map;
+
+   atomic_inc(>refcnt);
+   fdput(f);
 
return map;
 }
@@ -202,7 +212,7 @@ static int map_lookup_elem(union bpf_attr *attr)
return -EINVAL;
 
f = fdget(ufd);
-   map = bpf_map_get(f);
+   map = __bpf_map_get(f);
if (IS_ERR(map))
return PTR_ERR(map);
 
@@ -261,7 +271,7 @@ static int map_update_elem(union bpf_attr *attr)
return -EINVAL;
 
f = fdget(ufd);
-   map = bpf_map_get(f);
+   map = __bpf_map_get(f);
if (IS_ERR(map))
return PTR_ERR(map);
 
@@ -314,7 +324,7 @@ static int map_delete_elem(union bpf_attr *attr)
return -EINVAL;
 
f = fdget(ufd);
-   map = bpf_map_get(f);
+   map = __bpf_map_get(f);
if (IS_ERR(map))
return PTR_ERR(map);
 
@@ -355,7 +365,7 @@ static int map_get_next_key(union bpf_attr *attr)
return -EINVAL;
 
f = fdget(ufd);
-   map = bpf_map_get(f);
+   map = __bpf_map_get(f);
if (IS_ERR(map))
return PTR_ERR(map);
 
@@ -549,21 +559,16 @@ static int bpf_prog_new_fd(struct bpf_prog *prog)
O_RDWR | O_CLOEXEC);
 }
 
-static struct bpf_prog *get_prog(struct fd f)
+static struct bpf_prog *__bpf_prog_get(struct fd f)
 {
-   struct bpf_prog *prog;
-
if (!f.file)
return ERR_PTR(-EBADF);
-
if (f.file->f_op != _prog_fops) {
fdput(f);
return ERR_PTR(-EINVAL);
}
 
-   prog = f.file->private_data;
-
-   return prog;
+   return f.file->private_data;
 }
 
 /* called by sockets/tracing/seccomp before attaching program to an event
@@ -574,13 +579,13 @@ struct bpf_prog *bpf_prog_get(u32 ufd)
struct fd f = fdget(ufd);
struct bpf_prog *prog;
 
-   prog = get_prog(f);
-
+   prog = __bpf_prog_get(f);
if (IS_ERR(prog))
return prog;
 
atomic_inc(>aux->refcnt);
fdput(f);
+
return prog;
 }
 EXPORT_SYMBOL_GPL(bpf_prog_get);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b56cf51..fdc88c5 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1989,8 +1989,7 @@ static int replace_map_fd_with_map_ptr(struct 
verifier_env *env)
}
 
f = fdget(insn->imm);
-
-   map = bpf_map_get(f);
+   map = __bpf_map_get(f);
if (IS_ERR(map)) {
verbose("fd %d is not pointing to valid 
bpf_map\n",
insn->imm);
-- 
1.9.3

--
To unsubscri

Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-22 Thread Daniel Borkmann

On 10/22/2015 12:44 AM, Alexei Starovoitov wrote:
...

all users) When you have to hack drivers/base/core.c to get there it
should have been a warning sign that something is wrong with
this cdev approach.


Hmm, you know, this had nothing to do with it, merely to save ~20 LoC
that I can do just as well inside BPF framework. No changes in driver
API needed.


I've read the discussion passively and my take away is that, frankly,
I think the differences are somewhat minor. Both architectures can
scale to what we need. Both will do the job. I'm slightly worried about
exposing uAPI as a FS, I think that didn't work too well for sysfs. It's
pretty much a define the format once and never touch it again kind of
deal.


It's even worse in cdev style since it piggy backs on sysfs.


I don't mind with what approach we're going in the end, but this kind
of discussion is really tiring, and not going anywhere.

Lets just make a beer call, so we can hash out a way forward that works
for everyone.

On that note: cheers! ;)
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, REPORT] bpf_trace: build error without PERF_EVENTS

2015-11-10 Thread Daniel Borkmann

On 11/10/2015 06:14 PM, Alexei Starovoitov wrote:

On Tue, Nov 10, 2015 at 09:25:01AM -0500, Steven Rostedt wrote:

On Tue, 10 Nov 2015 14:31:38 +0100
Daniel Borkmann <dan...@iogearbox.net> wrote:


On 11/10/2015 01:55 PM, Arnd Bergmann wrote:

In my ARM randconfig tests, I'm getting a build error for
newly added code in bpf_perf_event_read and bpf_perf_event_output
whenever CONFIG_PERF_EVENTS is disabled:

kernel/trace/bpf_trace.c: In function 'bpf_perf_event_read':
kernel/trace/bpf_trace.c:203:11: error: 'struct perf_event' has no member named 
'oncpu'
if (event->oncpu != smp_processor_id() ||
   ^
kernel/trace/bpf_trace.c:204:11: error: 'struct perf_event' has no member named 
'pmu'
event->pmu->count)

This can happen when UPROBE_EVENT is enabled but KPROBE_EVENT
is disabled. I'm not sure if that is a configuration we care
about, otherwise we could prevent this case from occuring by
adding Kconfig dependencies.


I think that seems better than spreading #if IS_ENABLEDs into the code.
Probably enough to add a 'depends on PERF_EVENTS' to config BPF_EVENTS,
so it's also explicitly documented.



So just do the following then?

-- Steve

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 8d6363f42169..f5aecff2d243 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -434,7 +434,7 @@ config UPROBE_EVENT

  config BPF_EVENTS
depends on BPF_SYSCALL
-   depends on KPROBE_EVENT || UPROBE_EVENT
+   depends on KPROBE_EVENT && UPROBE_EVENT


yeah that's definitely cleaner and avoids ifdef creep in the future.


Agreed, that's better.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] arm64: bpf: add BPF XADD instruction

2015-11-11 Thread Daniel Borkmann

On 11/11/2015 11:24 AM, Will Deacon wrote:

On Wed, Nov 11, 2015 at 09:49:48AM +0100, Arnd Bergmann wrote:

On Tuesday 10 November 2015 18:52:45 Z Lim wrote:

On Tue, Nov 10, 2015 at 4:42 PM, Alexei Starovoitov
 wrote:

On Tue, Nov 10, 2015 at 04:26:02PM -0800, Shi, Yang wrote:

On 11/10/2015 4:08 PM, Eric Dumazet wrote:

On Tue, 2015-11-10 at 14:41 -0800, Yang Shi wrote:

aarch64 doesn't have native support for XADD instruction, implement it by
the below instruction sequence:


aarch64 supports atomic add in ARMv8.1.
For ARMv8(.0), please consider using LDXR/STXR sequence.


Is it worth optimizing for the 8.1 case? It would add a bit of complexity
to make the code depend on the CPU feature, but it's certainly doable.


What's the atomicity required for? Put another way, what are we racing
with (I thought bpf was single-threaded)? Do we need to worry about
memory barriers?

Apologies if these are stupid questions, but all I could find was
samples/bpf/sock_example.c and it didn't help much :(


The equivalent code more readable in restricted C syntax (that can be
compiled by llvm) can be found in samples/bpf/sockex1_kern.c. So the
built-in __sync_fetch_and_add() will be translated into a BPF_XADD
insn variant.

What you can race against is that an eBPF map can be _shared_ by
multiple eBPF programs that are attached somewhere in the system, and
they could all update a particular entry/counter from the map at the
same time.

Best,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] arm64: bpf: add BPF XADD instruction

2015-11-11 Thread Daniel Borkmann

On 11/11/2015 01:58 PM, Peter Zijlstra wrote:

On Wed, Nov 11, 2015 at 12:38:31PM +, Will Deacon wrote:

Hmm, gcc doesn't have an eBPF compiler backend, so this won't work on
gcc at all. The eBPF backend in LLVM recognizes the __sync_fetch_and_add()
keyword and maps that to a BPF_XADD version (BPF_W or BPF_DW). In the
interpreter (__bpf_prog_run()), as Eric mentioned, this maps to atomic_add()
and atomic64_add(), respectively. So the struct bpf_insn prog[] you saw
from sock_example.c can be regarded as one possible equivalent program
section output from the compiler.


Ok, so if I understand you correctly, then __sync_fetch_and_add() has
different semantics depending on the backend target. That seems counter
to the LLVM atomics Documentation:

   http://llvm.org/docs/Atomics.html

which specifically calls out the __sync_* primitives as being
sequentially-consistent and requiring barriers on ARM (which isn't the
case for atomic[64]_add in the kernel).

If we re-use the __sync_* naming scheme in the source language, I don't
think we can overlay our own semantics in the backend. The
__sync_fetch_and_add primitive is also expected to return the old value,
which doesn't appear to be the case for BPF_XADD.


Yikes. That's double fail. Please don't do this.

If you use the __sync stuff (and I agree with Will, you should not) it
really _SHOULD_ be sequentially consistent, which means full barriers
all over the place.

And if you name something XADD (exchange and add, or fetch-add) then it
had better return the previous value.

atomic*_add() does neither.


unsigned int ui;
unsigned long long ull;

void foo(void)
{
  (void) __sync_fetch_and_add(, 1);
  (void) __sync_fetch_and_add(, 1);
}

So clang front-end translates this snippet into intermediate
representation of ...

clang test.c -S -emit-llvm -o -
[...]
define void @foo() #0 {
  %1 = atomicrmw add i32* @ui, i32 1 seq_cst
  %2 = atomicrmw add i64* @ull, i64 1 seq_cst
  ret void
}
[...]

... which, if I see this correctly, then maps atomicrmw add {i32,i64}
in the BPF target into BPF_XADD as mentioned:

// Atomics
class XADD SizeOp, string OpcodeStr, PatFrag OpNode>
: InstBPF<(outs GPR:$dst), (ins MEMri:$addr, GPR:$val),
  !strconcat(OpcodeStr, "\t$dst, $addr, $val"),
  [(set GPR:$dst, (OpNode ADDRri:$addr, GPR:$val))]> {
  bits<3> mode;
  bits<2> size;
  bits<4> src;
  bits<20> addr;

  let Inst{63-61} = mode;
  let Inst{60-59} = size;
  let Inst{51-48} = addr{19-16}; // base reg
  let Inst{55-52} = src;
  let Inst{47-32} = addr{15-0}; // offset

  let mode = 6; // BPF_XADD
  let size = SizeOp;
  let BPFClass = 3; // BPF_STX
}

let Constraints = "$dst = $val" in {
def XADD32 : XADD<0, "xadd32", atomic_load_add_32>;
def XADD64 : XADD<3, "xadd64", atomic_load_add_64>;
// undefined def XADD16 : XADD<1, "xadd16", atomic_load_add_16>;
// undefined def XADD8  : XADD<2, "xadd8", atomic_load_add_8>;
}

I played a bit around with eBPF code to assign the __sync_fetch_and_add()
return value to a var and dump it to trace pipe, or use it as return code.
llvm compiles it (with the result assignment) and it looks like:

[...]
206: (b7) r3 = 3
207: (db) lock *(u64 *)(r0 +0) += r3
208: (bf) r1 = r10
209: (07) r1 += -16
210: (b7) r2 = 10
211: (85) call 6 // r3 dumped here
[...]

[...]
206: (b7) r5 = 3
207: (db) lock *(u64 *)(r0 +0) += r5
208: (bf) r1 = r10
209: (07) r1 += -16
210: (b7) r2 = 10
211: (b7) r3 = 43
212: (b7) r4 = 42
213: (85) call 6 // r5 dumped here
[...]

[...]
11: (b7) r0 = 3
12: (db) lock *(u64 *)(r1 +0) += r0
13: (95) exit // r0 returned here
[...]

What it seems is that we 'get back' the value (== 3 here in r3, r5, r0)
that we're adding, at least that's what seems to be generated wrt
register assignments. Hmm, the semantic differences of bpf target
should be documented somewhere for people writing eBPF programs to
be aware of.

Best,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] arm64: bpf: add BPF XADD instruction

2015-11-11 Thread Daniel Borkmann

On 11/11/2015 12:58 PM, Will Deacon wrote:

On Wed, Nov 11, 2015 at 11:42:11AM +0100, Daniel Borkmann wrote:

On 11/11/2015 11:24 AM, Will Deacon wrote:

On Wed, Nov 11, 2015 at 09:49:48AM +0100, Arnd Bergmann wrote:

On Tuesday 10 November 2015 18:52:45 Z Lim wrote:

On Tue, Nov 10, 2015 at 4:42 PM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:

On Tue, Nov 10, 2015 at 04:26:02PM -0800, Shi, Yang wrote:

On 11/10/2015 4:08 PM, Eric Dumazet wrote:

On Tue, 2015-11-10 at 14:41 -0800, Yang Shi wrote:

aarch64 doesn't have native support for XADD instruction, implement it by
the below instruction sequence:


aarch64 supports atomic add in ARMv8.1.
For ARMv8(.0), please consider using LDXR/STXR sequence.


Is it worth optimizing for the 8.1 case? It would add a bit of complexity
to make the code depend on the CPU feature, but it's certainly doable.


What's the atomicity required for? Put another way, what are we racing
with (I thought bpf was single-threaded)? Do we need to worry about
memory barriers?

Apologies if these are stupid questions, but all I could find was
samples/bpf/sock_example.c and it didn't help much :(


The equivalent code more readable in restricted C syntax (that can be
compiled by llvm) can be found in samples/bpf/sockex1_kern.c. So the
built-in __sync_fetch_and_add() will be translated into a BPF_XADD
insn variant.


Yikes, so the memory-model for BPF is based around the deprecated GCC
__sync builtins, that inherit their semantics from ia64? Any reason not
to use the C11-compatible __atomic builtins[1] as a base?


Hmm, gcc doesn't have an eBPF compiler backend, so this won't work on
gcc at all. The eBPF backend in LLVM recognizes the __sync_fetch_and_add()
keyword and maps that to a BPF_XADD version (BPF_W or BPF_DW). In the
interpreter (__bpf_prog_run()), as Eric mentioned, this maps to atomic_add()
and atomic64_add(), respectively. So the struct bpf_insn prog[] you saw
from sock_example.c can be regarded as one possible equivalent program
section output from the compiler.


What you can race against is that an eBPF map can be _shared_ by
multiple eBPF programs that are attached somewhere in the system, and
they could all update a particular entry/counter from the map at the
same time.


Ok, so it does sound like eBPF needs to define/choose a memory-model and
I worry that riding on the back of __sync isn't necessarily the right
thing to do, particularly as its fallen out of favour with the compiler
folks. On weakly-ordered architectures, it's also going to result in
heavy-weight barriers for all atomic operations.

Will

[1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tools/net: Use include/uapi with __EXPORTED_HEADERS__

2015-11-12 Thread Daniel Borkmann

On 11/11/2015 11:24 PM, Kamal Mostafa wrote:

Use the local uapi headers to keep in sync with "recently" added #define's
(e.g. SKF_AD_VLAN_TPID).  Refactored CFLAGS, and bpf_asm doesn't need -I.

Fixes: 3f356385e8a449e1d7cfc6b6f8d634ac4f5581a0
Signed-off-by: Kamal Mostafa <ka...@canonical.com>


Acked-by: Daniel Borkmann <dan...@iogearbox.net>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] arm64: bpf: add BPF XADD instruction

2015-11-11 Thread Daniel Borkmann

On 11/11/2015 07:31 PM, Peter Zijlstra wrote:

On Wed, Nov 11, 2015 at 10:11:33AM -0800, Alexei Starovoitov wrote:

On Wed, Nov 11, 2015 at 06:57:41PM +0100, Peter Zijlstra wrote:

On Wed, Nov 11, 2015 at 12:35:48PM -0500, David Miller wrote:

From: Alexei Starovoitov 
Date: Wed, 11 Nov 2015 09:27:00 -0800


BPF_XADD == atomic_add() in kernel. period.
we are not going to deprecate it or introduce something else.


Agreed, it makes no sense to try and tie C99 or whatever atomic
semantics to something that is already clearly defined to have
exactly kernel atomic_add() semantics.


Dave, this really doesn't make any sense to me. __sync primitives have
well defined semantics and (e)BPF is violating this.


bpf_xadd was never meant to be __sync_fetch_and_add equivalent.
 From the day one it meant to be atomic_add() as kernel does it.
I did piggy back on __sync in the llvm backend because it was the quick
and dirty way to move forward.
In retrospect I should have introduced a clean intrinstic for that instead,
but it's not too late to do it now. user space we can change at any time
unlike kernel.


I would argue that breaking userspace (language in this case) is equally
bad. Programs that used to work will now no longer work.


Well, on that note, it's not like you just change the target to bpf in your
Makefile and can compile (& load into the kernel) anything you want with it.
You do have to write small, restricted programs from scratch for a specific
use-case with the limited set of helper functions and intrinsics that are
available from the kernel. So I don't think that "Programs that used to work
will now no longer work." holds if you regard it as such.


Furthermore, the fetch_and_add (or XADD) name has well defined
semantics, which (e)BPF also violates.


bpf_xadd also didn't meant to be 'fetch'. It was void return from the beginning.


Then why the 'X'? The XADD name, does and always has meant: eXchange-ADD,
this means it must have a return value.

You using the XADD name for something that is not in fact XADD is just
wrong.


Atomicy is hard enough as it is, backends giving random interpretations
to them isn't helping anybody.


no randomness.


You mean every other backend translating __sync_fetch_and_add()
differently than you isn't random on your part?


bpf_xadd == atomic_add() in kernel.
imo that is the simplest and cleanest intepretantion one can have, no?


Wrong though, if you'd named it BPF_ADD, sure, XADD, not so much. That
is 'randomly' co-opting something that has well defined meaning and
semantics with something else.


It also baffles me that Alexei is seemingly unwilling to change/rev the
(e)BPF instructions, which would be invisible to the regular user, he
does want to change the language itself, which will impact all
'scripts'.


well, we cannot change it in kernel because it's ABI.


You can always rev it. Introduce a new set, and wait for users of the
old set to die, then remove it. We do that all the time with Linux ABI.


I'm not against adding new insns. We definitely can, but let's figure out why?
Is anything broken? No.


Yes, __sync_fetch_and_add() is broken when pulled through the eBPF
backend.


So what new insns make sense?


Depends a bit on how fancy you want to go. If you want to support weakly
ordered architectures at full speed you'll need more (and more
complexity) than if you decide to not go that way.

The simplest option would be a fully ordered compare-and-swap operation.
That is enough to implement everything else (at a cost). The other
extreme is a weak ll/sc with an optimizer pass recognising various forms
to translate into 'better' native instructions.


Add new one that does 'fetch_and_add' ? What is the real use case it
will be used for?


Look at all the atomic_{add,dec}_return*() users in the kernel. A typical
example would be a reader-writer lock implementations. See
include/asm-generic/rwsem.h for examples.


Adding new intrinsic to llvm is not a big deal. I'll add it as soon
as I have time to work on it or if somebody beats me to it I would be
glad to test it and apply it.


This isn't a speed coding contest. You want to think about this
properly.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] arm64: bpf: add BPF XADD instruction

2015-11-11 Thread Daniel Borkmann

On 11/11/2015 08:23 PM, Peter Zijlstra wrote:

On Wed, Nov 11, 2015 at 07:50:15PM +0100, Daniel Borkmann wrote:

Well, on that note, it's not like you just change the target to bpf in your
Makefile and can compile (& load into the kernel) anything you want with it.
You do have to write small, restricted programs from scratch for a specific
use-case with the limited set of helper functions and intrinsics that are
available from the kernel. So I don't think that "Programs that used to work
will now no longer work." holds if you regard it as such.


So I don't get this argument. If everything is so targeted, then why are
the BPF instructions an ABI.

If OTOH you're expected to be able to transfer these small proglets,
then too I would expect to transfer the source of these proglets.

You cannot argue both ways.


Ohh, I think we were talking past each other. ;) So, yeah, you'd likely need
to add new intrinstics that then map to the existing BPF_XADD instructions,
and perhaps spill a warning when __sync_fetch_and_add() is being used to
advise the developer to switch to the new intrinstics instead. From kernel
ABI PoV nothing would change.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] net-scm: Delete an unnecessary check before the function call "kfree"

2015-11-17 Thread Daniel Borkmann

On 11/17/2015 05:43 PM, SF Markus Elfring wrote:

From: Markus Elfring 
Date: Tue, 17 Nov 2015 17:37:22 +0100

The kfree() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
  net/core/scm.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index 3b6899b..4f64173 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -193,7 +193,7 @@ int __scm_send(struct socket *sock, struct msghdr *msg, 
struct scm_cookie *p)
}
}

-   if (p->fp && !p->fp->count)
+   if (likely(!p->fp->count))
{
kfree(p->fp);
p->fp = NULL;



Really, I don't like your blind, silly removals everywhere throughout
the kernel tree for these tests. Eric already mentioned that in some
situations where it's critical, it actually slows down the code, i.e.
you'll have an extra function call to get there and inside kfree() /
kfree_skb() / etc, the test is actually marked as unlikely().

Anyway, I think this one in particular could throw a NULL pointer deref.
You even say in your commit message "kfree() function tests whether its
argument [p->fp] is NULL" and yet if that is the case then, you already
deref'ed on the p->fp->count test ???
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: net-scm: Delete an unnecessary check before the function call "kfree"

2015-11-17 Thread Daniel Borkmann

On 11/17/2015 07:05 PM, SF Markus Elfring wrote:

Eric already mentioned that in some situations where it's critical,
it actually slows down the code, i.e. you'll have an extra
function call to get there and inside kfree() / kfree_skb() / etc,
the test is actually marked as unlikely().


How do you think about to add similar annotations to any more
source code places?


You mean this likely() annotation of yours? It doesn't make any sense
to me to place it there, and since you've spend the second thinking
about it when adding it to the diff, you should have already realized
that your code was buggy ...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, REPORT] bpf_trace: build error without PERF_EVENTS

2015-11-10 Thread Daniel Borkmann

On 11/10/2015 01:55 PM, Arnd Bergmann wrote:

In my ARM randconfig tests, I'm getting a build error for
newly added code in bpf_perf_event_read and bpf_perf_event_output
whenever CONFIG_PERF_EVENTS is disabled:

kernel/trace/bpf_trace.c: In function 'bpf_perf_event_read':
kernel/trace/bpf_trace.c:203:11: error: 'struct perf_event' has no member named 
'oncpu'
if (event->oncpu != smp_processor_id() ||
  ^
kernel/trace/bpf_trace.c:204:11: error: 'struct perf_event' has no member named 
'pmu'
   event->pmu->count)

This can happen when UPROBE_EVENT is enabled but KPROBE_EVENT
is disabled. I'm not sure if that is a configuration we care
about, otherwise we could prevent this case from occuring by
adding Kconfig dependencies.


I think that seems better than spreading #if IS_ENABLEDs into the code.
Probably enough to add a 'depends on PERF_EVENTS' to config BPF_EVENTS,
so it's also explicitly documented.


Simply hiding the broken code inside #ifdef CONFIG_PERF_EVENTS
as this patch does seems to reliably fix the error as well,
I have built thousands of randconfig kernels since I started
seeing this and added the workaround.

Signed-off-by: Arnd Bergmann 
Fixes: 62544ce8e01c ("bpf: fix bpf_perf_event_read() helper")
Fixes: a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")


Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] bpf: fix trivial comment typo

2015-11-03 Thread Daniel Borkmann

On 11/02/2015 10:48 PM, Matthew Fernandez wrote:

On 03/11/15 08:31, David Miller wrote:

From: Matthew Fernandez 
Date: Mon, 2 Nov 2015 11:59:03 +1100


bpf: fix trivial comment typo

Signed-off-by: Matthew Fernandez 


This doesn't apply to any tree.


I'm sorry, I think I must be missing something. This seems to apply cleanly to 
the current tip of mainline (e86328c489d7) to me. Was this not in the expected 
format? It wasn't my intention to waste your time, so I apologise for any 
newbie errors.


You might want to check Documentation/networking/netdev-FAQ.txt ;),
and rebase your spelling fix f.e. to the latest net-next tree.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] tracefs: fix refcount imbalance in start_creating

2015-11-04 Thread Daniel Borkmann
In tracefs' start_creating(), we pin the file system to safely access
its root. When we failed to create a file, we unpin the file system via
failed_creating() to release the mount count and eventually the reference
of the singleton vfsmount.

However, when we run into an error during lookup_one_len() when still
in start_creating(), we only release the parent's mutex but not so the
reference on the mount.

F.e., in securityfs_create_file(), after doing simple_pin_fs() when
lookup_one_len() fails there, we infact do simple_release_fs(). This
seems necessary here as well.

Same issue seen in debugfs due to 190afd81e4a5 ("debugfs: split the
beginning and the end of __create_file() off"), which seemed to got
carried over into tracefs, too. Noticed during code review.

Fixes: 4282d60689d4 ("tracefs: Add new tracefs file system")
Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
Acked-by: Steven Rostedt <rost...@goodmis.org>
---
 v1 -> v2:
 - Split into two patches (tracefs and debugfs, will send debugfs one 
separately)
 - Kept Steven's Acked-by

 fs/tracefs/inode.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index cbc8d5d..c66f242 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -340,8 +340,12 @@ static struct dentry *start_creating(const char *name, 
struct dentry *parent)
dput(dentry);
dentry = ERR_PTR(-EEXIST);
}
-   if (IS_ERR(dentry))
+
+   if (IS_ERR(dentry)) {
mutex_unlock(>d_inode->i_mutex);
+   simple_release_fs(_mount, _mount_count);
+   }
+
return dentry;
 }
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tracefs, debugfs: fix refcount imbalance in start_creating

2015-11-04 Thread Daniel Borkmann

On 11/02/2015 08:07 PM, Steven Rostedt wrote:

On Fri,  9 Oct 2015 20:30:10 +0200
Daniel Borkmann <dan...@iogearbox.net> wrote:


[...]


Fixes: 4282d60689d4 ("tracefs: Add new tracefs file system")
Fixes: 190afd81e4a5 ("debugfs: split the beginning and the end of __create_file() 
off")
Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
Cc: Steven Rostedt <rost...@goodmis.org>

[...]

Fyi, I'll respin and split this one into two, so they can be routed through
their individual trees. (Keeping Steven's Acked-by on the tracefs one.)

Thanks & sorry for the noise,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] debugfs: fix refcount imbalance in start_creating

2015-11-04 Thread Daniel Borkmann
In debugfs' start_creating(), we pin the file system to safely access
its root. When we failed to create a file, we unpin the file system via
failed_creating() to release the mount count and eventually the reference
of the vfsmount.

However, when we run into an error during lookup_one_len() when still
in start_creating(), we only release the parent's mutex but not so the
reference on the mount. Looks like it was done in the past, but after
splitting portions of __create_file() into start_creating() and
end_creating() via 190afd81e4a5 ("debugfs: split the beginning and the
end of __create_file() off"), this seemed missed. Noticed during code
review.

Fixes: 190afd81e4a5 ("debugfs: split the beginning and the end of 
__create_file() off")
Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
---
 v1 -> v2:
 - Split original patch into two patches (tracefs and debugfs)

 fs/debugfs/inode.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index c711be8..9c8d233 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -271,8 +271,12 @@ static struct dentry *start_creating(const char *name, 
struct dentry *parent)
dput(dentry);
dentry = ERR_PTR(-EEXIST);
}
-   if (IS_ERR(dentry))
+
+   if (IS_ERR(dentry)) {
mutex_unlock(_inode(parent)->i_mutex);
+   simple_release_fs(_mount, _mount_count);
+   }
+
return dentry;
 }
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] bpf: add mod default A and X test cases

2015-11-04 Thread Daniel Borkmann

On 11/04/2015 08:36 PM, Yang Shi wrote:

When running "mod X" operation, if X is 0 the filter has to be halt.
Add new test cases to cover A = A mod X if X is 0, and A = A mod 1.

CC: Xi Wang <xi.w...@gmail.com>
CC: Zi Shen Lim <zlim@gmail.com>
Signed-off-by: Yang Shi <yang@linaro.org>


LGTM!

Acked-by: Daniel Borkmann <dan...@iogearbox.net>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arm64: bpf: fix div-by-zero case

2015-11-04 Thread Daniel Borkmann

On 11/04/2015 07:41 PM, Shi, Yang wrote:
...

Agreed, and we may need add one more test cases in test_bpf module to cover
MOD?


Let me know if you have a test case ready :)


Does the below change look like a valid test?

+   "MOD default X",
+   .u.insns = {
+   /*
+* A = 0x42
+* A = A mod X ; this halt the filter execution if X is 0
+* ret 0x42
+*/
+   BPF_STMT(BPF_LD | BPF_IMM, 0x42),
+   BPF_STMT(BPF_ALU | BPF_MOD | BPF_X, 0),
+   BPF_STMT(BPF_RET | BPF_K, 0x42),
+   },
+   CLASSIC | FLAG_NO_DATA,
+   {},
+   { {0x1, 0x0 } },
+   },
+   {
+   "MOD default A",
+   .u.insns = {
+   /*
+* A = A mod 1
+* ret A
+*/
+   BPF_STMT(BPF_ALU | BPF_MOD | BPF_K, 0x1),
+   BPF_STMT(BPF_RET | BPF_A, 0x0),
+   },
+   CLASSIC | FLAG_NO_DATA,
+   {},
+   { {0x1, 0x0 } },
+   },

My test result with it:
test_bpf: #284 MOD default X jited:1 ret 66 != 0 FAIL (1 times)
test_bpf: #285 MOD default A jited:1 102 PASS

If it looks right, I will post a patch to add the test cases.


Looks good to me.

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2   3   4   5   6   7   8   9   10   >