Re: [PATCH bpf-next] libbpf: clarify flags in ringbuf helpers

2021-04-07 Thread Joe Stringer
Hi Pedro,

On Tue, Apr 6, 2021 at 11:58 AM Pedro Tammela  wrote:
>
> In 'bpf_ringbuf_reserve()' we require the flag to '0' at the moment.
>
> For 'bpf_ringbuf_{discard,submit,output}' a flag of '0' might send a
> notification to the process if needed.
>
> Signed-off-by: Pedro Tammela 
> ---
>  include/uapi/linux/bpf.h   | 7 +++
>  tools/include/uapi/linux/bpf.h | 7 +++
>  2 files changed, 14 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 49371eba98ba..8c5c7a893b87 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4061,12 +4061,15 @@ union bpf_attr {
>   * of new data availability is sent.
>   * If **BPF_RB_FORCE_WAKEUP** is specified in *flags*, 
> notification
>   * of new data availability is sent unconditionally.
> + * If **0** is specified in *flags*, notification
> + * of new data availability is sent if needed.

Maybe a trivial question, but what does "if needed" mean? Does that
mean "when the buffer is full"?


Re: [PATCH] openvswitch: perform refragmentation for packets which pass through conntrack

2021-03-21 Thread Joe Stringer
Hey Aaron, long time no chat :)

On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole  wrote:
>
> When a user instructs a flow pipeline to perform connection tracking,
> there is an implicit L3 operation that occurs - namely the IP fragments
> are reassembled and then processed as a single unit.  After this, new
> fragments are generated and then transmitted, with the hint that they
> should be fragmented along the max rx unit boundary.  In general, this
> behavior works well to forward packets along when the MTUs are congruent
> across the datapath.
>
> However, if using a protocol such as UDP on a network with mismatching
> MTUs, it is possible that the refragmentation will still produce an
> invalid fragment, and that fragmented packet will not be delivered.
> Such a case shouldn't happen because the user explicitly requested a
> layer 3+4 function (conntrack), and that function generates new fragments,
> so we should perform the needed actions in that case (namely, refragment
> IPv4 along a correct boundary, or send a packet too big in the IPv6 case).
>
> Additionally, introduce a test suite for openvswitch with a test case
> that ensures this MTU behavior, with the expectation that new tests are
> added when needed.
>
> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
> Signed-off-by: Aaron Conole 
> ---
> NOTE: checkpatch reports a whitespace error with the openvswitch.sh
>   script - this is due to using tab as the IFS value.  This part
>   of the script was copied from
>   tools/testing/selftests/net/pmtu.sh so I think should be
>   permissible.
>
>  net/openvswitch/actions.c  |   2 +-
>  tools/testing/selftests/net/.gitignore |   1 +
>  tools/testing/selftests/net/Makefile   |   1 +
>  tools/testing/selftests/net/openvswitch.sh | 394 +
>  4 files changed, 397 insertions(+), 1 deletion(-)
>  create mode 100755 tools/testing/selftests/net/openvswitch.sh
>
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index 92a0b67b2728..d858ea580e43 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff 
> *skb, int out_port,
> if (likely(!mru ||
>(skb->len <= mru + vport->dev->hard_header_len))) {
> ovs_vport_send(vport, skb, ovs_key_mac_proto(key));
> -   } else if (mru <= vport->dev->mtu) {
> +   } else if (mru) {
> struct net *net = read_pnet(&dp->net);
>
> ovs_fragment(net, vport, skb, mru, key);

I thought about this for a while. For a bit of context, my
recollection is that in the initial design, there was an attempt to
minimize the set of assumptions around L3 behaviour and despite
performing this pseudo-L3 action of connection tracking, attempt a
"bump-in-the-wire" approach where OVS is serving as an L2 switch and
if you wanted L3 features, you need to build them on top or explicitly
define that you're looking for L3 semantics. In this case, you're
interpreting that the combination of the conntrack action + an output
action implies that L3 routing is being performed. Hence, OVS should
act like a router and either refragment or generate ICMP PTB in the
case where MTU differs. According to the flow table, the rest of the
routing functionality (MAC handling for instance) may or may not have
been performed at this point, but we basically leave that up to the
SDN controller to implement the right behaviour. In relation to this
particular check, the idea was to retain the original geometry of the
packet such that it's as though there were no functionality performed
in the middle at all. OVS happened to do connection tracking (which
implicitly involved queueing fragments), but if you treat it as an
opaque box, you have ports connected and OVS is simply performing
forwarding between the ports.

One of the related implications is the contrast between what happens
in this case if you have a conntrack action injected or not when
outputting to another port. If you didn't put a connection tracking
action into the flows when redirecting here, then there would be no
defragmentation or refragmentation. In that case, OVS is just
attempting to forward to another device and if the MTU check fails,
then bad luck, packets will be dropped. Now, with the interpretation
in this patch, it seems like we're trying to say that, well, actually,
if the controller injects a connection tracking action, then the
controller implicitly switches OVS into a sort of half-L3 mode for
this particular flow. This makes the behaviour a bit inconsistent.

Another thought that occurs here is that if you have three interfaces
attached to the switch, say one with MTU 1500 and two with MTU 1450,
and the OVS flows are configured to conntrack and clone the packets
from the higher-MTU interface to the lower-MTU interfaces. If you
receive

Re: [PATCH bpf-next] bpf: fix missing * in bpf.h

2021-03-02 Thread Joe Stringer
On Fri, Feb 26, 2021 at 8:51 AM Quentin Monnet  wrote:
>
> 2021-02-24 10:59 UTC-0800 ~ Andrii Nakryiko 
> > On Wed, Feb 24, 2021 at 7:55 AM Daniel Borkmann  
> > wrote:
> >>
> >> On 2/23/21 3:43 PM, Jesper Dangaard Brouer wrote:
> >>> On Tue, 23 Feb 2021 20:45:54 +0800
> >>> Hangbin Liu  wrote:
> >>>
>  Commit 34b2021cc616 ("bpf: Add BPF-helper for MTU checking") lost a *
>  in bpf.h. This will make bpf_helpers_doc.py stop building
>  bpf_helper_defs.h immediately after bpf_check_mtu, which will affect
>  future add functions.
> 
>  Fixes: 34b2021cc616 ("bpf: Add BPF-helper for MTU checking")
>  Signed-off-by: Hangbin Liu 
>  ---
>    include/uapi/linux/bpf.h   | 2 +-
>    tools/include/uapi/linux/bpf.h | 2 +-
>    2 files changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> Thanks for fixing that!
> >>>
> >>> Acked-by: Jesper Dangaard Brouer 
> >>
> >> Thanks guys, applied!
> >>
> >>> I though I had already fix that, but I must have missed or reintroduced
> >>> this, when I rolling back broken ideas in V13.
> >>>
> >>> I usually run this command to check the man-page (before submitting):
> >>>
> >>>   ./scripts/bpf_helpers_doc.py | rst2man | man -l -
> >>
> >> [+ Andrii] maybe this could be included to run as part of CI to catch such
> >> things in advance?
> >
> > We do something like that as part of bpftool build, so there is no
> > reason we can't add this to selftests/bpf/Makefile as well.
>
> Hi, pretty sure this is the case already? [0]
>
> This helps catching RST formatting issues, for example if a description
> is using invalid markup, and reported by rst2man. My understanding is
> that in the current case, the missing star simply ends the block for the
> helpers documentation from the parser point of view, it's not considered
> an error.
>
> I see two possible workarounds:
>
> 1) Check that the number of helpers found ("len(self.helpers)") is equal
> to the number of helpers in the file, but that requires knowing how many
> helpers we have in the first place (e.g. parsing "__BPF_FUNC_MAPPER(FN)").

This is not so difficult as long as we stick to one symbol per line:

diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index e2ffac2b7695..74cdcc2bbf18 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -183,25 +183,51 @@ class HeaderParser(object):
 self.reader.readline()
 self.line = self.reader.readline()

+def get_elem_count(self, target):
+self.seek_to(target, 'Could not find symbol "%s"' % target)
+end_re = re.compile('^$')
+count = 0
+while True:
+capture = end_re.match(self.line)
+if capture:
+break
+self.line = self.reader.readline()
+count += 1
+
+# The last line (either '};' or '/* */' doesn't count.
+return count
+

I can either roll this into my docs update v2, or hold onto it for
another dedicated patch fixup. Either way I'm trialing this out
locally to regression-test my own docs update PR and make sure I'm not
breaking one of the various output formats.


Re: [PATCH bpf-next 00/17] Improve BPF syscall command documentation

2021-02-18 Thread Joe Stringer
On Thu, Feb 18, 2021 at 11:49 AM Jonathan Corbet  wrote:
>
> Joe Stringer  writes:
> > * The changes in patch 16 here extended Documentation/bpf/index.rst,
> > but to assist in improving the overall kernel documentation
> > organisation / hierarchy, you would prefer to instead introduce a
> > dedicated Documentation/userspace-api/bpf/ directory where the bpf
> > uAPI portions can be documented.
>
> An objective I've been working on for some years is reorienting the
> documentation with a focus on who the readers are.  We've tended to
> organize it by subsystem, requiring people to wade through a lot of
> stuff that isn't useful to them.  So yes, my preference would be to
> document the kernel's user-space API in the relevant manual.
>
> That said, I do tend to get pushback here at times, and the BPF API is
> arguably a bit different that much of the rest.  So while the above
> preference exists and is reasonably strong, the higher priority is to
> get good, current documentation in *somewhere* so that it's available to
> users.  I don't want to make life too difficult for people working
> toward that goal, even if I would paint it a different color.

Sure, I'm all for it. Unless I hear alternative feedback I'll roll it
under Documentation/userspace-api/bpf in the next revision.

> > In addition to this, today the bpf helpers documentation is built
> > through the bpftool build process as well as the runtime bpf
> > selftests, mostly as a way to ensure that the API documentation
> > conforms to a particular style, which then assists with the generation
> > of ReStructured Text and troff output. I can probably simplify the
> > make infrastructure involved in triggering the bpf docs build for bpf
> > subsystem developers and maintainers. I think there's likely still
> > interest from bpf folks to keep that particular dependency in the
> > selftests like today and even extend it to include this new
> > Documentation, so that we don't either introduce text that fails
> > against the parser or in some other way break the parser. Whether that
> > validation is done by scripts/kernel-doc or scripts/bpf_helpers_doc.py
> > doesn't make a big difference to me, other than I have zero experience
> > with Perl. My first impressions are that the bpf_helpers_doc.py is
> > providing stricter formatting requirements than what "DOC: " +
> > kernel-doc would provide, so my baseline inclination would be to keep
> > those patches to enhance that script and use that for the validation
> > side (help developers with stronger linting feedback), then use
> > kernel-doc for the actual html docs generation side, which would help
> > to satisfy your concern around duplication of the documentation build
> > systems.
>
> This doesn't sound entirely unreasonable.  I wonder if the BPF helper
> could be built into an sphinx extension to make it easy to pull that
> information into the docs build.  The advantage there is that it can be
> done in Python :)

Probably doable, it's already written in python. One thing at a time
though... :)

Cheers,
Joe


Re: [PATCH bpf-next 00/17] Improve BPF syscall command documentation

2021-02-17 Thread Joe Stringer
On Wed, Feb 17, 2021 at 9:32 AM Jonathan Corbet  wrote:
>
> [CC += linux-doc]
>
> Joe Stringer  writes:
>
> > From: Joe Stringer 
> >
> > The state of bpf(2) manual pages today is not exactly ideal. For the
> > most part, it was written several years ago and has not kept up with the
> > pace of development in the kernel tree. For instance, out of a total of
> > ~35 commands to the BPF syscall available today, when I pull the
> > kernel-man-pages tree today I find just 6 documented commands: The very
> > basics of map interaction and program load.
> >
> > In contrast, looking at bpf-helpers(7), I am able today to run one
> > command[0] to fetch API documentation of the very latest eBPF helpers
> > that have been added to the kernel. This documentation is up to date
> > because kernel maintainers enforce documenting the APIs as part of
> > the feature submission process. As far as I can tell, we rely on manual
> > synchronization from the kernel tree to the kernel-man-pages tree to
> > distribute these more widely, so all locations may not be completely up
> > to date. That said, the documentation does in fact exist in the first
> > place which is a major initial hurdle to overcome.
> >
> > Given the relative success of the process around bpf-helpers(7) to
> > encourage developers to document their user-facing changes, in this
> > patch series I explore applying this technique to bpf(2) as well.
>
> So I am totally in favor of improving the BPF docs, this is great work.
>
> That said, I am a bit less thrilled about creating a new, parallel
> documentation-build system in the kernel.  I don't think that BPF is so
> special that it needs to do its own thing here.
>
> If you started that way, you'd get the whole existing build system for
> free.  You would also have started down a path that could, some bright
> shining day, lead to this kind of documentation for *all* of our system
> calls.  That would be a huge improvement in how we do things.
>
> The troff output would still need implementation, but we'd like to have
> that anyway.  We used to create man pages for internal kernel APIs; that
> was lost in the sphinx transition and hasn't been a priority since
> people haven't been screaming, but it could still be nice to have it
> back.
>
> So...could I ask you to have a look at doing this within the kernel's
> docs system instead of in addition to it?  Even if it means digging into
> scripts/kernel-doc, which isn't all that high on my list of Fun Things
> To Do either?  I'm willing to try to help, and maybe we can get some
> other assistance too - I'm ever the optimist.

Hey Jon, thanks for the feedback. Absolutely, what you say makes
sense. The intent here wasn't to come up with something new. Based on
your prompt from this email (and a quick look at your KR '19
presentation), I'm hearing a few observations:
* Storing the documentation in the code next to the things that
contributors edit is a reasonable approach to documentation of this
kind.
* This series currently proposes adding some new Makefile
infrastructure. However, good use of the "kernel-doc" sphinx directive
+ "DOC: " incantations in the header should be able to achieve the
same without adding such dedicated build system logic to the tree.
* The changes in patch 16 here extended Documentation/bpf/index.rst,
but to assist in improving the overall kernel documentation
organisation / hierarchy, you would prefer to instead introduce a
dedicated Documentation/userspace-api/bpf/ directory where the bpf
uAPI portions can be documented.

>From the above, there's a couple of clear actionable items I can look
into for a series v2 which should tidy things up.

In addition to this, today the bpf helpers documentation is built
through the bpftool build process as well as the runtime bpf
selftests, mostly as a way to ensure that the API documentation
conforms to a particular style, which then assists with the generation
of ReStructured Text and troff output. I can probably simplify the
make infrastructure involved in triggering the bpf docs build for bpf
subsystem developers and maintainers. I think there's likely still
interest from bpf folks to keep that particular dependency in the
selftests like today and even extend it to include this new
Documentation, so that we don't either introduce text that fails
against the parser or in some other way break the parser. Whether that
validation is done by scripts/kernel-doc or scripts/bpf_helpers_doc.py
doesn't make a big difference to me, other than I have zero experience
with Perl. My first impressions are that the bpf_helpers_doc.py is
providing stricter formatting requirements than what "DOC

Re: [PATCH bpf-next 00/17] Improve BPF syscall command documentation

2021-02-17 Thread Joe Stringer
On Wed, Feb 17, 2021 at 5:55 AM Toke Høiland-Jørgensen  wrote:
>
> Joe Stringer  writes:
> > Given the relative success of the process around bpf-helpers(7) to
> > encourage developers to document their user-facing changes, in this
> > patch series I explore applying this technique to bpf(2) as well.
> > Unfortunately, even with bpf(2) being so out-of-date, there is still a
> > lot of content to convert over. In particular, I've identified at least
> > the following aspects of the bpf syscall which could individually be
> > generated from separate documentation in the header:
> > * BPF syscall commands
> > * BPF map types
> > * BPF program types
> > * BPF attachment points
>
> Does this also include program subtypes (AKA expected_attach_type?)

I seem to have left my lawyerly "including, but not limited to..."
language at home today ;-) . Of course, I can add that to the list.

> > At this point I'd like to put this out for comments. In my mind, the
> > ideal eventuation of this work would be to extend kernel UAPI headers
> > such that each of the categories I had listed above (commands, maps,
> > progs, hooks) have dedicated documentation in the kernel tree, and that
> > developers must update the comments in the headers to document the APIs
> > prior to patch acceptance, and that we could auto-generate the latest
> > version of the bpf(2) manual pages based on a few static description
> > sections combined with the dynamically-generated output from the header.
>
> I like the approach, and I don't think it's too onerous to require
> updates to the documentation everywhere like we (as you note) already do
> for helpers.
>
> So with that, please feel free to add my enthusiastic:
>
> Acked-by: Toke Høiland-Jørgensen 

Thanks Toke.


[PATCH bpf-next 11/17] scripts/bpf: Add syscall commands printer

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

Add a new target to bpf_doc.py to support generating the list of syscall
commands directly from the UAPI headers. Assuming that developer
submissions keep the main header up to date, this should allow the man
pages to be automatically generated based on the latest API changes
rather than requiring someone to separately go back through the API and
describe each command.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 scripts/bpf_doc.py | 98 +-
 1 file changed, 89 insertions(+), 9 deletions(-)

diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index 5a4f68aab335..72a2ba323692 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -14,6 +14,9 @@ import sys, os
 class NoHelperFound(BaseException):
 pass
 
+class NoSyscallCommandFound(BaseException):
+pass
+
 class ParsingError(BaseException):
 def __init__(self, line='', reader=None):
 if reader:
@@ -23,18 +26,27 @@ class ParsingError(BaseException):
 else:
 BaseException.__init__(self, 'Error parsing line: %s' % line)
 
-class Helper(object):
+
+class APIElement(object):
 """
-An object representing the description of an eBPF helper function.
-@proto: function prototype of the helper function
-@desc: textual description of the helper function
-@ret: description of the return value of the helper function
+An object representing the description of an aspect of the eBPF API.
+@proto: prototype of the API symbol
+@desc: textual description of the symbol
+@ret: (optional) description of any associated return value
 """
 def __init__(self, proto='', desc='', ret=''):
 self.proto = proto
 self.desc = desc
 self.ret = ret
 
+
+class Helper(APIElement):
+"""
+An object representing the description of an eBPF helper function.
+@proto: function prototype of the helper function
+@desc: textual description of the helper function
+@ret: description of the return value of the helper function
+"""
 def proto_break_down(self):
 """
 Break down helper function protocol into smaller chunks: return type,
@@ -61,6 +73,7 @@ class Helper(object):
 
 return res
 
+
 class HeaderParser(object):
 """
 An object used to parse a file in order to extract the documentation of a
@@ -73,6 +86,13 @@ class HeaderParser(object):
 self.reader = open(filename, 'r')
 self.line = ''
 self.helpers = []
+self.commands = []
+
+def parse_element(self):
+proto= self.parse_symbol()
+desc = self.parse_desc()
+ret  = self.parse_ret()
+return APIElement(proto=proto, desc=desc, ret=ret)
 
 def parse_helper(self):
 proto= self.parse_proto()
@@ -80,6 +100,18 @@ class HeaderParser(object):
 ret  = self.parse_ret()
 return Helper(proto=proto, desc=desc, ret=ret)
 
+def parse_symbol(self):
+p = re.compile(' \* ?(.+)$')
+capture = p.match(self.line)
+if not capture:
+raise NoSyscallCommandFound
+end_re = re.compile(' \* ?NOTES$')
+end = end_re.match(self.line)
+if end:
+raise NoSyscallCommandFound
+self.line = self.reader.readline()
+return capture.group(1)
+
 def parse_proto(self):
 # Argument can be of shape:
 #   - "void"
@@ -141,16 +173,29 @@ class HeaderParser(object):
 break
 return ret
 
-def run(self):
-# Advance to start of helper function descriptions.
-offset = self.reader.read().find('* Start of BPF helper function 
descriptions:')
+def seek_to(self, target, help_message):
+self.reader.seek(0)
+offset = self.reader.read().find(target)
 if offset == -1:
-raise Exception('Could not find start of eBPF helper descriptions 
list')
+raise Exception(help_message)
 self.reader.seek(offset)
 self.reader.readline()
 self.reader.readline()
 self.line = self.reader.readline()
 
+def parse_syscall(self):
+self.seek_to('* Start of BPF syscall commands:',
+ 'Could not find start of eBPF syscall descriptions list')
+while True:
+try:
+command = self.parse_element()
+self.commands.append(command)
+except NoSyscallCommandFound:
+break
+
+def parse_helpers(self):
+self.seek_to('* Start of BPF helper function descriptions:',
+ 'Could not find start of eBPF helper descriptions list')
 while True:

[PATCH bpf-next 17/17] tools: Sync uapi bpf.h header with latest changes

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

Synchronize the header after all of the recent changes.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 tools/include/uapi/linux/bpf.h | 707 -
 1 file changed, 706 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 16f2f0d2338a..4abf54327612 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -93,7 +93,712 @@ union bpf_iter_link_info {
} map;
 };
 
-/* BPF syscall commands, see bpf(2) man-page for details. */
+/* BPF syscall commands, see bpf(2) man-page for more details.
+ *
+ * The operation to be performed by the **bpf**\ () system call is determined
+ * by the *cmd* argument. Each operation takes an accompanying argument,
+ * provided via *attr*, which is a pointer to a union of type *bpf_attr* (see
+ * below). The size argument is the size of the union pointed to by *attr*.
+ *
+ * Start of BPF syscall commands:
+ *
+ * BPF_MAP_CREATE
+ * Description
+ * Create a map and return a file descriptor that refers to the
+ * map. The close-on-exec file descriptor flag (see **fcntl**\ (2))
+ * is automatically enabled for the new file descriptor.
+ *
+ * Applying **close**\ (2) to the file descriptor returned by
+ * **BPF_MAP_CREATE** will delete the map (but see NOTES).
+ *
+ * Return
+ * A new file descriptor (a nonnegative integer), or -1 if an
+ * error occurred (in which case, *errno* is set appropriately).
+ *
+ * BPF_MAP_LOOKUP_ELEM
+ * Description
+ * Look up an element with a given *key* in the map referred to
+ * by the file descriptor *map_fd*.
+ *
+ * The *flags* argument may be specified as one of the
+ * following:
+ *
+ * **BPF_F_LOCK**
+ * Look up the value of a spin-locked map without
+ * returning the lock. This must be specified if the
+ * elements contain a spinlock.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_MAP_UPDATE_ELEM
+ * Description
+ * Create or update an element (key/value pair) in a specified map.
+ *
+ * The *flags* argument should be specified as one of the
+ * following:
+ *
+ * **BPF_ANY**
+ * Create a new element or update an existing element.
+ * **BPF_NOEXIST**
+ * Create a new element only if it did not exist.
+ * **BPF_EXIST**
+ * Update an existing element.
+ * **BPF_F_LOCK**
+ * Update a spin_lock-ed map element.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * May set *errno* to **EINVAL**, **EPERM**, **ENOMEM**,
+ * **E2BIG**, **EEXIST**, or **ENOENT**.
+ *
+ * **E2BIG**
+ * The number of elements in the map reached the
+ * *max_entries* limit specified at map creation time.
+ * **EEXIST**
+ * If *flags* specifies **BPF_NOEXIST** and the element
+ * with *key* already exists in the map.
+ * **ENOENT**
+ * If *flags* specifies **BPF_EXIST** and the element with
+ * *key* does not exist in the map.
+ *
+ * BPF_MAP_DELETE_ELEM
+ * Description
+ * Look up and delete an element by key in a specified map.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_MAP_GET_NEXT_KEY
+ * Description
+ * Look up an element by key in a specified map and return the key
+ * of the next element. Can be used to iterate over all elements
+ * in the map.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * The following cases can be used to iterate over all elements of
+ * the map:
+ *
+ * * If *key* is not found, the operation returns zero and sets
+ *   the *next_key* pointer to the key of the first element.
+ * * If *key* is found, the operation returns zero and sets the
+ *   *next_key* pointer to the key of the next element.
+ * * If *key* is the last element, returns -1 and *errno* is set
+ *   to **ENOENT**.
+ *
+ * May set *errno* to **ENOMEM**, **EFAULT**, **EPERM**, or
+ * **EINVAL** on error.
+ *
+ * BPF_PROG_LOAD
+ * Description
+ * Verify and load an eBPF program, returning a new file

[PATCH bpf-next 15/17] selftests/bpf: Add docs target

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

This docs target will run the scripts/bpf_doc.py against the BPF UAPI
headers to ensure that the parser used for generating manual pages from
the headers doesn't trip on any newly added API documentation.

While we're at it, remove the bpftool-specific docs check target since
that would just be duplicated with the new target anyhow.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 tools/testing/selftests/bpf/Makefile  | 20 +-
 .../selftests/bpf/test_bpftool_build.sh   | 21 ---
 tools/testing/selftests/bpf/test_doc_build.sh | 13 
 3 files changed, 28 insertions(+), 26 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/test_doc_build.sh

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 044bfdcf5b74..e1a76444670c 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -68,6 +68,7 @@ TEST_PROGS := test_kmod.sh \
test_bpftool_build.sh \
test_bpftool.sh \
test_bpftool_metadata.sh \
+   test_docs_build.sh \
test_xsk.sh
 
 TEST_PROGS_EXTENDED := with_addr.sh \
@@ -103,6 +104,7 @@ override define CLEAN
$(call msg,CLEAN)
$(Q)$(RM) -r $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) 
$(TEST_GEN_FILES) $(EXTRA_CLEAN)
$(Q)$(MAKE) -C bpf_testmod clean
+   $(Q)$(MAKE) docs-clean
 endef
 
 include ../lib.mk
@@ -180,6 +182,7 @@ $(OUTPUT)/runqslower: $(BPFOBJ) | $(DEFAULT_BPFTOOL)
cp $(SCRATCH_DIR)/runqslower $@
 
 $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED): $(OUTPUT)/test_stub.o $(BPFOBJ)
+$(TEST_GEN_FILES): docs
 
 $(OUTPUT)/test_dev_cgroup: cgroup_helpers.c
 $(OUTPUT)/test_skb_cgroup_id_user: cgroup_helpers.c
@@ -200,11 +203,16 @@ $(DEFAULT_BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] 
$(BPFTOOLDIR)/Makefile)\
CC=$(HOSTCC) LD=$(HOSTLD)  \
OUTPUT=$(HOST_BUILD_DIR)/bpftool/  \
prefix= DESTDIR=$(HOST_SCRATCH_DIR)/ install
-   $(Q)mkdir -p $(BUILD_DIR)/bpftool/Documentation
-   $(Q)RST2MAN_OPTS="--exit-status=1" $(MAKE) $(submake_extras)   \
-   -C $(BPFTOOLDIR)/Documentation \
-   OUTPUT=$(BUILD_DIR)/bpftool/Documentation/ \
-   prefix= DESTDIR=$(SCRATCH_DIR)/ install
+
+docs:
+   $(Q)RST2MAN_OPTS="--exit-status=1" $(MAKE) $(submake_extras)\
+   -C $(TOOLSDIR)/bpf -f Makefile.docs \
+   prefix= OUTPUT=$(OUTPUT)/ DESTDIR=$(OUTPUT)/ $@
+
+docs-clean:
+   $(Q)$(MAKE) $(submake_extras)   \
+   -C $(TOOLSDIR)/bpf -f Makefile.docs \
+   prefix= OUTPUT=$(OUTPUT)/ DESTDIR=$(OUTPUT)/ $@
 
 $(BPFOBJ): $(wildcard $(BPFDIR)/*.[ch] $(BPFDIR)/Makefile)\
   ../../../include/uapi/linux/bpf.h   \
@@ -476,3 +484,5 @@ EXTRA_CLEAN := $(TEST_CUSTOM_PROGS) $(SCRATCH_DIR) 
$(HOST_SCRATCH_DIR)  \
prog_tests/tests.h map_tests/tests.h verifier/tests.h   \
feature \
$(addprefix $(OUTPUT)/,*.o *.skel.h no_alu32 bpf_gcc bpf_testmod.ko)
+
+.PHONY: docs docs-clean
diff --git a/tools/testing/selftests/bpf/test_bpftool_build.sh 
b/tools/testing/selftests/bpf/test_bpftool_build.sh
index 2db3c60e1e61..ac349a5cea7e 100755
--- a/tools/testing/selftests/bpf/test_bpftool_build.sh
+++ b/tools/testing/selftests/bpf/test_bpftool_build.sh
@@ -85,23 +85,6 @@ make_with_tmpdir() {
echo
 }
 
-make_doc_and_clean() {
-   echo -e "\$PWD:$PWD"
-   echo -e "command: make -s $* doc >/dev/null"
-   RST2MAN_OPTS="--exit-status=1" make $J -s $* doc
-   if [ $? -ne 0 ] ; then
-   ERROR=1
-   printf "FAILURE: Errors or warnings when building 
documentation\n"
-   fi
-   (
-   if [ $# -ge 1 ] ; then
-   cd ${@: -1}
-   fi
-   make -s doc-clean
-   )
-   echo
-}
-
 echo "Trying to build bpftool"
 echo -e "... through kbuild\n"
 
@@ -162,7 +145,3 @@ make_and_clean
 make_with_tmpdir OUTPUT
 
 make_with_tmpdir O
-
-echo -e "Checking documentation build\n"
-# From tools/bpf/bpftool
-make_doc_and_clean
diff --git a/tools/testing/selftests/bpf/test_doc_build.sh 
b/tools/testing/selftests/bpf/test_doc_build.sh
new file mode 100755
index ..7eb940a7b2eb
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_doc_build.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+# Assume script is located under tools/testing/selftests/bpf/. We want to start
+# build attempts from the

[PATCH bpf-next 13/17] tools/bpf: Templatize man page generation

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

Previously, the Makefile here was only targeting a single manual page so
it just hardcoded a bunch of individual rules to specifically handle
build, clean, install, uninstall for that particular page.

Upcoming commits will generate manual pages for an additional section,
so this commit prepares the makefile first by converting the existing
targets into an evaluated set of targets based on the manual page name
and section.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 tools/bpf/Makefile.docs  | 52 
 tools/bpf/bpftool/Documentation/Makefile |  8 ++--
 2 files changed, 39 insertions(+), 21 deletions(-)

diff --git a/tools/bpf/Makefile.docs b/tools/bpf/Makefile.docs
index dc4ce82ada33..7111888ca5d8 100644
--- a/tools/bpf/Makefile.docs
+++ b/tools/bpf/Makefile.docs
@@ -29,32 +29,50 @@ MAN7_RST = $(HELPERS_RST)
 _DOC_MAN7 = $(patsubst %.rst,%.7,$(MAN7_RST))
 DOC_MAN7 = $(addprefix $(OUTPUT),$(_DOC_MAN7))
 
+DOCTARGETS := helpers
+
+docs: $(DOCTARGETS)
 helpers: man7
 man7: $(DOC_MAN7)
 
 RST2MAN_DEP := $(shell command -v rst2man 2>/dev/null)
 
-$(OUTPUT)$(HELPERS_RST): $(UP2DIR)../../include/uapi/linux/bpf.h
-   $(QUIET_GEN)$(UP2DIR)../../scripts/bpf_doc.py --filename $< > $@
+# Configure make rules for the man page bpf-$1.$2.
+# $1 - target for scripts/bpf_doc.py
+# $2 - man page section to generate the troff file
+define DOCS_RULES =
+$(OUTPUT)bpf-$1.rst: $(UP2DIR)../../include/uapi/linux/bpf.h
+   $$(QUIET_GEN)$(UP2DIR)../../scripts/bpf_doc.py $1 \
+   --filename $$< > $$@
 
-$(OUTPUT)%.7: $(OUTPUT)%.rst
+$(OUTPUT)%.$2: $(OUTPUT)%.rst
 ifndef RST2MAN_DEP
-   $(error "rst2man not found, but required to generate man pages")
+   $$(error "rst2man not found, but required to generate man pages")
 endif
-   $(QUIET_GEN)rst2man $< > $@
+   $$(QUIET_GEN)rst2man $$< > $$@
+
+docs-clean-$1:
+   $$(call QUIET_CLEAN, eBPF_$1-manpage)
+   $(Q)$(RM) $$(DOC_MAN$2) $(OUTPUT)bpf-$1.rst
+
+docs-install-$1: docs
+   $$(call QUIET_INSTALL, eBPF_$1-manpage)
+   $(Q)$(INSTALL) -d -m 755 $(DESTDIR)$$(man$2dir)
+   $(Q)$(INSTALL) -m 644 $$(DOC_MAN$2) $(DESTDIR)$$(man$2dir)
+
+docs-uninstall-$1:
+   $$(call QUIET_UNINST, eBPF_$1-manpage)
+   $(Q)$(RM) $$(addprefix $(DESTDIR)$$(man$2dir)/,$$(_DOC_MAN$2))
+   $(Q)$(RMDIR) $(DESTDIR)$$(man$2dir)
 
-helpers-clean:
-   $(call QUIET_CLEAN, eBPF_helpers-manpage)
-   $(Q)$(RM) $(DOC_MAN7) $(OUTPUT)$(HELPERS_RST)
+.PHONY: $1 docs-clean-$1 docs-install-$1 docs-uninstall-$1
+endef
 
-helpers-install: helpers
-   $(call QUIET_INSTALL, eBPF_helpers-manpage)
-   $(Q)$(INSTALL) -d -m 755 $(DESTDIR)$(man7dir)
-   $(Q)$(INSTALL) -m 644 $(DOC_MAN7) $(DESTDIR)$(man7dir)
+# Create the make targets to generate manual pages by name and section
+$(eval $(call DOCS_RULES,helpers,7))
 
-helpers-uninstall:
-   $(call QUIET_UNINST, eBPF_helpers-manpage)
-   $(Q)$(RM) $(addprefix $(DESTDIR)$(man7dir)/,$(_DOC_MAN7))
-   $(Q)$(RMDIR) $(DESTDIR)$(man7dir)
+docs-clean: $(foreach doctarget,$(DOCTARGETS), docs-clean-$(doctarget))
+docs-install: $(foreach doctarget,$(DOCTARGETS), docs-install-$(doctarget))
+docs-uninstall: $(foreach doctarget,$(DOCTARGETS), docs-uninstall-$(doctarget))
 
-.PHONY: helpers helpers-clean helpers-install helpers-uninstall
+.PHONY: docs docs-clean docs-install docs-uninstall man7
diff --git a/tools/bpf/bpftool/Documentation/Makefile 
b/tools/bpf/bpftool/Documentation/Makefile
index bb7842efffd6..f60b800584a5 100644
--- a/tools/bpf/bpftool/Documentation/Makefile
+++ b/tools/bpf/bpftool/Documentation/Makefile
@@ -24,7 +24,7 @@ MAN8_RST = $(wildcard bpftool*.rst)
 _DOC_MAN8 = $(patsubst %.rst,%.8,$(MAN8_RST))
 DOC_MAN8 = $(addprefix $(OUTPUT),$(_DOC_MAN8))
 
-man: man8 helpers
+man: man8 docs
 man8: $(DOC_MAN8)
 
 RST2MAN_DEP := $(shell command -v rst2man 2>/dev/null)
@@ -46,16 +46,16 @@ ifndef RST2MAN_DEP
 endif
$(QUIET_GEN)( cat $< ; printf "%b" $(call see_also,$<) ) | rst2man 
$(RST2MAN_OPTS) > $@
 
-clean: helpers-clean
+clean: docs-clean
$(call QUIET_CLEAN, Documentation)
$(Q)$(RM) $(DOC_MAN8)
 
-install: man helpers-install
+install: man docs-install
$(call QUIET_INSTALL, Documentation-man)
$(Q)$(INSTALL) -d -m 755 $(DESTDIR)$(man8dir)
$(Q)$(INSTALL) -m 644 $(DOC_MAN8) $(DESTDIR)$(man8dir)
 
-uninstall: helpers-uninstall
+uninstall: docs-uninstall
$(call QUIET_UNINST, Documentation-man)
$(Q)$(RM) $(addprefix $(DESTDIR)$(man8dir)/,$(_DOC_MAN8))
$(Q)$(RMDIR) $(DESTDIR)$(man8dir)
-- 
2.27.0



[PATCH bpf-next 09/17] scripts/bpf: Rename bpf_helpers_doc.py -> bpf_doc.py

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

Rename this file in anticipation of it being used for generating more
than just helper man pages.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h   | 2 +-
 scripts/{bpf_helpers_doc.py => bpf_doc.py} | 4 ++--
 tools/bpf/Makefile.helpers | 2 +-
 tools/include/uapi/linux/bpf.h | 2 +-
 tools/lib/bpf/Makefile | 2 +-
 tools/perf/MANIFEST| 2 +-
 6 files changed, 7 insertions(+), 7 deletions(-)
 rename scripts/{bpf_helpers_doc.py => bpf_doc.py} (99%)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 893803f69a64..4abf54327612 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1425,7 +1425,7 @@ union bpf_attr {
  * parsed and used to produce a manual page. The workflow is the following,
  * and requires the rst2man utility:
  *
- * $ ./scripts/bpf_helpers_doc.py \
+ * $ ./scripts/bpf_doc.py \
  * --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
  * $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
  * $ man /tmp/bpf-helpers.7
diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_doc.py
similarity index 99%
rename from scripts/bpf_helpers_doc.py
rename to scripts/bpf_doc.py
index 867ada23281c..ca6e7559d696 100755
--- a/scripts/bpf_helpers_doc.py
+++ b/scripts/bpf_doc.py
@@ -221,7 +221,7 @@ class PrinterRST(Printer):
 .. 
 .. Please do not edit this file. It was generated from the documentation
 .. located in file include/uapi/linux/bpf.h of the Linux kernel sources
-.. (helpers description), and from scripts/bpf_helpers_doc.py in the same
+.. (helpers description), and from scripts/bpf_doc.py in the same
 .. repository (header and footer).
 
 ===
@@ -511,7 +511,7 @@ class PrinterHelpers(Printer):
 
 def print_header(self):
 header = '''\
-/* This is auto-generated file. See bpf_helpers_doc.py for details. */
+/* This is auto-generated file. See bpf_doc.py for details. */
 
 /* Forward declarations of BPF structs */'''
 
diff --git a/tools/bpf/Makefile.helpers b/tools/bpf/Makefile.helpers
index 854d084026dd..a26599022fd6 100644
--- a/tools/bpf/Makefile.helpers
+++ b/tools/bpf/Makefile.helpers
@@ -35,7 +35,7 @@ man7: $(DOC_MAN7)
 RST2MAN_DEP := $(shell command -v rst2man 2>/dev/null)
 
 $(OUTPUT)$(HELPERS_RST): $(UP2DIR)../../include/uapi/linux/bpf.h
-   $(QUIET_GEN)$(UP2DIR)../../scripts/bpf_helpers_doc.py --filename $< > $@
+   $(QUIET_GEN)$(UP2DIR)../../scripts/bpf_doc.py --filename $< > $@
 
 $(OUTPUT)%.7: $(OUTPUT)%.rst
 ifndef RST2MAN_DEP
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4c24daa43bac..16f2f0d2338a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -720,7 +720,7 @@ union bpf_attr {
  * parsed and used to produce a manual page. The workflow is the following,
  * and requires the rst2man utility:
  *
- * $ ./scripts/bpf_helpers_doc.py \
+ * $ ./scripts/bpf_doc.py \
  * --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
  * $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
  * $ man /tmp/bpf-helpers.7
diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index 887a494ad5fc..8170f88e8ea6 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -158,7 +158,7 @@ $(BPF_IN_STATIC): force $(BPF_HELPER_DEFS)
$(Q)$(MAKE) $(build)=libbpf OUTPUT=$(STATIC_OBJDIR)
 
 $(BPF_HELPER_DEFS): $(srctree)/tools/include/uapi/linux/bpf.h
-   $(QUIET_GEN)$(srctree)/scripts/bpf_helpers_doc.py --header \
+   $(QUIET_GEN)$(srctree)/scripts/bpf_doc.py --header \
--file $(srctree)/tools/include/uapi/linux/bpf.h > 
$(BPF_HELPER_DEFS)
 
 $(OUTPUT)libbpf.so: $(OUTPUT)libbpf.so.$(LIBBPF_VERSION)
diff --git a/tools/perf/MANIFEST b/tools/perf/MANIFEST
index 5d7b947320fb..f05c4d48fd7e 100644
--- a/tools/perf/MANIFEST
+++ b/tools/perf/MANIFEST
@@ -20,4 +20,4 @@ tools/lib/bitmap.c
 tools/lib/str_error_r.c
 tools/lib/vsprintf.c
 tools/lib/zalloc.c
-scripts/bpf_helpers_doc.py
+scripts/bpf_doc.py
-- 
2.27.0



[PATCH bpf-next 08/17] bpf: Document BPF_MAP_*_BATCH syscall commands

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

Based roughly on the following commits:
* Commit cb4d03ab499d ("bpf: Add generic support for lookup batch op")
* Commit 057996380a42 ("bpf: Add batch ops to all htab bpf map")
* Commit aa2e93b8e58e ("bpf: Add generic support for update and delete
  batch ops")

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
CC: Brian Vazquez 
CC: Yonghong Song 

@Yonghong, would you mind double-checking whether the text is accurate for the
case where BPF_MAP_LOOKUP_AND_DELETE_BATCH returns -EFAULT?
---
 include/uapi/linux/bpf.h | 114 +--
 1 file changed, 111 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a07cecfd2148..893803f69a64 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -550,13 +550,55 @@ union bpf_iter_link_info {
  * Description
  * Iterate and fetch multiple elements in a map.
  *
+ * Two opaque values are used to manage batch operations,
+ * *in_batch* and *out_batch*. Initially, *in_batch* must be set
+ * to NULL to begin the batched operation. After each subsequent
+ * **BPF_MAP_LOOKUP_BATCH**, the caller should pass the resultant
+ * *out_batch* as the *in_batch* for the next operation to
+ * continue iteration from the current point.
+ *
+ * The *keys* and *values* are output parameters which must point
+ * to memory large enough to hold *count* items based on the key
+ * and value size of the map *map_fd*. The *keys* buffer must be
+ * of *key_size* * *count*. The *values* buffer must be of
+ * *value_size* * *count*.
+ *
+ * The *elem_flags* argument may be specified as one of the
+ * following:
+ *
+ * **BPF_F_LOCK**
+ * Look up the value of a spin-locked map without
+ * returning the lock. This must be specified if the
+ * elements contain a spinlock.
+ *
+ * On success, *count* elements from the map are copied into the
+ * user buffer, with the keys copied into *keys* and the values
+ * copied into the corresponding indices in *values*.
+ *
+ * If an error is returned and *errno* is not **EFAULT**, *count*
+ * is set to the number of successfully processed elements.
+ *
  * Return
  * Returns zero on success. On error, -1 is returned and *errno*
  * is set appropriately.
  *
+ * May set *errno* to **ENOSPC** to indicate that *keys* or
+ * *values* is too small to dump an entire bucket during
+ * iteration of a hash-based map type.
+ *
  * BPF_MAP_LOOKUP_AND_DELETE_BATCH
  * Description
- * Iterate and delete multiple elements in a map.
+ * Iterate and delete all elements in a map.
+ *
+ * This operation has the same behavior as
+ * **BPF_MAP_LOOKUP_BATCH** with two exceptions:
+ *
+ * * Every element that is successfully returned is also deleted
+ *   from the map. This is at least *count* elements. Note that
+ *   *count* is both an input and an output parameter.
+ * * Upon returning with *errno* set to **EFAULT**, up to
+ *   *count* elements may be deleted without returning the keys
+ *   and values of the deleted elements.
  *
  * Return
  * Returns zero on success. On error, -1 is returned and *errno*
@@ -564,15 +606,81 @@ union bpf_iter_link_info {
  *
  * BPF_MAP_UPDATE_BATCH
  * Description
- * Iterate and update multiple elements in a map.
+ * Update multiple elements in a map by *key*.
+ *
+ * The *keys* and *values* are input parameters which must point
+ * to memory large enough to hold *count* items based on the key
+ * and value size of the map *map_fd*. The *keys* buffer must be
+ * of *key_size* * *count*. The *values* buffer must be of
+ * *value_size* * *count*.
+ *
+ * Each element specified in *keys* is sequentially updated to the
+ * value in the corresponding index in *values*. The *in_batch*
+ * and *out_batch* parameters are ignored and should be zeroed.
+ *
+ * The *elem_flags* argument should be specified as one of the
+ * following:
+ *
+ * **BPF_ANY**
+ * Create new elements or update a existing elements.
+ * **BPF_NOEXIST**
+ * Create new elements only if they do not exist.
+ * **BPF_EXIST**
+ * Update existing elements.
+ * **BPF_F_LOCK**
+ * Update spin_lock-ed map elements. This must be
+ * specifi

[PATCH bpf-next 16/17] docs/bpf: Add bpf() syscall command reference

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

Generate the syscall command reference from the UAPI header file and
include it in the main bpf docs page.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 Documentation/Makefile |  2 ++
 Documentation/bpf/Makefile | 28 
 Documentation/bpf/bpf_commands.rst |  5 +
 Documentation/bpf/index.rst| 14 +++---
 4 files changed, 46 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/bpf/Makefile
 create mode 100644 Documentation/bpf/bpf_commands.rst

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 9c42dde97671..408542825cc2 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -73,6 +73,7 @@ loop_cmd = $(echo-cmd) $(cmd_$(1)) || exit;
 
 quiet_cmd_sphinx = SPHINX  $@ --> file://$(abspath $(BUILDDIR)/$3/$4)
   cmd_sphinx = $(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) 
$(build)=Documentation/userspace-api/media $2 && \
+   $(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) $(build)=Documentation/bpf $2 
&& \
PYTHONDONTWRITEBYTECODE=1 \
BUILDDIR=$(abspath $(BUILDDIR)) SPHINX_CONF=$(abspath 
$(srctree)/$(src)/$5/$(SPHINX_CONF)) \
$(PYTHON3) $(srctree)/scripts/jobserver-exec \
@@ -133,6 +134,7 @@ refcheckdocs:
 
 cleandocs:
$(Q)rm -rf $(BUILDDIR)
+   $(Q)$(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) $(build)=Documentation/bpf 
clean
$(Q)$(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) 
$(build)=Documentation/userspace-api/media clean
 
 dochelp:
diff --git a/Documentation/bpf/Makefile b/Documentation/bpf/Makefile
new file mode 100644
index ..4f14db0891cc
--- /dev/null
+++ b/Documentation/bpf/Makefile
@@ -0,0 +1,28 @@
+# SPDX-License-Identifier: GPL-2.0
+
+# Rules to convert a .h file to inline RST documentation
+
+SRC_DIR = $(srctree)/Documentation/bpf
+PARSER = $(srctree)/scripts/bpf_doc.py
+UAPI = $(srctree)/include/uapi/linux
+
+TARGETS = $(BUILDDIR)/bpf/bpf_syscall.rst
+
+$(BUILDDIR)/bpf/bpf_syscall.rst: $(UAPI)/bpf.h
+   $(PARSER) syscall > $@
+
+.PHONY: all html epub xml latex linkcheck clean
+
+all: $(IMGDOT) $(BUILDDIR)/bpf $(TARGETS)
+
+html: all
+epub: all
+xml: all
+latex: $(IMGPDF) all
+linkcheck:
+
+clean:
+   -rm -f -- $(TARGETS) 2>/dev/null
+
+$(BUILDDIR)/bpf:
+   $(Q)mkdir -p $@
diff --git a/Documentation/bpf/bpf_commands.rst 
b/Documentation/bpf/bpf_commands.rst
new file mode 100644
index ..da388ffac85b
--- /dev/null
+++ b/Documentation/bpf/bpf_commands.rst
@@ -0,0 +1,5 @@
+**
+bpf() subcommand reference
+**
+
+.. kernel-include:: $BUILDDIR/bpf/bpf_syscall.rst
diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst
index 4f2874b729c3..631d02d4dc49 100644
--- a/Documentation/bpf/index.rst
+++ b/Documentation/bpf/index.rst
@@ -12,9 +12,6 @@ BPF instruction-set.
 The Cilium project also maintains a `BPF and XDP Reference Guide`_
 that goes into great technical depth about the BPF Architecture.
 
-The primary info for the bpf syscall is available in the `man-pages`_
-for `bpf(2)`_.
-
 BPF Type Format (BTF)
 =
 
@@ -35,6 +32,17 @@ Two sets of Questions and Answers (Q&A) are maintained.
bpf_design_QA
bpf_devel_QA
 
+Syscall API
+===
+
+The primary info for the bpf syscall is available in the `man-pages`_
+for `bpf(2)`_.
+
+.. toctree::
+   :maxdepth: 1
+
+   bpf_commands
+
 
 Helper functions
 
-- 
2.27.0



[PATCH bpf-next 12/17] tools/bpf: Rename Makefile.{helpers,docs}

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

In anticipation of including make targets for other manual pages in this
makefile, rename it to something a bit more generic.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 tools/bpf/{Makefile.helpers => Makefile.docs} | 2 +-
 tools/bpf/bpftool/Documentation/Makefile  | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)
 rename tools/bpf/{Makefile.helpers => Makefile.docs} (95%)

diff --git a/tools/bpf/Makefile.helpers b/tools/bpf/Makefile.docs
similarity index 95%
rename from tools/bpf/Makefile.helpers
rename to tools/bpf/Makefile.docs
index a26599022fd6..dc4ce82ada33 100644
--- a/tools/bpf/Makefile.helpers
+++ b/tools/bpf/Makefile.docs
@@ -3,7 +3,7 @@ ifndef allow-override
   include ../scripts/Makefile.include
   include ../scripts/utilities.mak
 else
-  # Assume Makefile.helpers is being run from bpftool/Documentation
+  # Assume Makefile.docs is being run from bpftool/Documentation
   # subdirectory. Go up two more directories to fetch bpf.h header and
   # associated script.
   UP2DIR := ../../
diff --git a/tools/bpf/bpftool/Documentation/Makefile 
b/tools/bpf/bpftool/Documentation/Makefile
index f33cb02de95c..bb7842efffd6 100644
--- a/tools/bpf/bpftool/Documentation/Makefile
+++ b/tools/bpf/bpftool/Documentation/Makefile
@@ -16,8 +16,8 @@ prefix ?= /usr/local
 mandir ?= $(prefix)/man
 man8dir = $(mandir)/man8
 
-# Load targets for building eBPF helpers man page.
-include ../../Makefile.helpers
+# Load targets for building eBPF man page.
+include ../../Makefile.docs
 
 MAN8_RST = $(wildcard bpftool*.rst)
 
-- 
2.27.0



[PATCH bpf-next 14/17] tools/bpf: Build bpf-sycall.2 in Makefile.docs

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

Add building of the bpf(2) syscall commands documentation as part of the
docs building step in the build. This allows us to pick up on potential
parse errors from the docs generator script as part of selftests.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 tools/bpf/Makefile.docs | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/tools/bpf/Makefile.docs b/tools/bpf/Makefile.docs
index 7111888ca5d8..47da582cdaf2 100644
--- a/tools/bpf/Makefile.docs
+++ b/tools/bpf/Makefile.docs
@@ -21,18 +21,27 @@ endif
 
 prefix ?= /usr/local
 mandir ?= $(prefix)/man
+man2dir = $(mandir)/man2
 man7dir = $(mandir)/man7
 
+SYSCALL_RST = bpf-syscall.rst
+MAN2_RST = $(SYSCALL_RST)
+
 HELPERS_RST = bpf-helpers.rst
 MAN7_RST = $(HELPERS_RST)
 
+_DOC_MAN2 = $(patsubst %.rst,%.2,$(MAN2_RST))
+DOC_MAN2 = $(addprefix $(OUTPUT),$(_DOC_MAN2))
+
 _DOC_MAN7 = $(patsubst %.rst,%.7,$(MAN7_RST))
 DOC_MAN7 = $(addprefix $(OUTPUT),$(_DOC_MAN7))
 
-DOCTARGETS := helpers
+DOCTARGETS := helpers syscall
 
 docs: $(DOCTARGETS)
+syscall: man2
 helpers: man7
+man2: $(DOC_MAN2)
 man7: $(DOC_MAN7)
 
 RST2MAN_DEP := $(shell command -v rst2man 2>/dev/null)
@@ -70,9 +79,10 @@ endef
 
 # Create the make targets to generate manual pages by name and section
 $(eval $(call DOCS_RULES,helpers,7))
+$(eval $(call DOCS_RULES,syscall,2))
 
 docs-clean: $(foreach doctarget,$(DOCTARGETS), docs-clean-$(doctarget))
 docs-install: $(foreach doctarget,$(DOCTARGETS), docs-install-$(doctarget))
 docs-uninstall: $(foreach doctarget,$(DOCTARGETS), docs-uninstall-$(doctarget))
 
-.PHONY: docs docs-clean docs-install docs-uninstall man7
+.PHONY: docs docs-clean docs-install docs-uninstall man2 man7
-- 
2.27.0



[PATCH bpf-next 10/17] scripts/bpf: Abstract eBPF API target parameter

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

Abstract out the target parameter so that upcoming commits, more than
just the existing "helpers" target can be called to generate specific
portions of docs from the eBPF UAPI headers.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 scripts/bpf_doc.py | 87 --
 1 file changed, 61 insertions(+), 26 deletions(-)

diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index ca6e7559d696..5a4f68aab335 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -2,6 +2,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 #
 # Copyright (C) 2018-2019 Netronome Systems, Inc.
+# Copyright (C) 2021 Isovalent, Inc.
 
 # In case user attempts to run with Python 2.
 from __future__ import print_function
@@ -165,10 +166,11 @@ class Printer(object):
 """
 A generic class for printers. Printers should be created with an array of
 Helper objects, and implement a way to print them in the desired fashion.
-@helpers: array of Helper objects to print to standard output
+@parser: A HeaderParser with objects to print to standard output
 """
-def __init__(self, helpers):
-self.helpers = helpers
+def __init__(self, parser):
+self.parser = parser
+self.elements = []
 
 def print_header(self):
 pass
@@ -181,19 +183,23 @@ class Printer(object):
 
 def print_all(self):
 self.print_header()
-for helper in self.helpers:
-self.print_one(helper)
+for elem in self.elements:
+self.print_one(elem)
 self.print_footer()
 
+
 class PrinterRST(Printer):
 """
-A printer for dumping collected information about helpers as a ReStructured
-Text page compatible with the rst2man program, which can be used to
-generate a manual page for the helpers.
-@helpers: array of Helper objects to print to standard output
+A generic class for printers that print ReStructured Text. Printers should
+be created with a HeaderParser object, and implement a way to print API
+elements in the desired fashion.
+@parser: A HeaderParser with objects to print to standard output
 """
-def print_header(self):
-header = '''\
+def __init__(self, parser):
+self.parser = parser
+
+def print_license(self):
+license = '''\
 .. Copyright (C) All BPF authors and contributors from 2014 to present.
 .. See git log include/uapi/linux/bpf.h in kernel tree for details.
 .. 
@@ -223,7 +229,37 @@ class PrinterRST(Printer):
 .. located in file include/uapi/linux/bpf.h of the Linux kernel sources
 .. (helpers description), and from scripts/bpf_doc.py in the same
 .. repository (header and footer).
+'''
+print(license)
+
+def print_elem(self, elem):
+if (elem.desc):
+print('\tDescription')
+# Do not strip all newline characters: formatted code at the end of
+# a section must be followed by a blank line.
+for line in re.sub('\n$', '', elem.desc, count=1).split('\n'):
+print('{}{}'.format('\t\t' if line else '', line))
+
+if (elem.ret):
+print('\tReturn')
+for line in elem.ret.rstrip().split('\n'):
+print('{}{}'.format('\t\t' if line else '', line))
+
+print('')
 
+
+class PrinterHelpersRST(PrinterRST):
+"""
+A printer for dumping collected information about helpers as a ReStructured
+Text page compatible with the rst2man program, which can be used to
+generate a manual page for the helpers.
+@parser: A HeaderParser with Helper objects to print to standard output
+"""
+def __init__(self, parser):
+self.elements = parser.helpers
+
+def print_header(self):
+header = '''\
 ===
 BPF-HELPERS
 ===
@@ -264,6 +300,7 @@ kernel at the top).
 HELPERS
 ===
 '''
+PrinterRST.print_license(self)
 print(header)
 
 def print_footer(self):
@@ -380,27 +417,19 @@ SEE ALSO
 
 def print_one(self, helper):
 self.print_proto(helper)
+self.print_elem(helper)
 
-if (helper.desc):
-print('\tDescription')
-# Do not strip all newline characters: formatted code at the end of
-# a section must be followed by a blank line.
-for line in re.sub('\n$', '', helper.desc, count=1).split('\n'):
-print('{}{}'.format('\t\t' if line else '', line))
 
-if (helper.ret):
-print('\tReturn')
-for line in helper.ret.rstrip().split('\n&#

[PATCH bpf-next 07/17] bpf: Document BPF_PROG_QUERY syscall command

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

Commit 468e2f64d220 ("bpf: introduce BPF_PROG_QUERY command") originally
introduced this, but there have been several additions since then.
Unlike BPF_PROG_ATTACH, it appears that the sockmap progs are not able
to be queried so far.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
CC: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h | 37 +
 1 file changed, 37 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 86fe0445c395..a07cecfd2148 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -386,6 +386,43 @@ union bpf_iter_link_info {
  * Obtain information about eBPF programs associated with the
  * specified *attach_type* hook.
  *
+ * The *target_fd* must be a valid file descriptor for a kernel
+ * object which depends on the attach type of *attach_bpf_fd*:
+ *
+ * **BPF_PROG_TYPE_CGROUP_DEVICE**,
+ * **BPF_PROG_TYPE_CGROUP_SKB**,
+ * **BPF_PROG_TYPE_CGROUP_SOCK**,
+ * **BPF_PROG_TYPE_CGROUP_SOCK_ADDR**,
+ * **BPF_PROG_TYPE_CGROUP_SOCKOPT**,
+ * **BPF_PROG_TYPE_CGROUP_SYSCTL**,
+ * **BPF_PROG_TYPE_SOCK_OPS**
+ *
+ * Control Group v2 hierarchy with the eBPF controller
+ * enabled. Requires the kernel to be compiled with
+ * **CONFIG_CGROUP_BPF**.
+ *
+ * **BPF_PROG_TYPE_FLOW_DISSECTOR**
+ *
+ * Network namespace (eg /proc/self/ns/net).
+ *
+ * **BPF_PROG_TYPE_LIRC_MODE2**
+ *
+ * LIRC device path (eg /dev/lircN). Requires the kernel
+ * to be compiled with **CONFIG_BPF_LIRC_MODE2**.
+ *
+ * **BPF_PROG_QUERY** always fetches the number of programs
+ * attached and the *attach_flags* which were used to attach those
+ * programs. Additionally, if *prog_ids* is nonzero and the number
+ * of attached programs is less than *prog_cnt*, populates
+ * *prog_ids* with the eBPF program ids of the programs attached
+ * at *target_fd*.
+ *
+ * The following flags may alter the result:
+ *
+ * **BPF_F_QUERY_EFFECTIVE**
+ * Only return information regarding programs which are
+ * currently effective at the specified *target_fd*.
+ *
  * Return
  * Returns zero on success. On error, -1 is returned and *errno*
  * is set appropriately.
-- 
2.27.0



[PATCH bpf-next 06/17] bpf: Document BPF_PROG_TEST_RUN syscall command

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

Based on a brief read of the corresponding source code.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 603605c5ca03..86fe0445c395 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -303,14 +303,22 @@ union bpf_iter_link_info {
  *
  * BPF_PROG_TEST_RUN
  * Description
- * Run an eBPF program a number of times against a provided
- * program context and return the modified program context and
- * duration of the test run.
+ * Run the eBPF program associated with the *prog_fd* a *repeat*
+ * number of times against a provided program context *ctx_in* and
+ * data *data_in*, and return the modified program context
+ * *ctx_out*, *data_out* (for example, packet data), result of the
+ * execution *retval*, and *duration* of the test run.
  *
  * Return
  * Returns zero on success. On error, -1 is returned and *errno*
  * is set appropriately.
  *
+ * **ENOSPC**
+ * Either *data_size_out* or *ctx_size_out* is too small.
+ * **ENOTSUPP**
+ * This command is not supported by the program type of
+ * the program referred to by *prog_fd*.
+ *
  * BPF_PROG_GET_NEXT_ID
  * Description
  * Fetch the next eBPF program currently loaded into the kernel.
-- 
2.27.0



[PATCH bpf-next 05/17] bpf: Document BPF_PROG_ATTACH syscall command

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

Document the prog attach command in more detail, based on git commits:
* commit f4324551489e ("bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH
  commands")
* commit 4f738adba30a ("bpf: create tcp_bpf_ulp allowing BPF to monitor
  socket TX/RX data")
* commit f4364dcfc86d ("media: rc: introduce BPF_PROG_LIRC_MODE2")
* commit d58e468b1112 ("flow_dissector: implements flow dissector BPF
  hook")

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
CC: Daniel Mack 
CC: John Fastabend 
CC: Sean Young 
CC: Petar Penkov 
---
 include/uapi/linux/bpf.h | 37 +
 1 file changed, 37 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8301a19c97de..603605c5ca03 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -250,6 +250,43 @@ union bpf_iter_link_info {
  * Attach an eBPF program to a *target_fd* at the specified
  * *attach_type* hook.
  *
+ * The *attach_type* specifies the eBPF attachment point to
+ * attach the program to, and must be one of *bpf_attach_type*
+ * (see below).
+ *
+ * The *attach_bpf_fd* must be a valid file descriptor for a
+ * loaded eBPF program of a cgroup, flow dissector, LIRC, sockmap
+ * or sock_ops type corresponding to the specified *attach_type*.
+ *
+ * The *target_fd* must be a valid file descriptor for a kernel
+ * object which depends on the attach type of *attach_bpf_fd*:
+ *
+ * **BPF_PROG_TYPE_CGROUP_DEVICE**,
+ * **BPF_PROG_TYPE_CGROUP_SKB**,
+ * **BPF_PROG_TYPE_CGROUP_SOCK**,
+ * **BPF_PROG_TYPE_CGROUP_SOCK_ADDR**,
+ * **BPF_PROG_TYPE_CGROUP_SOCKOPT**,
+ * **BPF_PROG_TYPE_CGROUP_SYSCTL**,
+ * **BPF_PROG_TYPE_SOCK_OPS**
+ *
+ * Control Group v2 hierarchy with the eBPF controller
+ * enabled. Requires the kernel to be compiled with
+ * **CONFIG_CGROUP_BPF**.
+ *
+ * **BPF_PROG_TYPE_FLOW_DISSECTOR**
+ *
+ * Network namespace (eg /proc/self/ns/net).
+ *
+ * **BPF_PROG_TYPE_LIRC_MODE2**
+ *
+ * LIRC device path (eg /dev/lircN). Requires the kernel
+ * to be compiled with **CONFIG_BPF_LIRC_MODE2**.
+ *
+ * **BPF_PROG_TYPE_SK_SKB**,
+ * **BPF_PROG_TYPE_SK_MSG**
+ *
+ * eBPF map of socket type (eg **BPF_MAP_TYPE_SOCKHASH**).
+ *
  * Return
  * Returns zero on success. On error, -1 is returned and *errno*
  * is set appropriately.
-- 
2.27.0



[PATCH bpf-next 04/17] bpf: Document BPF_PROG_PIN syscall command

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

Commit b2197755b263 ("bpf: add support for persistent maps/progs")
contains the original implementation and git logs, used as reference for
this documentation.

Also pull in the filename restriction as documented in commit 6d8cb045cde6
("bpf: comment why dots in filenames under BPF virtual FS are not allowed")

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
CC: Daniel Borkmann 
---
 include/uapi/linux/bpf.h | 34 +++---
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d02259458fd6..8301a19c97de 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -216,6 +216,22 @@ union bpf_iter_link_info {
  * Pin an eBPF program or map referred by the specified *bpf_fd*
  * to the provided *pathname* on the filesystem.
  *
+ * The *pathname* argument must not contain a dot (".").
+ *
+ * On success, *pathname* retains a reference to the eBPF object,
+ * preventing deallocation of the object when the original
+ * *bpf_fd* is closed. This allow the eBPF object to live beyond
+ * **close**\ (\ *bpf_fd*\ ), and hence the lifetime of the parent
+ * process.
+ *
+ * Applying **unlink**\ (2) or similar calls to the *pathname*
+ * unpins the object from the filesystem, removing the reference.
+ * If no other file descriptors or filesystem nodes refer to the
+ * same object, it will be deallocated (see NOTES).
+ *
+ * The filesystem type for the parent directory of *pathname* must
+ * be **BPF_FS_MAGIC**.
+ *
  * Return
  * Returns zero on success. On error, -1 is returned and *errno*
  * is set appropriately.
@@ -581,13 +597,17 @@ union bpf_iter_link_info {
  *
  * NOTES
  * eBPF objects (maps and programs) can be shared between processes.
- * For example, after **fork**\ (2), the child inherits file descriptors
- * referring to the same eBPF objects. In addition, file descriptors
- * referring to eBPF objects can be transferred over UNIX domain sockets.
- * File descriptors referring to eBPF objects can be duplicated in the
- * usual way, using **dup**\ (2) and similar calls. An eBPF object is
- * deallocated only after all file descriptors referring to the object
- * have been closed.
+ * * After **fork**\ (2), the child inherits file descriptors
+ *   referring to the same eBPF objects.
+ * * File descriptors referring to eBPF objects can be transferred over
+ *   **unix**\ (7) domain sockets.
+ * * File descriptors referring to eBPF objects can be duplicated in the
+ *   usual way, using **dup**\ (2) and similar calls.
+ * * File descriptors referring to eBPF objects can be pinned to the
+ *   filesystem using the **BPF_OBJ_PIN** command of **bpf**\ (2).
+ * An eBPF object is deallocated only after all file descriptors referring
+ * to the object have been closed and no references remain pinned to the
+ * filesystem or attached (for example, bound to a program or device).
  */
 enum bpf_cmd {
BPF_MAP_CREATE,
-- 
2.27.0



[PATCH bpf-next 03/17] bpf: Document BPF_F_LOCK in syscall commands

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

Document the meaning of the BPF_F_LOCK flag for the map lookup/update
descriptions. Based on commit 96049f3afd50 ("bpf: introduce BPF_F_LOCK
flag").

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
CC: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ac6880d7b01b..d02259458fd6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -120,6 +120,14 @@ union bpf_iter_link_info {
  * Look up an element with a given *key* in the map referred to
  * by the file descriptor *map_fd*.
  *
+ * The *flags* argument may be specified as one of the
+ * following:
+ *
+ * **BPF_F_LOCK**
+ * Look up the value of a spin-locked map without
+ * returning the lock. This must be specified if the
+ * elements contain a spinlock.
+ *
  * Return
  * Returns zero on success. On error, -1 is returned and *errno*
  * is set appropriately.
@@ -137,6 +145,8 @@ union bpf_iter_link_info {
  * Create a new element only if it did not exist.
  * **BPF_EXIST**
  * Update an existing element.
+ * **BPF_F_LOCK**
+ * Update a spin_lock-ed map element.
  *
  * Return
  * Returns zero on success. On error, -1 is returned and *errno*
-- 
2.27.0



[PATCH bpf-next 01/17] bpf: Import syscall arg documentation

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

These descriptions are present in the man-pages project from the
original submissions around 2015-2016. Import them so that they can be
kept up to date as developers extend the bpf syscall commands.

These descriptions follow the pattern used by scripts/bpf_helpers_doc.py
so that we can take advantage of the parser to generate more up-to-date
man page writing based upon these headers.

Some minor wording adjustments were made to make the descriptions
more consistent for the description / return format.

Reviewed-by: Quentin Monnet 
Co-authored-by: Alexei Starovoitov 
Co-authored-by: Michael Kerrisk 
Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h | 119 ++-
 1 file changed, 118 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4c24daa43bac..56d7db0f3daf 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -93,7 +93,124 @@ union bpf_iter_link_info {
} map;
 };
 
-/* BPF syscall commands, see bpf(2) man-page for details. */
+/* BPF syscall commands, see bpf(2) man-page for more details.
+ *
+ * The operation to be performed by the **bpf**\ () system call is determined
+ * by the *cmd* argument. Each operation takes an accompanying argument,
+ * provided via *attr*, which is a pointer to a union of type *bpf_attr* (see
+ * below). The size argument is the size of the union pointed to by *attr*.
+ *
+ * Start of BPF syscall commands:
+ *
+ * BPF_MAP_CREATE
+ * Description
+ * Create a map and return a file descriptor that refers to the
+ * map. The close-on-exec file descriptor flag (see **fcntl**\ (2))
+ * is automatically enabled for the new file descriptor.
+ *
+ * Applying **close**\ (2) to the file descriptor returned by
+ * **BPF_MAP_CREATE** will delete the map (but see NOTES).
+ *
+ * Return
+ * A new file descriptor (a nonnegative integer), or -1 if an
+ * error occurred (in which case, *errno* is set appropriately).
+ *
+ * BPF_MAP_LOOKUP_ELEM
+ * Description
+ * Look up an element with a given *key* in the map referred to
+ * by the file descriptor *map_fd*.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_MAP_UPDATE_ELEM
+ * Description
+ * Create or update an element (key/value pair) in a specified map.
+ *
+ * The *flags* argument should be specified as one of the
+ * following:
+ *
+ * **BPF_ANY**
+ * Create a new element or update an existing element.
+ * **BPF_NOEXIST**
+ * Create a new element only if it did not exist.
+ * **BPF_EXIST**
+ * Update an existing element.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * May set *errno* to **EINVAL**, **EPERM**, **ENOMEM**,
+ * **E2BIG**, **EEXIST**, or **ENOENT**.
+ *
+ * **E2BIG**
+ * The number of elements in the map reached the
+ * *max_entries* limit specified at map creation time.
+ * **EEXIST**
+ * If *flags* specifies **BPF_NOEXIST** and the element
+ * with *key* already exists in the map.
+ * **ENOENT**
+ * If *flags* specifies **BPF_EXIST** and the element with
+ * *key* does not exist in the map.
+ *
+ * BPF_MAP_DELETE_ELEM
+ * Description
+ * Look up and delete an element by key in a specified map.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_MAP_GET_NEXT_KEY
+ * Description
+ * Look up an element by key in a specified map and return the key
+ * of the next element. Can be used to iterate over all elements
+ * in the map.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * The following cases can be used to iterate over all elements of
+ * the map:
+ *
+ * * If *key* is not found, the operation returns zero and sets
+ *   the *next_key* pointer to the key of the first element.
+ * * If *key* is found, the operation returns zero and sets the
+ *   *next_key* pointer to the key of the next element.
+ * * If *key* is the last element, returns -1 and *errno* is set
+ *   to **ENOENT**.
+ *
+ * May set *errno* to **ENOMEM**, **EFAULT**, **EPERM**, or
+ * **EINVAL** on error.
+ *
+ * BPF_PROG_LOAD
+ * Description

[PATCH bpf-next 02/17] bpf: Add minimal bpf() command documentation

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

Introduce high-level descriptions of the intent and return codes of the
bpf() syscall commands. Subsequent patches may further flesh out the
content to provide a more useful programming reference.

Reviewed-by: Quentin Monnet 
Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h | 368 +++
 1 file changed, 368 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 56d7db0f3daf..ac6880d7b01b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -201,6 +201,374 @@ union bpf_iter_link_info {
  * A new file descriptor (a nonnegative integer), or -1 if an
  * error occurred (in which case, *errno* is set appropriately).
  *
+ * BPF_OBJ_PIN
+ * Description
+ * Pin an eBPF program or map referred by the specified *bpf_fd*
+ * to the provided *pathname* on the filesystem.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_OBJ_GET
+ * Description
+ * Open a file descriptor for the eBPF object pinned to the
+ * specified *pathname*.
+ *
+ * Return
+ * A new file descriptor (a nonnegative integer), or -1 if an
+ * error occurred (in which case, *errno* is set appropriately).
+ *
+ * BPF_PROG_ATTACH
+ * Description
+ * Attach an eBPF program to a *target_fd* at the specified
+ * *attach_type* hook.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_PROG_DETACH
+ * Description
+ * Detach the eBPF program associated with the *target_fd* at the
+ * hook specified by *attach_type*. The program must have been
+ * previously attached using **BPF_PROG_ATTACH**.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_PROG_TEST_RUN
+ * Description
+ * Run an eBPF program a number of times against a provided
+ * program context and return the modified program context and
+ * duration of the test run.
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_PROG_GET_NEXT_ID
+ * Description
+ * Fetch the next eBPF program currently loaded into the kernel.
+ *
+ * Looks for the eBPF program with an id greater than *start_id*
+ * and updates *next_id* on success. If no other eBPF programs
+ * remain with ids higher than *start_id*, returns -1 and sets
+ * *errno* to **ENOENT**.
+ *
+ * Return
+ * Returns zero on success. On error, or when no id remains, -1
+ * is returned and *errno* is set appropriately.
+ *
+ * BPF_MAP_GET_NEXT_ID
+ * Description
+ * Fetch the next eBPF map currently loaded into the kernel.
+ *
+ * Looks for the eBPF map with an id greater than *start_id*
+ * and updates *next_id* on success. If no other eBPF maps
+ * remain with ids higher than *start_id*, returns -1 and sets
+ * *errno* to **ENOENT**.
+ *
+ * Return
+ * Returns zero on success. On error, or when no id remains, -1
+ * is returned and *errno* is set appropriately.
+ *
+ * BPF_PROG_GET_FD_BY_ID
+ * Description
+ * Open a file descriptor for the eBPF program corresponding to
+ * *prog_id*.
+ *
+ * Return
+ * A new file descriptor (a nonnegative integer), or -1 if an
+ * error occurred (in which case, *errno* is set appropriately).
+ *
+ * BPF_MAP_GET_FD_BY_ID
+ * Description
+ * Open a file descriptor for the eBPF map corresponding to
+ * *map_id*.
+ *
+ * Return
+ * A new file descriptor (a nonnegative integer), or -1 if an
+ * error occurred (in which case, *errno* is set appropriately).
+ *
+ * BPF_OBJ_GET_INFO_BY_FD
+ * Description
+ * Obtain information about the eBPF object corresponding to
+ * *bpf_fd*.
+ *
+ * Populates up to *info_len* bytes of *info*, which will be in
+ * one of the following formats depending on the eBPF object type
+ * of *bpf_fd*:
+ *
+ * * **struct bpf_prog_info**
+ * * **struct bpf_map_info**
+ * * **struct bpf_btf_info**
+ * * **struct bpf_link_info**
+ *
+ * Return
+ * Returns zero on success. On error, -1 is returned and *errno*
+ * is set appropriately.
+ *
+ * BPF_PROG_QUERY
+ * Description
+ * Obtain information about eBPF programs associated with the
+ * specified

[PATCH bpf-next 00/17] Improve BPF syscall command documentation

2021-02-16 Thread Joe Stringer
From: Joe Stringer 

The state of bpf(2) manual pages today is not exactly ideal. For the
most part, it was written several years ago and has not kept up with the
pace of development in the kernel tree. For instance, out of a total of
~35 commands to the BPF syscall available today, when I pull the
kernel-man-pages tree today I find just 6 documented commands: The very
basics of map interaction and program load.

In contrast, looking at bpf-helpers(7), I am able today to run one
command[0] to fetch API documentation of the very latest eBPF helpers
that have been added to the kernel. This documentation is up to date
because kernel maintainers enforce documenting the APIs as part of
the feature submission process. As far as I can tell, we rely on manual
synchronization from the kernel tree to the kernel-man-pages tree to
distribute these more widely, so all locations may not be completely up
to date. That said, the documentation does in fact exist in the first
place which is a major initial hurdle to overcome.

Given the relative success of the process around bpf-helpers(7) to
encourage developers to document their user-facing changes, in this
patch series I explore applying this technique to bpf(2) as well.
Unfortunately, even with bpf(2) being so out-of-date, there is still a
lot of content to convert over. In particular, I've identified at least
the following aspects of the bpf syscall which could individually be
generated from separate documentation in the header:
* BPF syscall commands
* BPF map types
* BPF program types
* BPF attachment points

Rather than tackle everything at once, I have focused in this series on
the syscall commands, "enum bpf_cmd". This series is structured to first
import what useful descriptions there are from the kernel-man-pages
tree, then piece-by-piece document a few of the syscalls in more detail
in cases where I could find useful documentation from the git tree or
from a casual read of the code. Not all documentation is comprehensive
at this point, but a basis is provided with examples that can be further
enhanced with subsequent follow-up patches. Note, the series in its
current state only includes documentation around the syscall commands
themselves, so in the short term it doesn't allow us to automate bpf(2)
generation; Only one section of the man page could be replaced. Though
if there is appetite for this approach, this should be trivial to
improve on, even if just by importing the remaining static text from the
kernel-man-pages tree.

Following that, the series enhances the python scripting around parsing
the descriptions from the header files and generating dedicated
ReStructured Text and troff output. Finally, to expose the new text and
reduce the likelihood of having it get out of date or break the docs
parser, it is added to the selftests and exposed through the kernel
documentation web pages.

At this point I'd like to put this out for comments. In my mind, the
ideal eventuation of this work would be to extend kernel UAPI headers
such that each of the categories I had listed above (commands, maps,
progs, hooks) have dedicated documentation in the kernel tree, and that
developers must update the comments in the headers to document the APIs
prior to patch acceptance, and that we could auto-generate the latest
version of the bpf(2) manual pages based on a few static description
sections combined with the dynamically-generated output from the header.

Thanks also to Quentin Monnet for initial review.

[0]: make -C tools/bpf -f Makefile.docs bpf-helpers.7

Joe Stringer (17):
  bpf: Import syscall arg documentation
  bpf: Add minimal bpf() command documentation
  bpf: Document BPF_F_LOCK in syscall commands
  bpf: Document BPF_PROG_PIN syscall command
  bpf: Document BPF_PROG_ATTACH syscall command
  bpf: Document BPF_PROG_TEST_RUN syscall command
  bpf: Document BPF_PROG_QUERY syscall command
  bpf: Document BPF_MAP_*_BATCH syscall commands
  scripts/bpf: Rename bpf_helpers_doc.py -> bpf_doc.py
  scripts/bpf: Abstract eBPF API target parameter
  scripts/bpf: Add syscall commands printer
  tools/bpf: Rename Makefile.{helpers,docs}
  tools/bpf: Templatize man page generation
  tools/bpf: Build bpf-sycall.2 in Makefile.docs
  selftests/bpf: Add docs target
  docs/bpf: Add bpf() syscall command reference
  tools: Sync uapi bpf.h header with latest changes

 Documentation/Makefile|   2 +
 Documentation/bpf/Makefile|  28 +
 Documentation/bpf/bpf_commands.rst|   5 +
 Documentation/bpf/index.rst   |  14 +-
 include/uapi/linux/bpf.h  | 709 +-
 scripts/{bpf_helpers_doc.py => bpf_doc.py}| 189 -
 tools/bpf/Makefile.docs   |  88 +++
 tools/bpf/Makefile.helpers|  60 --
 tools/bpf/bpftool/Documentation/Makefile  |  12 +-
 tools/include/uapi/linux/bpf.h| 709 +++

[PATCHv2 iproute2 master] bpf: Fix race condition with map pinning

2019-09-19 Thread Joe Stringer
If two processes attempt to invoke bpf_map_attach() at the same time,
then they will both create maps, then the first will successfully pin
the map to the filesystem and the second will not pin the map, but will
continue operating with a reference to its own copy of the map. As a
result, the sharing of the same map will be broken from the two programs
that were concurrently loaded via loaders using this library.

Fix this by adding a retry in the case where the pinning fails because
the map already exists on the filesystem. In that case, re-attempt
opening a fd to the map on the filesystem as it shows that another
program already created and pinned a map at that location.

Signed-off-by: Joe Stringer 
---
v2: Fix close of created map in the EEXIST case.
v1: Original patch
---
 lib/bpf.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/lib/bpf.c b/lib/bpf.c
index 01152b26e54a..86ab0698660f 100644
--- a/lib/bpf.c
+++ b/lib/bpf.c
@@ -1707,7 +1707,9 @@ static int bpf_map_attach(const char *name, struct 
bpf_elf_ctx *ctx,
  int *have_map_in_map)
 {
int fd, ifindex, ret, map_inner_fd = 0;
+   bool retried = false;
 
+probe:
fd = bpf_probe_pinned(name, ctx, map->pinning);
if (fd > 0) {
ret = bpf_map_selfcheck_pinned(fd, map, ext,
@@ -1756,10 +1758,14 @@ static int bpf_map_attach(const char *name, struct 
bpf_elf_ctx *ctx,
}
 
ret = bpf_place_pinned(fd, name, ctx, map->pinning);
-   if (ret < 0 && errno != EEXIST) {
+   if (ret < 0) {
+   close(fd);
+   if (!retried && errno == EEXIST) {
+   retried = true;
+   goto probe;
+   }
fprintf(stderr, "Could not pin %s map: %s\n", name,
strerror(errno));
-   close(fd);
return ret;
}
 
-- 
2.20.1



Re: [PATCH iproute2 master] bpf: Fix race condition with map pinning

2019-09-19 Thread Joe Stringer
On Thu, Sep 19, 2019 at 3:07 PM Joe Stringer  wrote:
>
> If two processes attempt to invoke bpf_map_attach() at the same time,
> then they will both create maps, then the first will successfully pin
> the map to the filesystem and the second will not pin the map, but will
> continue operating with a reference to its own copy of the map. As a
> result, the sharing of the same map will be broken from the two programs
> that were concurrently loaded via loaders using this library.
>
> Fix this by adding a retry in the case where the pinning fails because
> the map already exists on the filesystem. In that case, re-attempt
> opening a fd to the map on the filesystem as it shows that another
> program already created and pinned a map at that location.
>
> Signed-off-by: Joe Stringer 
> ---
>  lib/bpf.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/lib/bpf.c b/lib/bpf.c
> index f64b58c3bb19..23eb8952cc28 100644
> --- a/lib/bpf.c
> +++ b/lib/bpf.c
> @@ -1625,7 +1625,9 @@ static int bpf_map_attach(const char *name, struct 
> bpf_elf_ctx *ctx,
>   int *have_map_in_map)
>  {
> int fd, ifindex, ret, map_inner_fd = 0;
> +   bool retried = false;
>
> +probe:
> fd = bpf_probe_pinned(name, ctx, map->pinning);
> if (fd > 0) {
> ret = bpf_map_selfcheck_pinned(fd, map, ext,
> @@ -1674,7 +1676,11 @@ static int bpf_map_attach(const char *name, struct 
> bpf_elf_ctx *ctx,
> }
>
> ret = bpf_place_pinned(fd, name, ctx, map->pinning);
> -   if (ret < 0 && errno != EEXIST) {
> +   if (ret < 0) {
> +   if (!retried && errno == EEXIST) {
> +   retried = true;
> +   goto probe;
> +   }

Ah, forgot to close 'fd' before the jump in this retry case. Will fix
that up in v2.


[PATCH iproute2 master] bpf: Fix race condition with map pinning

2019-09-19 Thread Joe Stringer
If two processes attempt to invoke bpf_map_attach() at the same time,
then they will both create maps, then the first will successfully pin
the map to the filesystem and the second will not pin the map, but will
continue operating with a reference to its own copy of the map. As a
result, the sharing of the same map will be broken from the two programs
that were concurrently loaded via loaders using this library.

Fix this by adding a retry in the case where the pinning fails because
the map already exists on the filesystem. In that case, re-attempt
opening a fd to the map on the filesystem as it shows that another
program already created and pinned a map at that location.

Signed-off-by: Joe Stringer 
---
 lib/bpf.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/lib/bpf.c b/lib/bpf.c
index f64b58c3bb19..23eb8952cc28 100644
--- a/lib/bpf.c
+++ b/lib/bpf.c
@@ -1625,7 +1625,9 @@ static int bpf_map_attach(const char *name, struct 
bpf_elf_ctx *ctx,
  int *have_map_in_map)
 {
int fd, ifindex, ret, map_inner_fd = 0;
+   bool retried = false;
 
+probe:
fd = bpf_probe_pinned(name, ctx, map->pinning);
if (fd > 0) {
ret = bpf_map_selfcheck_pinned(fd, map, ext,
@@ -1674,7 +1676,11 @@ static int bpf_map_attach(const char *name, struct 
bpf_elf_ctx *ctx,
}
 
ret = bpf_place_pinned(fd, name, ctx, map->pinning);
-   if (ret < 0 && errno != EEXIST) {
+   if (ret < 0) {
+   if (!retried && errno == EEXIST) {
+   retried = true;
+   goto probe;
+   }
fprintf(stderr, "Could not pin %s map: %s\n", name,
strerror(errno));
close(fd);
-- 
2.20.1



Re: Removing skb_orphan() from ip_rcv_core()

2019-06-25 Thread Joe Stringer
On Tue, Jun 25, 2019 at 4:07 AM Jamal Hadi Salim  wrote:
>
> On 2019-06-24 11:26 p.m., Joe Stringer wrote:
> [..]
> >
> > I haven't got as far as UDP yet, but I didn't see any need for a
> > dependency on netfilter.
>
> I'd be curious to see what you did. My experience, even for TCP is
> the socket(transparent/tproxy) lookup code (to set skb->sk either
> listening or established) is entangled in
> CONFIG_NETFILTER_SOMETHING_OR_OTHER. You have to rip it out of
> there (in the tproxy tc action into that  code). Only then can you
> compile out netfilter.
> I didnt bother to rip out code for udp case.
> i.e if you needed udp to work with the tc action,
> youd have to turn on NF. But that was because we had
> no need for udp transparent proxying.
> IOW:
> There is really no reason, afaik, for tproxy code to only be
> accessed if netfilter is compiled in. Not sure i made sense.

Oh, I see. Between the existing bpf_skc_lookup_tcp() and
bpf_sk_lookup_tcp() helpers in BPF, plus a new bpf_sk_assign() helper
and a little bit of lookup code using the appropriate tproxy ports
etc. from the BPF side, I was able to get it working. One could
imagine perhaps wrapping all this logic up in a higher level
"bpf_sk_lookup_tproxy()" helper call or similar, but I didn't go that
direction given that the BPF socket primitives seemed to provide the
necessary functionality in a more generic manner.


Re: Removing skb_orphan() from ip_rcv_core()

2019-06-25 Thread Joe Stringer
On Mon, Jun 24, 2019 at 11:37 PM Eric Dumazet  wrote:
> On 6/24/19 8:17 PM, Joe Stringer wrote:
> > On Fri, Jun 21, 2019 at 1:59 PM Florian Westphal  wrote:
> >> Joe Stringer  wrote:
> >>> However, if I drop these lines then I end up causing sockets to
> >>> release references too many times. Seems like if we don't orphan the
> >>> skb here, then later logic assumes that we have one more reference
> >>> than we actually have, and decrements the count when it shouldn't
> >>> (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems
> >>> to assume we always have a reference to the socket?)
> >>
> >> We might be calling the wrong destructor (i.e., the one set by tcp
> >> receive instead of the one set at tx time)?
> >
> > Hmm, interesting thought. Sure enough, with a bit of bpftrace
> > debugging we find it's tcp_wfree():
> >
> > $ cat ip_rcv.bt
> > #include 
> >
> > kprobe:ip_rcv {
> >$sk = ((struct sk_buff *)arg0)->sk;
> >$des = ((struct sk_buff *)arg0)->destructor;
> >if ($sk) {
> >if ($des) {
> >printf("received %s on %s with sk destructor %s
> > set\n", str(arg0), str(arg1), ksym($des));
> >@ip4_stacks[kstack] = count();
> >}
> >}
> > }
> > $ sudo bpftrace ip_rcv.bt
> > Attaching 1 probe...
> > received  on eth0 with sk destructor tcp_wfree set
> > ^C
> >
> > @ip4_stacks[
> >ip_rcv+1
> >__netif_receive_skb+24
> >process_backlog+179
> >net_rx_action+304
> >__do_softirq+220
> >do_softirq_own_stack+42
> >do_softirq.part.17+70
> >__local_bh_enable_ip+101
> >ip_finish_output2+421
> >__ip_finish_output+187
> >ip_finish_output+44
> >ip_output+109
> >ip_local_out+59
> >__ip_queue_xmit+368
> >ip_queue_xmit+16
> >__tcp_transmit_skb+1303
> >tcp_connect+2758
> >tcp_v4_connect+1135
> >__inet_stream_connect+214
> >inet_stream_connect+59
> >__sys_connect+237
> >__x64_sys_connect+26
> >do_syscall_64+90
> >entry_SYSCALL_64_after_hwframe+68
> > ]: 1
> >
> > Is there a solution here where we call the destructor if it's not
> > sock_efree()? When the socket is later stolen, it will only return the
> > reference via a call to sock_put(), so presumably at that point in the
> > stack we already assume that the skb->destructor is not one of these
> > other destructors (otherwise we wouldn't release the resources
> > correctly).
> >
>
> What was the driver here ? In any case, the following patch should help.
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 
> eeacebd7debbe6a55daedb92f00afd48051ebaf8..5075b4b267af7057f69fcb935226fce097a920e2
>  100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3699,6 +3699,7 @@ static __always_inline int dev_forward_skb(struct 
> net_device *dev,
> return NET_RX_DROP;
> }
>
> +   skb_orphan(skb);
> skb_scrub_packet(skb, true);
> skb->priority = 0;
> return 0;

Looks like it was bridge in the end, found by attaching a similar
bpftrace program to __dev_forward_sk(). Interestingly enough, the
device attached to the skb reported its name as "eth0" despite not
having such a named link or named bridge that I could find anywhere
via "ip link" / "brctl show"..

__dev_forward_skb+1
   dev_hard_start_xmit+151
   __dev_queue_xmit+1928
   dev_queue_xmit+16
   br_dev_queue_push_xmit+123
   br_forward_finish+69
   __br_forward+327
   br_forward+204
   br_dev_xmit+598
   dev_hard_start_xmit+151
   __dev_queue_xmit+1928
   dev_queue_xmit+16
   neigh_resolve_output+339
   ip_finish_output2+402
   __ip_finish_output+187
   ip_finish_output+44
   ip_output+109
   ip_local_out+59
   __ip_queue_xmit+368
   ip_queue_xmit+16
   __tcp_transmit_skb+1303
   tcp_connect+2758
   tcp_v4_connect+1135
   __inet_stream_connect+214
   inet_stream_connect+59
   __sys_connect+237
   __x64_sys_connect+26
   do_syscall_64+90
   entry_SYSCALL_64_after_hwframe+68

So I guess something like this could be another alternative:

diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 82225b8b54f5..c2de2bb35080 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -65,6 +65,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit);

int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
{
+   skb_orphan(skb);
   skb->tstamp = 0;
   return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING,
  net, sk, skb, NULL, skb->dev,


Re: Removing skb_orphan() from ip_rcv_core()

2019-06-24 Thread Joe Stringer
On Mon, Jun 24, 2019 at 7:47 AM Jamal Hadi Salim  wrote:
>
> On 2019-06-21 1:58 p.m., Joe Stringer wrote:
> > Hi folks, picking this up again..
> [..]
> > During LSFMM, it seemed like no-one knew quite why the skb_orphan() is
> > necessary in that path in the current version of the code, and that we
> > may be able to remove it. Florian, I know you weren't in the room for
> > that discussion, so raising it again now with a stack trace, Do you
> > have some sense what's going on here and whether there's a path
> > towards removing it from this path or allowing the skb->sk to be
> > retained during ip_rcv() in some conditions?
>
>
> Sorry - I havent followed the discussion but saw your email over
> the weekend and wanted to be at work to refresh my memory on some
> code. For maybe 2-3 years we have deployed the tproxy
> equivalent as a tc action on ingress (with no netfilter dependency).
>
> And, of course, we had to work around that specific code you are
> referring to - we didnt remove it. The tc action code increments
> the sk refcount and sets the tc index. The net core doesnt orphan
> the skb if a speacial tc index value is set (see attached patch)
>
> I never bothered up streaming the patch because the hack is a bit
> embarrassing (but worked ;->); and never posted the action code
> either because i thought this was just us that had this requirement.
> I am glad other people see the need for this feature. Is there effort
> to make this _not_ depend on iptables/netfilter? I am guessing if you
> want to do this from ebpf (tc or xdp) that is a requirement.
> Our need was with tcp at the time; so left udp dependency on netfilter
> alone.

I haven't got as far as UDP yet, but I didn't see any need for a
dependency on netfilter.


Re: Removing skb_orphan() from ip_rcv_core()

2019-06-24 Thread Joe Stringer
On Fri, Jun 21, 2019 at 1:59 PM Florian Westphal  wrote:
>
> Joe Stringer  wrote:
> > As discussed during LSFMM, I've been looking at adding something like
> > an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can
> > be implemented with integration into other BPF logic, however
> > currently any attempts to do so are blocked by the skb_orphan() call
> > in ip_rcv_core() (which will effectively ignore any socket assign
> > decision made by the TC BPF program).
> >
> > Recently I was attempting to remove the skb_orphan() call, and I've
> > been trying different things but there seems to be some context I'm
> > missing. Here's the core of the patch:
> >
> > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> > index ed97724c5e33..16aea980318a 100644
> > --- a/net/ipv4/ip_input.c
> > +++ b/net/ipv4/ip_input.c
> > @@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff
> > *skb, struct net *net)
> >memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
> >IPCB(skb)->iif = skb->skb_iif;
> >
> > -   /* Must drop socket now because of tproxy. */
> > -   skb_orphan(skb);
> >
> >return skb;
> >
> > The statement that the socket must be dropped because of tproxy
> > doesn't make sense to me, because the PRE_ROUTING hook is hit after
> > this, which will call into the tproxy logic and eventually
> > nf_tproxy_assign_sock() which already does the skb_orphan() itself.
>
> in comment: s/tproxy/skb_steal_sock/

For reference, I was following the path like this:

ip_rcv()
( -> ip_rcv_core() for skb_orphan)
-> NF_INET_PRE_ROUTING hook
(... invoke iptables hooks)
-> iptable_mangle_hook()
-> ipt_do_table()
... -> tproxy_tg4()
... -> nf_tproxy_assign_sock()
-> skb_orphan()
(... finish iptables processing)
( -> ip_rcv_finish())
( ... -> ip_rcv_finish_core() for early demux / route lookup )
(... -> dst_input())
(... -> tcp_v4_rcv())
( -> __inet_lookup_skb())
( -> skb_steal_sock() )

> at least thats what I concluded a few years ago when I looked into
> the skb_oprhan() need.
>
> IIRC some device drivers use skb->sk for backpressure, so without this
> non-tcp socket would be stolen by skb_steal_sock.

Do you happen to recall which device drivers? Or have some idea of a
list I could try to go through? Are you referring to virtual drivers
like veth or something else?

> We also recently removed skb orphan when crossing netns:
>
> commit 9c4c325252c54b34d53b3d0ffd535182b744e03d
> Author: Flavio Leitner 
> skbuff: preserve sock reference when scrubbing the skb.
>
> So thats another case where this orphan is needed.

Presumably the orphan is only needed in this case if the packet
crosses a namespace and then is subsequently passed back into the
stack?

> What could be done is adding some way to delay/defer the orphaning
> further, but we would need at the very least some annotation for
> skb_steal_sock to know when the skb->sk is really from TPROXY or
> if it has to orphan.

Eric mentions in another response to this thread that skb_orphan()
should be called from any ndo_start_xmit() which sends traffic back
into the stack. With that, presumably we would be pushing the
orphaning earlier such that the only way that the skb->sk ref can be
non-NULL around this point in receive would be because it was
specifically set by some kind of tproxy logic?

> Same for the safety check in the forwarding path.
> Netfilter modules need o be audited as well, they might make assumptions
> wrt. skb->sk being inet sockets (set by local stack or early demux).
>
> > However, if I drop these lines then I end up causing sockets to
> > release references too many times. Seems like if we don't orphan the
> > skb here, then later logic assumes that we have one more reference
> > than we actually have, and decrements the count when it shouldn't
> > (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems
> > to assume we always have a reference to the socket?)
>
> We might be calling the wrong destructor (i.e., the one set by tcp
> receive instead of the one set at tx time)?

Hmm, interesting thought. Sure enough, with a bit of bpftrace
debugging we find it's tcp_wfree():

$ cat ip_rcv.bt
#include 

kprobe:ip_rcv {
   $sk = ((struct sk_buff *)arg0)->sk;
   $des = ((struct sk_buff *)arg0)->destructor;
   if ($sk) {
   if ($des) {
   printf("received %s on %s with sk destructor %s
set\n", str(arg0), str(arg1), ksym($des));
   @ip4_stacks[kstack] = count();
   }
   }
}
$ sudo bpftrace ip_rcv.bt
Attaching 1 prob

Removing skb_orphan() from ip_rcv_core()

2019-06-21 Thread Joe Stringer
Hi folks, picking this up again..

As discussed during LSFMM, I've been looking at adding something like
an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can
be implemented with integration into other BPF logic, however
currently any attempts to do so are blocked by the skb_orphan() call
in ip_rcv_core() (which will effectively ignore any socket assign
decision made by the TC BPF program).

Recently I was attempting to remove the skb_orphan() call, and I've
been trying different things but there seems to be some context I'm
missing. Here's the core of the patch:

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index ed97724c5e33..16aea980318a 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff
*skb, struct net *net)
   memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
   IPCB(skb)->iif = skb->skb_iif;

-   /* Must drop socket now because of tproxy. */
-   skb_orphan(skb);

   return skb;

The statement that the socket must be dropped because of tproxy
doesn't make sense to me, because the PRE_ROUTING hook is hit after
this, which will call into the tproxy logic and eventually
nf_tproxy_assign_sock() which already does the skb_orphan() itself.

However, if I drop these lines then I end up causing sockets to
release references too many times. Seems like if we don't orphan the
skb here, then later logic assumes that we have one more reference
than we actually have, and decrements the count when it shouldn't
(perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems
to assume we always have a reference to the socket?)

Splat:

refcount_t hit zero at sk_stop_timer+0x2c/0x30 in cilium-agent[16359],
uid/euid: 0/0
WARNING: CPU: 0 PID: 16359 at kernel/panic.c:686 refcount_error_report+0x9c/0xa1
...
? inet_put_port+0xa6/0xd0
inet_csk_clear_xmit_timers+0x2e/0x50
tcp_done+0x8b/0xf0
tcp_reset+0x49/0xc0
tcp_validate_incoming+0x2f7/0x410
tcp_rcv_state_process+0x250/0xdb6
? tcp_v4_connect+0x46f/0x4e0
tcp_v4_do_rcv+0xbd/0x1f0
__release_sock+0x84/0xd0
release_sock+0x30/0xa0
inet_stream_connect+0x47/0x60

(Full version: 
https://gist.github.com/joestringer/d5313e4bf4231e2c46405bd7a3053936
)

This seems potentially related to some of the socket referencing
discussion in the peer thread "[RFC bpf-next 0/7] Programming socket
lookup with BPF".

During LSFMM, it seemed like no-one knew quite why the skb_orphan() is
necessary in that path in the current version of the code, and that we
may be able to remove it. Florian, I know you weren't in the room for
that discussion, so raising it again now with a stack trace, Do you
have some sense what's going on here and whether there's a path
towards removing it from this path or allowing the skb->sk to be
retained during ip_rcv() in some conditions?


Re: [RFC bpf-next 0/7] Programming socket lookup with BPF

2019-06-21 Thread Joe Stringer
On Fri, Jun 21, 2019 at 1:44 AM Jakub Sitnicki  wrote:
>
> On Fri, Jun 21, 2019, 00:20 Joe Stringer  wrote:
>>
>> On Wed, Jun 19, 2019 at 2:14 AM Jakub Sitnicki  wrote:
>> >
>> > Hey Florian,
>> >
>> > Thanks for taking a look at it.
>> >
>> > On Tue, Jun 18, 2019 at 03:52 PM CEST, Florian Westphal wrote:
>> > > Jakub Sitnicki  wrote:
>> > >>  - XDP programs using bpf_sk_lookup helpers, like load balancers, can't
>> > >>find the listening socket to check for SYN cookies with TPROXY 
>> > >> redirect.
>> > >
>> > > Sorry for the question, but where is the problem?
>> > > (i.e., is it with TPROXY or bpf side)?
>> >
>> > The way I see it is that the problem is that we have mappings for
>> > steering traffic into sockets split between two places: (1) the socket
>> > lookup tables, and (2) the TPROXY rules.
>> >
>> > BPF programs that need to check if there is a socket the packet is
>> > destined for have access to the socket lookup tables, via the mentioned
>> > bpf_sk_lookup helper, but are unaware of TPROXY redirects.
>> >
>> > For TCP we're able to look up from BPF if there are any established,
>> > request, and "normal" listening sockets. The listening sockets that
>> > receive connections via TPROXY are invisible to BPF progs.
>> >
>> > Why are we interested in finding all listening sockets? To check if any
>> > of them had SYN queue overflow recently and if we should honor SYN
>> > cookies.
>>
>> Why are they invisible? Can't you look them up with bpf_skc_lookup_tcp()?
>
>
> They are invisible in that sense that you can't look them up using the packet 
> 4-tuple. You have to somehow make the XDP/TC progs aware of the TPROXY 
> redirects to find the target sockets.

Isn't that what you're doing in the example from the cover letter
(reincluded below for reference), except with the new program type
rather than XDP/TC progs?

   switch (bpf_ntohl(ctx->local_ip4) >> 8) {
case NET1:
ctx->local_ip4 = bpf_htonl(IP4(127, 0, 0, 1));
ctx->local_port = 81;
return BPF_REDIRECT;
case NET2:
ctx->local_ip4 = bpf_htonl(IP4(127, 0, 0, 1));
ctx->local_port = 82;
return BPF_REDIRECT;
}

That said, I appreciate that even if you find the sockets from XDP,
you'd presumably need some way to retain the socket reference beyond
XDP execution to convince the stack to guide the traffic into that
socket, which would be a whole other effort. For your use case it may
or may not make the most sense.


Re: [RFC bpf-next 0/7] Programming socket lookup with BPF

2019-06-20 Thread Joe Stringer
On Wed, Jun 19, 2019 at 2:14 AM Jakub Sitnicki  wrote:
>
> Hey Florian,
>
> Thanks for taking a look at it.
>
> On Tue, Jun 18, 2019 at 03:52 PM CEST, Florian Westphal wrote:
> > Jakub Sitnicki  wrote:
> >>  - XDP programs using bpf_sk_lookup helpers, like load balancers, can't
> >>find the listening socket to check for SYN cookies with TPROXY redirect.
> >
> > Sorry for the question, but where is the problem?
> > (i.e., is it with TPROXY or bpf side)?
>
> The way I see it is that the problem is that we have mappings for
> steering traffic into sockets split between two places: (1) the socket
> lookup tables, and (2) the TPROXY rules.
>
> BPF programs that need to check if there is a socket the packet is
> destined for have access to the socket lookup tables, via the mentioned
> bpf_sk_lookup helper, but are unaware of TPROXY redirects.
>
> For TCP we're able to look up from BPF if there are any established,
> request, and "normal" listening sockets. The listening sockets that
> receive connections via TPROXY are invisible to BPF progs.
>
> Why are we interested in finding all listening sockets? To check if any
> of them had SYN queue overflow recently and if we should honor SYN
> cookies.

Why are they invisible? Can't you look them up with bpf_skc_lookup_tcp()?


Re: [PATCH bpf] bpf: Check sk_fullsock() before returning from bpf_sk_lookup()

2019-05-20 Thread Joe Stringer
On Fri, May 17, 2019 at 2:21 PM Martin KaFai Lau  wrote:
>
> The BPF_FUNC_sk_lookup_xxx helpers return RET_PTR_TO_SOCKET_OR_NULL.
> Meaning a fullsock ptr and its fullsock's fields in bpf_sock can be
> accessed, e.g. type, protocol, mark and priority.
> Some new helper, like bpf_sk_storage_get(), also expects
> ARG_PTR_TO_SOCKET is a fullsock.
>
> bpf_sk_lookup() currently calls sk_to_full_sk() before returning.
> However, the ptr returned from sk_to_full_sk() is not guaranteed
> to be a fullsock.  For example, it cannot get a fullsock if sk
> is in TCP_TIME_WAIT.
>
> This patch checks for sk_fullsock() before returning. If it is not
> a fullsock, sock_gen_put() is called if needed and then returns NULL.
>
> Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
> Cc: Joe Stringer 
> Signed-off-by: Martin KaFai Lau 
> ---

Acked-by: Joe Stringer 


Re: [PATCH bpf] bpf: Check sk_fullsock() before returning from bpf_sk_lookup()

2019-05-20 Thread Joe Stringer
On Sat, May 18, 2019 at 7:08 PM Martin Lau  wrote:
>
> On Sat, May 18, 2019 at 06:52:48PM -0700, Joe Stringer wrote:
> > On Sat, May 18, 2019, 09:05 Martin Lau  wrote:
> > >
> > > On Sat, May 18, 2019 at 08:38:46AM -1000, Joe Stringer wrote:
> > > > On Fri, May 17, 2019, 12:02 Martin Lau  wrote:
> > > >
> > > > > On Fri, May 17, 2019 at 02:51:48PM -0700, Eric Dumazet wrote:
> > > > > >
> > > > > >
> > > > > > On 5/17/19 2:21 PM, Martin KaFai Lau wrote:
> > > > > > > The BPF_FUNC_sk_lookup_xxx helpers return 
> > > > > > > RET_PTR_TO_SOCKET_OR_NULL.
> > > > > > > Meaning a fullsock ptr and its fullsock's fields in bpf_sock can 
> > > > > > > be
> > > > > > > accessed, e.g. type, protocol, mark and priority.
> > > > > > > Some new helper, like bpf_sk_storage_get(), also expects
> > > > > > > ARG_PTR_TO_SOCKET is a fullsock.
> > > > > > >
> > > > > > > bpf_sk_lookup() currently calls sk_to_full_sk() before returning.
> > > > > > > However, the ptr returned from sk_to_full_sk() is not guaranteed
> > > > > > > to be a fullsock.  For example, it cannot get a fullsock if sk
> > > > > > > is in TCP_TIME_WAIT.
> > > > > > >
> > > > > > > This patch checks for sk_fullsock() before returning. If it is not
> > > > > > > a fullsock, sock_gen_put() is called if needed and then returns 
> > > > > > > NULL.
> > > > > > >
> > > > > > > Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
> > > > > > > Cc: Joe Stringer 
> > > > > > > Signed-off-by: Martin KaFai Lau 
> > > > > > > ---
> > > > > > >  net/core/filter.c | 16 ++--
> > > > > > >  1 file changed, 14 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > > > > index 55bfc941d17a..85def5a20aaf 100644
> > > > > > > --- a/net/core/filter.c
> > > > > > > +++ b/net/core/filter.c
> > > > > > > @@ -5337,8 +5337,14 @@ __bpf_sk_lookup(struct sk_buff *skb, struct
> > > > > bpf_sock_tuple *tuple, u32 len,
> > > > > > > struct sock *sk = __bpf_skc_lookup(skb, tuple, len, 
> > > > > > > caller_net,
> > > > > > >ifindex, proto, netns_id,
> > > > > flags);
> > > > > > >
> > > > > > > -   if (sk)
> > > > > > > +   if (sk) {
> > > > > > > sk = sk_to_full_sk(sk);
> > > > > > > +   if (!sk_fullsock(sk)) {
> > > > > > > +   if (!sock_flag(sk, SOCK_RCU_FREE))
> > > > > > > +   sock_gen_put(sk);
> > > > > >
> > > > > > This looks a bit convoluted/weird.
> > > > > >
> > > > > > What about telling/asking __bpf_skc_lookup() to not return a non
> > > > > fullsock instead ?
> > > > > It is becausee some other helpers, like BPF_FUNC_skc_lookup_tcp,
> > > > > can return non fullsock
> > > > >
> > > >
> > > > FYI this is necessary for finding a transparently proxied socket for a
> > > > non-local connection (tproxy use case).
> > > You meant it is necessary to return a non fullsock from the
> > > BPF_FUNC_sk_lookup_xxx helpers?
> >
> > Yes, that's what I want to associate with the skb so that the delivery
> > to the SO_TRANSPARENT is received properly.
> >
> > For the first packet of a connection, we look up the socket using the
> > tproxy socket port as the destination, and deliver the packet there.
> > The SO_TRANSPARENT logic then kicks in and sends back the ack and
> > creates the non-full sock for the connection tuple, which can be
> > entirely unrelated to local addresses or ports.
> >
> > For the second forward-direction packet, (ie ACK in 3-way handshake)
> > then we must deliver the packet to this non-full sock as that's what
> > is negotiating the proxied connection. If you look up using the packet
> > tuple then get the full sock from it, it will go back to the
> > SO_TRANSPARENT parent socket. Delivering the ACK there will result in
> > a RST being sent back, because the SO_TRANSPARENT socket is just there
> > to accept new connections for connections to be proxied. So this is
> > the case where I need the non-full sock.
> >
> > (In practice, the lookup logic attempts the packet tuple first then if
> > that fails, uses the tproxy port for lookup to achieve the above).
> hmm...I am likely missing something.
>
> 1) The above can be done by the "BPF_FUNC_skC_lookup_tcp" which
>returns a non fullsock (RET_PTR_TO_SOCK_COMMON_OR_NULL), no?

Correct, I meant to send as response to Eric as to use cases for
__bpf_skc_lookup() returning non fullsock.

> 2) The bpf_func_proto of "BPF_FUNC_sk_lookup_tcp" returns
>fullsock (RET_PTR_TO_SOCKET_OR_NULL) and the bpf_prog (and
>the verifier) is expecting that.  How to address the bug here?

Your proposal seems fine to me.


Re: [PATCH bpf] bpf: Check sk_fullsock() before returning from bpf_sk_lookup()

2019-05-18 Thread Joe Stringer
On Sat, May 18, 2019, 09:05 Martin Lau  wrote:
>
> On Sat, May 18, 2019 at 08:38:46AM -1000, Joe Stringer wrote:
> > On Fri, May 17, 2019, 12:02 Martin Lau  wrote:
> >
> > > On Fri, May 17, 2019 at 02:51:48PM -0700, Eric Dumazet wrote:
> > > >
> > > >
> > > > On 5/17/19 2:21 PM, Martin KaFai Lau wrote:
> > > > > The BPF_FUNC_sk_lookup_xxx helpers return RET_PTR_TO_SOCKET_OR_NULL.
> > > > > Meaning a fullsock ptr and its fullsock's fields in bpf_sock can be
> > > > > accessed, e.g. type, protocol, mark and priority.
> > > > > Some new helper, like bpf_sk_storage_get(), also expects
> > > > > ARG_PTR_TO_SOCKET is a fullsock.
> > > > >
> > > > > bpf_sk_lookup() currently calls sk_to_full_sk() before returning.
> > > > > However, the ptr returned from sk_to_full_sk() is not guaranteed
> > > > > to be a fullsock.  For example, it cannot get a fullsock if sk
> > > > > is in TCP_TIME_WAIT.
> > > > >
> > > > > This patch checks for sk_fullsock() before returning. If it is not
> > > > > a fullsock, sock_gen_put() is called if needed and then returns NULL.
> > > > >
> > > > > Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
> > > > > Cc: Joe Stringer 
> > > > > Signed-off-by: Martin KaFai Lau 
> > > > > ---
> > > > >  net/core/filter.c | 16 ++--
> > > > >  1 file changed, 14 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > > index 55bfc941d17a..85def5a20aaf 100644
> > > > > --- a/net/core/filter.c
> > > > > +++ b/net/core/filter.c
> > > > > @@ -5337,8 +5337,14 @@ __bpf_sk_lookup(struct sk_buff *skb, struct
> > > bpf_sock_tuple *tuple, u32 len,
> > > > > struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
> > > > >ifindex, proto, netns_id,
> > > flags);
> > > > >
> > > > > -   if (sk)
> > > > > +   if (sk) {
> > > > > sk = sk_to_full_sk(sk);
> > > > > +   if (!sk_fullsock(sk)) {
> > > > > +   if (!sock_flag(sk, SOCK_RCU_FREE))
> > > > > +   sock_gen_put(sk);
> > > >
> > > > This looks a bit convoluted/weird.
> > > >
> > > > What about telling/asking __bpf_skc_lookup() to not return a non
> > > fullsock instead ?
> > > It is becausee some other helpers, like BPF_FUNC_skc_lookup_tcp,
> > > can return non fullsock
> > >
> >
> > FYI this is necessary for finding a transparently proxied socket for a
> > non-local connection (tproxy use case).
> You meant it is necessary to return a non fullsock from the
> BPF_FUNC_sk_lookup_xxx helpers?

Yes, that's what I want to associate with the skb so that the delivery
to the SO_TRANSPARENT is received properly.

For the first packet of a connection, we look up the socket using the
tproxy socket port as the destination, and deliver the packet there.
The SO_TRANSPARENT logic then kicks in and sends back the ack and
creates the non-full sock for the connection tuple, which can be
entirely unrelated to local addresses or ports.

For the second forward-direction packet, (ie ACK in 3-way handshake)
then we must deliver the packet to this non-full sock as that's what
is negotiating the proxied connection. If you look up using the packet
tuple then get the full sock from it, it will go back to the
SO_TRANSPARENT parent socket. Delivering the ACK there will result in
a RST being sent back, because the SO_TRANSPARENT socket is just there
to accept new connections for connections to be proxied. So this is
the case where I need the non-full sock.

(In practice, the lookup logic attempts the packet tuple first then if
that fails, uses the tproxy port for lookup to achieve the above).


Re: RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers

2019-05-18 Thread Joe Stringer
On Fri, May 17, 2019 at 7:15 AM Lorenz Bauer  wrote:
>
> On Thu, 16 May 2019 at 21:33, Alexei Starovoitov
>  wrote:
> >
> > On Thu, May 16, 2019 at 09:41:34AM +0100, Lorenz Bauer wrote:
> > > On Wed, 15 May 2019 at 18:16, Joe Stringer  wrote:
> > > >
> > > > On Wed, May 15, 2019 at 8:11 AM Lorenz Bauer  
> > > > wrote:
> > > > >
> > > > > In the BPF-based TPROXY session with Joe Stringer [1], I mentioned
> > > > > that the sk_lookup_* helpers currently return inconsistent results if
> > > > > SK_REUSEPORT programs are in play.
> > > > >
> > > > > SK_REUSEPORT programs are a hook point in inet_lookup. They get access
> > > > > to the full packet
> > > > > that triggered the look up. To support this, inet_lookup gained a new
> > > > > skb argument to provide such context. If skb is NULL, the SK_REUSEPORT
> > > > > program is skipped and instead the socket is selected by its hash.
> > > > >
> > > > > The first problem is that not all callers to inet_lookup from BPF have
> > > > > an skb, e.g. XDP. This means that a look up from XDP gives an
> > > > > incorrect result. For now that is not a huge problem. However, once we
> > > > > get sk_assign as proposed by Joe, we can end up circumventing
> > > > > SK_REUSEPORT.
> > > >
> > > > To clarify a bit, the reason this is a problem is that a
> > > > straightforward implementation may just consider passing the skb
> > > > context into the sk_lookup_*() and through to the inet_lookup() so
> > > > that it would run the SK_REUSEPORT BPF program for socket selection on
> > > > the skb when the packet-path BPF program performs the socket lookup.
> > > > However, as this paragraph describes, the skb context is not always
> > > > available.
> > > >
> > > > > At the conference, someone suggested using a similar approach to the
> > > > > work done on the flow dissector by Stanislav: create a dedicated
> > > > > context sk_reuseport which can either take an skb or a plain pointer.
> > > > > Patch up load_bytes to deal with both. Pass the context to
> > > > > inet_lookup.
> > > > >
> > > > > This is when we hit the second problem: using the skb or XDP context
> > > > > directly is incorrect, because it assumes that the relevant protocol
> > > > > headers are at the start of the buffer. In our use case, the correct
> > > > > headers are at an offset since we're inspecting encapsulated packets.
> > > > >
> > > > > The best solution I've come up with is to steal 17 bits from the flags
> > > > > argument to sk_lookup_*, 1 bit for BPF_F_HEADERS_AT_OFFSET, 16bit for
> > > > > the offset itself.
> > > >
> > > > FYI there's also the upper 32 bits of the netns_id parameter, another
> > > > option would be to steal 16 bits from there.
> > >
> > > Or len, which is only 16 bits realistically. The offset doesn't really 
> > > fit into
> > > either of them very well, using flags seemed the cleanest to me.
> > > Is there some best practice around this?
> > >
> > > >
> > > > > Thoughts?
> > > >
> > > > Internally with skbs, we use `skb_pull()` to manage header offsets,
> > > > could we do something similar with `bpf_xdp_adjust_head()` prior to
> > > > the call to `bpf_sk_lookup_*()`?
> > >
> > > That would only work if it retained the contents of the skipped
> > > buffer, and if there
> > > was a way to undo the adjustment later. We're doing the sk_lookup to
> > > decide whether to
> > > accept or forward the packet, so at the point of the call we might still 
> > > need
> > > that data. Is that feasible with skb / XDP ctx?
> >
> > While discussing the solution for reuseport I propose to use
> > progs/test_select_reuseport_kern.c as an example of realistic program.
> > It reads tcp/udp header directly via ctx->data or via bpf_skb_load_bytes()
> > including payload after the header.
> > It also uses bpf_skb_load_bytes_relative() to fetch IP.
> > I think if we're fixing the sk_lookup from XDP the above program
> > would need to work.
>
> Agreed.
>
> > And I think we can make it work by adding new requirement that
> > 'struct bpf_sock_tuple *&#

Re: RFC: Fixing SK_REUSEPORT from sk_lookup_* helpers

2019-05-15 Thread Joe Stringer
On Wed, May 15, 2019 at 8:11 AM Lorenz Bauer  wrote:
>
> In the BPF-based TPROXY session with Joe Stringer [1], I mentioned
> that the sk_lookup_* helpers currently return inconsistent results if
> SK_REUSEPORT programs are in play.
>
> SK_REUSEPORT programs are a hook point in inet_lookup. They get access
> to the full packet
> that triggered the look up. To support this, inet_lookup gained a new
> skb argument to provide such context. If skb is NULL, the SK_REUSEPORT
> program is skipped and instead the socket is selected by its hash.
>
> The first problem is that not all callers to inet_lookup from BPF have
> an skb, e.g. XDP. This means that a look up from XDP gives an
> incorrect result. For now that is not a huge problem. However, once we
> get sk_assign as proposed by Joe, we can end up circumventing
> SK_REUSEPORT.

To clarify a bit, the reason this is a problem is that a
straightforward implementation may just consider passing the skb
context into the sk_lookup_*() and through to the inet_lookup() so
that it would run the SK_REUSEPORT BPF program for socket selection on
the skb when the packet-path BPF program performs the socket lookup.
However, as this paragraph describes, the skb context is not always
available.

> At the conference, someone suggested using a similar approach to the
> work done on the flow dissector by Stanislav: create a dedicated
> context sk_reuseport which can either take an skb or a plain pointer.
> Patch up load_bytes to deal with both. Pass the context to
> inet_lookup.
>
> This is when we hit the second problem: using the skb or XDP context
> directly is incorrect, because it assumes that the relevant protocol
> headers are at the start of the buffer. In our use case, the correct
> headers are at an offset since we're inspecting encapsulated packets.
>
> The best solution I've come up with is to steal 17 bits from the flags
> argument to sk_lookup_*, 1 bit for BPF_F_HEADERS_AT_OFFSET, 16bit for
> the offset itself.

FYI there's also the upper 32 bits of the netns_id parameter, another
option would be to steal 16 bits from there.

> Thoughts?

Internally with skbs, we use `skb_pull()` to manage header offsets,
could we do something similar with `bpf_xdp_adjust_head()` prior to
the call to `bpf_sk_lookup_*()`?


Re: [ovs-dev] openvswitch crash on i386

2019-03-05 Thread Joe Stringer
On Tue, Mar 5, 2019 at 2:12 AM Christian Ehrhardt
 wrote:
>
> On Tue, Mar 5, 2019 at 10:58 AM Juerg Haefliger
>  wrote:
> >
> > Hi,
> >
> > Running the following commands in a loop will crash an i386 5.0 kernel
> > typically within a few iterations:
> >
> > ovs-vsctl add-br test
> > ovs-vsctl del-br test
> >
> > [  106.215748] BUG: unable to handle kernel paging request at e8a35f3b
> > [  106.216733] #PF error: [normal kernel read fault]
> > [  106.217464] *pdpt = 19a76001 *pde = 
> > [  106.218346] Oops:  [#1] SMP PTI
> > [  106.218911] CPU: 0 PID: 2050 Comm: systemd-udevd Tainted: GE 
> > 5.0.0 #25
> > [  106.220103] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > 1.11.1-1ubuntu1 04/01/2014
> > [  106.221447] EIP: kmem_cache_alloc_trace+0x7a/0x1b0
> > [  106.222178] Code: 01 00 00 8b 07 64 8b 50 04 64 03 05 28 61 e8 d2 8b 08 
> > 89 4d ec 85 c9 0f 84 03 01 00 00 8b 45 ec 8b 5f 14 8d 4a 01 8b 37 01 c3 
> > <33> 1b 33 9f b4 00 00 00 64 0f c7 0e 75 cb 8b 75 ec 8b 47 14 0f 18
> > [  106.224752] EAX: e8a35f3b EBX: e8a35f3b ECX: 869f EDX: 869e
> > [  106.225683] ESI: d2e96ef0 EDI: da401a00 EBP: d9b85dd0 ESP: d9b85db0
> > [  106.226662] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
> > [  106.227710] CR0: 80050033 CR2: e8a35f3b CR3: 185b8000 CR4: 06f0
> > [  106.228703] DR0:  DR1:  DR2:  DR3: 
> > [  106.229604] DR6: fffe0ff0 DR7: 0400
> > [  106.230114] Call Trace:
> > [  106.230525]  ? kernfs_fop_open+0xb4/0x390
> > [  106.231176]  kernfs_fop_open+0xb4/0x390
> > [  106.231856]  ? security_file_open+0x7c/0xc0
> > [  106.232562]  do_dentry_open+0x131/0x370
> > [  106.233229]  ? kernfs_fop_write+0x180/0x180
> > [  106.233905]  vfs_open+0x25/0x30
> > [  106.234432]  path_openat+0x2fd/0x1450
> > [  106.235084]  ? cp_new_stat64+0x115/0x140
> > [  106.235754]  ? cp_new_stat64+0x115/0x140
> > [  106.236427]  do_filp_open+0x6a/0xd0
> > [  106.237026]  ? cp_new_stat64+0x115/0x140
> > [  106.237748]  ? strncpy_from_user+0x3d/0x180
> > [  106.238539]  ? __alloc_fd+0x36/0x120
> > [  106.239256]  do_sys_open+0x175/0x210
> > [  106.239955]  sys_openat+0x1b/0x20
> > [  106.240596]  do_fast_syscall_32+0x7f/0x1e0
> > [  106.241313]  entry_SYSENTER_32+0x6b/0xbe
> > [  106.242017] EIP: 0xb7fae871
> > [  106.242559] Code: 8b 98 58 cd ff ff 89 c8 85 d2 74 02 89 0a 5b 5d c3 8b 
> > 04 24 c3 8b 14 24 c3 8b 34 24 c3 8b 3c 24 c3 51 52 55 89 e5 0f 34 cd 80 
> > <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
> > [  106.245551] EAX: ffda EBX: ff9c ECX: bffdcb60 EDX: 00088000
> > [  106.246651] ESI:  EDI: b7f9e000 EBP: 00088000 ESP: bffdc970
> > [  106.247706] DS: 007b ES: 007b FS:  GS: 0033 SS: 007b EFLAGS: 0246
> > [  106.248851] Modules linked in: openvswitch(E)
> > [  106.249621] CR2: e8a35f3b
> > [  106.250218] ---[ end trace 6a8d05679a59cda7 ]---
> >
> > I've bisected this down to the following commit that seems to have 
> > introduced
> > the issue:
> >
> > commit 120645513f55a4ac5543120d9e79925d30a0156f (refs/bisect/bad)
> > Author: Jarno Rajahalme 
> > Date:   Fri Apr 21 16:48:06 2017 -0700
> >
> > openvswitch: Add eventmask support to CT action.
> >
> > Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK,
> > which can be used in conjunction with the commit flag
> > (OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which
> > conntrack events (IPCT_*) should be delivered via the Netfilter
> > netlink multicast groups.  Default behavior depends on the system
> > configuration, but typically a lot of events are delivered.  This can be
> > very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some
> > types of events are of interest.
> >
> > Netfilter core init_conntrack() adds the event cache extension, so we
> > only need to set the ctmask value.  However, if the system is
> > configured without support for events, the setting will be skipped due
> > to extension not being found.
> >
> > Signed-off-by: Jarno Rajahalme 
> > Reviewed-by: Greg Rose 
> > Acked-by: Joe Stringer 
> > Signed-off-by: David S. Miller 
>
> Hi Juerg,
> the symptom, the identified breaking commit and actually all of it
> seems to be [1] which James, Joseph and I worked on already.
> I wanted to make you aware of the past context that already exists.

Re: [PATCH bpf-next 2/4] libbpf: Support 32-bit static data loads

2019-02-14 Thread Joe Stringer
On Thu, 14 Feb 2019 at 21:39, Y Song  wrote:
>
> On Mon, Feb 11, 2019 at 4:48 PM Joe Stringer  wrote:
> >
> > Support loads of static 32-bit data when BPF writers make use of
> > convenience macros for accessing static global data variables. A later
> > patch in this series will demonstrate its usage in a selftest.
> >
> > As of LLVM-7, this technique only works with 32-bit data, as LLVM will
> > complain if this technique is attempted with data of other sizes:
> >
> > LLVM ERROR: Unsupported relocation: try to compile with -O2 or above,
> > or check your static variable usage
>
> A little bit clarification from compiler side.
> The above compiler error is to prevent people use static variables since 
> current
> kernel/libbpf does not handle this. The compiler only warns if .bss or
> .data section
> has more than one definitions. The first definition always has section offset > 0
> and the compiler did not warn.

Ah, interesting. I observed that warning when I attempted to define
global variables of multiple sizes, and I thought also with sizes
other than 32-bit. This clarifies things a bit, thanks.

For the .bss my observation was that if you had a definition like:

static int a = 0;

Then this will be placed into .bss, hence why I looked into the
approach from this patch for patch 3 as well.

> The restriction is a little strange. To only work with 32-bit data is
> not a right
> statement. The following are some examples.
>
> The following static variable definitions will succeed:
> static int a; /* one in .bss */
> static long b = 2;  /* one in .data */
>
> The following definitions will fail as both in .bss.
> static int a;
> static int b;
>
> The following definitions will fail as both in .data:
> static char a = 2;
> static int b = 3;

Are there type restrictions or something? I've been defining multiple
static uint32_t and using them per the approach in this patch series
without hitting this compiler assertion.

> Using global variables can prevent compiler errors.
> maps are defined as globals and the compiler does not
> check whether a particular global variable is defining a map or not.
>
> If you just use static variable like below
> static int a = 2;
> without potential assignment to a, the compiler will replace variable
> a with 2 at compile time.
> An alternative is to define like below
> static volatile int a = 2;
> You can get a "load" for variable "a" in the bpf load even if there is
> no assignment to a.

I'll take a closer look at this too.

> Maybe now is the time to remove the compiler assertions as
> libbpf/kernel starts to
> handle static variables?

I don't understand why those assertions exists in this form. It
already allows code which will not load with libbpf (ie generate any
.data/.bss), does it help prevent unexpected situations for
developers?


Re: [PATCH bpf-next 4/4] selftests/bpf: Test static data relocation

2019-02-12 Thread Joe Stringer
On Mon, Feb 11, 2019 at 9:01 PM Alexei Starovoitov
 wrote:
>
> On Mon, Feb 11, 2019 at 04:47:29PM -0800, Joe Stringer wrote:
> > Add tests for libbpf relocation of static variable references into the
> > .data and .bss sections of the ELF.
> >
> > Signed-off-by: Joe Stringer 
> ...
> > +#define __fetch(x) (__u32)(&(x))
> > +
> > +static __u32 static_bss = 0; /* Reloc reference to .bss section */
> > +static __u32 static_data = 42;   /* Reloc reference to .data section */
> > +
> > +/**
> > + * Load a u32 value from a static variable into a map, for the userland 
> > test
> > + * program to validate.
> > + */
> > +SEC("static_data_load")
> > +int load_static_data(struct __sk_buff *skb)
> > +{
> > + __u32 key, value;
> > +
> > + key = 0;
> > + value = __fetch(static_bss);
>
> If we proceed with this approach we will not be able to add support
> for normal 'value = static_bss;' C code in the future.

Hmm, completely agree that breaking future use of standard code is a
non-starter.

Digging around a bit more, I think I could drop the .bss patch/code
here and still end up with something that will work for my use case.
Just need to ensure that all template values are non-zero when run
through the compiler.

> Let's figure out the way to do it right from the start.
> Support for global and static variables is must have feature to add asap,
> but let's not cut the corner like this.
> We did such hacks in the past and every time it came back to bite us.

Do you see any value in having incremental support in libbpf that
could be used as a fallback for older kernels like in patch #2 of this
series? I could imagine libbpf probing kernel support for
global/static variables and attempting to handle references to .data
via some more comprehensive mechanism in-kernel, or falling back to
this approach if it is not available.


[PATCH bpf-next 1/4] libbpf: Refactor relocations

2019-02-11 Thread Joe Stringer
Adjust the code for relocations slightly with no functional changes, so
that upcoming patches that will introduce support for relocations into
the .data and .bss sections can be added independent of these changes.

Signed-off-by: Joe Stringer 
---
 tools/lib/bpf/libbpf.c | 62 ++
 1 file changed, 32 insertions(+), 30 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index e3c39edfb9d3..1ec28d5154dc 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -852,20 +852,20 @@ static int bpf_object__elf_collect(struct bpf_object 
*obj, int flags)
obj->efile.symbols = data;
obj->efile.strtabidx = sh.sh_link;
}
-   } else if ((sh.sh_type == SHT_PROGBITS) &&
-  (sh.sh_flags & SHF_EXECINSTR) &&
-  (data->d_size > 0)) {
-   if (strcmp(name, ".text") == 0)
-   obj->efile.text_shndx = idx;
-   err = bpf_object__add_program(obj, data->d_buf,
- data->d_size, name, idx);
-   if (err) {
-   char errmsg[STRERR_BUFSIZE];
-   char *cp = libbpf_strerror_r(-err, errmsg,
-sizeof(errmsg));
-
-   pr_warning("failed to alloc program %s (%s): 
%s",
-  name, obj->path, cp);
+   } else if (sh.sh_type == SHT_PROGBITS && data->d_size > 0) {
+   if (sh.sh_flags & SHF_EXECINSTR) {
+   if (strcmp(name, ".text") == 0)
+   obj->efile.text_shndx = idx;
+   err = bpf_object__add_program(obj, data->d_buf,
+ data->d_size, 
name, idx);
+   if (err) {
+   char errmsg[STRERR_BUFSIZE];
+   char *cp = libbpf_strerror_r(-err, 
errmsg,
+
sizeof(errmsg));
+
+   pr_warning("failed to alloc program %s 
(%s): %s",
+  name, obj->path, cp);
+   }
}
} else if (sh.sh_type == SHT_REL) {
void *reloc = obj->efile.reloc;
@@ -1027,24 +1027,26 @@ bpf_program__collect_reloc(struct bpf_program *prog, 
GElf_Shdr *shdr,
return -LIBBPF_ERRNO__RELOC;
}
 
-   /* TODO: 'maps' is sorted. We can use bsearch to make it 
faster. */
-   for (map_idx = 0; map_idx < nr_maps; map_idx++) {
-   if (maps[map_idx].offset == sym.st_value) {
-   pr_debug("relocation: find map %zd (%s) for 
insn %u\n",
-map_idx, maps[map_idx].name, insn_idx);
-   break;
+   if (sym.st_shndx == maps_shndx) {
+   /* TODO: 'maps' is sorted. We can use bsearch to make 
it faster. */
+   for (map_idx = 0; map_idx < nr_maps; map_idx++) {
+   if (maps[map_idx].offset == sym.st_value) {
+   pr_debug("relocation: find map %zd (%s) 
for insn %u\n",
+map_idx, maps[map_idx].name, 
insn_idx);
+   break;
+   }
}
-   }
 
-   if (map_idx >= nr_maps) {
-   pr_warning("bpf relocation: map_idx %d large than %d\n",
-  (int)map_idx, (int)nr_maps - 1);
-   return -LIBBPF_ERRNO__RELOC;
-   }
+   if (map_idx >= nr_maps) {
+   pr_warning("bpf relocation: map_idx %d large 
than %d\n",
+  (int)map_idx, (int)nr_maps - 1);
+   return -LIBBPF_ERRNO__RELOC;
+   }
 
-   prog->reloc_desc[i].type = RELO_LD64;
-   prog->reloc_desc[i].insn_idx = insn_idx;
-   prog->reloc_desc[i].map_idx = map_idx;
+   prog->reloc_desc[i].type = RELO_LD64;
+   prog->reloc_desc[i].insn_idx = insn_idx;
+   prog->reloc_desc[i].map_idx = m

[PATCH bpf-next 2/4] libbpf: Support 32-bit static data loads

2019-02-11 Thread Joe Stringer
Support loads of static 32-bit data when BPF writers make use of
convenience macros for accessing static global data variables. A later
patch in this series will demonstrate its usage in a selftest.

As of LLVM-7, this technique only works with 32-bit data, as LLVM will
complain if this technique is attempted with data of other sizes:

LLVM ERROR: Unsupported relocation: try to compile with -O2 or above,
or check your static variable usage

Based on the proof of concept by Daniel Borkmann (presented at LPC 2018).

Signed-off-by: Joe Stringer 
---
 tools/lib/bpf/libbpf.c | 34 --
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 1ec28d5154dc..da35d5559b22 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -140,11 +140,13 @@ struct bpf_program {
enum {
RELO_LD64,
RELO_CALL,
+   RELO_DATA,
} type;
int insn_idx;
union {
int map_idx;
int text_off;
+   uint32_t data;
};
} *reloc_desc;
int nr_reloc;
@@ -210,6 +212,7 @@ struct bpf_object {
Elf *elf;
GElf_Ehdr ehdr;
Elf_Data *symbols;
+   Elf_Data *global_data;
size_t strtabidx;
struct {
GElf_Shdr shdr;
@@ -218,6 +221,7 @@ struct bpf_object {
int nr_reloc;
int maps_shndx;
int text_shndx;
+   int data_shndx;
} efile;
/*
 * All loaded bpf_object is linked in a list, which is
@@ -476,6 +480,7 @@ static void bpf_object__elf_finish(struct bpf_object *obj)
obj->efile.elf = NULL;
}
obj->efile.symbols = NULL;
+   obj->efile.global_data = NULL;
 
zfree(&obj->efile.reloc);
obj->efile.nr_reloc = 0;
@@ -866,6 +871,9 @@ static int bpf_object__elf_collect(struct bpf_object *obj, 
int flags)
pr_warning("failed to alloc program %s 
(%s): %s",
   name, obj->path, cp);
}
+   } else if (strcmp(name, ".data") == 0) {
+   obj->efile.global_data = data;
+   obj->efile.data_shndx = idx;
}
} else if (sh.sh_type == SHT_REL) {
void *reloc = obj->efile.reloc;
@@ -962,6 +970,7 @@ bpf_program__collect_reloc(struct bpf_program *prog, 
GElf_Shdr *shdr,
Elf_Data *symbols = obj->efile.symbols;
int text_shndx = obj->efile.text_shndx;
int maps_shndx = obj->efile.maps_shndx;
+   int data_shndx = obj->efile.data_shndx;
struct bpf_map *maps = obj->maps;
size_t nr_maps = obj->nr_maps;
int i, nrels;
@@ -1000,8 +1009,9 @@ bpf_program__collect_reloc(struct bpf_program *prog, 
GElf_Shdr *shdr,
 (long long) (rel.r_info >> 32),
 (long long) sym.st_value, sym.st_name);
 
-   if (sym.st_shndx != maps_shndx && sym.st_shndx != text_shndx) {
-   pr_warning("Program '%s' contains non-map related relo 
data pointing to section %u\n",
+   if (sym.st_shndx != maps_shndx && sym.st_shndx != text_shndx &&
+   sym.st_shndx != data_shndx) {
+   pr_warning("Program '%s' contains unrecognized relo 
data pointing to section %u\n",
   prog->section_name, sym.st_shndx);
return -LIBBPF_ERRNO__RELOC;
}
@@ -1046,6 +1056,20 @@ bpf_program__collect_reloc(struct bpf_program *prog, 
GElf_Shdr *shdr,
prog->reloc_desc[i].type = RELO_LD64;
prog->reloc_desc[i].insn_idx = insn_idx;
prog->reloc_desc[i].map_idx = map_idx;
+   } else if (sym.st_shndx == data_shndx) {
+   Elf_Data *global_data = obj->efile.global_data;
+   uint32_t *static_data;
+
+   if (sym.st_value + sizeof(uint32_t) > 
(int)global_data->d_size) {
+   pr_warning("bpf relocation: static data load 
beyond data size %lu\n",
+  global_data->d_size);
+   return -LIBBPF_ERRNO__RELOC;
+   }
+
+   static_data = global_data->d_buf + sym.st_value;
+   prog->reloc_desc[i].type = RELO_DATA;
+   

[PATCH bpf-next 4/4] selftests/bpf: Test static data relocation

2019-02-11 Thread Joe Stringer
Add tests for libbpf relocation of static variable references into the
.data and .bss sections of the ELF.

Signed-off-by: Joe Stringer 
---
 tools/testing/selftests/bpf/Makefile  |  2 +-
 tools/testing/selftests/bpf/test_progs.c  | 44 +
 .../selftests/bpf/test_static_data_kern.c | 47 +++
 3 files changed, 92 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_static_data_kern.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index c7e1e3255448..ef52a58e2368 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -36,7 +36,7 @@ BPF_OBJ_FILES = \
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o test_xdp_vlan.o \
xdp_dummy.o test_map_in_map.o test_spin_lock.o test_map_lock.o \
-   test_sock_fields_kern.o
+   test_sock_fields_kern.o test_static_data_kern.o
 
 # Objects are built with default compilation flags and with sub-register
 # code-gen enabled.
diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index c52bd90fbb34..72899d58a77c 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -736,6 +736,49 @@ static void test_pkt_md_access(void)
bpf_object__close(obj);
 }
 
+static void test_static_data_access(void)
+{
+   const char *file = "./test_static_data_kern.o";
+   struct bpf_object *obj;
+   __u32 duration = 0, retval;
+   int i, err, prog_fd, map_fd;
+   uint32_t value;
+
+   err = bpf_prog_load(file, BPF_PROG_TYPE_SCHED_CLS, &obj, &prog_fd);
+   if (CHECK(err, "load program", "error %d loading %s\n", err, file))
+   return;
+
+   map_fd = bpf_find_map(__func__, obj, "result");
+   if (map_fd < 0) {
+   error_cnt++;
+   goto close_prog;
+   }
+
+   err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
+   NULL, NULL, &retval, &duration);
+   CHECK(err || retval, "pass packet",
+ "err %d errno %d retval %d duration %d\n",
+ err, errno, retval, duration);
+
+   struct {
+   char *name;
+   uint32_t key;
+   uint32_t value;
+   } tests[] = {
+   { "relocate .bss reference", 0, 0 },
+   { "relocate .data reference", 1, 42 },
+   };
+   for (i = 0; i < sizeof(tests) / sizeof(tests[0]); i++) {
+   err = bpf_map_lookup_elem(map_fd, &tests[i].key, &value);
+   CHECK (err || value != tests[i].value, tests[i].name,
+  "err %d result %d expected %d\n",
+  err, value, tests[i].value);
+   }
+
+close_prog:
+   bpf_object__close(obj);
+}
+
 static void test_obj_name(void)
 {
struct {
@@ -2138,6 +2181,7 @@ int main(void)
test_flow_dissector();
test_spinlock();
test_map_lock();
+   test_static_data_access();
 
printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
diff --git a/tools/testing/selftests/bpf/test_static_data_kern.c 
b/tools/testing/selftests/bpf/test_static_data_kern.c
new file mode 100644
index ..f2485af6bd0b
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_static_data_kern.c
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2019 Isovalent, Inc.
+
+#include 
+#include 
+
+#include 
+
+#include "bpf_helpers.h"
+
+#define NUM_CGROUP_LEVELS  4
+
+struct bpf_map_def SEC("maps") result = {
+   .type = BPF_MAP_TYPE_ARRAY,
+   .key_size = sizeof(__u32),
+   .value_size = sizeof(__u32),
+   .max_entries = 2,
+};
+
+#define __fetch(x) (__u32)(&(x))
+
+static __u32 static_bss = 0;   /* Reloc reference to .bss section */
+static __u32 static_data = 42; /* Reloc reference to .data section */
+
+/**
+ * Load a u32 value from a static variable into a map, for the userland test
+ * program to validate.
+ */
+SEC("static_data_load")
+int load_static_data(struct __sk_buff *skb)
+{
+   __u32 key, value;
+
+   key = 0;
+   value = __fetch(static_bss);
+   bpf_map_update_elem(&result, &key, &value, 0);
+
+   key = 1;
+   value = __fetch(static_data);
+   bpf_map_update_elem(&result, &key, &value, 0);
+
+   return TC_ACT_OK;
+}
+
+int _version SEC("version") = 1;
+
+char _license[] SEC("license") = "GPL";
-- 
2.19.1



[PATCH bpf-next 0/4] libbpf: Add support for 32-bit static data

2019-02-11 Thread Joe Stringer
This series adds support to libbpf for relocating references to 32-bit
static data inside ELF files, both for .data and .bss, similar to one of
the approaches proposed in LPC 2018[0]. This improves a common workflow
for BPF users, where the BPF program may be customised each time it is
loaded, for example to tailor IP addresses for each instance of the
loaded program. Current approaches require full recompilation of the
programs for each load, however with templatized BPF programs, one ELF
template program may be generated, then the static data can be easily
substituted prior to loading into the kernel without invoking the
compiler again.

The approach here is useful for templating limited static data for ELF
programs, and will work regardless of kernel support for static data
sections. Its main limitation is that static data must be defined as
32-bit values in the BPF C input code (or defined using macros that use
32-bit values as the underlying store). The alternative approach
proposed at LPC would be more general and is being actively explored,
however it requires kernel extension and so will not solve this problem
for any existing kernels that are in use today.

There are similar patches floating around for iproute2 which I would
like to upstream as well[1].

[0] https://linuxplumbersconf.org/event/2/contributions/115/
[1] https://github.com/joestringer/iproute2/tree/bss

Joe Stringer (4):
  libbpf: Refactor relocations
  libbpf: Support 32-bit static data loads
  libbpf: Support relocations for bss.
  selftests/bpf: Test static data relocation

 tools/lib/bpf/libbpf.c| 108 --
 tools/testing/selftests/bpf/Makefile  |   2 +-
 tools/testing/selftests/bpf/test_progs.c  |  44 +++
 .../selftests/bpf/test_static_data_kern.c |  47 
 4 files changed, 168 insertions(+), 33 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_static_data_kern.c

-- 
2.19.1



[PATCH bpf-next 3/4] libbpf: Support relocations for bss.

2019-02-11 Thread Joe Stringer
The BSS section in an ELF generated by LLVM represents constants for
uninitialized variables or variables that are configured with a zero
value. Support initializing zeroed static data by parsing the
relocations with references to the .bss section and zeroing them.

Signed-off-by: Joe Stringer 
---
 tools/lib/bpf/libbpf.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index da35d5559b22..ff66d7e970c9 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -141,6 +141,7 @@ struct bpf_program {
RELO_LD64,
RELO_CALL,
RELO_DATA,
+   RELO_ZERO,
} type;
int insn_idx;
union {
@@ -222,6 +223,7 @@ struct bpf_object {
int maps_shndx;
int text_shndx;
int data_shndx;
+   int bss_shndx;
} efile;
/*
 * All loaded bpf_object is linked in a list, which is
@@ -901,6 +903,8 @@ static int bpf_object__elf_collect(struct bpf_object *obj, 
int flags)
obj->efile.reloc[n].shdr = sh;
obj->efile.reloc[n].data = data;
}
+   } else if (sh.sh_type == SHT_NOBITS && strcmp(name, ".bss") == 
0) {
+   obj->efile.bss_shndx = idx;
} else {
pr_debug("skip section(%d) %s\n", idx, name);
}
@@ -971,6 +975,7 @@ bpf_program__collect_reloc(struct bpf_program *prog, 
GElf_Shdr *shdr,
int text_shndx = obj->efile.text_shndx;
int maps_shndx = obj->efile.maps_shndx;
int data_shndx = obj->efile.data_shndx;
+   int bss_shndx = obj->efile.bss_shndx;
struct bpf_map *maps = obj->maps;
size_t nr_maps = obj->nr_maps;
int i, nrels;
@@ -1010,7 +1015,7 @@ bpf_program__collect_reloc(struct bpf_program *prog, 
GElf_Shdr *shdr,
 (long long) sym.st_value, sym.st_name);
 
if (sym.st_shndx != maps_shndx && sym.st_shndx != text_shndx &&
-   sym.st_shndx != data_shndx) {
+   sym.st_shndx != data_shndx && sym.st_shndx != bss_shndx) {
pr_warning("Program '%s' contains unrecognized relo 
data pointing to section %u\n",
   prog->section_name, sym.st_shndx);
return -LIBBPF_ERRNO__RELOC;
@@ -1070,6 +1075,9 @@ bpf_program__collect_reloc(struct bpf_program *prog, 
GElf_Shdr *shdr,
prog->reloc_desc[i].type = RELO_DATA;
prog->reloc_desc[i].insn_idx = insn_idx;
prog->reloc_desc[i].data = *static_data;
+   } else if (sym.st_shndx == bss_shndx) {
+   prog->reloc_desc[i].type = RELO_ZERO;
+   prog->reloc_desc[i].insn_idx = insn_idx;
}
}
return 0;
@@ -1429,6 +1437,10 @@ bpf_program__relocate(struct bpf_program *prog, struct 
bpf_object *obj)
 
insn_idx = prog->reloc_desc[i].insn_idx;
insns[insn_idx].imm = prog->reloc_desc[i].data;
+   } else if (prog->reloc_desc[i].type == RELO_ZERO) {
+   int insn_idx = prog->reloc_desc[i].insn_idx;
+
+   prog->insns[insn_idx].imm = 0;
}
}
 
-- 
2.19.1



Re: [PATCH bpf] bpf: Fix narrow load on a bpf_sock returned from sk_lookup()

2019-02-09 Thread Joe Stringer
On Fri, 8 Feb 2019 at 22:27, Martin KaFai Lau  wrote:
>
> By adding this test to test_verifier:
> {
> "reference tracking: access sk->src_ip4 (narrow load)",
> .insns = {
> BPF_SK_LOOKUP,
> BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
> BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
> BPF_LDX_MEM(BPF_H, BPF_REG_2, BPF_REG_0, offsetof(struct bpf_sock, 
> src_ip4) + 2),
> BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
> BPF_EMIT_CALL(BPF_FUNC_sk_release),
> BPF_EXIT_INSN(),
> },
> .prog_type = BPF_PROG_TYPE_SCHED_CLS,
> .result = ACCEPT,
> },
>
> The above test loads 2 bytes from sk->src_ip4 where
> sk is obtained by bpf_sk_lookup_tcp().
>
> It hits an internal verifier error from convert_ctx_accesses():
> [root@arch-fb-vm1 bpf]# ./test_verifier 665 665
> Failed to load prog 'Invalid argument'!
> 0: (b7) r2 = 0
> 1: (63) *(u32 *)(r10 -8) = r2
> 2: (7b) *(u64 *)(r10 -16) = r2
> 3: (7b) *(u64 *)(r10 -24) = r2
> 4: (7b) *(u64 *)(r10 -32) = r2
> 5: (7b) *(u64 *)(r10 -40) = r2
> 6: (7b) *(u64 *)(r10 -48) = r2
> 7: (bf) r2 = r10
> 8: (07) r2 += -48
> 9: (b7) r3 = 36
> 10: (b7) r4 = 0
> 11: (b7) r5 = 0
> 12: (85) call bpf_sk_lookup_tcp#84
> 13: (bf) r6 = r0
> 14: (15) if r0 == 0x0 goto pc+3
>  R0=sock(id=1,off=0,imm=0) R6=sock(id=1,off=0,imm=0) R10=fp0,call_-1 
> fp-8= fp-16= fp-24= fp-32= fp-40= 
> fp-48= refs=1
> 15: (69) r2 = *(u16 *)(r0 +26)
> 16: (bf) r1 = r6
> 17: (85) call bpf_sk_release#86
> 18: (95) exit
>
> from 14 to 18: safe
> processed 20 insns (limit 131072), stack depth 48
> bpf verifier is misconfigured
> Summary: 0 PASSED, 0 SKIPPED, 1 FAILED
>
> The bpf_sock_is_valid_access() is expecting src_ip4 can be narrowly
> loaded (meaning load any 1 or 2 bytes of the src_ip4) by
> marking info->ctx_field_size.  However, this marked
> ctx_field_size is not used.  This patch fixes it.
>
> Due to the recent refactoring in test_verifier,
> this new test will be added to the bpf-next branch
> (together with the bpf_tcp_sock patchset)
> to avoid merge conflict.
>
> Fixes: c64b7983288e ("bpf: Add PTR_TO_SOCKET verifier type")
> Cc: Joe Stringer 
> Signed-off-by: Martin KaFai Lau 
> ---

Nice find, thanks.

Acked-by: Joe Stringer 


Re: [PATCHv4 bpf-next 07/13] bpf: Add reference tracking to verifier

2019-01-10 Thread Joe Stringer
[resend as apparently I untoggled the "plain text" option in gmail...]

On Tue, 8 Jan 2019 at 23:42, Alexei Starovoitov
 wrote:
>
> On Tue, Oct 02, 2018 at 01:35:35PM -0700, Joe Stringer wrote:
> > Allow helper functions to acquire a reference and return it into a
> > register. Specific pointer types such as the PTR_TO_SOCKET will
> > implicitly represent such a reference. The verifier must ensure that
> > these references are released exactly once in each path through the
> > program.
> >
> > To achieve this, this commit assigns an id to the pointer and tracks it
> > in the 'bpf_func_state', then when the function or program exits,
> > verifies that all of the acquired references have been freed. When the
> > pointer is passed to a function that frees the reference, it is removed
> > from the 'bpf_func_state` and all existing copies of the pointer in
> > registers are marked invalid.
> >
> > Signed-off-by: Joe Stringer 
> > Acked-by: Alexei Starovoitov 
> ...
> > +static void release_reg_references(struct bpf_verifier_env *env,
> > +struct bpf_func_state *state, int id)
> > +{
> > + struct bpf_reg_state *regs = state->regs, *reg;
> > + int i;
> > +
> > + for (i = 0; i < MAX_BPF_REG; i++)
> > + if (regs[i].id == id)
> > + mark_reg_unknown(env, regs, i);
> > +
> > + bpf_for_each_spilled_reg(i, state, reg) {
> > + if (!reg)
> > + continue;
> > + if (reg_is_refcounted(reg) && reg->id == id)
> > + __mark_reg_unknown(reg);
> > + }
> > +}
>
> Hi Joe,
>
> I've been looking at this function again and wondering why second reg->id == 
> id
> check needs additional reg_is_refcounted() check?
> No tests have failed when I removed it.
> If reg->id is equal to id being released we need to clear the reg regardless
> whethere it's in the registers (the first loop) or
> whether it's in the stack (second loop).
> I think when reg->id == id that reg is guaranteed to be refcounted.
> Would you agree?

Yes,  the id should only ever be allocated once so if we end up
attempting to release register references to it, that should already
imply that the id identifies a reference and not some other pointer
type.

>
> I propose to simply remove that unnecessary reg_is_refcounted(reg) check.
> We can replace it with warn_on, but imo that's overkill.
> I'm thinking to repurpose release_reg_references() function for my 
> bpf_spin_lock work
> and that check is in the way.

SGTM, looks entirely unnecessary to me.

Cheers,
Joe


Re: [PATCH bpf-next] selftests/bpf: Fix sk lookup usage in test_sock_addr

2018-12-13 Thread Joe Stringer
On Thu, 13 Dec 2018 at 13:19, Andrey Ignatov  wrote:
>
> Semantic of netns_id argument of bpf_sk_lookup_tcp and bpf_sk_lookup_udp
> was changed (fixed) in f71c6143c203. Corresponding changes have to be
> applied to all call sites in selftests. The patch fixes corresponding
> call sites in test_sock_addr test: pass BPF_F_CURRENT_NETNS instead of 0
> in netns_id argument.
>
> Fixes: f71c6143c203 ("bpf: Support sk lookup in netns with id 0")
> Reported-by: Yonghong Song 
> Signed-off-by: Andrey Ignatov 

Acked-by: Joe Stringer 


Re: [PATCHv3 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-12-01 Thread Joe Stringer
On Fri, 30 Nov 2018 at 17:36, Alexei Starovoitov
 wrote:
>
> On Fri, Nov 30, 2018 at 03:32:20PM -0800, Joe Stringer wrote:
> > David Ahern and Nicolas Dichtel report that the handling of the netns id
> > 0 is incorrect for the BPF socket lookup helpers: rather than finding
> > the netns with id 0, it is resolving to the current netns. This renders
> > the netns_id 0 inaccessible.
> >
> > To fix this, adjust the API for the netns to treat all negative s32
> > values as a lookup in the current netns (including u64 values which when
> > truncated to s32 become negative), while any values with a positive
> > value in the signed 32-bit integer space would result in a lookup for a
> > socket in the netns corresponding to that id. As before, if the netns
> > with that ID does not exist, no socket will be found. Any netns outside
> > of these ranges will fail to find a corresponding socket, as those
> > values are reserved for future usage.
> >
> > Signed-off-by: Joe Stringer 
> > Acked-by: Nicolas Dichtel 
>
> Applied both. Thanks everyone.
>
> Joe, please provide a cover letter 0/N next time for the series
> or if they're really separate patches submit them one by one.

OK thanks, I'll keep that in mind.


[PATCHv3 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Joe Stringer
David Ahern and Nicolas Dichtel report that the handling of the netns id
0 is incorrect for the BPF socket lookup helpers: rather than finding
the netns with id 0, it is resolving to the current netns. This renders
the netns_id 0 inaccessible.

To fix this, adjust the API for the netns to treat all negative s32
values as a lookup in the current netns (including u64 values which when
truncated to s32 become negative), while any values with a positive
value in the signed 32-bit integer space would result in a lookup for a
socket in the netns corresponding to that id. As before, if the netns
with that ID does not exist, no socket will be found. Any netns outside
of these ranges will fail to find a corresponding socket, as those
values are reserved for future usage.

Signed-off-by: Joe Stringer 
Acked-by: Nicolas Dichtel 
---
 include/uapi/linux/bpf.h  | 35 ++---
 net/core/filter.c | 11 +++---
 tools/include/uapi/linux/bpf.h| 39 ---
 tools/testing/selftests/bpf/bpf_helpers.h |  4 +-
 .../selftests/bpf/test_sk_lookup_kern.c   | 18 -
 5 files changed, 63 insertions(+), 44 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 852dc17ab47a..ad68b472dad2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2170,7 +2170,7 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
- * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
  * Look for TCP socket matching *tuple*, optionally in a child
  * network namespace *netns*. The return value must be checked,
@@ -2187,12 +2187,14 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is a negative signed 32-bit integer, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is any other signed 32-bit value greater than or
+ * equal to zero then it specifies the ID of the netns relative to
+ * the netns associated with the *ctx*. *netns* values beyond the
+ * range of 32-bit integers are reserved for future use.
  *
  * All values for *flags* are reserved for future usage, and must
  * be left at zero.
@@ -2202,7 +2204,7 @@ union bpf_attr {
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
  *
- * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
  * Look for UDP socket matching *tuple*, optionally in a child
  * network namespace *netns*. The return value must be checked,
@@ -2219,12 +2221,14 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is a negative signed 32-bit integer, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is any other signed 32-bit value greater than or
+ * equal to zero then it specifies the ID of the netns relative to
+ * the netns associated with the *ctx*. *netns* values beyond the
+ * range of 32-bit integers are reserved for future use.
  *
  * All values for *flag

[PATCHv3 bpf 2/2] bpf: Improve socket lookup reuseport documentation

2018-11-30 Thread Joe Stringer
Improve the wording around socket lookup for reuseport sockets, and
ensure that both bpf.h headers are in sync.

Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h   | 4 
 tools/include/uapi/linux/bpf.h | 8 
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ad68b472dad2..47f620b5cc5c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2203,6 +2203,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
@@ -2237,6 +2239,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * int bpf_sk_release(struct bpf_sock *sk)
  * Description
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index de2072ef475b..47f620b5cc5c 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2203,8 +2203,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
- * For sockets with reuseport option, *struct bpf_sock*
- * return is from reuse->socks[] using hash of the packet.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
@@ -2239,8 +2239,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
- * For sockets with reuseport option, *struct bpf_sock*
- * return is from reuse->socks[] using hash of the packet.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * int bpf_sk_release(struct bpf_sock *sk)
  * Description
-- 
2.19.1



Re: [PATCHv2 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Joe Stringer
On Fri, 30 Nov 2018 at 15:27, Alexei Starovoitov
 wrote:
>
> On Fri, Nov 30, 2018 at 03:18:25PM -0800, Joe Stringer wrote:
> > On Fri, 30 Nov 2018 at 14:42, Alexei Starovoitov
> >  wrote:
> > >
> > > On Thu, Nov 29, 2018 at 04:29:33PM -0800, Joe Stringer wrote:
> > > > David Ahern and Nicolas Dichtel report that the handling of the netns id
> > > > 0 is incorrect for the BPF socket lookup helpers: rather than finding
> > > > the netns with id 0, it is resolving to the current netns. This renders
> > > > the netns_id 0 inaccessible.
> > > >
> > > > To fix this, adjust the API for the netns to treat all negative s32
> > > > values as a lookup in the current netns, while any values with a
> > > > positive value in the signed 32-bit integer space would result in a
> > > > lookup for a socket in the netns corresponding to that id. As before, if
> > > > the netns with that ID does not exist, no socket will be found.
> > > > Furthermore, if any bits are set in the upper 32-bits, then no socket
> > > > will be found.
> > > >
> > > > Signed-off-by: Joe Stringer 
> > > ..
> > > > +/* Current network namespace */
> > > > +#define BPF_CURRENT_NETNS(-1L)
> > >
> > > I was about to apply it, but then noticed that the name doesn't match
> > > the rest of the names.
> > > Could you rename it to BPF_F_CURRENT_NETNS ?
> >
> > I skipped the F_ part since it's not really a flag, it's a value. I
> > can put it back though.
>
> BPF_F_ prefix has smaller chance of conflicts.
> I wish we did that sooner.
> In retrospect BPF_ANY, BPF_EXIST were poorly picked names.

OK, I'll send out a v3 shortly.


Re: [PATCHv2 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Joe Stringer
On Fri, 30 Nov 2018 at 14:42, Alexei Starovoitov
 wrote:
>
> On Thu, Nov 29, 2018 at 04:29:33PM -0800, Joe Stringer wrote:
> > David Ahern and Nicolas Dichtel report that the handling of the netns id
> > 0 is incorrect for the BPF socket lookup helpers: rather than finding
> > the netns with id 0, it is resolving to the current netns. This renders
> > the netns_id 0 inaccessible.
> >
> > To fix this, adjust the API for the netns to treat all negative s32
> > values as a lookup in the current netns, while any values with a
> > positive value in the signed 32-bit integer space would result in a
> > lookup for a socket in the netns corresponding to that id. As before, if
> > the netns with that ID does not exist, no socket will be found.
> > Furthermore, if any bits are set in the upper 32-bits, then no socket
> > will be found.
> >
> > Signed-off-by: Joe Stringer 
> ..
> > +/* Current network namespace */
> > +#define BPF_CURRENT_NETNS(-1L)
>
> I was about to apply it, but then noticed that the name doesn't match
> the rest of the names.
> Could you rename it to BPF_F_CURRENT_NETNS ?

I skipped the F_ part since it's not really a flag, it's a value. I
can put it back though.

> Also reword the commit log so it's less misleading.

Can do.

Cheers,
Joe


Re: [PATCHv2 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Joe Stringer
On Thu, 29 Nov 2018 at 16:30, Joe Stringer  wrote:
>
> David Ahern and Nicolas Dichtel report that the handling of the netns id
> 0 is incorrect for the BPF socket lookup helpers: rather than finding
> the netns with id 0, it is resolving to the current netns. This renders
> the netns_id 0 inaccessible.
>
> To fix this, adjust the API for the netns to treat all negative s32
> values as a lookup in the current netns, while any values with a
> positive value in the signed 32-bit integer space would result in a
> lookup for a socket in the netns corresponding to that id. As before, if
> the netns with that ID does not exist, no socket will be found.
> Furthermore, if any bits are set in the upper 32-bits, then no socket
> will be found.

This last sentence is a little misleading, it only applies if the
highest bit in the lower 32 bits is 0.


[PATCHv2 bpf 2/2] bpf: Improve socket lookup reuseport documentation

2018-11-29 Thread Joe Stringer
Improve the wording around socket lookup for reuseport sockets, and
ensure that both bpf.h headers are in sync.

Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h   | 4 
 tools/include/uapi/linux/bpf.h | 8 
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 38924b306e9f..b73d574356f4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2203,6 +2203,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
@@ -2237,6 +2239,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * int bpf_sk_release(struct bpf_sock *sk)
  * Description
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 465ad585c836..b73d574356f4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2203,8 +2203,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
- * For sockets with reuseport option, *struct bpf_sock*
- * return is from reuse->socks[] using hash of the packet.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
@@ -2239,8 +2239,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
- * For sockets with reuseport option, *struct bpf_sock*
- * return is from reuse->socks[] using hash of the packet.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * int bpf_sk_release(struct bpf_sock *sk)
  * Description
-- 
2.17.1



[PATCHv2 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-29 Thread Joe Stringer
David Ahern and Nicolas Dichtel report that the handling of the netns id
0 is incorrect for the BPF socket lookup helpers: rather than finding
the netns with id 0, it is resolving to the current netns. This renders
the netns_id 0 inaccessible.

To fix this, adjust the API for the netns to treat all negative s32
values as a lookup in the current netns, while any values with a
positive value in the signed 32-bit integer space would result in a
lookup for a socket in the netns corresponding to that id. As before, if
the netns with that ID does not exist, no socket will be found.
Furthermore, if any bits are set in the upper 32-bits, then no socket
will be found.

Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h  | 35 ++---
 net/core/filter.c | 11 +++---
 tools/include/uapi/linux/bpf.h| 39 ---
 tools/testing/selftests/bpf/bpf_helpers.h |  4 +-
 .../selftests/bpf/test_sk_lookup_kern.c   | 18 -
 5 files changed, 63 insertions(+), 44 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 852dc17ab47a..38924b306e9f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2170,7 +2170,7 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
- * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
  * Look for TCP socket matching *tuple*, optionally in a child
  * network namespace *netns*. The return value must be checked,
@@ -2187,12 +2187,14 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is a negative signed 32-bit integer, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is any other signed 32-bit value greater than or
+ * equal to zero then it specifies the ID of the netns relative to
+ * the netns associated with the *ctx*. *netns* values beyond the
+ * range of 32-bit integers are reserved for future use.
  *
  * All values for *flags* are reserved for future usage, and must
  * be left at zero.
@@ -2202,7 +2204,7 @@ union bpf_attr {
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
  *
- * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
  * Look for UDP socket matching *tuple*, optionally in a child
  * network namespace *netns*. The return value must be checked,
@@ -2219,12 +2221,14 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is a negative signed 32-bit integer, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is any other signed 32-bit value greater than or
+ * equal to zero then it specifies the ID of the netns relative to
+ * the netns associated with the *ctx*. *netns* values beyond the
+ * range of 32-bit integers are reserved for future use.
  *
  * All values for *flags* are reserved for future usage, and must
  * be left at zero.
@@ -2405,6 +2409,9 @@ enum bpf_f

Re: [PATCH bpf] bpf: Support sk lookup in netns with id 0

2018-11-28 Thread Joe Stringer
On Tue, 27 Nov 2018 at 13:12, Alexei Starovoitov
 wrote:
>
> On Tue, Nov 27, 2018 at 10:01:40AM -0800, Joe Stringer wrote:
> > On Tue, 27 Nov 2018 at 06:49, Nicolas Dichtel  
> > wrote:
> > >
> > > Le 26/11/2018 ą 23:08, David Ahern a écrit :
> > > > On 11/26/18 2:27 PM, Joe Stringer wrote:
> > > >> @@ -2405,6 +2407,9 @@ enum bpf_func_id {
> > > >>  /* BPF_FUNC_perf_event_output for sk_buff input context. */
> > > >>  #define BPF_F_CTXLEN_MASK   (0xfULL << 32)
> > > >>
> > > >> +/* BPF_FUNC_sk_lookup_tcp and BPF_FUNC_sk_lookup_udp flags. */
> > > >> +#define BPF_F_SK_CURRENT_NS 0x8000 /* For netns field */
> > > >> +
> > > >
> > > > I went down the nsid road because it will be needed for other use cases
> > > > (e.g., device lookups), and we should have a general API for network
> > > > namespaces. Given that, I think the _SK should be dropped from the name.
> >
> > Fair point, I'll drop _SK from the name
> >
> > > >
> > > Would it not be possible to have a s32 instead of an u32 for the coming 
> > > APIs?
> > > It would be better to match the current netlink and kernel APIs.
> >
> > Sure, I'll look into this.
> >
> > I had earlier considered whether it's worth attempting to leave the
> > upper 32 bits of this parameter open for potential future expansion,
> > but at this point I'm not taking that into consideration. If anyone
> > has preferences or thoughts on that I'd be interested to hear them.
>
> Can we keep u64 as an argument type and do
> if ((s32)netns_id < 0) {
>   net = caller_net;
> } else {
>   if (netns_id > S32_MAX)
> goto err;
>   net = get_net_ns_by_id(caller_net, netns_id);
> }
>
> No need for extra macro in such case and passing -1 would match the rest of 
> the kernel.
> Upper 32-bit would still be open for future expansion.

Sounds good.


Re: [PATCH bpf] bpf: Support sk lookup in netns with id 0

2018-11-27 Thread Joe Stringer
On Tue, 27 Nov 2018 at 06:49, Nicolas Dichtel  wrote:
>
> Le 26/11/2018 à 23:08, David Ahern a écrit :
> > On 11/26/18 2:27 PM, Joe Stringer wrote:
> >> @@ -2405,6 +2407,9 @@ enum bpf_func_id {
> >>  /* BPF_FUNC_perf_event_output for sk_buff input context. */
> >>  #define BPF_F_CTXLEN_MASK   (0xfULL << 32)
> >>
> >> +/* BPF_FUNC_sk_lookup_tcp and BPF_FUNC_sk_lookup_udp flags. */
> >> +#define BPF_F_SK_CURRENT_NS 0x8000 /* For netns field */
> >> +
> >
> > I went down the nsid road because it will be needed for other use cases
> > (e.g., device lookups), and we should have a general API for network
> > namespaces. Given that, I think the _SK should be dropped from the name.

Fair point, I'll drop _SK from the name

> >
> Would it not be possible to have a s32 instead of an u32 for the coming APIs?
> It would be better to match the current netlink and kernel APIs.

Sure, I'll look into this.

I had earlier considered whether it's worth attempting to leave the
upper 32 bits of this parameter open for potential future expansion,
but at this point I'm not taking that into consideration. If anyone
has preferences or thoughts on that I'd be interested to hear them.


[PATCH bpf] bpf: Support sk lookup in netns with id 0

2018-11-26 Thread Joe Stringer
David Ahern and Nicolas Dichtel report that the handling of the netns id
0 is incorrect for the BPF socket lookup helpers: rather than finding
the netns with id 0, it is resolving to the current netns. This renders
the netns_id 0 inaccessible.

To fix this, adjust the API for the netns to treat all u32 values with
the highest bit set (BPF_F_SK_CURRENT_NS) as a lookup in the current
netns, while any values with a lower value (including zero) would result
in a lookup for a socket in the netns corresponding to that id. As
before, if the netns with that ID does not exist, no socket will be
found.

Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h  | 29 +---
 net/core/filter.c | 16 -
 tools/include/uapi/linux/bpf.h| 33 ---
 .../selftests/bpf/test_sk_lookup_kern.c   | 18 +-
 4 files changed, 55 insertions(+), 41 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 852dc17ab47a..543945d520b9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2187,12 +2187,13 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is **BPF_F_SK_CURRENT_NS** or greater, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is less than **BPF_F_SK_CURRENT_NS**, then it
+ * specifies the ID of the netns relative to the netns associated
+ * with the *ctx*.
  *
  * All values for *flags* are reserved for future usage, and must
  * be left at zero.
@@ -2219,12 +2220,13 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is **BPF_F_SK_CURRENT_NS** or greater, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is less than **BPF_F_SK_CURRENT_NS**, then it
+ * specifies the ID of the netns relative to the netns associated
+ * with the *ctx*.
  *
  * All values for *flags* are reserved for future usage, and must
  * be left at zero.
@@ -2405,6 +2407,9 @@ enum bpf_func_id {
 /* BPF_FUNC_perf_event_output for sk_buff input context. */
 #define BPF_F_CTXLEN_MASK  (0xfULL << 32)
 
+/* BPF_FUNC_sk_lookup_tcp and BPF_FUNC_sk_lookup_udp flags. */
+#define BPF_F_SK_CURRENT_NS0x8000 /* For netns field */
+
 /* Mode for BPF_FUNC_skb_adjust_room helper. */
 enum bpf_adj_room_mode {
BPF_ADJ_ROOM_NET,
diff --git a/net/core/filter.c b/net/core/filter.c
index 9a1327eb25fa..8c8a7ad3f5e6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4882,7 +4882,7 @@ static struct sock *sk_lookup(struct net *net, struct 
bpf_sock_tuple *tuple,
  */
 static unsigned long
 bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
- u8 proto, u64 netns_id, u64 flags)
+ u8 proto, u32 netns_id, u64 flags)
 {
struct net *caller_net;
struct sock *sk = NULL;
@@ -4890,22 +4890,22 @@ bpf_sk_lookup(struct sk_buff *skb, struct 
bpf_sock_tuple *tuple, u32 len,
struct net *net;
 
family = len == sizeof(tuple->ipv4) ? AF_INET : AF_INET6;
-   if (unlikely(family == AF_UNSPEC || netns_id > U32_MAX || flags))
+   if (unlikely(family == AF_UNSPEC || flags))
goto out;
 
if (skb->dev)
caller_net = dev_net(skb->dev);
else
caller_net = sock_net(skb->sk);
-   if (netns_id) {
+   if (netns_id & BPF_F_SK_CURRENT_NS) {
+

Re: netns_id in bpf_sk_lookup_{tcp,udp}

2018-11-19 Thread Joe Stringer
On Mon, 19 Nov 2018 at 12:54, Joe Stringer  wrote:
>
> On Mon, 19 Nov 2018 at 12:29, Nicolas Dichtel  
> wrote:
> >
> > Le 19/11/2018 à 20:54, David Ahern a écrit :
> > > On 11/19/18 12:47 PM, Joe Stringer wrote:
> > >> On Mon, 19 Nov 2018 at 10:39, David Ahern  wrote:
> > >>>
> > >>> On 11/19/18 11:36 AM, Joe Stringer wrote:
> > >>>> Hi David, thanks for pointing this out.
> > >>>>
> > >>>> This is more of an oversight through iterations, the runtime lookup
> > >>>> will fail to find a socket if the netns value is greater than the
> > >>>> range of a uint32 so I think it would actually make more sense to drop
> > >>>> the parameter size to u32 rather than u64 so that this would be
> > >>>> validated at load time rather than silently returning NULL because of
> > >>>> a bad parameter.
> > >>>
> > >>> ok. I was wondering if it was a u64 to handle nsid of 0 which as I
> > >>> understand it is a legal nsid. If you drop to u32, how do you know when
> > >>> nsid has been set?
> > >>
> > >> I was operating under the assumption that 0 represents the root netns
> > >> id, and cannot be assigned to another non-root netns.
> > >>
> > >> Looking at __peernet2id_alloc(), it seems to me like it attempts to
> > >> find a netns and if it cannot find one, returns 0, which then leads to
> > >> a scroll over the idr starting from 0 to INT_MAX to find a legitimate
> > >> id for the netns, so I think this is a fair assumption?
> > The NET_ID_ZERO trick is used to manage nsid 0 in net_eq_idr() 
> > (idr_for_each()
> > stops when the callback returns != 0).
> >
> > >>
> > >
> > > Maybe Nicolas can give a definitive answer; as I recall he added the
> > > NSID option. I have not had time to walk the code. But I do recall
> > > seeing an id of 0. e.g, on my dev box:
> > > $ ip netns
> > > vms (id: 0)
> > >
> > > And include/uapi/linux/net_namespace.h shows -1 as not assigned.
> > Yes, 0 is a valid value and can be assigned to any netns.
> > nsid are signed 32 bit values. Note that -1 (NETNSA_NSID_NOT_ASSIGNED) is 
> > used
> > by the kernel to express that the nsid is not assigned. It can also be used 
> > by
> > the user to let the kernel chooses a nsid.
> >
> > $ ip netns add foo
> > $ ip netns add bar
> > $ ip netns
> > bar
> > foo
> > $ ip netns set foo 0
> > $ ip netns set bar auto
> > $ ip netns
> > bar (id: 1)
> > foo (id: 0)
>
> OK, I'll fix this up then.

Here's what I have in mind:

@@ -2221,12 +,13 @@ union bpf_attr {
 * **sizeof**\ (*tuple*\ **->ipv6**)
 * Look for an IPv6 socket.
 *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is **BPF_F_SK_CURRENT_NS** or greater, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is less than **BPF_F_SK_CURRENT_NS**, then it
+ * specifies the ID of the netns relative to the netns associated
+ * with the *ctx*.
 *
 * All values for *flags* are reserved for future usage, and must
 * be left at zero.
@@ -2409,6 +2411,9 @@ enum bpf_func_id {
/* BPF_FUNC_perf_event_output for sk_buff input context. */
#define BPF_F_CTXLEN_MASK  (0xfULL << 32)

+/* BPF_FUNC_sk_lookup_tcp and BPF_FUNC_sk_lookup_udp flags. */
+#define BPF_F_SK_CURRENT_NS0x8000 /* For netns argument */
+
/* Mode for BPF_FUNC_skb_adjust_room helper. */
enum bpf_adj_room_mode {
   BPF_ADJ_ROOM_NET,

Plus adjusting all of the internal types and the helper headers to use
u32. With the highest bit used to specify that the netns should be the
current netns, all other netns IDs should be available.


Re: netns_id in bpf_sk_lookup_{tcp,udp}

2018-11-19 Thread Joe Stringer
On Mon, 19 Nov 2018 at 12:29, Nicolas Dichtel  wrote:
>
> Le 19/11/2018 à 20:54, David Ahern a écrit :
> > On 11/19/18 12:47 PM, Joe Stringer wrote:
> >> On Mon, 19 Nov 2018 at 10:39, David Ahern  wrote:
> >>>
> >>> On 11/19/18 11:36 AM, Joe Stringer wrote:
> >>>> Hi David, thanks for pointing this out.
> >>>>
> >>>> This is more of an oversight through iterations, the runtime lookup
> >>>> will fail to find a socket if the netns value is greater than the
> >>>> range of a uint32 so I think it would actually make more sense to drop
> >>>> the parameter size to u32 rather than u64 so that this would be
> >>>> validated at load time rather than silently returning NULL because of
> >>>> a bad parameter.
> >>>
> >>> ok. I was wondering if it was a u64 to handle nsid of 0 which as I
> >>> understand it is a legal nsid. If you drop to u32, how do you know when
> >>> nsid has been set?
> >>
> >> I was operating under the assumption that 0 represents the root netns
> >> id, and cannot be assigned to another non-root netns.
> >>
> >> Looking at __peernet2id_alloc(), it seems to me like it attempts to
> >> find a netns and if it cannot find one, returns 0, which then leads to
> >> a scroll over the idr starting from 0 to INT_MAX to find a legitimate
> >> id for the netns, so I think this is a fair assumption?
> The NET_ID_ZERO trick is used to manage nsid 0 in net_eq_idr() (idr_for_each()
> stops when the callback returns != 0).
>
> >>
> >
> > Maybe Nicolas can give a definitive answer; as I recall he added the
> > NSID option. I have not had time to walk the code. But I do recall
> > seeing an id of 0. e.g, on my dev box:
> > $ ip netns
> > vms (id: 0)
> >
> > And include/uapi/linux/net_namespace.h shows -1 as not assigned.
> Yes, 0 is a valid value and can be assigned to any netns.
> nsid are signed 32 bit values. Note that -1 (NETNSA_NSID_NOT_ASSIGNED) is used
> by the kernel to express that the nsid is not assigned. It can also be used by
> the user to let the kernel chooses a nsid.
>
> $ ip netns add foo
> $ ip netns add bar
> $ ip netns
> bar
> foo
> $ ip netns set foo 0
> $ ip netns set bar auto
> $ ip netns
> bar (id: 1)
> foo (id: 0)

OK, I'll fix this up then.


Re: netns_id in bpf_sk_lookup_{tcp,udp}

2018-11-19 Thread Joe Stringer
On Mon, 19 Nov 2018 at 10:39, David Ahern  wrote:
>
> On 11/19/18 11:36 AM, Joe Stringer wrote:
> > Hi David, thanks for pointing this out.
> >
> > This is more of an oversight through iterations, the runtime lookup
> > will fail to find a socket if the netns value is greater than the
> > range of a uint32 so I think it would actually make more sense to drop
> > the parameter size to u32 rather than u64 so that this would be
> > validated at load time rather than silently returning NULL because of
> > a bad parameter.
>
> ok. I was wondering if it was a u64 to handle nsid of 0 which as I
> understand it is a legal nsid. If you drop to u32, how do you know when
> nsid has been set?

I was operating under the assumption that 0 represents the root netns
id, and cannot be assigned to another non-root netns.

Looking at __peernet2id_alloc(), it seems to me like it attempts to
find a netns and if it cannot find one, returns 0, which then leads to
a scroll over the idr starting from 0 to INT_MAX to find a legitimate
id for the netns, so I think this is a fair assumption?


Re: netns_id in bpf_sk_lookup_{tcp,udp}

2018-11-19 Thread Joe Stringer
Hi David, thanks for pointing this out.

This is more of an oversight through iterations, the runtime lookup
will fail to find a socket if the netns value is greater than the
range of a uint32 so I think it would actually make more sense to drop
the parameter size to u32 rather than u64 so that this would be
validated at load time rather than silently returning NULL because of
a bad parameter.

I'll send a patch to bpf tree.

Cheers,
Joe

On Sun, 18 Nov 2018 at 19:27, David Ahern  wrote:
>
> Hi Joe:
>
> The netns_id to the bpf_sk_lookup_{tcp,udp} functions in
> net/core/filter.c is a u64, yet the APIs in include/uapi/linux/bpf.h
> shows a u32. Is that intentional or an oversight through the iterations?
>
> David


[PATCH bpf-next] selftests/bpf: Fix uninitialized duration warning

2018-11-09 Thread Joe Stringer
Daniel Borkmann reports:

test_progs.c: In function ‘main’:
test_progs.c:81:3: warning: ‘duration’ may be used uninitialized in this 
function [-Wmaybe-uninitialized]
   printf("%s:PASS:%s %d nsec\n", __func__, tag, duration);\
   ^~
test_progs.c:1706:8: note: ‘duration’ was declared here
  __u32 duration;
^~~~

Signed-off-by: Joe Stringer 
---

I'm actually not able to reproduce this with GCC 7.3 or 8.2, so I'll
rely on review to establish that this patch works as intended.
---
 tools/testing/selftests/bpf/test_progs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index 2d3c04f45530..c1e688f61061 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -1703,7 +1703,7 @@ static void test_reference_tracking()
const char *file = "./test_sk_lookup_kern.o";
struct bpf_object *obj;
struct bpf_program *prog;
-   __u32 duration;
+   __u32 duration = 0;
int err = 0;
 
obj = bpf_object__open(file);
-- 
2.17.1



Re: [PATCH bpf] bpf: Fix IPv6 dport byte order in bpf_sk_lookup_udp

2018-11-07 Thread Joe Stringer
On Wed, 7 Nov 2018 at 13:37, Andrey Ignatov  wrote:
>
> Lookup functions in sk_lookup have different expectations about byte
> order of provided arguments.
>
> Specifically __inet_lookup, __udp4_lib_lookup and __udp6_lib_lookup
> expect dport to be in network byte order and do ntohs(dport) internally.
>
> At the same time __inet6_lookup expects dport to be in host byte order
> and correspondingly name the argument hnum.
>
> sk_lookup works correctly with __inet_lookup, __udp4_lib_lookup and
> __inet6_lookup with regard to dport. But in __udp6_lib_lookup case it
> uses host instead of expected network byte order. It makes result
> returned by bpf_sk_lookup_udp for IPv6 incorrect.
>
> The patch fixes byte order of dport passed to __udp6_lib_lookup.
>
> Originally sk_lookup properly handled UDPv6, but not TCPv6. 5ef0ae84f02a
> fixes TCPv6 but breaks UDPv6.
>
> Fixes: 5ef0ae84f02a ("bpf: Fix IPv6 dport byte-order in bpf_sk_lookup")
> Signed-off-by: Andrey Ignatov 

Thanks for the fix, makes sense.

Acked-by: Joe Stringer 


Re: [PATCH bpf-next] bpf: Extend the sk_lookup() helper to XDP hookpoint.

2018-10-19 Thread Joe Stringer
On Thu, 18 Oct 2018 at 22:07, Martin Lau  wrote:
>
> On Thu, Oct 18, 2018 at 04:52:40PM -0700, Joe Stringer wrote:
> > On Thu, 18 Oct 2018 at 14:20, Daniel Borkmann  wrote:
> > >
> > > On 10/18/2018 11:06 PM, Joe Stringer wrote:
> > > > On Thu, 18 Oct 2018 at 11:54, Nitin Hande  wrote:
> > > [...]
> > > >> Open Issue
> > > >> * The underlying code relies on presence of an skb to find out the
> > > >> right sk for the case of REUSEPORT socket option. Since there is
> > > >> no skb available at XDP hookpoint, the helper function will return
> > > >> the first available sk based off the 5 tuple hash. If the desire
> > > >> is to return a particular sk matching reuseport_cb function, please
> > > >> suggest way to tackle it, which can be addressed in a future commit.
> > >
> > > >> Signed-off-by: Nitin Hande 
> > > >
> > > > Thanks Nitin, LGTM overall.
> > > >
> > > > The REUSEPORT thing suggests that the usage of this helper from XDP
> > > > layer may lead to a different socket being selected vs. the equivalent
> > > > call at TC hook, or other places where the selection may occur. This
> > > > could be a bit counter-intuitive.
> > > >
> > > > One thought I had to work around this was to introduce a flag,
> > > > something like BPF_F_FIND_REUSEPORT_SK_BY_HASH. This flag would
> > > > effectively communicate in the API that the bpf_sk_lookup_xxx()
> > > > functions will only select a REUSEPORT socket based on the hash and
> > > > not by, for example BPF_PROG_TYPE_SK_REUSEPORT programs. The absence
> > > > of the flag would support finding REUSEPORT sockets by other
> > > > mechanisms (which would be allowed for now from TC hooks but would be
> > > > disallowed from XDP, since there's no specific plan to support this).
> > >
> > > Hmm, given skb is NULL here the only way to lookup the socket in such
> > > scenario is based on hash, that is, inet_ehashfn() / inet6_ehashfn(),
> > > perhaps alternative is to pass this hash in from XDP itself to the
> > > helper so it could be custom selector. Do you have a specific use case
> > > on this for XDP (just curious)?
> >
> > I don't have a use case for SO_REUSEPORT introspection from XDP, so
> > I'm primarily thinking from the perspective of making the behaviour
> > clear in the API in a way that leaves open the possibility for a
> > reasonable implementation in future. From that perspective, my main
> > concern is that it may surprise some BPF writers that the same
> > "bpf_sk_lookup_tcp()" call (with identical parameters) may have
> > different behaviour at TC vs. XDP layers, as the BPF selection of
> > sockets is respected at TC but not at XDP.
> >
> > FWIW we're already out of parameters for the actual call, so if we
> > wanted to allow passing a hash in, we'd need to either dedicate half
> > the 'flags' field for this configurable hash, or consider adding the
> > new hash parameter to 'struct bpf_sock_tuple'.
> >
> > +Martin for any thoughts on SO_REUSEPORT and XDP here.
> The XDP/TC prog has read access to the sk fields through
> 'struct bpf_sock'?
>
> A quick thought...
> Considering all sk in the same reuse->socks[] share
> many things (e.g. family,type,protocol,ip,port..etc are the same),
> I wonder returning which particular sk from reuse->socks[] will
> matter too much since most of the fields from 'struct bpf_sock' will
> be the same.  Some of fields in 'struct bpf_sock' could be different
> though, like priority?  Hence, another possibility is to limit the
> accessible fields for the XDP prog.  Only allow accessing the fields
> that must be the same among the sk in the same reuse->socks[].

This sounds pretty reasonable to me.


Re: [PATCH bpf-next] bpf: Extend the sk_lookup() helper to XDP hookpoint.

2018-10-18 Thread Joe Stringer
On Thu, 18 Oct 2018 at 14:20, Daniel Borkmann  wrote:
>
> On 10/18/2018 11:06 PM, Joe Stringer wrote:
> > On Thu, 18 Oct 2018 at 11:54, Nitin Hande  wrote:
> [...]
> >> Open Issue
> >> * The underlying code relies on presence of an skb to find out the
> >> right sk for the case of REUSEPORT socket option. Since there is
> >> no skb available at XDP hookpoint, the helper function will return
> >> the first available sk based off the 5 tuple hash. If the desire
> >> is to return a particular sk matching reuseport_cb function, please
> >> suggest way to tackle it, which can be addressed in a future commit.
>
> >> Signed-off-by: Nitin Hande 
> >
> > Thanks Nitin, LGTM overall.
> >
> > The REUSEPORT thing suggests that the usage of this helper from XDP
> > layer may lead to a different socket being selected vs. the equivalent
> > call at TC hook, or other places where the selection may occur. This
> > could be a bit counter-intuitive.
> >
> > One thought I had to work around this was to introduce a flag,
> > something like BPF_F_FIND_REUSEPORT_SK_BY_HASH. This flag would
> > effectively communicate in the API that the bpf_sk_lookup_xxx()
> > functions will only select a REUSEPORT socket based on the hash and
> > not by, for example BPF_PROG_TYPE_SK_REUSEPORT programs. The absence
> > of the flag would support finding REUSEPORT sockets by other
> > mechanisms (which would be allowed for now from TC hooks but would be
> > disallowed from XDP, since there's no specific plan to support this).
>
> Hmm, given skb is NULL here the only way to lookup the socket in such
> scenario is based on hash, that is, inet_ehashfn() / inet6_ehashfn(),
> perhaps alternative is to pass this hash in from XDP itself to the
> helper so it could be custom selector. Do you have a specific use case
> on this for XDP (just curious)?

I don't have a use case for SO_REUSEPORT introspection from XDP, so
I'm primarily thinking from the perspective of making the behaviour
clear in the API in a way that leaves open the possibility for a
reasonable implementation in future. From that perspective, my main
concern is that it may surprise some BPF writers that the same
"bpf_sk_lookup_tcp()" call (with identical parameters) may have
different behaviour at TC vs. XDP layers, as the BPF selection of
sockets is respected at TC but not at XDP.

FWIW we're already out of parameters for the actual call, so if we
wanted to allow passing a hash in, we'd need to either dedicate half
the 'flags' field for this configurable hash, or consider adding the
new hash parameter to 'struct bpf_sock_tuple'.

+Martin for any thoughts on SO_REUSEPORT and XDP here.


Re: [PATCH bpf-next] bpf: Extend the sk_lookup() helper to XDP hookpoint.

2018-10-18 Thread Joe Stringer
On Thu, 18 Oct 2018 at 11:54, Nitin Hande  wrote:
>
>
> This patch proposes to extend the sk_lookup() BPF API to the
> XDP hookpoint. The sk_lookup() helper supports a lookup
> on incoming packet to find the corresponding socket that will
> receive this packet. Current support for this BPF API is
> at the tc hookpoint. This patch will extend this API at XDP
> hookpoint. A XDP program can map the incoming packet to the
> 5-tuple parameter and invoke the API to find the corresponding
> socket structure.
>
> Open Issue
> * The underlying code relies on presence of an skb to find out the
> right sk for the case of REUSEPORT socket option. Since there is
> no skb available at XDP hookpoint, the helper function will return
> the first available sk based off the 5 tuple hash. If the desire
> is to return a particular sk matching reuseport_cb function, please
> suggest way to tackle it, which can be addressed in a future commit.
>
> Signed-off-by: Nitin Hande 

Thanks Nitin, LGTM overall.

The REUSEPORT thing suggests that the usage of this helper from XDP
layer may lead to a different socket being selected vs. the equivalent
call at TC hook, or other places where the selection may occur. This
could be a bit counter-intuitive.

One thought I had to work around this was to introduce a flag,
something like BPF_F_FIND_REUSEPORT_SK_BY_HASH. This flag would
effectively communicate in the API that the bpf_sk_lookup_xxx()
functions will only select a REUSEPORT socket based on the hash and
not by, for example BPF_PROG_TYPE_SK_REUSEPORT programs. The absence
of the flag would support finding REUSEPORT sockets by other
mechanisms (which would be allowed for now from TC hooks but would be
disallowed from XDP, since there's no specific plan to support this).


[PATCH bpf-next 1/2] bpf: Allow sk_lookup with IPv6 module

2018-10-15 Thread Joe Stringer
This is a more complete fix than d71019b54bff ("net: core: Fix build
with CONFIG_IPV6=m"), so that IPv6 sockets may be looked up if the IPv6
module is loaded (not just if it's compiled in).

Signed-off-by: Joe Stringer 
---
 include/net/addrconf.h |  5 +
 net/core/filter.c  | 12 +++-
 net/ipv6/af_inet6.c|  1 +
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 6def0351bcc3..14b789a123e7 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -265,6 +265,11 @@ extern const struct ipv6_stub *ipv6_stub __read_mostly;
 struct ipv6_bpf_stub {
int (*inet6_bind)(struct sock *sk, struct sockaddr *uaddr, int addr_len,
  bool force_bind_address_no_port, bool with_lock);
+   struct sock *(*udp6_lib_lookup)(struct net *net,
+   const struct in6_addr *saddr, __be16 
sport,
+   const struct in6_addr *daddr, __be16 
dport,
+   int dif, int sdif, struct udp_table 
*tbl,
+   struct sk_buff *skb);
 };
 extern const struct ipv6_bpf_stub *ipv6_bpf_stub __read_mostly;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index b844761b5d4c..21aba2a521c7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4842,7 +4842,7 @@ static struct sock *sk_lookup(struct net *net, struct 
bpf_sock_tuple *tuple,
sk = __udp4_lib_lookup(net, src4, tuple->ipv4.sport,
   dst4, tuple->ipv4.dport,
   dif, sdif, &udp_table, skb);
-#if IS_REACHABLE(CONFIG_IPV6)
+#if IS_ENABLED(CONFIG_IPV6)
} else {
struct in6_addr *src6 = (struct in6_addr *)&tuple->ipv6.saddr;
struct in6_addr *dst6 = (struct in6_addr *)&tuple->ipv6.daddr;
@@ -4853,10 +4853,12 @@ static struct sock *sk_lookup(struct net *net, struct 
bpf_sock_tuple *tuple,
src6, tuple->ipv6.sport,
dst6, tuple->ipv6.dport,
dif, sdif, &refcounted);
-   else
-   sk = __udp6_lib_lookup(net, src6, tuple->ipv6.sport,
-  dst6, tuple->ipv6.dport,
-  dif, sdif, &udp_table, skb);
+   else if (likely(ipv6_bpf_stub))
+   sk = ipv6_bpf_stub->udp6_lib_lookup(net,
+   src6, 
tuple->ipv6.sport,
+   dst6, 
tuple->ipv6.dport,
+   dif, sdif,
+   &udp_table, skb);
 #endif
}
 
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index e9c8cfdf4b4c..3f4d61017a69 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -901,6 +901,7 @@ static const struct ipv6_stub ipv6_stub_impl = {
 
 static const struct ipv6_bpf_stub ipv6_bpf_stub_impl = {
.inet6_bind = __inet6_bind,
+   .udp6_lib_lookup = __udp6_lib_lookup,
 };
 
 static int __init inet6_init(void)
-- 
2.17.1



[PATCH bpf-next 0/2] IPv6 sk-lookup fixes

2018-10-15 Thread Joe Stringer
This series includes a couple of fixups for the IPv6 socket lookup
helper, to make the API more consistent (always supply all arguments in
network byte-order) and to allow its use when IPv6 is compiled as a
module.

Joe Stringer (2):
  bpf: Allow sk_lookup with IPv6 module
  bpf: Fix IPv6 dport byte-order in bpf_sk_lookup

 include/net/addrconf.h |  5 +
 net/core/filter.c  | 15 +--
 net/ipv6/af_inet6.c|  1 +
 3 files changed, 15 insertions(+), 6 deletions(-)

-- 
2.17.1



[PATCH bpf-next 2/2] bpf: Fix IPv6 dport byte-order in bpf_sk_lookup

2018-10-15 Thread Joe Stringer
Commit 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
mistakenly passed the destination port in network byte-order to the IPv6
TCP/UDP socket lookup functions, which meant that BPF writers would need
to either manually swap the byte-order of this field or otherwise IPv6
sockets could not be located via this helper.

Fix the issue by swapping the byte-order appropriately in the helper.
This also makes the API more consistent with the IPv4 version.

Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
Signed-off-by: Joe Stringer 
---
 net/core/filter.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 21aba2a521c7..d877c4c599ce 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4846,17 +4846,18 @@ static struct sock *sk_lookup(struct net *net, struct 
bpf_sock_tuple *tuple,
} else {
struct in6_addr *src6 = (struct in6_addr *)&tuple->ipv6.saddr;
struct in6_addr *dst6 = (struct in6_addr *)&tuple->ipv6.daddr;
+   u16 hnum = ntohs(tuple->ipv6.dport);
int sdif = inet6_sdif(skb);
 
if (proto == IPPROTO_TCP)
sk = __inet6_lookup(net, &tcp_hashinfo, skb, 0,
src6, tuple->ipv6.sport,
-   dst6, tuple->ipv6.dport,
+   dst6, hnum,
dif, sdif, &refcounted);
else if (likely(ipv6_bpf_stub))
sk = ipv6_bpf_stub->udp6_lib_lookup(net,
src6, 
tuple->ipv6.sport,
-   dst6, 
tuple->ipv6.dport,
+   dst6, hnum,
dif, sdif,
&udp_table, skb);
 #endif
-- 
2.17.1



[PATCH bpf-next] bpf: Fix dev pointer dereference from sk_skb

2018-10-12 Thread Joe Stringer
Dan Carpenter reports:

The patch 6acc9b432e67: "bpf: Add helper to retrieve socket in BPF"
from Oct 2, 2018, leads to the following Smatch complaint:

net/core/filter.c:4893 bpf_sk_lookup()
error: we previously assumed 'skb->dev' could be null (see line 4885)

Fix this issue by checking skb->dev before using it.

Signed-off-by: Joe Stringer 
---
 net/core/filter.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 4bbc6567fcb8..b844761b5d4c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4821,9 +4821,12 @@ static const struct bpf_func_proto 
bpf_lwt_seg6_adjust_srh_proto = {
 static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
  struct sk_buff *skb, u8 family, u8 proto)
 {
-   int dif = skb->dev->ifindex;
bool refcounted = false;
struct sock *sk = NULL;
+   int dif = 0;
+
+   if (skb->dev)
+   dif = skb->dev->ifindex;
 
if (family == AF_INET) {
__be32 src4 = tuple->ipv4.saddr;
-- 
2.17.1



Re: [PATCH bpf-next] net: core: Fix build with CONFIG_IPV6=m

2018-10-04 Thread Joe Stringer
On Thu, 4 Oct 2018 at 01:48, Daniel Borkmann  wrote:
>
> On 10/03/2018 07:32 AM, Joe Stringer wrote:
> > Stephen Rothwell reports the following link failure with IPv6 as module:
> >
> >   x86_64-linux-gnu-ld: net/core/filter.o: in function `sk_lookup':
> >   (.text+0x19219): undefined reference to `__udp6_lib_lookup'
> >
> > Fix the build by only enabling the IPv6 socket lookup if IPv6 support is
> > compiled into the kernel.
> >
> > Signed-off-by: Joe Stringer 
> > ---
> >  net/core/filter.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 591c698bc517..30c6b2d3ef16 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -4838,7 +4838,7 @@ struct sock *sk_lookup(struct net *net, struct 
> > bpf_sock_tuple *tuple,
> >   sk = __udp4_lib_lookup(net, src4, tuple->ipv4.sport,
> >  dst4, tuple->ipv4.dport,
> >  dif, sdif, &udp_table, skb);
> > -#if IS_ENABLED(CONFIG_IPV6)
> > +#if IS_REACHABLE(CONFIG_IPV6)
> >   } else {
> >   struct in6_addr *src6 = (struct in6_addr *)&tuple->ipv6.saddr;
> >   struct in6_addr *dst6 = (struct in6_addr *)&tuple->ipv6.daddr;
> >
>
> Applied as a quick fix, thanks Joe, but ideally this should also work when 
> ipv6
> is compiled as a module. There's the ipv6_bpf_stub, which does that job for 
> other
> helpers that would call into v6 code out of the builtin filter.c, so I think 
> we
> should follow the same approach here as well. See commit d74bad4e74ee ("bpf:
> Hooks for sys_connect").

Thanks for the pointers, I'll look into that.

To confirm my understanding, is it possible to unload the IPv6 module?
I don't see any code that uninitializes "ipv6_bpf_stub". Seems like a
simple conditional check on that variable should be enough to gate its
usage from packet paths where sk_lookup could be invoked (Given that
the system could receive any packets, including IPv6 when the module
is not loaded).

Cheers,
Joe


[PATCH bpf-next] net: core: Fix build with CONFIG_IPV6=m

2018-10-02 Thread Joe Stringer
Stephen Rothwell reports the following link failure with IPv6 as module:

  x86_64-linux-gnu-ld: net/core/filter.o: in function `sk_lookup':
  (.text+0x19219): undefined reference to `__udp6_lib_lookup'

Fix the build by only enabling the IPv6 socket lookup if IPv6 support is
compiled into the kernel.

Signed-off-by: Joe Stringer 
---
 net/core/filter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 591c698bc517..30c6b2d3ef16 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4838,7 +4838,7 @@ struct sock *sk_lookup(struct net *net, struct 
bpf_sock_tuple *tuple,
sk = __udp4_lib_lookup(net, src4, tuple->ipv4.sport,
   dst4, tuple->ipv4.dport,
   dif, sdif, &udp_table, skb);
-#if IS_ENABLED(CONFIG_IPV6)
+#if IS_REACHABLE(CONFIG_IPV6)
} else {
struct in6_addr *src6 = (struct in6_addr *)&tuple->ipv6.saddr;
struct in6_addr *dst6 = (struct in6_addr *)&tuple->ipv6.daddr;
-- 
2.17.1



[PATCHv4 bpf-next 09/13] selftests/bpf: Generalize dummy program types

2018-10-02 Thread Joe Stringer
Don't hardcode the dummy program types to SOCKET_FILTER type, as this
prevents testing bpf_tail_call in conjunction with other program types.
Instead, use the program type specified in the test case.

Signed-off-by: Joe Stringer 
---
v3: New patch.
v4: No change.
---
 tools/testing/selftests/bpf/test_verifier.c | 31 +++--
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 6e0b3f148cdb..163fd1c0062c 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -12652,18 +12652,18 @@ static int create_map(uint32_t type, uint32_t 
size_key,
return fd;
 }
 
-static int create_prog_dummy1(void)
+static int create_prog_dummy1(enum bpf_map_type prog_type)
 {
struct bpf_insn prog[] = {
BPF_MOV64_IMM(BPF_REG_0, 42),
BPF_EXIT_INSN(),
};
 
-   return bpf_load_program(BPF_PROG_TYPE_SOCKET_FILTER, prog,
+   return bpf_load_program(prog_type, prog,
ARRAY_SIZE(prog), "GPL", 0, NULL, 0);
 }
 
-static int create_prog_dummy2(int mfd, int idx)
+static int create_prog_dummy2(enum bpf_map_type prog_type, int mfd, int idx)
 {
struct bpf_insn prog[] = {
BPF_MOV64_IMM(BPF_REG_3, idx),
@@ -12674,11 +12674,12 @@ static int create_prog_dummy2(int mfd, int idx)
BPF_EXIT_INSN(),
};
 
-   return bpf_load_program(BPF_PROG_TYPE_SOCKET_FILTER, prog,
+   return bpf_load_program(prog_type, prog,
ARRAY_SIZE(prog), "GPL", 0, NULL, 0);
 }
 
-static int create_prog_array(uint32_t max_elem, int p1key)
+static int create_prog_array(enum bpf_map_type prog_type, uint32_t max_elem,
+int p1key)
 {
int p2key = 1;
int mfd, p1fd, p2fd;
@@ -12690,8 +12691,8 @@ static int create_prog_array(uint32_t max_elem, int 
p1key)
return -1;
}
 
-   p1fd = create_prog_dummy1();
-   p2fd = create_prog_dummy2(mfd, p2key);
+   p1fd = create_prog_dummy1(prog_type);
+   p2fd = create_prog_dummy2(prog_type, mfd, p2key);
if (p1fd < 0 || p2fd < 0)
goto out;
if (bpf_map_update_elem(mfd, &p1key, &p1fd, BPF_ANY) < 0)
@@ -12748,8 +12749,8 @@ static int create_cgroup_storage(bool percpu)
 
 static char bpf_vlog[UINT_MAX >> 8];
 
-static void do_test_fixup(struct bpf_test *test, struct bpf_insn *prog,
- int *map_fds)
+static void do_test_fixup(struct bpf_test *test, enum bpf_map_type prog_type,
+ struct bpf_insn *prog, int *map_fds)
 {
int *fixup_map1 = test->fixup_map1;
int *fixup_map2 = test->fixup_map2;
@@ -12805,7 +12806,7 @@ static void do_test_fixup(struct bpf_test *test, struct 
bpf_insn *prog,
}
 
if (*fixup_prog1) {
-   map_fds[4] = create_prog_array(4, 0);
+   map_fds[4] = create_prog_array(prog_type, 4, 0);
do {
prog[*fixup_prog1].imm = map_fds[4];
fixup_prog1++;
@@ -12813,7 +12814,7 @@ static void do_test_fixup(struct bpf_test *test, struct 
bpf_insn *prog,
}
 
if (*fixup_prog2) {
-   map_fds[5] = create_prog_array(8, 7);
+   map_fds[5] = create_prog_array(prog_type, 8, 7);
do {
prog[*fixup_prog2].imm = map_fds[5];
fixup_prog2++;
@@ -12859,11 +12860,13 @@ static void do_test_single(struct bpf_test *test, 
bool unpriv,
for (i = 0; i < MAX_NR_MAPS; i++)
map_fds[i] = -1;
 
-   do_test_fixup(test, prog, map_fds);
+   if (!prog_type)
+   prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
+   do_test_fixup(test, prog_type, prog, map_fds);
prog_len = probe_filter_length(prog);
 
-   fd_prog = bpf_verify_program(prog_type ? : BPF_PROG_TYPE_SOCKET_FILTER,
-prog, prog_len, test->flags & 
F_LOAD_WITH_STRICT_ALIGNMENT,
+   fd_prog = bpf_verify_program(prog_type, prog, prog_len,
+test->flags & F_LOAD_WITH_STRICT_ALIGNMENT,
 "GPL", 0, bpf_vlog, sizeof(bpf_vlog), 1);
 
expected_ret = unpriv && test->result_unpriv != UNDEF ?
-- 
2.17.1



[PATCHv4 bpf-next 11/13] libbpf: Support loading individual progs

2018-10-02 Thread Joe Stringer
Allow the individual program load to be invoked. This will help with
testing, where a single ELF may contain several sections, some of which
denote subprograms that are expected to fail verification, along with
some which are expected to pass verification. By allowing programs to be
iterated and individually loaded, each program can be independently
checked against its expected verification result.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 tools/lib/bpf/libbpf.c | 4 ++--
 tools/lib/bpf/libbpf.h | 3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 425d5ca45c97..9e68fd9fcfca 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -228,7 +228,7 @@ struct bpf_object {
 };
 #define obj_elf_valid(o)   ((o)->efile.elf)
 
-static void bpf_program__unload(struct bpf_program *prog)
+void bpf_program__unload(struct bpf_program *prog)
 {
int i;
 
@@ -1375,7 +1375,7 @@ load_program(enum bpf_prog_type type, enum 
bpf_attach_type expected_attach_type,
return ret;
 }
 
-static int
+int
 bpf_program__load(struct bpf_program *prog,
  char *license, u32 kern_version)
 {
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 511c1294dcbf..2ed24d3f80b3 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -128,10 +128,13 @@ void bpf_program__set_ifindex(struct bpf_program *prog, 
__u32 ifindex);
 
 const char *bpf_program__title(struct bpf_program *prog, bool needs_copy);
 
+int bpf_program__load(struct bpf_program *prog, char *license,
+ u32 kern_version);
 int bpf_program__fd(struct bpf_program *prog);
 int bpf_program__pin_instance(struct bpf_program *prog, const char *path,
  int instance);
 int bpf_program__pin(struct bpf_program *prog, const char *path);
+void bpf_program__unload(struct bpf_program *prog);
 
 struct bpf_insn;
 
-- 
2.17.1



[PATCHv4 bpf-next 08/13] bpf: Add helper to retrieve socket in BPF

2018-10-02 Thread Joe Stringer
This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
socket listening on this host, and returns a socket pointer which the
BPF program can then access to determine, for instance, whether to
forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
socket, so when a BPF program makes use of this function, it must
subsequently pass the returned pointer into the newly added sk_release()
to return the reference.

By way of example, the following pseudocode would filter inbound
connections at XDP if there is no corresponding service listening for
the traffic:

  struct bpf_sock_tuple tuple;
  struct bpf_sock_ops *sk;

  populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
  sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
  if (!sk) {
// Couldn't find a socket listening for this traffic. Drop.
return TC_ACT_SHOT;
  }
  bpf_sk_release(sk, 0);
  return TC_ACT_OK;

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
v2: Rework 'struct bpf_sock_tuple' to allow passing a packet pointer.
Limit netns_id field to 32 bits.
Fix compile error with CONFIG_IPV6 enabled.
Allow direct packet access from helper.
v3: Fix release of caller_net when netns is not specified.
Use skb->sk to find caller net when skb->dev is unavailable.
Remove flags argument to sk_release().
Define the semantics of the new helpers more clearly.
v4: Add ack from Alexei.
---
 include/uapi/linux/bpf.h  |  93 -
 kernel/bpf/verifier.c |   8 +-
 net/core/filter.c | 151 ++
 tools/include/uapi/linux/bpf.h|  93 -
 tools/testing/selftests/bpf/bpf_helpers.h |  12 ++
 5 files changed, 354 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e2070d819e04..f9187b41dff6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2144,6 +2144,77 @@ union bpf_attr {
  * request in the skb.
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * Description
+ * Look for TCP socket matching *tuple*, optionally in a child
+ * network namespace *netns*. The return value must be checked,
+ * and if non-NULL, released via **bpf_sk_release**\ ().
+ *
+ * The *ctx* should point to the context of the program, such as
+ * the skb or socket (depending on the hook in use). This is used
+ * to determine the base network namespace for the lookup.
+ *
+ * *tuple_size* must be one of:
+ *
+ * **sizeof**\ (*tuple*\ **->ipv4**)
+ * Look for an IPv4 socket.
+ * **sizeof**\ (*tuple*\ **->ipv6**)
+ * Look for an IPv6 socket.
+ *
+ * If the *netns* is zero, then the socket lookup table in the
+ * netns associated with the *ctx* will be used. For the TC hooks,
+ * this in the netns of the device in the skb. For socket hooks,
+ * this in the netns of the socket. If *netns* is non-zero, then
+ * it specifies the ID of the netns relative to the netns
+ * associated with the *ctx*.
+ *
+ * All values for *flags* are reserved for future usage, and must
+ * be left at zero.
+ *
+ * This helper is available only if the kernel was compiled with
+ * **CONFIG_NET** configuration option.
+ * Return
+ * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ *
+ * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * Description
+ * Look for UDP socket matching *tuple*, optionally in a child
+ * network namespace *netns*. The return value must be checked,
+ * and if non-NULL, released via **bpf_sk_release**\ ().
+ *
+ * The *ctx* should point to the context of the program, such as
+ * the skb or socket (depending on the hook in use). This is used
+ * to determine the base network namespace for the lookup.
+ *
+ * *tuple_size* must be one of:
+ *
+ * **sizeof**\ (*tuple*\ **->ipv4**)
+ * Look for an IPv4 socket.
+ * **sizeof**\ (*tuple*\ **->ipv6**)
+ * Look for an IPv6 socket.
+ *
+ * If the *netns* is zero, then the socket lookup table in the
+ * netns associated with the *ctx* will be used. For the TC hooks,
+ * this in the netns of the device in the skb. For socket hooks,
+ * this in the netns of

[PATCHv4 bpf-next 10/13] selftests/bpf: Add tests for reference tracking

2018-10-02 Thread Joe Stringer
reference tracking: leak potential reference
reference tracking: leak potential reference on stack
reference tracking: leak potential reference on stack 2
reference tracking: zero potential reference
reference tracking: copy and zero potential references
reference tracking: release reference without check
reference tracking: release reference
reference tracking: release reference twice
reference tracking: release reference twice inside branch
reference tracking: alloc, check, free in one subbranch
reference tracking: alloc, check, free in both subbranches
reference tracking in call: free reference in subprog
reference tracking in call: free reference in subprog and outside
reference tracking in call: alloc & leak reference in subprog
reference tracking in call: alloc in subprog, release outside
reference tracking in call: sk_ptr leak into caller stack
reference tracking in call: sk_ptr spill into caller stack
reference tracking: allow LD_ABS
reference tracking: forbid LD_ABS while holding reference
reference tracking: allow LD_IND
reference tracking: forbid LD_IND while holding reference
reference tracking: check reference or tail call
reference tracking: release reference then tail call
reference tracking: leak possible reference over tail call
reference tracking: leak checked reference over tail call
reference tracking: mangle and release sock_or_null
reference tracking: mangle and release sock
reference tracking: access member
reference tracking: write to member
reference tracking: invalid 64-bit access of member
reference tracking: access after release
reference tracking: direct access for lookup
unpriv: spill/fill of different pointers stx - ctx and sock
unpriv: spill/fill of different pointers stx - leak sock
unpriv: spill/fill of different pointers stx - sock and ctx (read)
unpriv: spill/fill of different pointers stx - sock and ctx (write)

Signed-off-by: Joe Stringer 

---
v3: Rebase against bpf_sk_release() flags argument removal.
Removed Alexei's ack since there are many new tests:
* "reference tracking: allow LD_ABS",
* "reference tracking: forbid LD_ABS while holding reference",
* "reference tracking: allow LD_IND",
* "reference tracking: forbid LD_IND while holding reference",
* "reference tracking: check reference or tail call",
* "reference tracking: release reference then tail call",
* "reference tracking: leak possible reference over tail call",
* "reference tracking: leak checked reference over tail call",
* "reference tracking: mangle and release sock_or_null",
* "reference tracking: mangle and release sock",
* "reference tracking: access member",
* "reference tracking: write to member",
* "reference tracking: invalid 64-bit access of member",
* "reference tracking: access after release",
* "reference tracking: direct access for lookup",
v4: New tests:
* unpriv: spill/fill of different pointers stx - ctx and sock
* unpriv: spill/fill of different pointers stx - leak sock
* unpriv: spill/fill of different pointers stx - sock and ctx (read)
* unpriv: spill/fill of different pointers stx - sock and ctx (write)
---
 tools/testing/selftests/bpf/test_verifier.c | 759 
 1 file changed, 759 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 163fd1c0062c..bc9cd8537467 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -3,6 +3,7 @@
  *
  * Copyright (c) 2014 PLUMgrid, http://plumgrid.com
  * Copyright (c) 2017 Facebook
+ * Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -178,6 +179,24 @@ static void bpf_fill_rand_ld_dw(struct bpf_test *self)
self->retval = (uint32_t)res;
 }
 
+/* BPF_SK_LOOKUP contains 13 instructions, if you need to fix up maps */
+#define BPF_SK_LOOKUP  \
+   /* struct bpf_sock_tuple tuple = {} */  \
+   BPF_MOV64_IMM(BPF_REG_2, 0),\
+   BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),  \
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -16),\
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -24),\
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -32),\
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -40),\
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -48),\
+   /* sk = sk_lookup_tcp(ctx, &tuple, sizeof tuple, 0, 0) */   \
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10

[PATCHv4 bpf-next 05/13] bpf: Add PTR_TO_SOCKET verifier type

2018-10-02 Thread Joe Stringer
Teach the verifier a little bit about a new type of pointer, a
PTR_TO_SOCKET. This pointer type is accessed from BPF through the
'struct bpf_sock' structure.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
v2: Reuse reg_type_mismatch() in more places.
Reduce the number of passes at convert_ctx_access().
v3: Fix build with !CONFIG_NET.
v4: Swap order of checks in sock_filter_is_valid_access().
Add Alexei's ack.
---
 include/linux/bpf.h  |  34 ++
 include/linux/bpf_verifier.h |   2 +
 kernel/bpf/verifier.c| 120 +++
 net/core/filter.c|  30 +
 4 files changed, 160 insertions(+), 26 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 018299a595c8..027697b6a22f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -154,6 +154,7 @@ enum bpf_arg_type {
 
ARG_PTR_TO_CTX, /* pointer to context */
ARG_ANYTHING,   /* any (initialized) argument is ok */
+   ARG_PTR_TO_SOCKET,  /* pointer to bpf_sock */
 };
 
 /* type of values returned from helper functions */
@@ -162,6 +163,7 @@ enum bpf_return_type {
RET_VOID,   /* function doesn't return anything */
RET_PTR_TO_MAP_VALUE,   /* returns a pointer to map elem value 
*/
RET_PTR_TO_MAP_VALUE_OR_NULL,   /* returns a pointer to map elem value 
or NULL */
+   RET_PTR_TO_SOCKET_OR_NULL,  /* returns a pointer to a socket or 
NULL */
 };
 
 /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF 
programs
@@ -213,6 +215,8 @@ enum bpf_reg_type {
PTR_TO_PACKET,   /* reg points to skb->data */
PTR_TO_PACKET_END,   /* skb->data + headlen */
PTR_TO_FLOW_KEYS,/* reg points to bpf_flow_keys */
+   PTR_TO_SOCKET,   /* reg points to struct bpf_sock */
+   PTR_TO_SOCKET_OR_NULL,   /* reg points to struct bpf_sock or NULL */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -343,6 +347,11 @@ const struct bpf_func_proto 
*bpf_get_trace_printk_proto(void);
 
 typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src,
unsigned long off, unsigned long len);
+typedef u32 (*bpf_convert_ctx_access_t)(enum bpf_access_type type,
+   const struct bpf_insn *src,
+   struct bpf_insn *dst,
+   struct bpf_prog *prog,
+   u32 *target_size);
 
 u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
 void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy);
@@ -836,4 +845,29 @@ extern const struct bpf_func_proto 
bpf_get_local_storage_proto;
 void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
+#if defined(CONFIG_NET)
+bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type,
+ struct bpf_insn_access_aux *info);
+u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
+   const struct bpf_insn *si,
+   struct bpf_insn *insn_buf,
+   struct bpf_prog *prog,
+   u32 *target_size);
+#else
+static inline bool bpf_sock_is_valid_access(int off, int size,
+   enum bpf_access_type type,
+   struct bpf_insn_access_aux *info)
+{
+   return false;
+}
+static inline u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
+ const struct bpf_insn *si,
+ struct bpf_insn *insn_buf,
+ struct bpf_prog *prog,
+ u32 *target_size)
+{
+   return 0;
+}
+#endif
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index d0e7f97e8b60..a411363098a5 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -58,6 +58,8 @@ struct bpf_reg_state {
 * offset, so they can share range knowledge.
 * For PTR_TO_MAP_VALUE_OR_NULL this is used to share which map value we
 * came from, when one is tested for != NULL.
+* For PTR_TO_SOCKET this is used to share which pointers retain the
+* same reference to the socket, to determine proper reference freeing.
 */
u32 id;
/* For scalar types (SCALAR_VALUE), this represents our knowledge of
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 98b218bd46e8..f86386c9affd 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -80,8 +80,8 @@ static const struct bpf_verifier_ops * const 
bpf_verifier_ops[] 

[PATCHv4 bpf-next 02/13] bpf: Simplify ptr_min_max_vals adjustment

2018-10-02 Thread Joe Stringer
An upcoming commit will add another two pointer types that need very
similar behaviour, so generalise this function now.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c   | 22 ++---
 tools/testing/selftests/bpf/test_verifier.c | 14 ++---
 2 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9c82d8f58085..abf567200574 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2669,20 +2669,18 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
return -EACCES;
}
 
-   if (ptr_reg->type == PTR_TO_MAP_VALUE_OR_NULL) {
-   verbose(env, "R%d pointer arithmetic on 
PTR_TO_MAP_VALUE_OR_NULL prohibited, null-check it first\n",
-   dst);
-   return -EACCES;
-   }
-   if (ptr_reg->type == CONST_PTR_TO_MAP) {
-   verbose(env, "R%d pointer arithmetic on CONST_PTR_TO_MAP 
prohibited\n",
-   dst);
+   switch (ptr_reg->type) {
+   case PTR_TO_MAP_VALUE_OR_NULL:
+   verbose(env, "R%d pointer arithmetic on %s prohibited, 
null-check it first\n",
+   dst, reg_type_str[ptr_reg->type]);
return -EACCES;
-   }
-   if (ptr_reg->type == PTR_TO_PACKET_END) {
-   verbose(env, "R%d pointer arithmetic on PTR_TO_PACKET_END 
prohibited\n",
-   dst);
+   case CONST_PTR_TO_MAP:
+   case PTR_TO_PACKET_END:
+   verbose(env, "R%d pointer arithmetic on %s prohibited\n",
+   dst, reg_type_str[ptr_reg->type]);
return -EACCES;
+   default:
+   break;
}
 
/* In case of 'scalar += pointer', dst_reg inherits pointer type and id.
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index c7d25f23baf9..a90be44f61e0 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -3638,7 +3638,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END",
+   .errstr = "R3 pointer arithmetic on pkt_end",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
},
@@ -4896,7 +4896,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 4 },
-   .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL",
+   .errstr = "R4 pointer arithmetic on map_value_or_null",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS
},
@@ -4917,7 +4917,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 4 },
-   .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL",
+   .errstr = "R4 pointer arithmetic on map_value_or_null",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS
},
@@ -4938,7 +4938,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 4 },
-   .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL",
+   .errstr = "R4 pointer arithmetic on map_value_or_null",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS
},
@@ -7253,7 +7253,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map_in_map = { 3 },
-   .errstr = "R1 pointer arithmetic on CONST_PTR_TO_MAP 
prohibited",
+   .errstr = "R1 pointer arithmetic on map_ptr prohibited",
.result = REJECT,
},
{
@@ -8927,7 +8927,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END",
+   .errstr = "R3 pointer arithmetic on pkt_end",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
},
@@ -8946,7 +8946,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END",
+   .errstr = "R3 pointer arithmetic on pkt_end",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
},
-- 
2.17.1



[PATCHv4 bpf-next 13/13] Documentation: Describe bpf reference tracking

2018-10-02 Thread Joe Stringer
Document the new pointer types in the verifier and how the pointer ID
tracking works to ensure that references which are taken are later
released.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 Documentation/networking/filter.txt | 64 +
 1 file changed, 64 insertions(+)

diff --git a/Documentation/networking/filter.txt 
b/Documentation/networking/filter.txt
index e6b4ebb2b243..4443ce958862 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -1125,6 +1125,14 @@ pointer type.  The types of pointers describe their 
base, as follows:
 PTR_TO_STACKFrame pointer.
 PTR_TO_PACKET   skb->data.
 PTR_TO_PACKET_END   skb->data + headlen; arithmetic forbidden.
+PTR_TO_SOCKET   Pointer to struct bpf_sock_ops, implicitly refcounted.
+PTR_TO_SOCKET_OR_NULL
+Either a pointer to a socket, or NULL; socket lookup
+returns this type, which becomes a PTR_TO_SOCKET when
+checked != NULL. PTR_TO_SOCKET is reference-counted,
+so programs must release the reference through the
+socket release function before the end of the program.
+Arithmetic on these pointers is forbidden.
 However, a pointer may be offset from this base (as a result of pointer
 arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable
 offset'.  The former is used when an exactly-known value (e.g. an immediate
@@ -1171,6 +1179,13 @@ over the Ethernet header, then reads IHL and addes (IHL 
* 4), the resulting
 pointer will have a variable offset known to be 4n+2 for some n, so adding the 
2
 bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses 
through
 that pointer are safe.
+The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common
+to all copies of the pointer returned from a socket lookup. This has similar
+behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but
+it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly
+represents a reference to the corresponding 'struct sock'. To ensure that the
+reference is not leaked, it is imperative to NULL-check the reference and in
+the non-NULL case, and pass the valid reference to the socket release function.
 
 Direct packet access
 
@@ -1444,6 +1459,55 @@ Error:
   8: (7a) *(u64 *)(r0 +0) = 1
   R0 invalid mem access 'imm'
 
+Program that performs a socket lookup then sets the pointer to NULL without
+checking it:
+value:
+  BPF_MOV64_IMM(BPF_REG_2, 0),
+  BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
+  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+  BPF_MOV64_IMM(BPF_REG_3, 4),
+  BPF_MOV64_IMM(BPF_REG_4, 0),
+  BPF_MOV64_IMM(BPF_REG_5, 0),
+  BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
+  BPF_MOV64_IMM(BPF_REG_0, 0),
+  BPF_EXIT_INSN(),
+Error:
+  0: (b7) r2 = 0
+  1: (63) *(u32 *)(r10 -8) = r2
+  2: (bf) r2 = r10
+  3: (07) r2 += -8
+  4: (b7) r3 = 4
+  5: (b7) r4 = 0
+  6: (b7) r5 = 0
+  7: (85) call bpf_sk_lookup_tcp#65
+  8: (b7) r0 = 0
+  9: (95) exit
+  Unreleased reference id=1, alloc_insn=7
+
+Program that performs a socket lookup but does not NULL-check the returned
+value:
+  BPF_MOV64_IMM(BPF_REG_2, 0),
+  BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
+  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+  BPF_MOV64_IMM(BPF_REG_3, 4),
+  BPF_MOV64_IMM(BPF_REG_4, 0),
+  BPF_MOV64_IMM(BPF_REG_5, 0),
+  BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
+  BPF_EXIT_INSN(),
+Error:
+  0: (b7) r2 = 0
+  1: (63) *(u32 *)(r10 -8) = r2
+  2: (bf) r2 = r10
+  3: (07) r2 += -8
+  4: (b7) r3 = 4
+  5: (b7) r4 = 0
+  6: (b7) r5 = 0
+  7: (85) call bpf_sk_lookup_tcp#65
+  8: (95) exit
+  Unreleased reference id=1, alloc_insn=7
+
 Testing
 ---
 
-- 
2.17.1



[PATCHv4 bpf-next 12/13] selftests/bpf: Add C tests for reference tracking

2018-10-02 Thread Joe Stringer
Add some tests that demonstrate and test the balanced lookup/free
nature of socket lookup. Section names that start with "fail" represent
programs that are expected to fail verification; all others should
succeed.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
v3: Rebase against flags arg change of bpf_sk_release()
New tests:
* "fail_use_after_free"
* "fail_modify_sk_pointer"
* "fail_modify_sk_or_null_pointer"
v4: No change.
---
 tools/testing/selftests/bpf/Makefile  |   2 +-
 tools/testing/selftests/bpf/test_progs.c  |  38 
 .../selftests/bpf/test_sk_lookup_kern.c   | 180 ++
 3 files changed, 219 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_sk_lookup_kern.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index f802de526f57..1381ab81099c 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -36,7 +36,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o \
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
-   test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o
+   test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o test_sk_lookup_kern.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index 63a671803ed6..e8becca9c521 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -1698,6 +1698,43 @@ static void test_task_fd_query_tp(void)
   "sys_enter_read");
 }
 
+static void test_reference_tracking()
+{
+   const char *file = "./test_sk_lookup_kern.o";
+   struct bpf_object *obj;
+   struct bpf_program *prog;
+   __u32 duration;
+   int err = 0;
+
+   obj = bpf_object__open(file);
+   if (IS_ERR(obj)) {
+   error_cnt++;
+   return;
+   }
+
+   bpf_object__for_each_program(prog, obj) {
+   const char *title;
+
+   /* Ignore .text sections */
+   title = bpf_program__title(prog, false);
+   if (strstr(title, ".text") != NULL)
+   continue;
+
+   bpf_program__set_type(prog, BPF_PROG_TYPE_SCHED_CLS);
+
+   /* Expect verifier failure if test name has 'fail' */
+   if (strstr(title, "fail") != NULL) {
+   libbpf_set_print(NULL, NULL, NULL);
+   err = !bpf_program__load(prog, "GPL", 0);
+   libbpf_set_print(printf, printf, NULL);
+   } else {
+   err = bpf_program__load(prog, "GPL", 0);
+   }
+   CHECK(err, title, "\n");
+   }
+   bpf_object__close(obj);
+}
+
 int main(void)
 {
jit_enabled = is_jit_enabled();
@@ -1719,6 +1756,7 @@ int main(void)
test_get_stack_raw_tp();
test_task_fd_query_rawtp();
test_task_fd_query_tp();
+   test_reference_tracking();
 
printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
diff --git a/tools/testing/selftests/bpf/test_sk_lookup_kern.c 
b/tools/testing/selftests/bpf/test_sk_lookup_kern.c
new file mode 100644
index ..b745bdc08c2b
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sk_lookup_kern.c
@@ -0,0 +1,180 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+// Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+char _license[] SEC("license") = "GPL";
+
+/* Fill 'tuple' with L3 info, and attempt to find L4. On fail, return NULL. */
+static struct bpf_sock_tuple *get_tuple(void *data, __u64 nh_off,
+   void *data_end, __u16 eth_proto,
+   bool *ipv4)
+{
+   struct bpf_sock_tuple *result;
+   __u8 proto = 0;
+   __u64 ihl_len;
+
+   if (eth_proto == bpf_htons(ETH_P_IP)) {
+   struct iphdr *iph = (struct iphdr *)(data + nh_off);
+
+   if (iph + 1 > data_end)
+   return NULL;
+   ihl_len = iph->ihl * 4;
+   proto = iph->protocol;
+   *ipv4 = true;
+   result = (

[PATCHv4 bpf-next 07/13] bpf: Add reference tracking to verifier

2018-10-02 Thread Joe Stringer
Allow helper functions to acquire a reference and return it into a
register. Specific pointer types such as the PTR_TO_SOCKET will
implicitly represent such a reference. The verifier must ensure that
these references are released exactly once in each path through the
program.

To achieve this, this commit assigns an id to the pointer and tracks it
in the 'bpf_func_state', then when the function or program exits,
verifies that all of the acquired references have been freed. When the
pointer is passed to a function that frees the reference, it is removed
from the 'bpf_func_state` and all existing copies of the pointer in
registers are marked invalid.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
v2: Replace ptr_id defensive coding when releasing reference state with an
internal error (-EFAULT)
Add Ack by Alexei.
v3: No change.
v4: Add PTR_TO_SOCKET to is_ctx_reg().
---
 include/linux/bpf_verifier.h |  24 ++-
 kernel/bpf/verifier.c| 306 ---
 2 files changed, 308 insertions(+), 22 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index a411363098a5..7b6fd2ab3263 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -104,6 +104,17 @@ struct bpf_stack_state {
u8 slot_type[BPF_REG_SIZE];
 };
 
+struct bpf_reference_state {
+   /* Track each reference created with a unique id, even if the same
+* instruction creates the reference multiple times (eg, via CALL).
+*/
+   int id;
+   /* Instruction where the allocation of this reference occurred. This
+* is used purely to inform the user of a reference leak.
+*/
+   int insn_idx;
+};
+
 /* state of the program:
  * type of all registers and stack info
  */
@@ -121,7 +132,9 @@ struct bpf_func_state {
 */
u32 subprogno;
 
-   /* should be second to last. See copy_func_state() */
+   /* The following fields should be last. See copy_func_state() */
+   int acquired_refs;
+   struct bpf_reference_state *refs;
int allocated_stack;
struct bpf_stack_state *stack;
 };
@@ -217,11 +230,16 @@ __printf(2, 0) void bpf_verifier_vlog(struct 
bpf_verifier_log *log,
 __printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env,
   const char *fmt, ...);
 
-static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env)
+static inline struct bpf_func_state *cur_func(struct bpf_verifier_env *env)
 {
struct bpf_verifier_state *cur = env->cur_state;
 
-   return cur->frame[cur->curframe]->regs;
+   return cur->frame[cur->curframe];
+}
+
+static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env)
+{
+   return cur_func(env)->regs;
 }
 
 int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 11e982381061..cd0d8bc00bd1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1,5 +1,6 @@
 /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
  * Copyright (c) 2016 Facebook
+ * Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -140,6 +141,18 @@ static const struct bpf_verifier_ops * const 
bpf_verifier_ops[] = {
  *
  * After the call R0 is set to return type of the function and registers R1-R5
  * are set to NOT_INIT to indicate that they are no longer readable.
+ *
+ * The following reference types represent a potential reference to a kernel
+ * resource which, after first being allocated, must be checked and freed by
+ * the BPF program:
+ * - PTR_TO_SOCKET_OR_NULL, PTR_TO_SOCKET
+ *
+ * When the verifier sees a helper call return a reference type, it allocates a
+ * pointer id for the reference and stores it in the current function state.
+ * Similar to the way that PTR_TO_MAP_VALUE_OR_NULL is converted into
+ * PTR_TO_MAP_VALUE, PTR_TO_SOCKET_OR_NULL becomes PTR_TO_SOCKET when the type
+ * passes through a NULL-check conditional. For the branch wherein the state is
+ * changed to CONST_IMM, the verifier releases the reference.
  */
 
 /* verifier_state + insn_idx are pushed to stack when branch is encountered */
@@ -189,6 +202,7 @@ struct bpf_call_arg_meta {
int access_size;
s64 msize_smax_value;
u64 msize_umax_value;
+   int ptr_id;
 };
 
 static DEFINE_MUTEX(bpf_verifier_lock);
@@ -251,7 +265,42 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
 
 static bool reg_type_may_be_null(enum bpf_reg_type type)
 {
-   return type == PTR_TO_MAP_VALUE_OR_NULL;
+   return type == PTR_TO_MAP_VALUE_OR_NULL ||
+  type == PTR_TO_SOCKET_OR_NULL;
+}
+
+static bool type_is_refcounted(enum bpf_reg_type type)
+{
+   return type == PTR_TO_SOCKET;
+}
+
+st

[PATCHv4 bpf-next 01/13] bpf: Add iterator for spilled registers

2018-10-02 Thread Joe Stringer
Add this iterator for spilled registers, it concentrates the details of
how to get the current frame's spilled registers into a single macro
while clarifying the intention of the code which is calling the macro.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
v2-v3: No change.
v4: Prefix globally defined macros with "bpf_".
---
 include/linux/bpf_verifier.h | 11 +++
 kernel/bpf/verifier.c| 16 +++-
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index b42b60a83e19..d0e7f97e8b60 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -131,6 +131,17 @@ struct bpf_verifier_state {
u32 curframe;
 };
 
+#define bpf_get_spilled_reg(slot, frame)   \
+   (((slot < frame->allocated_stack / BPF_REG_SIZE) && \
+ (frame->stack[slot].slot_type[0] == STACK_SPILL)) \
+? &frame->stack[slot].spilled_ptr : NULL)
+
+/* Iterate over 'frame', setting 'reg' to either NULL or a spilled register. */
+#define bpf_for_each_spilled_reg(iter, frame, reg) \
+   for (iter = 0, reg = bpf_get_spilled_reg(iter, frame);  \
+iter < frame->allocated_stack / BPF_REG_SIZE;  \
+iter++, reg = bpf_get_spilled_reg(iter, frame))
+
 /* linked list of verifier states used to prune search */
 struct bpf_verifier_state_list {
struct bpf_verifier_state state;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a8cc83a970d1..9c82d8f58085 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2252,10 +2252,9 @@ static void __clear_all_pkt_pointers(struct 
bpf_verifier_env *env,
if (reg_is_pkt_pointer_any(®s[i]))
mark_reg_unknown(env, regs, i);
 
-   for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) {
-   if (state->stack[i].slot_type[0] != STACK_SPILL)
+   bpf_for_each_spilled_reg(i, state, reg) {
+   if (!reg)
continue;
-   reg = &state->stack[i].spilled_ptr;
if (reg_is_pkt_pointer_any(reg))
__mark_reg_unknown(reg);
}
@@ -3395,10 +3394,9 @@ static void find_good_pkt_pointers(struct 
bpf_verifier_state *vstate,
 
for (j = 0; j <= vstate->curframe; j++) {
state = vstate->frame[j];
-   for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) {
-   if (state->stack[i].slot_type[0] != STACK_SPILL)
+   bpf_for_each_spilled_reg(i, state, reg) {
+   if (!reg)
continue;
-   reg = &state->stack[i].spilled_ptr;
if (reg->type == type && reg->id == dst_reg->id)
reg->range = max(reg->range, new_range);
}
@@ -3643,7 +3641,7 @@ static void mark_map_regs(struct bpf_verifier_state 
*vstate, u32 regno,
  bool is_null)
 {
struct bpf_func_state *state = vstate->frame[vstate->curframe];
-   struct bpf_reg_state *regs = state->regs;
+   struct bpf_reg_state *reg, *regs = state->regs;
u32 id = regs[regno].id;
int i, j;
 
@@ -3652,8 +3650,8 @@ static void mark_map_regs(struct bpf_verifier_state 
*vstate, u32 regno,
 
for (j = 0; j <= vstate->curframe; j++) {
state = vstate->frame[j];
-   for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) {
-   if (state->stack[i].slot_type[0] != STACK_SPILL)
+   bpf_for_each_spilled_reg(i, state, reg) {
+   if (!reg)
continue;
mark_map_reg(&state->stack[i].spilled_ptr, 0, id, 
is_null);
}
-- 
2.17.1



[PATCHv4 bpf-next 03/13] bpf: Reuse canonical string formatter for ctx errs

2018-10-02 Thread Joe Stringer
The array "reg_type_str" provides canonical formatting of register
types, however a couple of places would previously check whether a
register represented the context and write the name "context" directly.
An upcoming commit will add another pointer type to these statements, so
to provide more accurate error messages in the verifier, update these
error messages to use "reg_type_str" instead.

Signed-off-by: Joe Stringer 
---
v4: New patch.
---
 kernel/bpf/verifier.c   |  7 +++
 tools/testing/selftests/bpf/test_verifier.c | 10 +-
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index abf567200574..8b4e70eeced2 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1763,8 +1763,7 @@ static int check_xadd(struct bpf_verifier_env *env, int 
insn_idx, struct bpf_ins
if (is_ctx_reg(env, insn->dst_reg) ||
is_pkt_reg(env, insn->dst_reg)) {
verbose(env, "BPF_XADD stores into R%d %s is not allowed\n",
-   insn->dst_reg, is_ctx_reg(env, insn->dst_reg) ?
-   "context" : "packet");
+   insn->dst_reg, reg_type_str[insn->dst_reg]);
return -EACCES;
}
 
@@ -4871,8 +4870,8 @@ static int do_check(struct bpf_verifier_env *env)
return err;
 
if (is_ctx_reg(env, insn->dst_reg)) {
-   verbose(env, "BPF_ST stores into R%d context is 
not allowed\n",
-   insn->dst_reg);
+   verbose(env, "BPF_ST stores into R%d %s is not 
allowed\n",
+   insn->dst_reg, 
reg_type_str[insn->dst_reg]);
return -EACCES;
}
 
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index a90be44f61e0..6e0b3f148cdb 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -3276,7 +3276,7 @@ static struct bpf_test tests[] = {
BPF_ST_MEM(BPF_DW, BPF_REG_1, offsetof(struct 
__sk_buff, mark), 0),
BPF_EXIT_INSN(),
},
-   .errstr = "BPF_ST stores into R1 context is not allowed",
+   .errstr = "BPF_ST stores into R1 inv is not allowed",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
},
@@ -3288,7 +3288,7 @@ static struct bpf_test tests[] = {
 BPF_REG_0, offsetof(struct __sk_buff, 
mark), 0),
BPF_EXIT_INSN(),
},
-   .errstr = "BPF_XADD stores into R1 context is not allowed",
+   .errstr = "BPF_XADD stores into R1 inv is not allowed",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
},
@@ -5266,7 +5266,7 @@ static struct bpf_test tests[] = {
.errstr_unpriv = "R2 leaks addr into mem",
.result_unpriv = REJECT,
.result = REJECT,
-   .errstr = "BPF_XADD stores into R1 context is not allowed",
+   .errstr = "BPF_XADD stores into R1 inv is not allowed",
},
{
"leak pointer into ctx 2",
@@ -5281,7 +5281,7 @@ static struct bpf_test tests[] = {
.errstr_unpriv = "R10 leaks addr into mem",
.result_unpriv = REJECT,
.result = REJECT,
-   .errstr = "BPF_XADD stores into R1 context is not allowed",
+   .errstr = "BPF_XADD stores into R1 inv is not allowed",
},
{
"leak pointer into ctx 3",
@@ -12230,7 +12230,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.result = REJECT,
-   .errstr = "BPF_XADD stores into R2 packet",
+   .errstr = "BPF_XADD stores into R2 ctx",
.prog_type = BPF_PROG_TYPE_XDP,
},
{
-- 
2.17.1



[PATCHv4 bpf-next 06/13] bpf: Macrofy stack state copy

2018-10-02 Thread Joe Stringer
An upcoming commit will need very similar copy/realloc boilerplate, so
refactor the existing stack copy/realloc functions into macros to
simplify it.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c | 106 --
 1 file changed, 60 insertions(+), 46 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f86386c9affd..11e982381061 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -388,60 +388,74 @@ static void print_verifier_state(struct bpf_verifier_env 
*env,
verbose(env, "\n");
 }
 
-static int copy_stack_state(struct bpf_func_state *dst,
-   const struct bpf_func_state *src)
-{
-   if (!src->stack)
-   return 0;
-   if (WARN_ON_ONCE(dst->allocated_stack < src->allocated_stack)) {
-   /* internal bug, make state invalid to reject the program */
-   memset(dst, 0, sizeof(*dst));
-   return -EFAULT;
-   }
-   memcpy(dst->stack, src->stack,
-  sizeof(*src->stack) * (src->allocated_stack / BPF_REG_SIZE));
-   return 0;
-}
+#define COPY_STATE_FN(NAME, COUNT, FIELD, SIZE)
\
+static int copy_##NAME##_state(struct bpf_func_state *dst, \
+  const struct bpf_func_state *src)\
+{  \
+   if (!src->FIELD)\
+   return 0;   \
+   if (WARN_ON_ONCE(dst->COUNT < src->COUNT)) {\
+   /* internal bug, make state invalid to reject the program */ \
+   memset(dst, 0, sizeof(*dst));   \
+   return -EFAULT; \
+   }   \
+   memcpy(dst->FIELD, src->FIELD,  \
+  sizeof(*src->FIELD) * (src->COUNT / SIZE));  \
+   return 0;   \
+}
+/* copy_stack_state() */
+COPY_STATE_FN(stack, allocated_stack, stack, BPF_REG_SIZE)
+#undef COPY_STATE_FN
+
+#define REALLOC_STATE_FN(NAME, COUNT, FIELD, SIZE) \
+static int realloc_##NAME##_state(struct bpf_func_state *state, int size, \
+ bool copy_old)\
+{  \
+   u32 old_size = state->COUNT;\
+   struct bpf_##NAME##_state *new_##FIELD; \
+   int slot = size / SIZE; \
+   \
+   if (size <= old_size || !size) {\
+   if (copy_old)   \
+   return 0;   \
+   state->COUNT = slot * SIZE; \
+   if (!size && old_size) {\
+   kfree(state->FIELD);\
+   state->FIELD = NULL;\
+   }   \
+   return 0;   \
+   }   \
+   new_##FIELD = kmalloc_array(slot, sizeof(struct bpf_##NAME##_state), \
+   GFP_KERNEL);\
+   if (!new_##FIELD)   \
+   return -ENOMEM; \
+   if (copy_old) { \
+   if (state->FIELD)   \
+   memcpy(new_##FIELD, state->FIELD,   \
+  sizeof(*new_##FIELD) * (old_size / SIZE)); \
+   memset(new_##FIELD + old_size / SIZE, 0,\
+  sizeof(*new_##FIELD) * (size - old_size) / SIZE); \
+   }   \
+   state->COUNT = slot * SIZE; \
+   kfree(state->FIELD);\
+   state->FIELD = new_##FIELD; \
+   return 0;   \
+}
+/* realloc_stack_state() */
+REALLOC_STATE_FN(stack, allocated_stack, stack, BPF_REG_SIZE)
+#undef REALLOC_STATE_FN
 

[PATCHv4 bpf-next 04/13] bpf: Generalize ptr_or_null regs check

2018-10-02 Thread Joe Stringer
This check will be reused by an upcoming commit for conditional jump
checks for sockets. Refactor it a bit to simplify the later commit.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c | 43 +--
 1 file changed, 25 insertions(+), 18 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8b4e70eeced2..98b218bd46e8 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -249,6 +249,11 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
   type == PTR_TO_PACKET_META;
 }
 
+static bool reg_type_may_be_null(enum bpf_reg_type type)
+{
+   return type == PTR_TO_MAP_VALUE_OR_NULL;
+}
+
 /* string representation of 'enum bpf_reg_type' */
 static const char * const reg_type_str[] = {
[NOT_INIT]  = "?",
@@ -3599,12 +3604,10 @@ static void reg_combine_min_max(struct bpf_reg_state 
*true_src,
}
 }
 
-static void mark_map_reg(struct bpf_reg_state *regs, u32 regno, u32 id,
-bool is_null)
+static void mark_ptr_or_null_reg(struct bpf_reg_state *reg, u32 id,
+bool is_null)
 {
-   struct bpf_reg_state *reg = ®s[regno];
-
-   if (reg->type == PTR_TO_MAP_VALUE_OR_NULL && reg->id == id) {
+   if (reg_type_may_be_null(reg->type) && reg->id == id) {
/* Old offset (both fixed and variable parts) should
 * have been known-zero, because we don't allow pointer
 * arithmetic on pointers that might be NULL.
@@ -3617,11 +3620,13 @@ static void mark_map_reg(struct bpf_reg_state *regs, 
u32 regno, u32 id,
}
if (is_null) {
reg->type = SCALAR_VALUE;
-   } else if (reg->map_ptr->inner_map_meta) {
-   reg->type = CONST_PTR_TO_MAP;
-   reg->map_ptr = reg->map_ptr->inner_map_meta;
-   } else {
-   reg->type = PTR_TO_MAP_VALUE;
+   } else if (reg->type == PTR_TO_MAP_VALUE_OR_NULL) {
+   if (reg->map_ptr->inner_map_meta) {
+   reg->type = CONST_PTR_TO_MAP;
+   reg->map_ptr = reg->map_ptr->inner_map_meta;
+   } else {
+   reg->type = PTR_TO_MAP_VALUE;
+   }
}
/* We don't need id from this point onwards anymore, thus we
 * should better reset it, so that state pruning has chances
@@ -3634,8 +3639,8 @@ static void mark_map_reg(struct bpf_reg_state *regs, u32 
regno, u32 id,
 /* The logic is similar to find_good_pkt_pointers(), both could eventually
  * be folded together at some point.
  */
-static void mark_map_regs(struct bpf_verifier_state *vstate, u32 regno,
- bool is_null)
+static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
+ bool is_null)
 {
struct bpf_func_state *state = vstate->frame[vstate->curframe];
struct bpf_reg_state *reg, *regs = state->regs;
@@ -3643,14 +3648,14 @@ static void mark_map_regs(struct bpf_verifier_state 
*vstate, u32 regno,
int i, j;
 
for (i = 0; i < MAX_BPF_REG; i++)
-   mark_map_reg(regs, i, id, is_null);
+   mark_ptr_or_null_reg(®s[i], id, is_null);
 
for (j = 0; j <= vstate->curframe; j++) {
state = vstate->frame[j];
bpf_for_each_spilled_reg(i, state, reg) {
if (!reg)
continue;
-   mark_map_reg(&state->stack[i].spilled_ptr, 0, id, 
is_null);
+   mark_ptr_or_null_reg(reg, id, is_null);
}
}
 }
@@ -3852,12 +3857,14 @@ static int check_cond_jmp_op(struct bpf_verifier_env 
*env,
/* detect if R == 0 where R is returned from bpf_map_lookup_elem() */
if (BPF_SRC(insn->code) == BPF_K &&
insn->imm == 0 && (opcode == BPF_JEQ || opcode == BPF_JNE) &&
-   dst_reg->type == PTR_TO_MAP_VALUE_OR_NULL) {
-   /* Mark all identical map registers in each branch as either
+   reg_type_may_be_null(dst_reg->type)) {
+   /* Mark all identical registers in each branch as either
 * safe or unknown depending R == 0 or R != 0 conditional.
 */
-   mark_map_regs(this_branch, insn->dst_reg, opcode == BPF_JNE);
-   mark_map_regs(other_branch, insn->dst_reg, opcode == BPF_JEQ);
+   mark_ptr_or_null_regs(this_branch, insn->dst_reg,
+ opcode == BPF_JNE);
+   mark

[PATCHv4 bpf-next 00/13] Add socket lookup support

2018-10-02 Thread Joe Stringer
This series proposes a new helper for the BPF API which allows BPF programs to
perform lookups for sockets in a network namespace. This would allow programs
to determine early on in processing whether the stack is expecting to receive
the packet, and perform some action (eg drop, forward somewhere) based on this
information.

The series is structured roughly into:
* Misc refactor
* Add the socket pointer type
* Add reference tracking to ensure that socket references are freed
* Extend the BPF API to add sk_lookup_xxx() / sk_release() functions
* Add tests/documentation

The helper proposed in this series includes a parameter for a tuple which must
be filled in by the caller to determine the socket to look up. The simplest
case would be filling with the contents of the packet, ie mapping the packet's
5-tuple into the parameter. In common cases, it may alternatively be useful to
reverse the direction of the tuple and perform a lookup, to find the socket
that initiates this connection; and if the BPF program ever performs a form of
IP address translation, it may further be useful to be able to look up
arbitrary tuples that are not based upon the packet, but instead based on state
held in BPF maps or hardcoded in the BPF program.

Currently, access into the socket's fields are limited to those which are
otherwise already accessible, and are restricted to read-only access.

Changes since v3:
* New patch: "bpf: Reuse canonical string formatter for ctx errs"
* Add PTR_TO_SOCKET to is_ctx_reg().
* Add a few new checks to prevent mixing of socket/non-socket pointers.
* Swap order of checks in sock_filter_is_valid_access().
* Prefix register spill macros with "bpf_".
* Add acks from previous round
* Rebase

Changes since v2:
* New patch: "selftests/bpf: Generalize dummy program types".
  This enables adding verifier tests for socket lookup with tail calls.
* Define the semantics of the new helpers more clearly in uAPI header.
* Fix release of caller_net when netns is not specified.
* Use skb->sk to find caller net when skb->dev is unavailable.
* Fix build with !CONFIG_NET.
* Replace ptr_id defensive coding when releasing reference state with an
  internal error (-EFAULT).
* Remove flags argument to sk_release().
* Add several new assembly tests suggested by Daniel.
* Add a few new C tests.
* Fix typo in verifier error message.

Changes since v1:
* Limit netns_id field to 32 bits
* Reuse reg_type_mismatch() in more places
* Reduce the number of passes at convert_ctx_access()
* Replace ptr_id defensive coding when releasing reference state with an
  internal error (-EFAULT)
* Rework 'struct bpf_sock_tuple' to allow passing a packet pointer
* Allow direct packet access from helper
* Fix compile error with CONFIG_IPV6 enabled
* Improve commit messages

Changes since RFC:
* Split up sk_lookup() into sk_lookup_tcp(), sk_lookup_udp().
* Only take references on the socket when necessary.
  * Make sk_release() only free the socket reference in this case.
* Fix some runtime reference leaks:
  * Disallow BPF_LD_[ABS|IND] instructions while holding a reference.
  * Disallow bpf_tail_call() while holding a reference.
* Prevent the same instruction being used for reference and other
  pointer type.
* Simplify locating copies of a reference during helper calls by caching
  the pointer id from the caller.
* Fix kbuild compilation warnings with particular configs.
* Improve code comments describing the new verifier pieces.
* Tested by Nitin

This tree is also available at:
https://github.com/joestringer/linux/commits/submit/sk-lookup-v4

Joe Stringer (13):
  bpf: Add iterator for spilled registers
  bpf: Simplify ptr_min_max_vals adjustment
  bpf: Reuse canonical string formatter for ctx errs
  bpf: Generalize ptr_or_null regs check
  bpf: Add PTR_TO_SOCKET verifier type
  bpf: Macrofy stack state copy
  bpf: Add reference tracking to verifier
  bpf: Add helper to retrieve socket in BPF
  selftests/bpf: Generalize dummy program types
  selftests/bpf: Add tests for reference tracking
  libbpf: Support loading individual progs
  selftests/bpf: Add C tests for reference tracking
  Documentation: Describe bpf reference tracking

 Documentation/networking/filter.txt   |  64 ++
 include/linux/bpf.h   |  34 +
 include/linux/bpf_verifier.h  |  37 +-
 include/uapi/linux/bpf.h  |  93 +-
 kernel/bpf/verifier.c | 604 ++---
 net/core/filter.c | 181 +++-
 tools/include/uapi/linux/bpf.h|  93 +-
 tools/lib/bpf/libbpf.c|   4 +-
 tools/lib/bpf/libbpf.h|   3 +
 tools/testing/selftests/bpf/Makefile  |   2 +-
 tools/testing/selftests/bpf/bpf_helpers.h |  12 +
 tools/testing/selftests/bpf/test_progs.c  |  38 +
 .../selftests/bpf/test_sk_lookup_kern.c   | 180 
 tools/testing/selftests/

Re: [PATCHv3 bpf-next 04/12] bpf: Add PTR_TO_SOCKET verifier type

2018-10-02 Thread Joe Stringer
On Fri, 28 Sep 2018 at 06:38, Daniel Borkmann  wrote:
>
> On 09/28/2018 01:26 AM, Joe Stringer wrote:
> > Teach the verifier a little bit about a new type of pointer, a
> > PTR_TO_SOCKET. This pointer type is accessed from BPF through the
> > 'struct bpf_sock' structure.
> >
> > Signed-off-by: Joe Stringer 
> [...]
> > +/* Return true if it's OK to have the same insn return a different type. */
> > +static bool reg_type_mismatch_ok(enum bpf_reg_type type)
> > +{
> > + switch (type) {
> > + case PTR_TO_CTX:
> > + case PTR_TO_SOCKET:
> > + case PTR_TO_SOCKET_OR_NULL:
> > + return false;
> > + default:
> > + return true;
> > + }
> > +}
> > +
> > +/* If an instruction was previously used with particular pointer types, 
> > then we
> > + * need to be careful to avoid cases such as the below, where it may be ok
> > + * for one branch accessing the pointer, but not ok for the other branch:
> > + *
> > + * R1 = sock_ptr
> > + * goto X;
> > + * ...
> > + * R1 = some_other_valid_ptr;
> > + * goto X;
> > + * ...
> > + * R2 = *(u32 *)(R1 + 0);
> > + */
> > +static bool reg_type_mismatch(enum bpf_reg_type src, enum bpf_reg_type 
> > prev)
> > +{
> > + return src != prev && (!reg_type_mismatch_ok(src) ||
> > +!reg_type_mismatch_ok(prev));
> > +}
> > +
> >  static int do_check(struct bpf_verifier_env *env)
> >  {
> >   struct bpf_verifier_state *state;
> > @@ -4812,9 +4894,7 @@ static int do_check(struct bpf_verifier_env *env)
> >*/
> >   *prev_src_type = src_reg_type;
> >
> > - } else if (src_reg_type != *prev_src_type &&
> > -(src_reg_type == PTR_TO_CTX ||
> > - *prev_src_type == PTR_TO_CTX)) {
> > + } else if (reg_type_mismatch(src_reg_type, 
> > *prev_src_type)) {
> >   /* ABuser program is trying to use the same 
> > insn
> >* dst_reg = *(u32*) (src_reg + off)
> >* with different pointer types:
> > @@ -4859,9 +4939,7 @@ static int do_check(struct bpf_verifier_env *env)
> >
> >   if (*prev_dst_type == NOT_INIT) {
> >   *prev_dst_type = dst_reg_type;
> > - } else if (dst_reg_type != *prev_dst_type &&
> > -(dst_reg_type == PTR_TO_CTX ||
> > - *prev_dst_type == PTR_TO_CTX)) {
> > + } else if (reg_type_mismatch(dst_reg_type, 
> > *prev_dst_type)) {
> >   verbose(env, "same insn cannot be used with 
> > different pointers\n");
> >   return -EINVAL;
>
> Can also be as follow-up later on, but it would be crucial to also have
> test_verifier tests on this logic here with mixing these pointer types
> from different branches (right now we only cover ctx there).

Thanks for the feedback. I've applied all of your suggestions.

Regarding these newer tests, I have added a few and will post that
with my next revision. Fortunately with the reference tracking it's
actually quite difficult to mix up the pointer types between socket
and another type, because if the type of the register is ambiguous
then you either end up leaking a reference or attempting to release
using a pointer to a non-socket. I've added tests for both of those
cases, along with attempts to read and write into offsets inside
ambiguous pointers which triggers most of these paths.


[PATCHv3 bpf-next 00/12] Add socket lookup support

2018-09-27 Thread Joe Stringer
This series proposes a new helper for the BPF API which allows BPF programs to
perform lookups for sockets in a network namespace. This would allow programs
to determine early on in processing whether the stack is expecting to receive
the packet, and perform some action (eg drop, forward somewhere) based on this
information.

The series is structured roughly into:
* Misc refactor
* Add the socket pointer type
* Add reference tracking to ensure that socket references are freed
* Extend the BPF API to add sk_lookup_xxx() / sk_release() functions
* Add tests/documentation

The helper proposed in this series includes a parameter for a tuple which must
be filled in by the caller to determine the socket to look up. The simplest
case would be filling with the contents of the packet, ie mapping the packet's
5-tuple into the parameter. In common cases, it may alternatively be useful to
reverse the direction of the tuple and perform a lookup, to find the socket
that initiates this connection; and if the BPF program ever performs a form of
IP address translation, it may further be useful to be able to look up
arbitrary tuples that are not based upon the packet, but instead based on state
held in BPF maps or hardcoded in the BPF program.

Currently, access into the socket's fields are limited to those which are
otherwise already accessible, and are restricted to read-only access.

Changes since v2:
* New patch: "selftests/bpf: Generalize dummy program types".
  This enables adding verifier tests for socket lookup with tail calls.
* Define the semantics of the new helpers more clearly in uAPI header.
* Fix release of caller_net when netns is not specified.
* Use skb->sk to find caller net when skb->dev is unavailable.
* Fix build with !CONFIG_NET.
* Replace ptr_id defensive coding when releasing reference state with an
  internal error (-EFAULT).
* Remove flags argument to sk_release().
* Add several new assembly tests suggested by Daniel.
* Add a few new C tests.
* Fix typo in verifier error message.

Changes since v1:
* Limit netns_id field to 32 bits
* Reuse reg_type_mismatch() in more places
* Reduce the number of passes at convert_ctx_access()
* Replace ptr_id defensive coding when releasing reference state with an
  internal error (-EFAULT)
* Rework 'struct bpf_sock_tuple' to allow passing a packet pointer
* Allow direct packet access from helper
* Fix compile error with CONFIG_IPV6 enabled
* Improve commit messages

Changes since RFC:
* Split up sk_lookup() into sk_lookup_tcp(), sk_lookup_udp().
* Only take references on the socket when necessary.
  * Make sk_release() only free the socket reference in this case.
* Fix some runtime reference leaks:
  * Disallow BPF_LD_[ABS|IND] instructions while holding a reference.
  * Disallow bpf_tail_call() while holding a reference.
* Prevent the same instruction being used for reference and other
  pointer type.
* Simplify locating copies of a reference during helper calls by caching
  the pointer id from the caller.
* Fix kbuild compilation warnings with particular configs.
* Improve code comments describing the new verifier pieces.
* Testing courtesy of Nitin

This tree is also available at:
https://github.com/joestringer/linux/commits/submit/sk-lookup-v3

Joe Stringer (12):
  bpf: Add iterator for spilled registers
  bpf: Simplify ptr_min_max_vals adjustment
  bpf: Generalize ptr_or_null regs check
  bpf: Add PTR_TO_SOCKET verifier type
  bpf: Macrofy stack state copy
  bpf: Add reference tracking to verifier
  bpf: Add helper to retrieve socket in BPF
  selftests/bpf: Generalize dummy program types
  selftests/bpf: Add tests for reference tracking
  libbpf: Support loading individual progs
  selftests/bpf: Add C tests for reference tracking
  Documentation: Describe bpf reference tracking

 Documentation/networking/filter.txt   |  64 ++
 include/linux/bpf.h   |  34 +
 include/linux/bpf_verifier.h  |  37 +-
 include/uapi/linux/bpf.h  |  93 ++-
 kernel/bpf/verifier.c | 594 +---
 net/core/filter.c | 181 -
 tools/include/uapi/linux/bpf.h|  93 ++-
 tools/lib/bpf/libbpf.c|   4 +-
 tools/lib/bpf/libbpf.h|   3 +
 tools/testing/selftests/bpf/Makefile  |   2 +-
 tools/testing/selftests/bpf/bpf_helpers.h |  12 +
 tools/testing/selftests/bpf/test_progs.c  |  38 +
 .../selftests/bpf/test_sk_lookup_kern.c   | 180 +
 tools/testing/selftests/bpf/test_verifier.c   | 670 +-
 14 files changed, 1858 insertions(+), 147 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sk_lookup_kern.c

-- 
2.17.1



[PATCHv3 bpf-next 04/12] bpf: Add PTR_TO_SOCKET verifier type

2018-09-27 Thread Joe Stringer
Teach the verifier a little bit about a new type of pointer, a
PTR_TO_SOCKET. This pointer type is accessed from BPF through the
'struct bpf_sock' structure.

Signed-off-by: Joe Stringer 
---
v2: Reuse reg_type_mismatch() in more places
Reduce the number of passes at convert_ctx_access()

v3: Fix build with !CONFIG_NET
---
 include/linux/bpf.h  |  34 ++
 include/linux/bpf_verifier.h |   2 +
 kernel/bpf/verifier.c| 120 +++
 net/core/filter.c|  30 +
 4 files changed, 160 insertions(+), 26 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 018299a595c8..027697b6a22f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -154,6 +154,7 @@ enum bpf_arg_type {
 
ARG_PTR_TO_CTX, /* pointer to context */
ARG_ANYTHING,   /* any (initialized) argument is ok */
+   ARG_PTR_TO_SOCKET,  /* pointer to bpf_sock */
 };
 
 /* type of values returned from helper functions */
@@ -162,6 +163,7 @@ enum bpf_return_type {
RET_VOID,   /* function doesn't return anything */
RET_PTR_TO_MAP_VALUE,   /* returns a pointer to map elem value 
*/
RET_PTR_TO_MAP_VALUE_OR_NULL,   /* returns a pointer to map elem value 
or NULL */
+   RET_PTR_TO_SOCKET_OR_NULL,  /* returns a pointer to a socket or 
NULL */
 };
 
 /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF 
programs
@@ -213,6 +215,8 @@ enum bpf_reg_type {
PTR_TO_PACKET,   /* reg points to skb->data */
PTR_TO_PACKET_END,   /* skb->data + headlen */
PTR_TO_FLOW_KEYS,/* reg points to bpf_flow_keys */
+   PTR_TO_SOCKET,   /* reg points to struct bpf_sock */
+   PTR_TO_SOCKET_OR_NULL,   /* reg points to struct bpf_sock or NULL */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -343,6 +347,11 @@ const struct bpf_func_proto 
*bpf_get_trace_printk_proto(void);
 
 typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src,
unsigned long off, unsigned long len);
+typedef u32 (*bpf_convert_ctx_access_t)(enum bpf_access_type type,
+   const struct bpf_insn *src,
+   struct bpf_insn *dst,
+   struct bpf_prog *prog,
+   u32 *target_size);
 
 u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
 void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy);
@@ -836,4 +845,29 @@ extern const struct bpf_func_proto 
bpf_get_local_storage_proto;
 void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
+#if defined(CONFIG_NET)
+bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type,
+ struct bpf_insn_access_aux *info);
+u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
+   const struct bpf_insn *si,
+   struct bpf_insn *insn_buf,
+   struct bpf_prog *prog,
+   u32 *target_size);
+#else
+static inline bool bpf_sock_is_valid_access(int off, int size,
+   enum bpf_access_type type,
+   struct bpf_insn_access_aux *info)
+{
+   return false;
+}
+static inline u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
+ const struct bpf_insn *si,
+ struct bpf_insn *insn_buf,
+ struct bpf_prog *prog,
+ u32 *target_size)
+{
+   return 0;
+}
+#endif
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index af262b97f586..23a2b17bfd75 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -58,6 +58,8 @@ struct bpf_reg_state {
 * offset, so they can share range knowledge.
 * For PTR_TO_MAP_VALUE_OR_NULL this is used to share which map value we
 * came from, when one is tested for != NULL.
+* For PTR_TO_SOCKET this is used to share which pointers retain the
+* same reference to the socket, to determine proper reference freeing.
 */
u32 id;
/* For scalar types (SCALAR_VALUE), this represents our knowledge of
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index bbb0a812ee81..d4abbf0d5727 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -80,8 +80,8 @@ static const struct bpf_verifier_ops * const 
bpf_verifier_ops[] = {
  * (like pointer plus pointer becomes SCALAR_VALUE type)
  *
  * When verifier sees load or store instruct

[PATCHv3 bpf-next 08/12] selftests/bpf: Generalize dummy program types

2018-09-27 Thread Joe Stringer
Don't hardcode the dummy program types to SOCKET_FILTER type, as this
prevents testing bpf_tail_call in conjunction with other program types.
Instead, use the program type specified in the test case.

Signed-off-by: Joe Stringer 
---
 tools/testing/selftests/bpf/test_verifier.c | 31 +++--
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index a90be44f61e0..020b1467e565 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -12652,18 +12652,18 @@ static int create_map(uint32_t type, uint32_t 
size_key,
return fd;
 }
 
-static int create_prog_dummy1(void)
+static int create_prog_dummy1(enum bpf_map_type prog_type)
 {
struct bpf_insn prog[] = {
BPF_MOV64_IMM(BPF_REG_0, 42),
BPF_EXIT_INSN(),
};
 
-   return bpf_load_program(BPF_PROG_TYPE_SOCKET_FILTER, prog,
+   return bpf_load_program(prog_type, prog,
ARRAY_SIZE(prog), "GPL", 0, NULL, 0);
 }
 
-static int create_prog_dummy2(int mfd, int idx)
+static int create_prog_dummy2(enum bpf_map_type prog_type, int mfd, int idx)
 {
struct bpf_insn prog[] = {
BPF_MOV64_IMM(BPF_REG_3, idx),
@@ -12674,11 +12674,12 @@ static int create_prog_dummy2(int mfd, int idx)
BPF_EXIT_INSN(),
};
 
-   return bpf_load_program(BPF_PROG_TYPE_SOCKET_FILTER, prog,
+   return bpf_load_program(prog_type, prog,
ARRAY_SIZE(prog), "GPL", 0, NULL, 0);
 }
 
-static int create_prog_array(uint32_t max_elem, int p1key)
+static int create_prog_array(enum bpf_map_type prog_type, uint32_t max_elem,
+int p1key)
 {
int p2key = 1;
int mfd, p1fd, p2fd;
@@ -12690,8 +12691,8 @@ static int create_prog_array(uint32_t max_elem, int 
p1key)
return -1;
}
 
-   p1fd = create_prog_dummy1();
-   p2fd = create_prog_dummy2(mfd, p2key);
+   p1fd = create_prog_dummy1(prog_type);
+   p2fd = create_prog_dummy2(prog_type, mfd, p2key);
if (p1fd < 0 || p2fd < 0)
goto out;
if (bpf_map_update_elem(mfd, &p1key, &p1fd, BPF_ANY) < 0)
@@ -12748,8 +12749,8 @@ static int create_cgroup_storage(bool percpu)
 
 static char bpf_vlog[UINT_MAX >> 8];
 
-static void do_test_fixup(struct bpf_test *test, struct bpf_insn *prog,
- int *map_fds)
+static void do_test_fixup(struct bpf_test *test, enum bpf_map_type prog_type,
+ struct bpf_insn *prog, int *map_fds)
 {
int *fixup_map1 = test->fixup_map1;
int *fixup_map2 = test->fixup_map2;
@@ -12805,7 +12806,7 @@ static void do_test_fixup(struct bpf_test *test, struct 
bpf_insn *prog,
}
 
if (*fixup_prog1) {
-   map_fds[4] = create_prog_array(4, 0);
+   map_fds[4] = create_prog_array(prog_type, 4, 0);
do {
prog[*fixup_prog1].imm = map_fds[4];
fixup_prog1++;
@@ -12813,7 +12814,7 @@ static void do_test_fixup(struct bpf_test *test, struct 
bpf_insn *prog,
}
 
if (*fixup_prog2) {
-   map_fds[5] = create_prog_array(8, 7);
+   map_fds[5] = create_prog_array(prog_type, 8, 7);
do {
prog[*fixup_prog2].imm = map_fds[5];
fixup_prog2++;
@@ -12859,11 +12860,13 @@ static void do_test_single(struct bpf_test *test, 
bool unpriv,
for (i = 0; i < MAX_NR_MAPS; i++)
map_fds[i] = -1;
 
-   do_test_fixup(test, prog, map_fds);
+   if (!prog_type)
+   prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
+   do_test_fixup(test, prog_type, prog, map_fds);
prog_len = probe_filter_length(prog);
 
-   fd_prog = bpf_verify_program(prog_type ? : BPF_PROG_TYPE_SOCKET_FILTER,
-prog, prog_len, test->flags & 
F_LOAD_WITH_STRICT_ALIGNMENT,
+   fd_prog = bpf_verify_program(prog_type, prog, prog_len,
+test->flags & F_LOAD_WITH_STRICT_ALIGNMENT,
 "GPL", 0, bpf_vlog, sizeof(bpf_vlog), 1);
 
expected_ret = unpriv && test->result_unpriv != UNDEF ?
-- 
2.17.1



[PATCHv3 bpf-next 02/12] bpf: Simplify ptr_min_max_vals adjustment

2018-09-27 Thread Joe Stringer
An upcoming commit will add another two pointer types that need very
similar behaviour, so generalise this function now.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c   | 22 ++---
 tools/testing/selftests/bpf/test_verifier.c | 14 ++---
 2 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 18347de310ad..87b75efc1dc1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2669,20 +2669,18 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
return -EACCES;
}
 
-   if (ptr_reg->type == PTR_TO_MAP_VALUE_OR_NULL) {
-   verbose(env, "R%d pointer arithmetic on 
PTR_TO_MAP_VALUE_OR_NULL prohibited, null-check it first\n",
-   dst);
-   return -EACCES;
-   }
-   if (ptr_reg->type == CONST_PTR_TO_MAP) {
-   verbose(env, "R%d pointer arithmetic on CONST_PTR_TO_MAP 
prohibited\n",
-   dst);
+   switch (ptr_reg->type) {
+   case PTR_TO_MAP_VALUE_OR_NULL:
+   verbose(env, "R%d pointer arithmetic on %s prohibited, 
null-check it first\n",
+   dst, reg_type_str[ptr_reg->type]);
return -EACCES;
-   }
-   if (ptr_reg->type == PTR_TO_PACKET_END) {
-   verbose(env, "R%d pointer arithmetic on PTR_TO_PACKET_END 
prohibited\n",
-   dst);
+   case CONST_PTR_TO_MAP:
+   case PTR_TO_PACKET_END:
+   verbose(env, "R%d pointer arithmetic on %s prohibited\n",
+   dst, reg_type_str[ptr_reg->type]);
return -EACCES;
+   default:
+   break;
}
 
/* In case of 'scalar += pointer', dst_reg inherits pointer type and id.
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index c7d25f23baf9..a90be44f61e0 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -3638,7 +3638,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END",
+   .errstr = "R3 pointer arithmetic on pkt_end",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
},
@@ -4896,7 +4896,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 4 },
-   .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL",
+   .errstr = "R4 pointer arithmetic on map_value_or_null",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS
},
@@ -4917,7 +4917,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 4 },
-   .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL",
+   .errstr = "R4 pointer arithmetic on map_value_or_null",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS
},
@@ -4938,7 +4938,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 4 },
-   .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL",
+   .errstr = "R4 pointer arithmetic on map_value_or_null",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS
},
@@ -7253,7 +7253,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map_in_map = { 3 },
-   .errstr = "R1 pointer arithmetic on CONST_PTR_TO_MAP 
prohibited",
+   .errstr = "R1 pointer arithmetic on map_ptr prohibited",
.result = REJECT,
},
{
@@ -8927,7 +8927,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END",
+   .errstr = "R3 pointer arithmetic on pkt_end",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
},
@@ -8946,7 +8946,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END",
+   .errstr = "R3 pointer arithmetic on pkt_end",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
},
-- 
2.17.1



[PATCHv3 bpf-next 05/12] bpf: Macrofy stack state copy

2018-09-27 Thread Joe Stringer
An upcoming commit will need very similar copy/realloc boilerplate, so
refactor the existing stack copy/realloc functions into macros to
simplify it.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c | 106 --
 1 file changed, 60 insertions(+), 46 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d4abbf0d5727..cf8704d137fa 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -388,60 +388,74 @@ static void print_verifier_state(struct bpf_verifier_env 
*env,
verbose(env, "\n");
 }
 
-static int copy_stack_state(struct bpf_func_state *dst,
-   const struct bpf_func_state *src)
-{
-   if (!src->stack)
-   return 0;
-   if (WARN_ON_ONCE(dst->allocated_stack < src->allocated_stack)) {
-   /* internal bug, make state invalid to reject the program */
-   memset(dst, 0, sizeof(*dst));
-   return -EFAULT;
-   }
-   memcpy(dst->stack, src->stack,
-  sizeof(*src->stack) * (src->allocated_stack / BPF_REG_SIZE));
-   return 0;
-}
+#define COPY_STATE_FN(NAME, COUNT, FIELD, SIZE)
\
+static int copy_##NAME##_state(struct bpf_func_state *dst, \
+  const struct bpf_func_state *src)\
+{  \
+   if (!src->FIELD)\
+   return 0;   \
+   if (WARN_ON_ONCE(dst->COUNT < src->COUNT)) {\
+   /* internal bug, make state invalid to reject the program */ \
+   memset(dst, 0, sizeof(*dst));   \
+   return -EFAULT; \
+   }   \
+   memcpy(dst->FIELD, src->FIELD,  \
+  sizeof(*src->FIELD) * (src->COUNT / SIZE));  \
+   return 0;   \
+}
+/* copy_stack_state() */
+COPY_STATE_FN(stack, allocated_stack, stack, BPF_REG_SIZE)
+#undef COPY_STATE_FN
+
+#define REALLOC_STATE_FN(NAME, COUNT, FIELD, SIZE) \
+static int realloc_##NAME##_state(struct bpf_func_state *state, int size, \
+ bool copy_old)\
+{  \
+   u32 old_size = state->COUNT;\
+   struct bpf_##NAME##_state *new_##FIELD; \
+   int slot = size / SIZE; \
+   \
+   if (size <= old_size || !size) {\
+   if (copy_old)   \
+   return 0;   \
+   state->COUNT = slot * SIZE; \
+   if (!size && old_size) {\
+   kfree(state->FIELD);\
+   state->FIELD = NULL;\
+   }   \
+   return 0;   \
+   }   \
+   new_##FIELD = kmalloc_array(slot, sizeof(struct bpf_##NAME##_state), \
+   GFP_KERNEL);\
+   if (!new_##FIELD)   \
+   return -ENOMEM; \
+   if (copy_old) { \
+   if (state->FIELD)   \
+   memcpy(new_##FIELD, state->FIELD,   \
+  sizeof(*new_##FIELD) * (old_size / SIZE)); \
+   memset(new_##FIELD + old_size / SIZE, 0,\
+  sizeof(*new_##FIELD) * (size - old_size) / SIZE); \
+   }   \
+   state->COUNT = slot * SIZE; \
+   kfree(state->FIELD);\
+   state->FIELD = new_##FIELD; \
+   return 0;   \
+}
+/* realloc_stack_state() */
+REALLOC_STATE_FN(stack, allocated_stack, stack, BPF_REG_SIZE)
+#undef REALLOC_STATE_FN
 

[PATCHv3 bpf-next 01/12] bpf: Add iterator for spilled registers

2018-09-27 Thread Joe Stringer
Add this iterator for spilled registers, it concentrates the details of
how to get the current frame's spilled registers into a single macro
while clarifying the intention of the code which is calling the macro.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf_verifier.h | 11 +++
 kernel/bpf/verifier.c| 16 +++-
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index b42b60a83e19..af262b97f586 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -131,6 +131,17 @@ struct bpf_verifier_state {
u32 curframe;
 };
 
+#define __get_spilled_reg(slot, frame) \
+   (((slot < frame->allocated_stack / BPF_REG_SIZE) && \
+ (frame->stack[slot].slot_type[0] == STACK_SPILL)) \
+? &frame->stack[slot].spilled_ptr : NULL)
+
+/* Iterate over 'frame', setting 'reg' to either NULL or a spilled register. */
+#define for_each_spilled_reg(iter, frame, reg) \
+   for (iter = 0, reg = __get_spilled_reg(iter, frame);\
+iter < frame->allocated_stack / BPF_REG_SIZE;  \
+iter++, reg = __get_spilled_reg(iter, frame))
+
 /* linked list of verifier states used to prune search */
 struct bpf_verifier_state_list {
struct bpf_verifier_state state;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a8cc83a970d1..18347de310ad 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2252,10 +2252,9 @@ static void __clear_all_pkt_pointers(struct 
bpf_verifier_env *env,
if (reg_is_pkt_pointer_any(®s[i]))
mark_reg_unknown(env, regs, i);
 
-   for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) {
-   if (state->stack[i].slot_type[0] != STACK_SPILL)
+   for_each_spilled_reg(i, state, reg) {
+   if (!reg)
continue;
-   reg = &state->stack[i].spilled_ptr;
if (reg_is_pkt_pointer_any(reg))
__mark_reg_unknown(reg);
}
@@ -3395,10 +3394,9 @@ static void find_good_pkt_pointers(struct 
bpf_verifier_state *vstate,
 
for (j = 0; j <= vstate->curframe; j++) {
state = vstate->frame[j];
-   for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) {
-   if (state->stack[i].slot_type[0] != STACK_SPILL)
+   for_each_spilled_reg(i, state, reg) {
+   if (!reg)
continue;
-   reg = &state->stack[i].spilled_ptr;
if (reg->type == type && reg->id == dst_reg->id)
reg->range = max(reg->range, new_range);
}
@@ -3643,7 +3641,7 @@ static void mark_map_regs(struct bpf_verifier_state 
*vstate, u32 regno,
  bool is_null)
 {
struct bpf_func_state *state = vstate->frame[vstate->curframe];
-   struct bpf_reg_state *regs = state->regs;
+   struct bpf_reg_state *reg, *regs = state->regs;
u32 id = regs[regno].id;
int i, j;
 
@@ -3652,8 +3650,8 @@ static void mark_map_regs(struct bpf_verifier_state 
*vstate, u32 regno,
 
for (j = 0; j <= vstate->curframe; j++) {
state = vstate->frame[j];
-   for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) {
-   if (state->stack[i].slot_type[0] != STACK_SPILL)
+   for_each_spilled_reg(i, state, reg) {
+   if (!reg)
continue;
mark_map_reg(&state->stack[i].spilled_ptr, 0, id, 
is_null);
}
-- 
2.17.1



[PATCHv3 bpf-next 11/12] selftests/bpf: Add C tests for reference tracking

2018-09-27 Thread Joe Stringer
Add some tests that demonstrate and test the balanced lookup/free
nature of socket lookup. Section names that start with "fail" represent
programs that are expected to fail verification; all others should
succeed.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
v3: Rebase against flags arg change of bpf_sk_release()
New tests:
* "fail_use_after_free"
* "fail_modify_sk_pointer"
* "fail_modify_sk_or_null_pointer"
---
 tools/testing/selftests/bpf/Makefile  |   2 +-
 tools/testing/selftests/bpf/test_progs.c  |  38 
 .../selftests/bpf/test_sk_lookup_kern.c   | 180 ++
 3 files changed, 219 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_sk_lookup_kern.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index f802de526f57..1381ab81099c 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -36,7 +36,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o \
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
-   test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o
+   test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o test_sk_lookup_kern.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index 63a671803ed6..e8becca9c521 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -1698,6 +1698,43 @@ static void test_task_fd_query_tp(void)
   "sys_enter_read");
 }
 
+static void test_reference_tracking()
+{
+   const char *file = "./test_sk_lookup_kern.o";
+   struct bpf_object *obj;
+   struct bpf_program *prog;
+   __u32 duration;
+   int err = 0;
+
+   obj = bpf_object__open(file);
+   if (IS_ERR(obj)) {
+   error_cnt++;
+   return;
+   }
+
+   bpf_object__for_each_program(prog, obj) {
+   const char *title;
+
+   /* Ignore .text sections */
+   title = bpf_program__title(prog, false);
+   if (strstr(title, ".text") != NULL)
+   continue;
+
+   bpf_program__set_type(prog, BPF_PROG_TYPE_SCHED_CLS);
+
+   /* Expect verifier failure if test name has 'fail' */
+   if (strstr(title, "fail") != NULL) {
+   libbpf_set_print(NULL, NULL, NULL);
+   err = !bpf_program__load(prog, "GPL", 0);
+   libbpf_set_print(printf, printf, NULL);
+   } else {
+   err = bpf_program__load(prog, "GPL", 0);
+   }
+   CHECK(err, title, "\n");
+   }
+   bpf_object__close(obj);
+}
+
 int main(void)
 {
jit_enabled = is_jit_enabled();
@@ -1719,6 +1756,7 @@ int main(void)
test_get_stack_raw_tp();
test_task_fd_query_rawtp();
test_task_fd_query_tp();
+   test_reference_tracking();
 
printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
diff --git a/tools/testing/selftests/bpf/test_sk_lookup_kern.c 
b/tools/testing/selftests/bpf/test_sk_lookup_kern.c
new file mode 100644
index ..b745bdc08c2b
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sk_lookup_kern.c
@@ -0,0 +1,180 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+// Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+char _license[] SEC("license") = "GPL";
+
+/* Fill 'tuple' with L3 info, and attempt to find L4. On fail, return NULL. */
+static struct bpf_sock_tuple *get_tuple(void *data, __u64 nh_off,
+   void *data_end, __u16 eth_proto,
+   bool *ipv4)
+{
+   struct bpf_sock_tuple *result;
+   __u8 proto = 0;
+   __u64 ihl_len;
+
+   if (eth_proto == bpf_htons(ETH_P_IP)) {
+   struct iphdr *iph = (struct iphdr *)(data + nh_off);
+
+   if (iph + 1 > data_end)
+   return NULL;
+   ihl_len = iph->ihl * 4;
+   proto = iph->protocol;
+   *ipv4 = true;
+   result = (struct bpf_sock_tuple *)&iph

[PATCHv3 bpf-next 12/12] Documentation: Describe bpf reference tracking

2018-09-27 Thread Joe Stringer
Document the new pointer types in the verifier and how the pointer ID
tracking works to ensure that references which are taken are later
released.

Signed-off-by: Joe Stringer 
Acked-by: Alexei Starovoitov 
---
 Documentation/networking/filter.txt | 64 +
 1 file changed, 64 insertions(+)

diff --git a/Documentation/networking/filter.txt 
b/Documentation/networking/filter.txt
index e6b4ebb2b243..4443ce958862 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -1125,6 +1125,14 @@ pointer type.  The types of pointers describe their 
base, as follows:
 PTR_TO_STACKFrame pointer.
 PTR_TO_PACKET   skb->data.
 PTR_TO_PACKET_END   skb->data + headlen; arithmetic forbidden.
+PTR_TO_SOCKET   Pointer to struct bpf_sock_ops, implicitly refcounted.
+PTR_TO_SOCKET_OR_NULL
+Either a pointer to a socket, or NULL; socket lookup
+returns this type, which becomes a PTR_TO_SOCKET when
+checked != NULL. PTR_TO_SOCKET is reference-counted,
+so programs must release the reference through the
+socket release function before the end of the program.
+Arithmetic on these pointers is forbidden.
 However, a pointer may be offset from this base (as a result of pointer
 arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable
 offset'.  The former is used when an exactly-known value (e.g. an immediate
@@ -1171,6 +1179,13 @@ over the Ethernet header, then reads IHL and addes (IHL 
* 4), the resulting
 pointer will have a variable offset known to be 4n+2 for some n, so adding the 
2
 bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses 
through
 that pointer are safe.
+The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common
+to all copies of the pointer returned from a socket lookup. This has similar
+behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but
+it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly
+represents a reference to the corresponding 'struct sock'. To ensure that the
+reference is not leaked, it is imperative to NULL-check the reference and in
+the non-NULL case, and pass the valid reference to the socket release function.
 
 Direct packet access
 
@@ -1444,6 +1459,55 @@ Error:
   8: (7a) *(u64 *)(r0 +0) = 1
   R0 invalid mem access 'imm'
 
+Program that performs a socket lookup then sets the pointer to NULL without
+checking it:
+value:
+  BPF_MOV64_IMM(BPF_REG_2, 0),
+  BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
+  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+  BPF_MOV64_IMM(BPF_REG_3, 4),
+  BPF_MOV64_IMM(BPF_REG_4, 0),
+  BPF_MOV64_IMM(BPF_REG_5, 0),
+  BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
+  BPF_MOV64_IMM(BPF_REG_0, 0),
+  BPF_EXIT_INSN(),
+Error:
+  0: (b7) r2 = 0
+  1: (63) *(u32 *)(r10 -8) = r2
+  2: (bf) r2 = r10
+  3: (07) r2 += -8
+  4: (b7) r3 = 4
+  5: (b7) r4 = 0
+  6: (b7) r5 = 0
+  7: (85) call bpf_sk_lookup_tcp#65
+  8: (b7) r0 = 0
+  9: (95) exit
+  Unreleased reference id=1, alloc_insn=7
+
+Program that performs a socket lookup but does not NULL-check the returned
+value:
+  BPF_MOV64_IMM(BPF_REG_2, 0),
+  BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
+  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+  BPF_MOV64_IMM(BPF_REG_3, 4),
+  BPF_MOV64_IMM(BPF_REG_4, 0),
+  BPF_MOV64_IMM(BPF_REG_5, 0),
+  BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
+  BPF_EXIT_INSN(),
+Error:
+  0: (b7) r2 = 0
+  1: (63) *(u32 *)(r10 -8) = r2
+  2: (bf) r2 = r10
+  3: (07) r2 += -8
+  4: (b7) r3 = 4
+  5: (b7) r4 = 0
+  6: (b7) r5 = 0
+  7: (85) call bpf_sk_lookup_tcp#65
+  8: (95) exit
+  Unreleased reference id=1, alloc_insn=7
+
 Testing
 ---
 
-- 
2.17.1



[PATCHv3 bpf-next 07/12] bpf: Add helper to retrieve socket in BPF

2018-09-27 Thread Joe Stringer
This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
socket listening on this host, and returns a socket pointer which the
BPF program can then access to determine, for instance, whether to
forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
socket, so when a BPF program makes use of this function, it must
subsequently pass the returned pointer into the newly added sk_release()
to return the reference.

By way of example, the following pseudocode would filter inbound
connections at XDP if there is no corresponding service listening for
the traffic:

  struct bpf_sock_tuple tuple;
  struct bpf_sock_ops *sk;

  populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
  sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
  if (!sk) {
// Couldn't find a socket listening for this traffic. Drop.
return TC_ACT_SHOT;
  }
  bpf_sk_release(sk);
  return TC_ACT_OK;

Signed-off-by: Joe Stringer 
---
v2: Rework 'struct bpf_sock_tuple' to allow passing a packet pointer
Limit netns_id field to 32 bits
Fix compile error with CONFIG_IPV6 enabled
Allow direct packet access from helper

v3: Fix release of caller_net when netns is not specified.
Use skb->sk to find caller net when skb->dev is unavailable.
Remove flags argument to sk_release()
Define the semantics of the new helpers more clearly.
---
 include/uapi/linux/bpf.h  |  93 -
 kernel/bpf/verifier.c |   8 +-
 net/core/filter.c | 151 ++
 tools/include/uapi/linux/bpf.h|  93 -
 tools/testing/selftests/bpf/bpf_helpers.h |  12 ++
 5 files changed, 354 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e2070d819e04..f9187b41dff6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2144,6 +2144,77 @@ union bpf_attr {
  * request in the skb.
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * Description
+ * Look for TCP socket matching *tuple*, optionally in a child
+ * network namespace *netns*. The return value must be checked,
+ * and if non-NULL, released via **bpf_sk_release**\ ().
+ *
+ * The *ctx* should point to the context of the program, such as
+ * the skb or socket (depending on the hook in use). This is used
+ * to determine the base network namespace for the lookup.
+ *
+ * *tuple_size* must be one of:
+ *
+ * **sizeof**\ (*tuple*\ **->ipv4**)
+ * Look for an IPv4 socket.
+ * **sizeof**\ (*tuple*\ **->ipv6**)
+ * Look for an IPv6 socket.
+ *
+ * If the *netns* is zero, then the socket lookup table in the
+ * netns associated with the *ctx* will be used. For the TC hooks,
+ * this in the netns of the device in the skb. For socket hooks,
+ * this in the netns of the socket. If *netns* is non-zero, then
+ * it specifies the ID of the netns relative to the netns
+ * associated with the *ctx*.
+ *
+ * All values for *flags* are reserved for future usage, and must
+ * be left at zero.
+ *
+ * This helper is available only if the kernel was compiled with
+ * **CONFIG_NET** configuration option.
+ * Return
+ * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ *
+ * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * Description
+ * Look for UDP socket matching *tuple*, optionally in a child
+ * network namespace *netns*. The return value must be checked,
+ * and if non-NULL, released via **bpf_sk_release**\ ().
+ *
+ * The *ctx* should point to the context of the program, such as
+ * the skb or socket (depending on the hook in use). This is used
+ * to determine the base network namespace for the lookup.
+ *
+ * *tuple_size* must be one of:
+ *
+ * **sizeof**\ (*tuple*\ **->ipv4**)
+ * Look for an IPv4 socket.
+ * **sizeof**\ (*tuple*\ **->ipv6**)
+ * Look for an IPv6 socket.
+ *
+ * If the *netns* is zero, then the socket lookup table in the
+ * netns associated with the *ctx* will be used. For the TC hooks,
+ * this in the netns of the device in the skb. For socket hooks,
+ * this in the netns of the socket. If *netns* is non-zero, then
+ * it specifies 

  1   2   3   4   5   6   >