Re: [iovisor-dev] bpf_redirect_map not working after tail call

2018-06-04 Thread Jesper Dangaard Brouer via iovisor-dev


On Fri, 1 Jun 2018 14:15:58 +0200
Sebastiano Miano via iovisor-dev  wrote:

> Dear all,
> 
> We have noticed that the bpf_redirect_map returns an error when it is
> called after a tail call.
> The xdp_redirect_map program (under sample/bpf) works fine, but if we
> modify it as shown in the following diff, it doesn't work anymore.
> I have debugged it with the xdp_monitor application and the error
> returned is EFAULT.
> Is this a known issue? Am I doing something wrong?

Argh, this is likely an issue/bug due to the check xdp_map_invalid(),
that was introduced in commit 7c3001313396 ("bpf: fix ri->map_owner
pointer on bpf_prog_realloc").

To Daniel, I don't know how to solve this, could you give some advice?



 static inline bool xdp_map_invalid(const struct bpf_prog *xdp_prog,
   unsigned long aux)
 {
return (unsigned long)xdp_prog->aux != aux;
 }

 static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
   struct bpf_prog *xdp_prog)
 {
struct redirect_info *ri = this_cpu_ptr(&redirect_info);
unsigned long map_owner = ri->map_owner;
struct bpf_map *map = ri->map;
u32 index = ri->ifindex;
void *fwd = NULL;
int err;

[...]
if (unlikely(xdp_map_invalid(xdp_prog, map_owner))) {
err = -EFAULT;
map = NULL;
goto err;
}
[...]


> P.S. I have tested the program with the latest bpf-next kernel.
> 
> 
> 
> diff --git a/samples/bpf/xdp_redirect_map_kern.c
> b/samples/bpf/xdp_redirect_map_kern.c
> index 740a529..bf1275a 100644
> --- a/samples/bpf/xdp_redirect_map_kern.c
> +++ b/samples/bpf/xdp_redirect_map_kern.c
> @@ -36,6 +36,13 @@ struct bpf_map_def SEC("maps") rxcnt = {
>   .max_entries = 1,
>  };
> 
> +struct bpf_map_def SEC("maps") prog_table = {
> + .type   = BPF_MAP_TYPE_PROG_ARRAY,
> + .key_size = sizeof(int),
> + .value_size = sizeof(int),
> + .max_entries = 32,
> +};
> +
>  static void swap_src_dst_mac(void *data)
>  {
>   unsigned short *p = data;
> @@ -89,4 +96,15 @@ int xdp_redirect_dummy_prog(struct xdp_md *ctx)
>   return XDP_PASS;
>  }
> 
> +/* Entry point */
> +SEC("xdp_redirect_entry_point")
> +int xdp_redirect_entry_point_prog(struct xdp_md *ctx)
> +{
> + //char fmt[] = "xdp_redirect_entry_point\n";
> + //bpf_trace_printk(fmt, sizeof(fmt));
> + bpf_tail_call(ctx, &prog_table, 0);
> + // Tail call failed
> + return XDP_DROP;
> +}
> +
>  char _license[] SEC("license") = "GPL";
> diff --git a/samples/bpf/xdp_redirect_map_user.c
> b/samples/bpf/xdp_redirect_map_user.c
> index 4445e76..b2d2059 100644
> --- a/samples/bpf/xdp_redirect_map_user.c
> +++ b/samples/bpf/xdp_redirect_map_user.c
> @@ -120,7 +120,13 @@ int main(int argc, char **argv)
>   return 1;
>   }
> 
> - if (bpf_set_link_xdp_fd(ifindex_in, prog_fd[0], xdp_flags) < 0) {
> + ret = bpf_map_update_elem(map_fd[2], &key, &prog_fd[0], 0);
> + if (ret) {
> + perror("bpf_update_elem");
> + goto out;
> + }
> +
> + if (bpf_set_link_xdp_fd(ifindex_in, prog_fd[2], xdp_flags) < 0) {
>   printf("ERROR: link set xdp fd failed on %d\n", ifindex_in);
>   return 1;
>   }
> ___
> iovisor-dev mailing list
> iovisor-dev@lists.iovisor.org
> https://lists.iovisor.org/mailman/listinfo/iovisor-dev



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] bpftool binary size

2018-04-18 Thread Jesper Dangaard Brouer via iovisor-dev
(Cc. iovisor-dev list)

On Wed, 18 Apr 2018 21:32:45 +0200
Jesper Dangaard Brouer  wrote:

> On Wed, 18 Apr 2018 20:13:39 +0100
> Quentin Monnet  wrote:
> 
> > Hi Jesper,
> > 
> > Out of curiosity, how much did you say your bpftool executable weigh? Was
> > it 2 MB? I recompiled it on my laptop and it's 173 kB, 113 kB if stripped.
> > I don't believe it embeds libopcode on my system (right now Ubuntu 18.04).
> > What is your setup, if you don't mind me asking?  
> 
> On my Fedora 26, it is 2.2MB (2207976) and stripped 2MB (2046784).
> 
> For some reason the bpftool binary gets statically linked with
> libopcodes, which is /usr/lib64/libopcodes.a 1.76MB.

So, I figured out why this happens.

What does bpftool use BFD for?

I need the BFD lib and include file: /usr/include/bfd.h.
Thus, on fedora I need to install package binutils-devel.

The file /usr/lib64/libopcodes.so contains a ld-script that enforce
static linking:

$ cat /usr/lib64/libopcodes.so
/* GNU ld script */

/* Ensure this .so library will not be used by a link for a different format
   on a multi-architecture system.  */
OUTPUT_FORMAT(elf64-x86-64)

INPUT ( /usr/lib64/libopcodes.a -lbfd )



The info/description on the RPM file is interesting, it explicitly
explain why this static linking is enforced... and "strongly
encouraged" developers to use libelf instead of BFD.

Is using libelf instead an option?


$ rpm -qi binutils-devel
Name: binutils-devel
Version : 2.27
Release : 28.fc26
Architecture: x86_64
Install Date: Wed 18 Apr 2018 09:34:52 PM CEST
Group   : System Environment/Libraries
Size: 4651893
License : GPLv3+
Signature   : RSA/SHA256, Thu 02 Nov 2017 10:37:53 AM CET, Key ID 
812a6b4b64dab85d
Source RPM  : binutils-2.27-28.fc26.src.rpm
Build Date  : Fri 13 Oct 2017 03:48:11 PM CEST
Build Host  : buildvm-16.phx2.fedoraproject.org
Relocations : (not relocatable)
Packager: Fedora Project
Vendor  : Fedora Project
URL : http://sources.redhat.com/binutils
Summary : BFD and opcodes static and dynamic libraries and header files
Description :
This package contains BFD and opcodes static and dynamic libraries.

The dynamic libraries are in this package, rather than a seperate
base package because they are actually linker scripts that force
the use of the static libraries.  This is because the API of the
BFD library is too unstable to be used dynamically.

The static libraries are here because they are now needed by the
dynamic libraries.

Developers starting new projects are strongly encouraged to consider
using libelf instead of BFD.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Best userspace programming API for XDP features query to kernel?

2018-04-09 Thread Jesper Dangaard Brouer via iovisor-dev
On Fri, 6 Apr 2018 12:36:18 +0200
Daniel Borkmann  wrote:

> On 04/05/2018 10:51 PM, Jesper Dangaard Brouer wrote:
> > On Thu, 5 Apr 2018 12:37:19 +0200
> > Daniel Borkmann  wrote:
> >   
> >> On 04/04/2018 02:28 PM, Jesper Dangaard Brouer via iovisor-dev wrote:  
> >>> Hi Suricata people,
> >>>
> >>> When Eric Leblond (and I helped) integrated XDP in Suricata, we ran
> >>> into the issue, that at Suricata load/start time, we cannot determine
> >>> if the chosen XDP config options, like xdp-cpu-redirect[1], is valid on
> >>> this HW (e.g require driver XDP_REDIRECT support and bpf cpumap).
> >>>
> >>> We would have liked a way to report that suricata.yaml config was
> >>> invalid for this hardware/setup.  Now, it just loads, and packets gets
> >>> silently dropped by XDP (well a WARN_ONCE and catchable via tracepoints).
> >>>
> >>> My question to suricata developers: (Q1) Do you already have code that
> >>> query the kernel or drivers for features?
> >>>
> >>> At the IOvisor call (2 weeks ago), we discussed two options of exposing
> >>> XDP features avail in a given driver.
> >>>
> >>> Option#1: Extend existing ethtool -k/-K "offload and other features"
> >>> with some XDP features, that userspace can query. (Do you already query
> >>> offloads, regarding Q1)
> >>>
> >>> Option#2: Invent a new 'ip link set xdp' netlink msg with a query option. 
> >>>
> >>
> >> I don't really mind if you go via ethtool, as long as we handle this
> >> generically from there and e.g. call the dev's ndo_bpf handler such that
> >> we keep all the information in one place. This can be a new ndo_bpf command
> >> e.g. XDP_QUERY_FEATURES or such.  
> > 
> > Just to be clear: notice as Victor points out[2], they are programmable
> > going though the IOCTL (SIOCETHTOOL) and not using cmdline tools.  
> 
> Sure, that was perfectly clear. (But at the same time if you extend the
> ioctl, it's obvious to also add support to actual ethtool cmdline tool.)
> 
> > [2] https://github.com/OISF/suricata/blob/master/src/util-ioctl.c#L326
> > 
> > If you want everything to go through the drivers ndo_bpf call anyway
> > (which userspace API is netlink based) then at what point to you  
> 
> Not really, that's the front end. ndo_bpf itself is a plain netdev op
> and has no real tie to netlink.
> 
> > want drivers to call their own ndo_bpf, when activated though their
> > ethtool_ops ? (Sorry, but I don't follow the flow you are proposing)
> > 
> > Notice, I'm not directly against using the drivers ndo_bpf call.  I can
> > see it does provide kernel more flexibility than the ethtool IOCTL.  
> 
> What I was saying is that even if you go via ethtool ioctl api, where
> you end up in dev_ethtool() and have some new ETHTOOL_* query command,
> then instead of adding a new ethtool_ops callback, we can and should
> reuse ndo_bpf from there.

Okay, you want to catch this in dev_ethtool(), that connected the dots
for me, thanks.  BUT it does feel a little fake and confusing to do
this, as the driver developers seeing the info via ethtool, would
expect they need to implement an ethtool_ops callback.

My original idea behind going through the ethtool APIs, were that every
driver is already using this, and it's familiar to the driver
developers.  And there are existing userspace tools that sysadm's
already know, that we can leverage.

Thinking about this over the weekend.  I have changed my mind a bit,
based on your feedback. I do buy into your idea of forcing things
through the drivers ndo_bpf call.  I'm no-longer convinced this should
go through ethtool, but as (you desc) we can, if we find that this is
the easiest ways to export and leverage an existing userspace tool.



> [...]
> > Here, I want to discuss how drivers expose/tell userspace that they
> > support a given feature: Specifically a bit for: XDP_REDIRECT action
> > support.
> >   
> >> Same for meta data,  
> > 
> > Well, not really.  It would be a "nice-to-have", but not strictly
> > needed as a feature bit.  XDP meta-data is controlled via a helper.
> > And the BPF-prog can detect/see runtime, that the helper bpf_xdp_adjust_meta
> > returns -ENOTSUPP (and need to check the ret value anyhow).  Thus,
> > there is that not much gained by exposing this to be detected setup
> > time, as all drivers should eventually support this, and we can detect
> > it runtime.
> > 
> > 

Re: [iovisor-dev] Best userspace programming API for XDP features query to kernel?

2018-04-06 Thread Jesper Dangaard Brouer via iovisor-dev
On Thu, 5 Apr 2018 14:47:16 -0700
Jakub Kicinski  wrote:

> On Thu, 5 Apr 2018 22:51:33 +0200, Jesper Dangaard Brouer wrote:
> > > What about nfp in terms of XDP
> > > offload capabilities, should they be included as well or is probing to 
> > > load
> > > the program and see if it loads/JITs as we do today just fine (e.g. you'd
> > > otherwise end up with extra flags on a per BPF helper basis)?
> > 
> > No, flags per BPF helper basis. As I've described above, helper belong
> > to the BPF core, not the driver.  Here I want to know what the specific
> > driver support.  
> 
> I think Daniel meant for nfp offload.  The offload restrictions are
> quite involved, are we going to be able to express those?

Let's keep thing separate.

I'm requesting something really simple.  I want the driver to tell me
what XDP actions it supports.  We/I can implement an XDP_QUERY_ACTIONS
via ndo_bpf, problem solved.  It is small, specific and simple.

For my other use-case of enabling XDP-xmit on a device, I can
implement another ndo_bpf extension. Current approach today is loading
a dummy XDP prog via ndo_bpf anyway (which is awkward). Again a
specific change that let us move one-step further.


For your nfp offload use-case, you/we have to find a completely
different solution.  You have hit a design choice made by BPF.
Which is that BPF is part of the core kernel, and helpers cannot be
loaded as kernel modules.  As we cannot remove or add helpers after the
verifier certified the program.  And basically your nfp offload driver
comes as a kernel module.
 (Details: and you basically already solved your issue by modifying the
core verifier to do a call back to bpf_prog_offload_verify_insn()).
Point being this is very different from what I'm requesting.  Thus, for
offloading you already have a solution, to my setup time detect
problem, as your program gets rejected setup/load time by the verifier.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [Oisf-devel] Best userspace programming API for XDP features query to kernel?

2018-04-05 Thread Jesper Dangaard Brouer via iovisor-dev
On Thu, 5 Apr 2018 09:47:37 +0200
Victor Julien  wrote:

> > Option#2: Invent a new 'ip link set xdp' netlink msg with a query option.  
> 
> Do you have an example of how this is queried?

The code for querying should not be too difficult.

It would likely be similar to how we currently "set"/attach an XDP
program, via its BPF file-descriptor to an ifindex.  Eric Leblond,
choose to hide this in the kernel library "libbpf", see code:

 function bpf_set_link_xdp_fd()
 https://github.com/torvalds/linux/blob/master/tools/lib/bpf/bpf.c#L456-L575

Given Suricata already depend on libbpf for eBFP and XDP support, then
it might make sense to add an API call to "get" XDP link info, e.g.
bpf_get_link_xdp_features(int ifindex) ?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Best userspace programming API for XDP features query to kernel?

2018-04-05 Thread Jesper Dangaard Brouer via iovisor-dev
On Thu, 5 Apr 2018 12:37:19 +0200
Daniel Borkmann  wrote:

> On 04/04/2018 02:28 PM, Jesper Dangaard Brouer via iovisor-dev wrote:
> > Hi Suricata people,
> > 
> > When Eric Leblond (and I helped) integrated XDP in Suricata, we ran
> > into the issue, that at Suricata load/start time, we cannot determine
> > if the chosen XDP config options, like xdp-cpu-redirect[1], is valid on
> > this HW (e.g require driver XDP_REDIRECT support and bpf cpumap).
> > 
> > We would have liked a way to report that suricata.yaml config was
> > invalid for this hardware/setup.  Now, it just loads, and packets gets
> > silently dropped by XDP (well a WARN_ONCE and catchable via tracepoints).
> > 
> > My question to suricata developers: (Q1) Do you already have code that
> > query the kernel or drivers for features?
> > 
> > At the IOvisor call (2 weeks ago), we discussed two options of exposing
> > XDP features avail in a given driver.
> > 
> > Option#1: Extend existing ethtool -k/-K "offload and other features"
> > with some XDP features, that userspace can query. (Do you already query
> > offloads, regarding Q1)
> > 
> > Option#2: Invent a new 'ip link set xdp' netlink msg with a query option.  
> 
> I don't really mind if you go via ethtool, as long as we handle this
> generically from there and e.g. call the dev's ndo_bpf handler such that
> we keep all the information in one place. This can be a new ndo_bpf command
> e.g. XDP_QUERY_FEATURES or such.

Just to be clear: notice as Victor points out[2], they are programmable
going though the IOCTL (SIOCETHTOOL) and not using cmdline tools.

[2] https://github.com/OISF/suricata/blob/master/src/util-ioctl.c#L326

If you want everything to go through the drivers ndo_bpf call anyway
(which userspace API is netlink based) then at what point to you
want drivers to call their own ndo_bpf, when activated though their
ethtool_ops ? (Sorry, but I don't follow the flow you are proposing)

Notice, I'm not directly against using the drivers ndo_bpf call.  I can
see it does provide kernel more flexibility than the ethtool IOCTL.


> More specifically, how would such feature mask look like? How fine grained
> would this be? When you add a new minor feature to, say, cpumap that not
> all drivers support yet, we'd need a new flag each time, no?

No, CPUMAP is not a driver level feature, and thus does not require a
features flag exposed by the driver.  CPUMAP depends on the driver
feature XDP_REDIRECT.

It is important we separate driver-level features and BPF/XDP core
features.  I feel that we constantly talk past each-other when we mix
that up.

It is true that Suricata _also_ need to detect if the running kernel
support the map type called: BPF_MAP_TYPE_CPUMAP.  *BUT* that is a
completely separate mechanism.  It is a core kernel bpf feature, and I
have accepted that this can only be done via probing the kernel (simply
use the bpf syscall and try to create this map type).

Here, I want to discuss how drivers expose/tell userspace that they
support a given feature: Specifically a bit for: XDP_REDIRECT action
support.


> Same for meta data,

Well, not really.  It would be a "nice-to-have", but not strictly
needed as a feature bit.  XDP meta-data is controlled via a helper.
And the BPF-prog can detect/see runtime, that the helper bpf_xdp_adjust_meta
returns -ENOTSUPP (and need to check the ret value anyhow).  Thus,
there is that not much gained by exposing this to be detected setup
time, as all drivers should eventually support this, and we can detect
it runtime.

The missing XDP_REDIRECT action features bit it different, as the
BPF-prog cannot detect runtime that this is an unsupported action.
Plus, setup time we cannot query the driver for supported XDP actions.


> then potentially for the redirect memory return work,

I'm not sure the redirect memory return types, belong here.  First of
all it is per RX-ring.  Second, some other config method will likely be
config interface.  Like with AF_XDP-zero-copy it is a new NDO. So, it
is more losely coupled.

> or the af_xdp bits, 

No need for bit for AF_XDP in copy-mode (current RFC), as this only
depend on driver supporting XDP_REDIRECT action.

For AF_XDP in zero-copy mode, then yes we need a way for userspace to
"see" if this mode is supported by the driver.  But it might not need a
feature bit here... as the bind() call (which knows the ifindex) could
fail when it tried to enable this ZC mode.  It would make userspace's
live easier to add ZC as a driver feature bit.


> the xdp_rxq_info would have needed it, etc. 

Same comment as mem-type, not necessarily, as it is more losely coupled
to XDP.

> What about nfp in terms of XDP
> offload capabilities, should they be included as well or is 

[iovisor-dev] Best userspace programming API for XDP features query to kernel?

2018-04-04 Thread Jesper Dangaard Brouer via iovisor-dev
Hi Suricata people,

When Eric Leblond (and I helped) integrated XDP in Suricata, we ran
into the issue, that at Suricata load/start time, we cannot determine
if the chosen XDP config options, like xdp-cpu-redirect[1], is valid on
this HW (e.g require driver XDP_REDIRECT support and bpf cpumap).

We would have liked a way to report that suricata.yaml config was
invalid for this hardware/setup.  Now, it just loads, and packets gets
silently dropped by XDP (well a WARN_ONCE and catchable via tracepoints).

My question to suricata developers: (Q1) Do you already have code that
query the kernel or drivers for features?


At the IOvisor call (2 weeks ago), we discussed two options of exposing
XDP features avail in a given driver.

Option#1: Extend existing ethtool -k/-K "offload and other features"
with some XDP features, that userspace can query. (Do you already query
offloads, regarding Q1)

Option#2: Invent a new 'ip link set xdp' netlink msg with a query option.

(Q2) Do Suricata devs have any preference (or other options/ideas) for
the way the kernel expose this info to userspace?



[1] 
http://suricata.readthedocs.io/en/latest/capture-hardware/ebpf-xdp.html#the-xdp-cpu-redirect-case
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Notification when an eBPF map is modified

2018-02-17 Thread Jesper Dangaard Brouer via iovisor-dev


On Sat, 17 Feb 2018 13:49:22 + Teng Qin via iovisor-dev 
 wrote:

> > We were looking for a mechanism transparent to the eBPF program, though.
> > A possible rational is to have an hot-standby copy of the program
> > (including the state) in some other location, but I don't want my
> > dataplane to be aware of that.
> > Thanks,
> >
> > fulvio  
> 
> 
> You could also (use another BPF program or ftrace) to trace the
> bpf_map_update_elem Tracepoint. But in that case you get all update calls
> and would need to filter for the one you are interested on your own:)

That is a good idea.

Try it out via perf-record to see if it contains what you need:

 $ perf record -e bpf:bpf_map_update_elem -a

 $ perf script
 xdp_redirect_ma  2273 [011] 261187.968223: bpf:bpf_map_update_elem: map type= 
ufd=4 key=[00 00 00 00] val=[07 00 00 00]


Looking at the above output and tracepoint kernel code, we should
extend that with a map_id to easily identify/filter what map you are
interested in.

See patch below signature (not even compile tested).

Example for attaching to tracepoints see:
 samples/bpf/xdp_monitor_*.c

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

tracepoint: add map id to bpf tracepoints

From: Jesper Dangaard Brouer 


---
 include/trace/events/bpf.h |   12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/bpf.h b/include/trace/events/bpf.h
index 150185647e6b..e6479ba45261 100644
--- a/include/trace/events/bpf.h
+++ b/include/trace/events/bpf.h
@@ -140,7 +140,7 @@ TRACE_EVENT(bpf_map_create,
__entry->flags   = map->map_flags;
__entry->ufd = ufd;
),
-
+// TODO also add map_id here
TP_printk("map type=%s ufd=%d key=%u val=%u max=%u flags=%x",
  __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB),
  __entry->ufd, __entry->size_key, __entry->size_value,
@@ -199,15 +199,18 @@ DECLARE_EVENT_CLASS(bpf_obj_map,
__field(u32, type)
__field(int, ufd)
__string(path, pname->name)
+   __field(u32, map_id)
),
 
TP_fast_assign(
__assign_str(path, pname->name);
__entry->type = map->map_type;
__entry->ufd  = ufd;
+   __entry->map_id = map->id;
),
 
-   TP_printk("map type=%s ufd=%d path=%s",
+   TP_printk("map id=%u type=%s ufd=%d path=%s",
+ __entry->map_id,
  __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB),
  __entry->ufd, __get_str(path))
 );
@@ -244,6 +247,7 @@ DECLARE_EVENT_CLASS(bpf_map_keyval,
__dynamic_array(u8, val, map->value_size)
__field(bool, val_trunc)
__field(int, ufd)
+   __field(u32, map_id)
),
 
TP_fast_assign(
@@ -255,9 +259,11 @@ DECLARE_EVENT_CLASS(bpf_map_keyval,
__entry->val_len   = min(map->value_size, 16U);
__entry->val_trunc = map->value_size != __entry->val_len;
__entry->ufd   = ufd;
+   __entry->map_id= map->id;
),
 
-   TP_printk("map type=%s ufd=%d key=[%s%s] val=[%s%s]",
+   TP_printk("map id=%d type=%s ufd=%d key=[%s%s] val=[%s%s]",
+ __entry->map_id,
  __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB),
  __entry->ufd,
  __print_hex(__get_dynamic_array(key), __entry->key_len),
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] XDP router exam - bpf create map failed error.

2018-01-28 Thread Jesper Dangaard Brouer via iovisor-dev
On Sun, 28 Jan 2018 17:29:56 +0300
"Mr. Haritsu via iovisor-dev"  wrote:

> Hi Guys,
> 
> I'm beginner of XDP and BPF. I'm working also XDP exams and prototype-kernel.
> xdp_ddos01_blacklist exam it worked and xdp dropped packets.
> 
> But xdp_router_ipv4 isn't working.
> 
> root@xdptester:~/samples/bpf# ./xdp_router_ipv4 6 7
> failed to create a map: 1 Operation not permitted
> 
> Does anyone have this problem?

It is frustrating that people keeps hitting this, and cannot figure out
what the issue is... 

This is because the kernel return EPERM and not ENOMEM ... if the
kernel had returned a "out of memory" error, I bet you could have
figured out that you need to increase the "locked memory" limit
(RLIMIT_MEMLOCK) via function call setrlimit()

https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html

http://man7.org/linux/man-pages/man3/setrlimit.3p.html

Can I ask: What school is giving exams/assignments in XDP?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Is there information about eBPF + OVS ?

2017-12-27 Thread Jesper Dangaard Brouer via iovisor-dev
On Wed, 27 Dec 2017 16:58:06 +0900 Heung Sik Choi via iovisor-dev 
 wrote:

> Hi, I need help and send you an mail.
> 
> I recently found that there was an OVS using eBPF.
> (http://openvswitch.org/support/ovscon2016/7/1120-tu.pdf)
> 
> However, I can not find any information on how to install it.
> 
> If you are using eBPF + OVS, please let me know if you have any insights
> about it.

I'll redirect that question to William Tu (Cc'ed).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH, bpf-llvm] Fix bug on silently truncating 64-bit immediate

2017-10-12 Thread Jesper Dangaard Brouer via iovisor-dev
On Thu, 12 Oct 2017 06:21:30 -0400
Jiong Wang  wrote:

> We came across an llvm bug when compiling some testcases that 64-bit
> immediates are silently truncated into 32-bit and then packed into
> BPF_JMP | BPF_K encoding.  This caused comparison with wrong value.

I think you send this to the wrong mailing list... this looks like a
patch against the LLVM source code.

Shouldn't you send to: llvm-...@lists.llvm.org ?
 
> This bug looks to be introduced by r308080.  

This looks like a very recent change:
  https://llvm.org/viewvc/llvm-project?view=revision&revision=308080
  Sat Jul 15 05:41:42 2017 UTC (2 months, 4 weeks ago)

(As you are sending this to a user mailing list: xdp-newb...@vger.kernel.org)

I want to know if this have made it into a LLVM release? and which release?

As the git-repo[1] replica of LLVM SVN-repo does not git-tag the
releases, I could not answer this question myself via the command:

$ git describe --contains 7c423e0690

[1] http://llvm.org/git/llvm.git

> The Select_Ri pattern is
> supposed to be lowered into J*_Ri while the latter only support 32-bit
> immediate encoding, therefore Select_Ri should have similar immediate
> predicate check as what J*_Ri are doing.
> 
> Reported-by: Jakub Kicinski 
> Signed-off-by: Jiong Wang 
> ---
>  lib/Target/BPF/BPFISelLowering.cpp |  8 ++--
>  lib/Target/BPF/BPFInstrInfo.td |  2 +-
>  test/CodeGen/BPF/select_ri.ll  | 35 +++
>  3 files changed, 42 insertions(+), 3 deletions(-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] eBPF prog getting rejected with devmap (and cpumap)

2017-09-21 Thread Jesper Dangaard Brouer via iovisor-dev
On Thu, 21 Sep 2017 22:06:27 +0200
Daniel Borkmann  wrote:

> On 09/21/2017 10:02 PM, Jesper Dangaard Brouer via iovisor-dev wrote:
> > Hi Daniel,
> >
> > My very simple program is getting rejected on bpf loading time.  When
> > using another map as input for a decision, for map type BPF_MAP_TYPE_DEVMAP.
> >
> > SEC("xdp_redirect_map_rr")
> > int xdp_prog_redirect_map_rr(struct xdp_md *ctx)
> > {
> > void *data_end = (void *)(long)ctx->data_end;
> > void *data = (void *)(long)ctx->data;
> > struct ethhdr *eth = data;
> > int vport = 0;
> > u32 key = 0;
> > long *value;
> >
> > // count packet in global counter
> > value = bpf_map_lookup_elem(&rxcnt, &key);
> > if (value)
> > *value += 1;
> >
> > /* Strange, verifier reject this (with LLVM version 3.9.1) */
> > vport = *value % 2;  
> 
> Here it's dereferenced where above it could have been NULL
> from the lookup, so verifier is correct. You'd need to place
> it into the if clause above or just do 'if (!value) return XDP_ABORTED'
> or the like.

Ah, I see!  Thanks for the quick answer and solution :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


[iovisor-dev] eBPF prog getting rejected with devmap (and cpumap)

2017-09-21 Thread Jesper Dangaard Brouer via iovisor-dev
Hi Daniel,

My very simple program is getting rejected on bpf loading time.  When
using another map as input for a decision, for map type BPF_MAP_TYPE_DEVMAP.

SEC("xdp_redirect_map_rr")
int xdp_prog_redirect_map_rr(struct xdp_md *ctx)
{
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
struct ethhdr *eth = data;
int vport = 0;
u32 key = 0;
long *value;

// count packet in global counter
value = bpf_map_lookup_elem(&rxcnt, &key);
if (value)
*value += 1;

/* Strange, verifier reject this (with LLVM version 3.9.1) */
vport = *value % 2;

if (vport >= 10)
return XDP_ABORTED;

return bpf_redirect_map(&tx_port, vport, 0);
}

bpf_load_program() err=13
0: (b7) r6 = 0
1: (63) *(u32 *)(r10 -4) = r6
2: (bf) r2 = r10
3: (07) r2 += -4
4: (18) r1 = 0x880832682700
6: (85) call bpf_map_lookup_elem#1
7: (79) r2 = *(u64 *)(r0 +0)
R0 invalid mem access 'map_value_or_null'

Are we missing something verifier code for BPF_MAP_TYPE_DEVMAP ?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Running Prototype-kernel package

2017-06-01 Thread Jesper Dangaard Brouer via iovisor-dev
On Thu, 1 Jun 2017 18:59:34 +0200
Jesper Dangaard Brouer  wrote:

> I guess, I'll update the documentation a bit after your feedback,

Done, updated the documentation in the prototype-kernel github repo.

See three top commits in this link:
 https://github.com/netoptimizer/prototype-kernel/commits/7db5539438fa7

The prototype-kernel doc is rendered here:
 https://prototype-kernel.readthedocs.io/en/latest/index.html
 https://prototype-kernel.readthedocs.io/en/latest/prototype-kernel/index.html
 
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Running Prototype-kernel package

2017-06-01 Thread Jesper Dangaard Brouer via iovisor-dev

On Thu, 1 Jun 2017 17:45:03 +0300 Adel Fuchs  wrote:

> Thank you so much for your help. I need to run an eBPF programs that
> filters packets according to IP address or port. I'm trying to run the
> prototype-kernel package but encounter problems.

The prototype-kernel github repo[1] is actually originally meant for
developing real kernel module.  I guess, I should/could have placed the
eBPF/XDP examples[3] in another github repo.

[1] https://github.com/netoptimizer/prototype-kernel
[2] 
http://netoptimizer.blogspot.dk/2014/11/announce-github-repo-prototype-kernel.html
[3] 
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/samples/bpf

> I went over the "getting started" tutorial, but I couldn'f find a
> tutorial which explains how to actually run the samples. So this is
> what I did:
> 
> clang -O2 -Wall -target bpf -c xdp_ddos01_blacklist_cmdline.c -o
> xdp_ddos01_blacklist_cmdline

Don't do this... as you found out all the included path are screwed
this way.  There is a Makefile in the directory[3] kernel/samples/bpf,
so simply chdir into this dir and run 'make'.  Just like you would do
in the kernel tree[4].

[4] https://github.com/torvalds/linux/blob/master/samples/bpf/README.rst

[...] 
> Could you please help me run these samples? Could you please guide me
> to the right sample that I need and tell me how to run it?
> Note that I was able to run a simple XDP program on this computer, so
> I think that it is configured all right...

I've also been touring with a tutorial talk, see latest here[5]

[5] 
http://people.netfilter.org/hawk/presentations/LLC2017/XDP_DDoS_protecting_LLC2017.pdf

I guess, I'll update the documentation a bit after your feedback,
thanks.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Error with printk and bpf_trace_printk

2017-06-01 Thread Jesper Dangaard Brouer via iovisor-dev
On Thu, 1 Jun 2017 14:10:01 +0300
Adel Fuchs  wrote:

> Hi Jesper,
> I tried adding your solution, bpf_debug,  and I'm now able to run the
> program with no errors but the trace_pipe file stays empty.
> I just added this to my program:
> 
> #ifdef DEBUG
> /* Only use this for debug output. Notice output from bpf_trace_printk()
> * end-up in /sys/kernel/debug/tracing/trace_pipe
> */
> #define bpf_debug(fmt, ...) \
> ({ \
> char fmt[] = fmt; \
> bpf_trace_printk(fmt, sizeof(fmt), \
> ##__VA_ARGS__); \
> })
> #else
> #define bpf_debug(fmt, ...) { } while (0)
> #endif
> 
> 
> And added a printing command:
> bpf_debug("hi");
> 
> Do you know what's the problem?

I assume you are doing:
 sudo cat /sys/kernel/debug/tracing/trace_pipe

The problem could be that the kernel need to be compiled with the right
trace config options... does anybody on the list(s) know? 

(p.s. removed netdev list cc)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer



> On Tue, May 30, 2017 at 3:24 PM, Jesper Dangaard Brouer
>  wrote:
> >
> > Notice, there are two mailing lists (Cc'ed) that you should likely ask
> > these kind of questions on (instead of netdev), depending on if this is
> > mostly related to bpf (iovisor-dev@lists.iovisor.org) or somehow
> > related to XDP (xdp-newb...@vger.kernel.org).
> >
> > See my answer inlined below:
> >
> > On Sun, 28 May 2017 17:48:20 +0300 Adel Fuchs  wrote:
> >  
> >> I have a working eBPF program, and I'm trying to add outputs to it.
> >> I'm not able to use both printk and bpf_trace_printk functions. I get
> >> this error:
> >>
> >> ELF contains non-map related relo data in entry 0 pointing to section
> >> 8! Compiler bug?!
> >>
> >> Prog section 'ingress' rejected: Invalid argument (22)!
> >>  - Type: 3
> >>  - Instructions: 16 (0 over limit)
> >>  - License:  GPL
> >>
> >> Verifier analysis:
> >>
> >> 0: (bf) r6 = r1
> >> 1: (18) r1 = 0x0
> >> 3: (85) call bpf_unspec#0
> >> unknown func bpf_unspec#0
> >>
> >> Error fetching program/map!
> >> Failed to retrieve (e)BPF data!
> >>
> >> Are there certain "includes" that I need to add?
> >> In addition, I'm not sure I'm using the function correctly. I just
> >> wrote: printk("hi")  
> >
> > You obviously cannot call printk directly from and eBPF program.
> > I wonder how you got this compiling...
> >
> > As you hinted yourself, you should be using: bpf_trace_printk().
> > But it is actually tricky to use... and not much help is around to
> > figure this out.
> >
> > First of all the output end-up in this file: 
> > /sys/kernel/debug/tracing/trace_pipe
> > Remember to read the output use 'cat' like:
> >
> >  sudo cat /sys/kernel/debug/tracing/trace_pipe
> >
> > And only the first process to read the output gets the output...
> >
> >
> > I deduct you are using the TC/iproute2 examples:
> >  
> > https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/tree/examples/bpf
> >
> > Next gotcha is that, you need to provide the char* string in a very
> > special way to make this compile correctly.  The iproute2 provide a
> > helper define called "printt()" in include/bpf_api.h for this:
> >
> > #ifndef printt
> > # define printt(fmt, ...)   \
> > ({  \
> > char fmt[] = fmt;   \
> > trace_printk(fmt, sizeof(fmt), ##__VA_ARGS__);  \
> > })
> > #endif
> >
> > Or see my solution here:
> > [1] 
> > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_ddos01_blacklist_kern.c#L86:L99
> >
> >
> > Another gotcha I've experienced is that if you format the string
> > incorrectly, or use a modifier like %X, which bpf_trace_printk() does
> > not seem to understand, then you "hear-nothing"...  Also experienced if
> > using more than 3 arguments, then it fails or also go silent. Be
> > careful when using this somewhat "flaky" debug facility.
> >
> > Do remember these bpf_trace_printk() should only be used for debugging,
> > as it is very slow...
> > --
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer  



___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Error with printk and bpf_trace_printk

2017-05-30 Thread Jesper Dangaard Brouer via iovisor-dev

Notice, there are two mailing lists (Cc'ed) that you should likely ask
these kind of questions on (instead of netdev), depending on if this is
mostly related to bpf (iovisor-dev@lists.iovisor.org) or somehow
related to XDP (xdp-newb...@vger.kernel.org).

See my answer inlined below:

On Sun, 28 May 2017 17:48:20 +0300 Adel Fuchs  wrote:

> I have a working eBPF program, and I'm trying to add outputs to it.
> I'm not able to use both printk and bpf_trace_printk functions. I get
> this error:
> 
> ELF contains non-map related relo data in entry 0 pointing to section
> 8! Compiler bug?!
> 
> Prog section 'ingress' rejected: Invalid argument (22)!
>  - Type: 3
>  - Instructions: 16 (0 over limit)
>  - License:  GPL
> 
> Verifier analysis:
> 
> 0: (bf) r6 = r1
> 1: (18) r1 = 0x0
> 3: (85) call bpf_unspec#0
> unknown func bpf_unspec#0
> 
> Error fetching program/map!
> Failed to retrieve (e)BPF data!
> 
> Are there certain "includes" that I need to add?
> In addition, I'm not sure I'm using the function correctly. I just
> wrote: printk("hi")

You obviously cannot call printk directly from and eBPF program.
I wonder how you got this compiling...

As you hinted yourself, you should be using: bpf_trace_printk().
But it is actually tricky to use... and not much help is around to
figure this out.

First of all the output end-up in this file: 
/sys/kernel/debug/tracing/trace_pipe
Remember to read the output use 'cat' like:

 sudo cat /sys/kernel/debug/tracing/trace_pipe

And only the first process to read the output gets the output...


I deduct you are using the TC/iproute2 examples:
 
https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/tree/examples/bpf

Next gotcha is that, you need to provide the char* string in a very
special way to make this compile correctly.  The iproute2 provide a
helper define called "printt()" in include/bpf_api.h for this:

#ifndef printt
# define printt(fmt, ...)   \
({  \
char fmt[] = fmt;   \
trace_printk(fmt, sizeof(fmt), ##__VA_ARGS__);  \
})
#endif

Or see my solution here:
[1] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_ddos01_blacklist_kern.c#L86:L99


Another gotcha I've experienced is that if you format the string
incorrectly, or use a modifier like %X, which bpf_trace_printk() does
not seem to understand, then you "hear-nothing"...  Also experienced if
using more than 3 arguments, then it fails or also go silent. Be
careful when using this somewhat "flaky" debug facility.

Do remember these bpf_trace_printk() should only be used for debugging,
as it is very slow...
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Describing howto read the eBPF generated ELF binary

2017-03-07 Thread Jesper Dangaard Brouer via iovisor-dev
On Tue, 7 Mar 2017 21:54:42 +0100
Jesper Dangaard Brouer  wrote:

> > I also suggest to replace it with 'llvm-objdump -h file.o'
> > since it prints the same info as 'readelf -SW'
> > but less verbose.  
> 
> Good idea to instead describe use of:
>   'llvm-objdump -h file.o'

Fixed:
 https://github.com/netoptimizer/prototype-kernel/commit/c5cf59ff9a9b

https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#elf-binary
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Describing howto read the eBPF generated ELF binary

2017-03-07 Thread Jesper Dangaard Brouer via iovisor-dev
On Tue, 7 Mar 2017 07:57:00 -0800
Alexei Starovoitov via iovisor-dev  wrote:

> On Tue, Mar 7, 2017 at 3:07 AM, Jesper Dangaard Brouer
>  wrote:
> > On Mon, 6 Mar 2017 16:14:06 -0800
> > Alexei Starovoitov via iovisor-dev  wrote:
> >  
> >> On Mon, Mar 6, 2017 at 10:29 AM, Jesper Dangaard Brouer
> >>  wrote:  
> >> > On Mon, 6 Mar 2017 09:11:46 -0800
> >> > Alexei Starovoitov via iovisor-dev  wrote:
> >> >  
> >> >> On Mon, Mar 6, 2017 at 3:53 AM, Jesper Dangaard Brouer
> >> >>  wrote:  
> >> >> > Hi All,
> >> >> >
> >> >> > I've added a section to my eBPF documentation, about how to read the
> >> >> > eBPF generated ELF binary, and deduct the size of the compiled program
> >> >> > (mostly for kernel/samples/bpf).
> >> >> >
> >> >> > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#elf-binary
> >> >> > https://github.com/netoptimizer/prototype-kernel/commit/079352102cb0ba
> >> >> >
> >> >> > Can someone validate what I've saying is true? (commit inlined below
> >> >> > sign, to make is as easy as possible for people to correct me).
> >> >> >
> >> >> > And anything else users can use the readelf output for?
> >> >> >
> >> >> > --
> >> >> > Best regards,
> >> >> >   Jesper Dangaard Brouer
> >> >> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >> >> >   LinkedIn: http://www.linkedin.com/in/brouer
> >> >> >
> >> >> > commit 079352102cb0ba12141ecd28c216ec5ac5290192
> >> >> > Author: Jesper Dangaard Brouer 
> >> >> > Date:   Fri Feb 24 12:10:45 2017 +0100
> >> >> >
> >> >> > doc: eBPF describe howto read the eBPF generated ELF binary
> >> >> >
> >> >> > Signed-off-by: Jesper Dangaard Brouer 
> >> >> >
> >> >> > diff --git a/kernel/Documentation/bpf/troubleshooting.rst 
> >> >> > b/kernel/Documentation/bpf/troubleshooting.rst
> >> >> > index b6f3b6fe9501..39ebffae4142 100644
> >> >> > --- a/kernel/Documentation/bpf/troubleshooting.rst
> >> >> > +++ b/kernel/Documentation/bpf/troubleshooting.rst
> >> >> > @@ -15,6 +15,41 @@ see system call `setrlimit(2)`_.
> >> >> >  The ``bpf_create_map`` call will return errno EPERM (Operation not
> >> >> >  permitted) when the RLIMIT_MEMLOCK memory size limit is exceeded.
> >> >> >
> >> >> > -
> >> >> >  .. _setrlimit(2): 
> >> >> > http://man7.org/linux/man-pages/man2/setrlimit.2.html
> >> >> >
> >> >> > +ELF binary
> >> >> > +==
> >> >> > +
> >> >> > +The binary containing the eBPF program, which got generated by the
> >> >> > +LLVM compiler, is an ELF binary.  For samples/bpf/ this is the file
> >> >> > +named xxx_kern.o.
> >> >> > +
> >> >> > +To answer questions like how big is my eBPF program, it is possible 
> >> >> > to
> >> >> > +use a tool like ``readelf``. ::
> >> >> > +
> >> >> > + $ readelf -SW xdp_ddos01_blacklist_kern.o
> >> >> > + There are 8 section headers, starting at offset 0x398:
> >> >> > +
> >> >> > + Section Headers:
> >> >> > +  [Nr] Name   Type   Address  OffSize   ES 
> >> >> > Flg Lk Inf Al
> >> >> > +  [ 0]NULL    00 00 00   
> >> >> >0   0  0
> >> >> > +  [ 1] .strtabSTRTAB  000320 72 00   
> >> >> >0   0  1
> >> >> > +  [ 2] .text  PROGBITS    40 00 00  
> >> >> > AX  0   0  4
> >> >> > +  [ 3] xdp_prog   PROGBITS    40 0001b8 00  
> >> >> > AX  0   0  8
> >> >> > +  [ 4] .relxdp_prog   REL 000300 20 10   
> >> >> >7   3  8
> >> >> > +  [ 5] maps   PROGBITS    0001f8 28 00  
> >> >> > WA  0   0  4
> >> >> > +  [ 6] licensePROGBITS    000220 04 00  
> >> >> > WA  0   0  1
> >> >> > +  [ 7] .symtabSYMTAB  000228 d8 18   
> >> >> >1   5  8
> >> >> > + Key to Flags:
> >> >> > +  W (write), A (alloc), X (execute), M (merge), S (strings)
> >> >> > +  I (info), L (link order), G (group), T (TLS), E (exclude), x 
> >> >> > (unknown)
> >> >> > +  O (extra OS processing required) o (OS specific), p (processor 
> >> >> > specific)
> >> >> > +
> >> >> > +From the output, we can see the programmer choose to name the XDP
> >> >> > +program section "xdp_prog".  From this line ([ 3]) the size column
> >> >> > +shows the size 0001b8 in hex, which is easily converted on the 
> >> >> > cmdline
> >> >> > +to 440 bytes::
> >> >> > +
> >> >> > + $ echo $((0x0001b8))
> >> >> > + 440  
> >> >>
> >> >> hmm. i think instead of clarifying bpf elf output is doing the opposite.
> >> >> at least i'm completely lost in the above text.
> >> >> What all of the above suppose to mean?  
> >> >
> >> > That what I'm implicit asking you ;-)
> >> >  
> >> >> why is it useful to do readelf? and look at hex ?  
> >> >
> >> > What other tools exists to help me understand the contents of the LLVM
> >> > compiled binary _kern.o ?  
> >>
> >> The best is to use
> >> llvm-objdump -S prog_kern.o  
> >
> > Hmmm, what version of LLVM have the -S option?
> >
> > $ llvm

Re: [iovisor-dev] Describing howto read the eBPF generated ELF binary

2017-03-07 Thread Jesper Dangaard Brouer via iovisor-dev
On Tue, 7 Mar 2017 12:07:40 +0100
Jesper Dangaard Brouer via iovisor-dev  wrote:

> On Mon, 6 Mar 2017 16:14:06 -0800
> Alexei Starovoitov via iovisor-dev  wrote:
> 
> > On Mon, Mar 6, 2017 at 10:29 AM, Jesper Dangaard Brouer
> >  wrote:  
> > > On Mon, 6 Mar 2017 09:11:46 -0800
> > > Alexei Starovoitov via iovisor-dev  wrote:
> > >
> > >> On Mon, Mar 6, 2017 at 3:53 AM, Jesper Dangaard Brouer
> > >>  wrote:
> > >> > Hi All,
> > >> >
> > >> > I've added a section to my eBPF documentation, about how to read the
> > >> > eBPF generated ELF binary, and deduct the size of the compiled program
> > >> > (mostly for kernel/samples/bpf).
> > >> >
> > >> > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#elf-binary
> > >> > https://github.com/netoptimizer/prototype-kernel/commit/079352102cb0ba
> > >> >
> > >> > Can someone validate what I've saying is true? (commit inlined below
> > >> > sign, to make is as easy as possible for people to correct me).
> > >> >
> > >> > And anything else users can use the readelf output for?
> > >> >
> > >> > --
> > >> > Best regards,
> > >> >   Jesper Dangaard Brouer
> > >> >   MSc.CS, Principal Kernel Engineer at Red Hat
> > >> >   LinkedIn: http://www.linkedin.com/in/brouer
> > >> >
> > >> > commit 079352102cb0ba12141ecd28c216ec5ac5290192
> > >> > Author: Jesper Dangaard Brouer 
> > >> > Date:   Fri Feb 24 12:10:45 2017 +0100
> > >> >
> > >> > doc: eBPF describe howto read the eBPF generated ELF binary
> > >> >
> > >> > Signed-off-by: Jesper Dangaard Brouer 
> > >> >
> > >> > diff --git a/kernel/Documentation/bpf/troubleshooting.rst 
> > >> > b/kernel/Documentation/bpf/troubleshooting.rst
> > >> > index b6f3b6fe9501..39ebffae4142 100644
> > >> > --- a/kernel/Documentation/bpf/troubleshooting.rst
> > >> > +++ b/kernel/Documentation/bpf/troubleshooting.rst
> > >> > @@ -15,6 +15,41 @@ see system call `setrlimit(2)`_.
> > >> >  The ``bpf_create_map`` call will return errno EPERM (Operation not
> > >> >  permitted) when the RLIMIT_MEMLOCK memory size limit is exceeded.
> > >> >
> > >> > -
> > >> >  .. _setrlimit(2): 
> > >> > http://man7.org/linux/man-pages/man2/setrlimit.2.html
> > >> >
> > >> > +ELF binary
> > >> > +==
> > >> > +
> > >> > +The binary containing the eBPF program, which got generated by the
> > >> > +LLVM compiler, is an ELF binary.  For samples/bpf/ this is the file
> > >> > +named xxx_kern.o.
> > >> > +
> > >> > +To answer questions like how big is my eBPF program, it is possible to
> > >> > +use a tool like ``readelf``. ::
> > >> > +
> > >> > + $ readelf -SW xdp_ddos01_blacklist_kern.o
> > >> > + There are 8 section headers, starting at offset 0x398:
> > >> > +
> > >> > + Section Headers:
> > >> > +  [Nr] Name   Type   Address  OffSize   ES 
> > >> > Flg Lk Inf Al
> > >> > +  [ 0]NULL    00 00 00
> > >> >   0   0  0
> > >> > +  [ 1] .strtabSTRTAB  000320 72 00
> > >> >   0   0  1
> > >> > +  [ 2] .text  PROGBITS    40 00 00  
> > >> > AX  0   0  4
> > >> > +  [ 3] xdp_prog   PROGBITS    40 0001b8 00  
> > >> > AX  0   0  8
> > >> > +  [ 4] .relxdp_prog   REL 000300 20 10
> > >> >   7   3  8
> > >> > +  [ 5] maps   PROGBITS    0001f8 28 00  
> > >> > WA  0   0  4
> > >> > +  [ 6] licensePROGBITS    000220 04 00  
> > >> > WA  0   0  1
> > >> > +  [ 7] .symtabSYMTAB  000228 d8 18
> > >> >   1   5  8
> > >> > + Key to Flags:
> > >> > +  W (write), A (alloc), X (execute), M (merge), S (strings)
> > >> > +  I (info), L (lin

Re: [iovisor-dev] Describing howto read the eBPF generated ELF binary

2017-03-07 Thread Jesper Dangaard Brouer via iovisor-dev
On Tue, 07 Mar 2017 12:40:03 +0100
Daniel Borkmann  wrote:

> On 03/07/2017 12:07 PM, Jesper Dangaard Brouer via iovisor-dev wrote:
> > On Mon, 6 Mar 2017 16:14:06 -0800
> > Alexei Starovoitov via iovisor-dev  wrote:  
> [...]
> >> that's practically impossible to know in advance, since hardening and
> >> start address randomization will play a role.
> >> Or use sysctl net.core.bpf_jit_enable=2
> >> at load time which gives raw x86 hex.  
> >
> > The sysctl adjusting sounds interesting:
> >
> >   sysctl net.core.bpf_jit_enable=2
> >
> > What is below "proglen=335" the JIT'ed asm-code size?  
> 
> It's in bpf_jit_dump(): proglen is the len of opcode sequence generated
> and flen is the number of bpf insns. You can use tools/net/bpf_jit_disasm.c
> to disassemble that output. bpf_jit_disasm -o will dump the related opcodes
> as well.

Thanks for input, added:
 https://github.com/netoptimizer/prototype-kernel/commit/0b31532f42cd8

> > flen=55 proglen=335 pass=4 image=a0006820 from=xdp_ddos01_blac 
> > pid=1
> > JIT code: : 55 48 89 e5 48 81 ec 28 02 00 00 48 89 9d d8 fd
> > JIT code: 0010: ff ff 4c 89 ad e0 fd ff ff 4c 89 b5 e8 fd ff ff
> > JIT code: 0020: 4c 89 bd f0 fd ff ff 31 c0 48 89 85 f8 fd ff ff
> > JIT code: 0030: bb 02 00 00 00 48 8b 77 08 48 8b 7f 00 48 89 fa
> > JIT code: 0040: 48 83 c2 0e 48 39 f2 0f 87 e1 00 00 00 48 0f b6
> > JIT code: 0050: 4f 0c 48 0f b6 57 0d 48 c1 e2 08 48 09 ca 48 89
> > JIT code: 0060: d1 48 81 e1 ff 00 00 00 41 b8 06 00 00 00 49 39
> > JIT code: 0070: c8 0f 87 b7 00 00 00 48 81 fa 88 a8 00 00 74 0e
> > JIT code: 0080: b9 0e 00 00 00 48 81 fa 81 00 00 00 75 1a 48 89
> > JIT code: 0090: fa 48 83 c2 12 48 39 f2 0f 87 90 00 00 00 b9 12
> > JIT code: 00a0: 00 00 00 48 0f b7 57 10 bb 02 00 00 00 48 81 e2
> > JIT code: 00b0: ff ff 00 00 48 83 fa 08 75 49 48 01 cf 31 db 48
> > JIT code: 00c0: 89 fa 48 83 c2 14 48 39 f2 77 38 8b 7f 0c 89 7d
> > JIT code: 00d0: fc 48 89 ee 48 83 c6 fc 48 bf 00 9c 24 5f 07 88
> > JIT code: 00e0: ff ff e8 29 cd 13 e1 bb 02 00 00 00 48 83 f8 00
> > JIT code: 00f0: 74 11 48 8b 78 00 48 83 c7 01 48 89 78 00 bb 01
> > JIT code: 0100: 00 00 00 89 5d f8 48 89 ee 48 83 c6 f8 48 bf c0
> > JIT code: 0110: 76 12 13 04 88 ff ff e8 f4 cc 13 e1 48 83 f8 00
> > JIT code: 0120: 74 0c 48 8b 78 00 48 83 c7 01 48 89 78 00 48 89
> > JIT code: 0130: d8 48 8b 9d d8 fd ff ff 4c 8b ad e0 fd ff ff 4c
> > JIT code: 0140: 8b b5 e8 fd ff ff 4c 8b bd f0 fd ff ff c9 c3
> >
> > $ echo $((0x140))
> > 320
> > $ echo $((0x140 + 15))
> > 335
> >
> > Using same example code, thus BPF instructions-size was 440 bytes, so
> > the JIT'ed code size does get smaller.
> >  
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Describing howto read the eBPF generated ELF binary

2017-03-07 Thread Jesper Dangaard Brouer via iovisor-dev
On Mon, 6 Mar 2017 16:14:06 -0800
Alexei Starovoitov via iovisor-dev  wrote:

> On Mon, Mar 6, 2017 at 10:29 AM, Jesper Dangaard Brouer
>  wrote:
> > On Mon, 6 Mar 2017 09:11:46 -0800
> > Alexei Starovoitov via iovisor-dev  wrote:
> >  
> >> On Mon, Mar 6, 2017 at 3:53 AM, Jesper Dangaard Brouer
> >>  wrote:  
> >> > Hi All,
> >> >
> >> > I've added a section to my eBPF documentation, about how to read the
> >> > eBPF generated ELF binary, and deduct the size of the compiled program
> >> > (mostly for kernel/samples/bpf).
> >> >
> >> > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#elf-binary
> >> > https://github.com/netoptimizer/prototype-kernel/commit/079352102cb0ba
> >> >
> >> > Can someone validate what I've saying is true? (commit inlined below
> >> > sign, to make is as easy as possible for people to correct me).
> >> >
> >> > And anything else users can use the readelf output for?
> >> >
> >> > --
> >> > Best regards,
> >> >   Jesper Dangaard Brouer
> >> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >> >   LinkedIn: http://www.linkedin.com/in/brouer
> >> >
> >> > commit 079352102cb0ba12141ecd28c216ec5ac5290192
> >> > Author: Jesper Dangaard Brouer 
> >> > Date:   Fri Feb 24 12:10:45 2017 +0100
> >> >
> >> > doc: eBPF describe howto read the eBPF generated ELF binary
> >> >
> >> > Signed-off-by: Jesper Dangaard Brouer 
> >> >
> >> > diff --git a/kernel/Documentation/bpf/troubleshooting.rst 
> >> > b/kernel/Documentation/bpf/troubleshooting.rst
> >> > index b6f3b6fe9501..39ebffae4142 100644
> >> > --- a/kernel/Documentation/bpf/troubleshooting.rst
> >> > +++ b/kernel/Documentation/bpf/troubleshooting.rst
> >> > @@ -15,6 +15,41 @@ see system call `setrlimit(2)`_.
> >> >  The ``bpf_create_map`` call will return errno EPERM (Operation not
> >> >  permitted) when the RLIMIT_MEMLOCK memory size limit is exceeded.
> >> >
> >> > -
> >> >  .. _setrlimit(2): http://man7.org/linux/man-pages/man2/setrlimit.2.html
> >> >
> >> > +ELF binary
> >> > +==
> >> > +
> >> > +The binary containing the eBPF program, which got generated by the
> >> > +LLVM compiler, is an ELF binary.  For samples/bpf/ this is the file
> >> > +named xxx_kern.o.
> >> > +
> >> > +To answer questions like how big is my eBPF program, it is possible to
> >> > +use a tool like ``readelf``. ::
> >> > +
> >> > + $ readelf -SW xdp_ddos01_blacklist_kern.o
> >> > + There are 8 section headers, starting at offset 0x398:
> >> > +
> >> > + Section Headers:
> >> > +  [Nr] Name   Type   Address  OffSize   ES Flg 
> >> > Lk Inf Al
> >> > +  [ 0]NULL    00 00 00  
> >> > 0   0  0
> >> > +  [ 1] .strtabSTRTAB  000320 72 00  
> >> > 0   0  1
> >> > +  [ 2] .text  PROGBITS    40 00 00  AX  
> >> > 0   0  4
> >> > +  [ 3] xdp_prog   PROGBITS    40 0001b8 00  AX  
> >> > 0   0  8
> >> > +  [ 4] .relxdp_prog   REL 000300 20 10  
> >> > 7   3  8
> >> > +  [ 5] maps   PROGBITS    0001f8 28 00  WA  
> >> > 0   0  4
> >> > +  [ 6] licensePROGBITS    000220 04 00  WA  
> >> > 0   0  1
> >> > +  [ 7] .symtabSYMTAB  000228 d8 18  
> >> > 1   5  8
> >> > + Key to Flags:
> >> > +  W (write), A (alloc), X (execute), M (merge), S (strings)
> >> > +  I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
> >> > +  O (extra OS processing required) o (OS specific), p (processor 
> >> > specific)
> >> > +
> >> > +From the output, we can see the programmer choose to name the XDP
> >> > +program section "xdp_prog".  From this line ([ 3]) the size column
> >> > +shows the size 0001b8 in hex, which is easily converted on the cmdline
> >> > +to 440 bytes::
> >> > +
> >> > + $ echo $((0x0001b8))
> >> > + 440  
> >>
> >> hmm. i think instead of clarifying bpf elf output is doing the opposite.
> >> at least i'm completely lost in the above text.
> >> What all of the above suppose to mean?  
> >
> > That what I'm implicit asking you ;-)
> >  
> >> why is it useful to do readelf? and look at hex ?  
> >
> > What other tools exists to help me understand the contents of the LLVM
> > compiled binary _kern.o ?  
> 
> The best is to use
> llvm-objdump -S prog_kern.o

Hmmm, what version of LLVM have the -S option?

$ llvm-objdump -S xdp_ddos01_blacklist_kern.o
llvm-objdump: Unknown command line argument '-S'.  Try: 'llvm-objdump -help'
llvm-objdump: Did you mean '-D'?

$ llvm-objdump --version | egrep 'version|bpf'
  LLVM version 3.8.1
bpf- BPF (host endian)
bpfeb  - BPF (big endian)
bpfel  - BPF (little endian)

I would really like some command tool, to inspect eBPF-ELF program
with, available in a distro version, in this case Fedora 25.


> it will show section names, asm code and original C code
> if compiled with -g

Re: [iovisor-dev] Describing howto read the eBPF generated ELF binary

2017-03-06 Thread Jesper Dangaard Brouer via iovisor-dev
On Mon, 6 Mar 2017 09:11:46 -0800
Alexei Starovoitov via iovisor-dev  wrote:

> On Mon, Mar 6, 2017 at 3:53 AM, Jesper Dangaard Brouer
>  wrote:
> > Hi All,
> >
> > I've added a section to my eBPF documentation, about how to read the
> > eBPF generated ELF binary, and deduct the size of the compiled program
> > (mostly for kernel/samples/bpf).
> >
> > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#elf-binary
> > https://github.com/netoptimizer/prototype-kernel/commit/079352102cb0ba
> >
> > Can someone validate what I've saying is true? (commit inlined below
> > sign, to make is as easy as possible for people to correct me).
> >
> > And anything else users can use the readelf output for?
> >
> > --
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer
> >
> > commit 079352102cb0ba12141ecd28c216ec5ac5290192
> > Author: Jesper Dangaard Brouer 
> > Date:   Fri Feb 24 12:10:45 2017 +0100
> >
> > doc: eBPF describe howto read the eBPF generated ELF binary
> >
> > Signed-off-by: Jesper Dangaard Brouer 
> >
> > diff --git a/kernel/Documentation/bpf/troubleshooting.rst 
> > b/kernel/Documentation/bpf/troubleshooting.rst
> > index b6f3b6fe9501..39ebffae4142 100644
> > --- a/kernel/Documentation/bpf/troubleshooting.rst
> > +++ b/kernel/Documentation/bpf/troubleshooting.rst
> > @@ -15,6 +15,41 @@ see system call `setrlimit(2)`_.
> >  The ``bpf_create_map`` call will return errno EPERM (Operation not
> >  permitted) when the RLIMIT_MEMLOCK memory size limit is exceeded.
> >
> > -
> >  .. _setrlimit(2): http://man7.org/linux/man-pages/man2/setrlimit.2.html
> >
> > +ELF binary
> > +==
> > +
> > +The binary containing the eBPF program, which got generated by the
> > +LLVM compiler, is an ELF binary.  For samples/bpf/ this is the file
> > +named xxx_kern.o.
> > +
> > +To answer questions like how big is my eBPF program, it is possible to
> > +use a tool like ``readelf``. ::
> > +
> > + $ readelf -SW xdp_ddos01_blacklist_kern.o
> > + There are 8 section headers, starting at offset 0x398:
> > +
> > + Section Headers:
> > +  [Nr] Name   Type   Address  OffSize   ES Flg Lk 
> > Inf Al
> > +  [ 0]NULL    00 00 00  0  
> >  0  0
> > +  [ 1] .strtabSTRTAB  000320 72 00  0  
> >  0  1
> > +  [ 2] .text  PROGBITS    40 00 00  AX  0  
> >  0  4
> > +  [ 3] xdp_prog   PROGBITS    40 0001b8 00  AX  0  
> >  0  8
> > +  [ 4] .relxdp_prog   REL 000300 20 10  7  
> >  3  8
> > +  [ 5] maps   PROGBITS    0001f8 28 00  WA  0  
> >  0  4
> > +  [ 6] licensePROGBITS    000220 04 00  WA  0  
> >  0  1
> > +  [ 7] .symtabSYMTAB  000228 d8 18  1  
> >  5  8
> > + Key to Flags:
> > +  W (write), A (alloc), X (execute), M (merge), S (strings)
> > +  I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
> > +  O (extra OS processing required) o (OS specific), p (processor specific)
> > +
> > +From the output, we can see the programmer choose to name the XDP
> > +program section "xdp_prog".  From this line ([ 3]) the size column
> > +shows the size 0001b8 in hex, which is easily converted on the cmdline
> > +to 440 bytes::
> > +
> > + $ echo $((0x0001b8))
> > + 440  
> 
> hmm. i think instead of clarifying bpf elf output is doing the opposite.
> at least i'm completely lost in the above text.
> What all of the above suppose to mean?

That what I'm implicit asking you ;-)

> why is it useful to do readelf? and look at hex ?

What other tools exists to help me understand the contents of the LLVM
compiled binary _kern.o ?

> I've never done it myself.

Is there an easier way to answer my question:
Q: how big is my eBPF program?


I would actually (also) like to know the size of the JIT'ed asm code?
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


[iovisor-dev] Describing howto read the eBPF generated ELF binary

2017-03-06 Thread Jesper Dangaard Brouer via iovisor-dev
(repost with subscribed email)
Hi All,

I've added a section to my eBPF documentation, about how to read the
eBPF generated ELF binary, and deduct the size of the compiled program
(mostly for kernel/samples/bpf).

https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#elf-binary
https://github.com/netoptimizer/prototype-kernel/commit/079352102cb0ba

Can someone validate what I've saying is true? (commit inlined below
signatur, to make is as easy as possible for people to correct me).

And anything else users can use the readelf output for?

- - 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

commit 079352102cb0ba12141ecd28c216ec5ac5290192
Author: Jesper Dangaard Brouer 
Date:   Fri Feb 24 12:10:45 2017 +0100

doc: eBPF describe howto read the eBPF generated ELF binary

Signed-off-by: Jesper Dangaard Brouer 

diff --git a/kernel/Documentation/bpf/troubleshooting.rst 
b/kernel/Documentation/bpf/troubleshooting.rst
index b6f3b6fe9501..39ebffae4142 100644
--- a/kernel/Documentation/bpf/troubleshooting.rst
+++ b/kernel/Documentation/bpf/troubleshooting.rst
@@ -15,6 +15,41 @@ see system call `setrlimit(2)`_.
 The ``bpf_create_map`` call will return errno EPERM (Operation not
 permitted) when the RLIMIT_MEMLOCK memory size limit is exceeded.
 
-
 .. _setrlimit(2): http://man7.org/linux/man-pages/man2/setrlimit.2.html
 
+ELF binary
+==
+
+The binary containing the eBPF program, which got generated by the
+LLVM compiler, is an ELF binary.  For samples/bpf/ this is the file
+named xxx_kern.o.
+
+To answer questions like how big is my eBPF program, it is possible to
+use a tool like ``readelf``. ::
+
+ $ readelf -SW xdp_ddos01_blacklist_kern.o
+ There are 8 section headers, starting at offset 0x398:
+
+ Section Headers:
+  [Nr] Name   Type   Address  OffSize   ES Flg Lk Inf 
Al
+  [ 0]NULL    00 00 00  0   0  0
+  [ 1] .strtabSTRTAB  000320 72 00  0   0  
1
+  [ 2] .text  PROGBITS    40 00 00  AX  0   0  
4
+  [ 3] xdp_prog   PROGBITS    40 0001b8 00  AX  0   0  
8
+  [ 4] .relxdp_prog   REL 000300 20 10  7   3  
8
+  [ 5] maps   PROGBITS    0001f8 28 00  WA  0   0  
4
+  [ 6] licensePROGBITS    000220 04 00  WA  0   0  
1
+  [ 7] .symtabSYMTAB  000228 d8 18  1   5  
8
+ Key to Flags:
+  W (write), A (alloc), X (execute), M (merge), S (strings)
+  I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
+  O (extra OS processing required) o (OS specific), p (processor specific)
+
+From the output, we can see the programmer choose to name the XDP
+program section "xdp_prog".  From this line ([ 3]) the size column
+shows the size 0001b8 in hex, which is easily converted on the cmdline
+to 440 bytes::
+
+ $ echo $((0x0001b8))
+ 440
+

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


[iovisor-dev] Describing howto read the eBPF generated ELF binary

2017-03-06 Thread Jesper Dangaard Brouer via iovisor-dev
Hi All,

I've added a section to my eBPF documentation, about how to read the
eBPF generated ELF binary, and deduct the size of the compiled program
(mostly for kernel/samples/bpf).

https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#elf-binary
https://github.com/netoptimizer/prototype-kernel/commit/079352102cb0ba

Can someone validate what I've saying is true? (commit inlined below
sign, to make is as easy as possible for people to correct me).

And anything else users can use the readelf output for?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

commit 079352102cb0ba12141ecd28c216ec5ac5290192
Author: Jesper Dangaard Brouer 
Date:   Fri Feb 24 12:10:45 2017 +0100

doc: eBPF describe howto read the eBPF generated ELF binary

Signed-off-by: Jesper Dangaard Brouer 

diff --git a/kernel/Documentation/bpf/troubleshooting.rst 
b/kernel/Documentation/bpf/troubleshooting.rst
index b6f3b6fe9501..39ebffae4142 100644
--- a/kernel/Documentation/bpf/troubleshooting.rst
+++ b/kernel/Documentation/bpf/troubleshooting.rst
@@ -15,6 +15,41 @@ see system call `setrlimit(2)`_.
 The ``bpf_create_map`` call will return errno EPERM (Operation not
 permitted) when the RLIMIT_MEMLOCK memory size limit is exceeded.
 
-
 .. _setrlimit(2): http://man7.org/linux/man-pages/man2/setrlimit.2.html
 
+ELF binary
+==
+
+The binary containing the eBPF program, which got generated by the
+LLVM compiler, is an ELF binary.  For samples/bpf/ this is the file
+named xxx_kern.o.
+
+To answer questions like how big is my eBPF program, it is possible to
+use a tool like ``readelf``. ::
+
+ $ readelf -SW xdp_ddos01_blacklist_kern.o
+ There are 8 section headers, starting at offset 0x398:
+
+ Section Headers:
+  [Nr] Name   Type   Address  OffSize   ES Flg Lk Inf 
Al
+  [ 0]NULL    00 00 00  0   0  0
+  [ 1] .strtabSTRTAB  000320 72 00  0   0  
1
+  [ 2] .text  PROGBITS    40 00 00  AX  0   0  
4
+  [ 3] xdp_prog   PROGBITS    40 0001b8 00  AX  0   0  
8
+  [ 4] .relxdp_prog   REL 000300 20 10  7   3  
8
+  [ 5] maps   PROGBITS    0001f8 28 00  WA  0   0  
4
+  [ 6] licensePROGBITS    000220 04 00  WA  0   0  
1
+  [ 7] .symtabSYMTAB  000228 d8 18  1   5  
8
+ Key to Flags:
+  W (write), A (alloc), X (execute), M (merge), S (strings)
+  I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
+  O (extra OS processing required) o (OS specific), p (processor specific)
+
+From the output, we can see the programmer choose to name the XDP
+program section "xdp_prog".  From this line ([ 3]) the size column
+shows the size 0001b8 in hex, which is easily converted on the cmdline
+to 440 bytes::
+
+ $ echo $((0x0001b8))
+ 440
+
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Documentation on eBPF map types?

2017-02-03 Thread Jesper Dangaard Brouer via iovisor-dev
On Fri, 3 Feb 2017 07:43:33 -0800
Alexei Starovoitov  wrote:

> On Fri, Feb 3, 2017 at 5:52 AM, Jesper Dangaard Brouer
>  wrote:
> > On Thu, 2 Feb 2017 20:27:15 -0800
> > Alexei Starovoitov  wrote:
> >  
> >> On Thu, Feb 02, 2017 at 06:00:09PM +0100, Jesper Dangaard Brouer wrote:  
> > [...]  
> >> Comments below:  
> > [...]
> >  
> >> > +Interacting with maps
> >> > +=
> >> > +
> >> > +Interacting with an eBPF maps from **userspace**, happens through the
> >> > +`bpf`_ syscall and a file descriptor.  The kernel
> >> > +`tools/lib/bpf/bpf.h`_ define some ``bpf_map_*()`` helper functions
> >> > +for wrapping the `bpf_cmd`_ relating to manipulating the map elements.
> >> > +
> >> > +.. code-block:: c
> >> > +
> >> > +  enum bpf_cmd {
> >> > +   [...]
> >> > +   BPF_MAP_LOOKUP_ELEM,
> >> > +   BPF_MAP_UPDATE_ELEM,
> >> > +   BPF_MAP_DELETE_ELEM,
> >> > +   BPF_MAP_GET_NEXT_KEY,
> >> > +   [...]
> >> > +  };
> >> > +  /* Corresponding helper functions */
> >> > +  int bpf_map_lookup_elem(int fd, void *key, void *value);
> >> > +  int bpf_map_update_elem(int fd, void *key, void *value, __u64 flags);
> >> > +  int bpf_map_delete_elem(int fd, void *key);
> >> > +  int bpf_map_get_next_key(int fd, void *key, void *next_key);
> >> > +
> >> > +Notice from userspace, there is no call to atomically increment or
> >> > +decrement the value 'in-place'. The bpf_map_update_elem() call will
> >> > +overwrite the existing value.  
> >>
> >> will overwrite in _atomic_ way. Meaning that the whole value will
> >> be replaced atomically regardless whether update() is called from
> >> kernel or from user space.  
> >
> > Looking at the memcpy code in array_map_update_elem(), I don't see
> > anything preventing two processes from data-update racing with
> > each-other, especially when the value_size is large.  
> 
> yes. my comment above applies to hash map only.

Okay, understood.  I did see locks in the hash map code, but I've not
documented that map type yet.

I've completely rewritten the section: "Interacting with maps"
See commit:
 https://github.com/netoptimizer/prototype-kernel/commit/9b6eba65171ce8d83f6
Or html:
 
http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html#interacting-with-maps

Also inlined here:

[PATCH] doc: ebpf rewrite section Interacting with maps

From: Jesper Dangaard Brouer 

Signed-off-by: Jesper Dangaard Brouer 
---
 kernel/Documentation/bpf/ebpf_maps.rst |  145 +---
 1 file changed, 112 insertions(+), 33 deletions(-)

diff --git a/kernel/Documentation/bpf/ebpf_maps.rst 
b/kernel/Documentation/bpf/ebpf_maps.rst
index 590e1ac72eb3..b549e4716501 100644
--- a/kernel/Documentation/bpf/ebpf_maps.rst
+++ b/kernel/Documentation/bpf/ebpf_maps.rst
@@ -112,13 +112,60 @@ files for defining the maps, but it uses another layout.  
See man-page
 Interacting with maps
 =
 
+Interacting with eBPF maps happens through some **lookup/update/delete**
+primitives.
+
+When writing eBFP programs using load helpers and libraries from
+samples/bpf/ and tools/lib/bpf/.  Common function name API have been
+created that hides the details of how kernel vs. userspace access
+these primitives (which is quite different).
+
+The common function names (parameters and return values differs):
+
+.. code-block:: c
+
+  void bpf_map_lookup_elem(map, void *key. ...);
+  void bpf_map_update_elem(map, void *key, ..., __u64 flags);
+  void bpf_map_delete_elem(map, void *key);
+
+The ``flags`` argument in ``bpf_map_update_elem()`` allows to define
+semantics on whether the element exists:
+
+.. code-block:: c
+
+  /* File: include/uapi/linux/bpf.h */
+  /* flags for BPF_MAP_UPDATE_ELEM command */
+  #define BPF_ANY  0 /* create new element or update existing */
+  #define BPF_NOEXIST  1 /* create new element only if it didn't exist */
+  #define BPF_EXIST2 /* only update existing element */
+
+Userspace
+-
+The userspace API map helpers are defined in `tools/lib/bpf/bpf.h`_
+and looks like this:
+
+.. code-block:: c
+
+  /* Userspace helpers */
+  int bpf_map_lookup_elem(int fd, void *key, void *value);
+  int bpf_map_update_elem(int fd, void *key, void *value, __u64 flags);
+  int bpf_map_delete_elem(int fd, void *key);
+  /* Only userspace: */
+  int bpf_map_get_next_key(int fd, void *key, void *next_key);
+
+
 Interacting with an eBPF map from **userspace**, happens through the
-`bpf`_ syscall and a file descriptor.  The kernel
-`tools/lib/bpf/bpf.h`_ defines some ``bpf_map_*()`` helper functions
-for wrapping the `bpf_cmd`_ related to manipulating the map elements.
+`bpf`_ syscall and a file descriptor.  See how the map handle ``int
+fd`` is a file descriptor .  On success, zero is returned, on
+failures -1 is returned and errno is set.
+
+Wrappers for the bpf syscall is implemented in `tools/lib/bpf/bpf.c`_,
+and ends up calling functions in `kernel/bpf/syscall.c`_, like
+`map_lookup_elem`_.
 
 .. code-block:: c
 
+  /* Corresponding sysca

Re: [iovisor-dev] Documentation on eBPF map types?

2017-02-03 Thread Jesper Dangaard Brouer via iovisor-dev
On Thu, 2 Feb 2017 20:27:15 -0800
Alexei Starovoitov  wrote:

> On Thu, Feb 02, 2017 at 06:00:09PM +0100, Jesper Dangaard Brouer wrote:
[...]
> Comments below:
[...] 

> > +Interacting with maps
> > +=
> > +
> > +Interacting with an eBPF maps from **userspace**, happens through the
> > +`bpf`_ syscall and a file descriptor.  The kernel
> > +`tools/lib/bpf/bpf.h`_ define some ``bpf_map_*()`` helper functions
> > +for wrapping the `bpf_cmd`_ relating to manipulating the map elements.
> > +
> > +.. code-block:: c
> > +
> > +  enum bpf_cmd {
> > +   [...]
> > +   BPF_MAP_LOOKUP_ELEM,
> > +   BPF_MAP_UPDATE_ELEM,
> > +   BPF_MAP_DELETE_ELEM,
> > +   BPF_MAP_GET_NEXT_KEY,
> > +   [...]
> > +  };
> > +  /* Corresponding helper functions */
> > +  int bpf_map_lookup_elem(int fd, void *key, void *value);
> > +  int bpf_map_update_elem(int fd, void *key, void *value, __u64 flags);
> > +  int bpf_map_delete_elem(int fd, void *key);
> > +  int bpf_map_get_next_key(int fd, void *key, void *next_key);
> > +
> > +Notice from userspace, there is no call to atomically increment or
> > +decrement the value 'in-place'. The bpf_map_update_elem() call will
> > +overwrite the existing value.  
> 
> will overwrite in _atomic_ way. Meaning that the whole value will
> be replaced atomically regardless whether update() is called from
> kernel or from user space.

Looking at the memcpy code in array_map_update_elem(), I don't see
anything preventing two processes from data-update racing with
each-other, especially when the value_size is large.


> Also please mention first that kernel can do atomic increment,
> before saying that user space cannot. Otherwise it's quite confusing.

True, noted, I'll rearrange.

> > The flags argument allows
> > +bpf_map_update_elem() define semantics on weather the element exist:
> > +
> > +.. code-block:: c
> > +
> > +  /* File: include/uapi/linux/bpf.h */
> > +  /* flags for BPF_MAP_UPDATE_ELEM command */
> > +  #define BPF_ANY  0 /* create new element or update existing */
> > +  #define BPF_NOEXIST  1 /* create new element if it didn't exist */
> > +  #define BPF_EXIST2 /* update existing element */
> > +
> > +The eBPF-program running "kernel-side" have almost the same primitives
> > +(lookup/update/delete) for interacting with the map, but it interact
> > +more directly with the map data structures. For example the call
> > +``bpf_map_lookup_elem()`` returns a direct pointer to the 'value'
> > +memory-element inside the kernel (while userspace gets a copy).  This
> > +allows the eBPF-program to atomically increment or decrement the value
> > +'in-place', by using appropiate compiler primitives like
> > +``__sync_fetch_and_add()``, which is understood by LLVM when
> > +generating eBPF instructions.  
> 
> ahh. so you do it here. may be move this paragraph on top ?

yes.

> > +
> > +On the kernel side, implementing a map type requires defining some
> > +function (pointers) via `struct bpf_map_ops`_.  And eBPF programs have
> > +access to ``map_lookup_elem``, ``map_update_elem`` and
> > +``map_delete_elem``, which gets invoked from eBPF via bpf-helpers in  
> 
> this paragraph reads a bit out of place.
> I'm not sure what 'kernel implements the map' means.
> You mean 'access the map' ?

True, it is not very clear.  I'm talking about how a new map type is
implemented, and what functions it need to implement. It is confusing,
I'll try to come up with something better.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Documentation on eBPF map types?

2017-02-03 Thread Jesper Dangaard Brouer via iovisor-dev
On Thu, 2 Feb 2017 20:27:15 -0800
Alexei Starovoitov  wrote:

> On Thu, Feb 02, 2017 at 06:00:09PM +0100, Jesper Dangaard Brouer wrote:
> > 
> > [PATCH] doc: how interacting with eBPF maps works
> > 
> > Documented by reading the code.  
> 
> looks great!
> please submit the patch to net-next and cc Jonathan ?

Good idea, but I'll take this round of review first, before pushing it
to the kernel tree. 

I also want a build-env section.

Also considering splitting the "Types of maps"[1] section into a
separate file, as this avoids a too long HTML page being generated.

[1] 
http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html#types-of-maps

> Comments below:
> 
> > I hope someone more knowledgeable will review this and
> > correct me where I misunderstood things.
> > 
> > Signed-off-by: Jesper Dangaard Brouer 
> > ---
> >  kernel/Documentation/bpf/ebpf_maps.rst |  126 
> > ++--
> >  1 file changed, 120 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/Documentation/bpf/ebpf_maps.rst 
> > b/kernel/Documentation/bpf/ebpf_maps.rst
> > index 562edd566e0b..55068c7f3dab 100644
> > --- a/kernel/Documentation/bpf/ebpf_maps.rst
> > +++ b/kernel/Documentation/bpf/ebpf_maps.rst
> > @@ -23,13 +23,128 @@ and accessed by multiple programs (from man-page 
> > `bpf(2)`_):
> >   up to the user process and eBPF program to decide what they store
> >   inside maps.
> >  
> > +Creating a map
> > +==
> > +
> > +A maps is created based on a request from userspace, via the `bpf`_
> > +syscall (`bpf_cmd`_ BPF_MAP_CREATE), and returns a new file descriptor
> > +that refers to the map. These are the setup arguments when creating a
> > +map.
> > +
> > +.. code-block:: c
> > +
> > +  struct { /* anonymous struct used by BPF_MAP_CREATE command */
> > + __u32   map_type;   /* one of enum bpf_map_type */
> > + __u32   key_size;   /* size of key in bytes */
> > + __u32   value_size; /* size of value in bytes */
> > + __u32   max_entries;/* max number of entries in a map */
> > + __u32   map_flags;  /* prealloc or not */
> > +  };
> > +
> > +For programs under samples/bpf/ the ``load_bpf_file()`` call (from
> > +`samples/bpf/bpf_load`_) takes care of parsing elf file compiled by
^^^  

> > +LLVM, pickup 'maps' section and creates maps via BPF syscall.  This is
  ^^ ^

> > +done by defining a ``struct bpf_map_def`` with an elf section
> > +__attribute__ ``SEC("maps")``, in the xxx_kern.c file.  The maps file
> > +descriptor is available in the userspace xxx_user.c file, via global
> > +array variable ``map_fd[]``, and the array map index correspons to the
> > +order the maps sections were defined in elf file of xxx_kern.c file.
> > +
> > +.. code-block:: c
> > +
> > +  struct bpf_map_def {
> > +   unsigned int type;
> > +   unsigned int key_size;
> > +   unsigned int value_size;
> > +   unsigned int max_entries;
> > +   unsigned int map_flags;
> > +  };  
> 
> As Daniel said please mention that this is C program <-> elf loader
> convention. It's not a kernel abi.

I feel I did mention the ELF part above, but I guess should try to make
it more clear, as you missed this.


> perf map defitions are different from tc and from samples/bpf/

I didn't know about TC maps.  Can you or Daniel please point me to some
TC-eBFP-code that setup a map?  So, I can wrap my head around how it
does it... I think it would be valuable to have a section here
describing how TC manage maps.

See what I figured out myself in this commit:
 https://github.com/netoptimizer/prototype-kernel/commit/3640a48df52c

And the new TC section:
 
http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html#qdisc-traffic-control-convention

> iovisor/bcc is not using elf at all.

True (I do know that bcc doesn't use elf).
 
> > +
> > +  struct bpf_map_def SEC("maps") my_map = {
> > +   .type= BPF_MAP_TYPE_XXX,
> > +   .key_size= sizeof(u32),
> > +   .value_size  = sizeof(u64),
> > +   .max_entries = 42,
> > +   .map_flags   = 0
> > +  };
> > +
> > +.. _samples/bpf/bpf_load:
> > +   
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/bpf_load.c
> > +
> > +
> > +Interacting with maps
> > +=

I'll use another email and commit change to address this "Interacting
with maps" section...


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Documentation on eBPF map types?

2017-02-02 Thread Jesper Dangaard Brouer via iovisor-dev
On Sat, 28 Jan 2017 10:14:58 +1100
Brendan Gregg  wrote:

> I did some in the bcc ref guide, but it's incomplete, and the bcc versions:
> https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md

Hi Brendan,

I also included BCC in my documentation, please correct me:
 http://prototype-kernel.readthedocs.io/en/latest/bpf/bcc_tool_chain.html
 https://github.com/netoptimizer/prototype-kernel/commit/b61a7ecfe08d2cc0

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Documentation on eBPF map types?

2017-02-02 Thread Jesper Dangaard Brouer via iovisor-dev
On Thu, 2 Feb 2017 11:56:19 +0100
Jesper Dangaard Brouer  wrote:

> On Tue, 31 Jan 2017 20:54:10 -0800
> Alexei Starovoitov  wrote:
> 
> > On Tue, Jan 31, 2017 at 9:54 AM, Jesper Dangaard Brouer via
> > iovisor-dev  wrote:  
> > >
> > > On Sat, 28 Jan 2017 10:14:58 +1100 Brendan Gregg 
> > >  wrote:
> > >
> > >> I did some in the bcc ref guide, but it's incomplete, and the bcc 
> > >> versions:
> > >> https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md
> > >
> > > Thanks! - this seem rather BCC specific syntax, and I'm looking for
> > > documentation close for the kernel (samples/bpf/).
> > >
> > > The best documentation I found was the man-page for the syscall bpf(2):
> > >  http://man7.org/linux/man-pages/man2/bpf.2.html
> > >
> > > In lack of a better place, I've started documenting eBPF here:
> > >  https://prototype-kernel.readthedocs.io/en/latest/bpf/index.html
> > >
> > > This doc is compatible with the kernels doc format, and I hope we can
> > > push this into the kernel tree, if it turns out to be valuable?
> > > (https://www.kernel.org/doc/html/latest/)
> > 
> > yeah. definitely would be great to add map descriptions to the kernel docs.
> > So far most of it is in commit logs.
> > git log kernel/bpf/arraymap.c|tail -33
> > git log kernel/bpf/hashtab.c|tail -33
> > will give an overview of key hash and array map principles.  
> 
> Thanks, I'm using that to write some doc.
>  http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html
> Gotten to BPF_MAP_TYPE_ARRAY
>  
> http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html#bpf-map-type-array
>  
> 
> Can you explain the difference between the kernel and userspace side of
> the call bpf_map_lookup_elem() ?
> 
> Kernel side:
>   long *value = bpf_map_lookup_elem(&my_map, &key);
> 
> Userspace side:
>   long long value;
>   bpf_map_lookup_elem(map_fd[0], &key, &value)
> 
> Looks like userspace gets a copy of the memory...
> If so, how can userspace then increment the value safely?

I documented this myself, please correct me. 
 
http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html#creating-a-map
 
http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html#interacting-with-maps

Commit:
 https://github.com/netoptimizer/prototype-kernel/commit/ffbcdc453f3c

Inlined below:
 -

[PATCH] doc: how interacting with eBPF maps works

Documented by reading the code.

I hope someone more knowledgeable will review this and
correct me where I misunderstood things.

Signed-off-by: Jesper Dangaard Brouer 
---
 kernel/Documentation/bpf/ebpf_maps.rst |  126 ++--
 1 file changed, 120 insertions(+), 6 deletions(-)

diff --git a/kernel/Documentation/bpf/ebpf_maps.rst 
b/kernel/Documentation/bpf/ebpf_maps.rst
index 562edd566e0b..55068c7f3dab 100644
--- a/kernel/Documentation/bpf/ebpf_maps.rst
+++ b/kernel/Documentation/bpf/ebpf_maps.rst
@@ -23,13 +23,128 @@ and accessed by multiple programs (from man-page 
`bpf(2)`_):
  up to the user process and eBPF program to decide what they store
  inside maps.
 
+Creating a map
+==
+
+A maps is created based on a request from userspace, via the `bpf`_
+syscall (`bpf_cmd`_ BPF_MAP_CREATE), and returns a new file descriptor
+that refers to the map. These are the setup arguments when creating a
+map.
+
+.. code-block:: c
+
+  struct { /* anonymous struct used by BPF_MAP_CREATE command */
+ __u32   map_type;   /* one of enum bpf_map_type */
+ __u32   key_size;   /* size of key in bytes */
+ __u32   value_size; /* size of value in bytes */
+ __u32   max_entries;/* max number of entries in a map */
+ __u32   map_flags;  /* prealloc or not */
+  };
+
+For programs under samples/bpf/ the ``load_bpf_file()`` call (from
+`samples/bpf/bpf_load`_) takes care of parsing elf file compiled by
+LLVM, pickup 'maps' section and creates maps via BPF syscall.  This is
+done by defining a ``struct bpf_map_def`` with an elf section
+__attribute__ ``SEC("maps")``, in the xxx_kern.c file.  The maps file
+descriptor is available in the userspace xxx_user.c file, via global
+array variable ``map_fd[]``, and the array map index correspons to the
+order the maps sections were defined in elf file of xxx_kern.c file.
+
+.. code-block:: c
+
+  struct bpf_map_def {
+   unsigned int type;
+   unsigned int key_size;
+   unsigned int value_size;
+   unsigned int max_entries;
+   unsigned int map_flags;
+  };
+
+  struct bpf_map_def SEC("maps") my_map = {
+   .type= BPF_MAP_TYPE_XXX,
+ 

Re: [iovisor-dev] Documentation on eBPF map types?

2017-02-02 Thread Jesper Dangaard Brouer via iovisor-dev
On Tue, 31 Jan 2017 20:54:10 -0800
Alexei Starovoitov  wrote:

> On Tue, Jan 31, 2017 at 9:54 AM, Jesper Dangaard Brouer via
> iovisor-dev  wrote:
> >
> > On Sat, 28 Jan 2017 10:14:58 +1100 Brendan Gregg 
> >  wrote:
> >  
> >> I did some in the bcc ref guide, but it's incomplete, and the bcc versions:
> >> https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md  
> >
> > Thanks! - this seem rather BCC specific syntax, and I'm looking for
> > documentation close for the kernel (samples/bpf/).
> >
> > The best documentation I found was the man-page for the syscall bpf(2):
> >  http://man7.org/linux/man-pages/man2/bpf.2.html
> >
> > In lack of a better place, I've started documenting eBPF here:
> >  https://prototype-kernel.readthedocs.io/en/latest/bpf/index.html
> >
> > This doc is compatible with the kernels doc format, and I hope we can
> > push this into the kernel tree, if it turns out to be valuable?
> > (https://www.kernel.org/doc/html/latest/)  
> 
> yeah. definitely would be great to add map descriptions to the kernel docs.
> So far most of it is in commit logs.
> git log kernel/bpf/arraymap.c|tail -33
> git log kernel/bpf/hashtab.c|tail -33
> will give an overview of key hash and array map principles.

Thanks, I'm using that to write some doc.
 http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html
Gotten to BPF_MAP_TYPE_ARRAY
 
http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html#bpf-map-type-array
 

Can you explain the difference between the kernel and userspace side of
the call bpf_map_lookup_elem() ?

Kernel side:
  long *value = bpf_map_lookup_elem(&my_map, &key);

Userspace side:
  long long value;
  bpf_map_lookup_elem(map_fd[0], &key, &value)

Looks like userspace gets a copy of the memory...
If so, how can userspace then increment the value safely?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Documentation on eBPF map types?

2017-01-31 Thread Jesper Dangaard Brouer via iovisor-dev

On Sat, 28 Jan 2017 10:14:58 +1100 Brendan Gregg  
wrote:

> I did some in the bcc ref guide, but it's incomplete, and the bcc versions:
> https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md

Thanks! - this seem rather BCC specific syntax, and I'm looking for
documentation close for the kernel (samples/bpf/).

The best documentation I found was the man-page for the syscall bpf(2):
 http://man7.org/linux/man-pages/man2/bpf.2.html

In lack of a better place, I've started documenting eBPF here:
 https://prototype-kernel.readthedocs.io/en/latest/bpf/index.html

This doc is compatible with the kernels doc format, and I hope we can
push this into the kernel tree, if it turns out to be valuable?
(https://www.kernel.org/doc/html/latest/)

--Jesper


> On Fri, Jan 27, 2017 at 9:54 PM, Jesper Dangaard Brouer via iovisor-dev <
> iovisor-dev@lists.iovisor.org> wrote:
> 
> > Hi IOvisor/eBPF people,
> >
> > Do we have some documentation on eBPF maps?
> >
> > Like that map types are available, and what they are useful for?
> >
> > Notice, just list listing[1] the enum is not so useful:
> >
> >  enum bpf_map_type {
> > BPF_MAP_TYPE_UNSPEC,
> > BPF_MAP_TYPE_HASH,
> > BPF_MAP_TYPE_ARRAY,
> > BPF_MAP_TYPE_PROG_ARRAY,
> > BPF_MAP_TYPE_PERF_EVENT_ARRAY,
> > BPF_MAP_TYPE_PERCPU_HASH,
> > BPF_MAP_TYPE_PERCPU_ARRAY,
> > BPF_MAP_TYPE_STACK_TRACE,
> > BPF_MAP_TYPE_CGROUP_ARRAY,
> > BPF_MAP_TYPE_LRU_HASH,
> > BPF_MAP_TYPE_LRU_PERCPU_HASH,
> >  };
> >
> > [1] http://lxr.free-electrons.com/source/tools/include/uapi/
> > linux/bpf.h?v=4.9#L78
> >
> > I also lack some documentation about how I interact with these maps...


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


[iovisor-dev] Documentation on eBPF map types?

2017-01-27 Thread Jesper Dangaard Brouer via iovisor-dev
Hi IOvisor/eBPF people,

Do we have some documentation on eBPF maps?

Like that map types are available, and what they are useful for?

Notice, just list listing[1] the enum is not so useful:

 enum bpf_map_type {
BPF_MAP_TYPE_UNSPEC,
BPF_MAP_TYPE_HASH,
BPF_MAP_TYPE_ARRAY,
BPF_MAP_TYPE_PROG_ARRAY,
BPF_MAP_TYPE_PERF_EVENT_ARRAY,
BPF_MAP_TYPE_PERCPU_HASH,
BPF_MAP_TYPE_PERCPU_ARRAY,
BPF_MAP_TYPE_STACK_TRACE,
BPF_MAP_TYPE_CGROUP_ARRAY,
BPF_MAP_TYPE_LRU_HASH,
BPF_MAP_TYPE_LRU_PERCPU_HASH,
 };

[1] 
http://lxr.free-electrons.com/source/tools/include/uapi/linux/bpf.h?v=4.9#L78

I also lack some documentation about how I interact with these maps...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Howto debug eBPF programs and bpf_trace_printk

2017-01-24 Thread Jesper Dangaard Brouer via iovisor-dev
On Tue, 24 Jan 2017 10:52:39 +0100
Jesper Dangaard Brouer  wrote:

> Hi IOvisor-group,
> 
> I'm playing with kernels samples/bpf, and want so advice on debugging
> eBPF programs.
> 
> First question: I noticed bpf_trace_printk(), but nothing shows up when
> I'm using it... what did I miss/forget?

(So, let me answer my own question.)

The trace "printk" output goes into:
 /sys/kernel/debug/tracing/trace_pipe

And samples/bpf have a function read_trace_pipe() that reads from this.


> Second question: When doing some mistake the eBPF validator does not
> like, I get a dump of eBPF asm codes, which I cannot correlate to line
> number in the pseudo-C code.  Is there some debug option I can enable
> to get some more help?  (Saw William had some additional ";" output)


Third question:
Sometimes I get an error while loading the program like:

 $ sudo ./tracex1 
 bpf_load_program() err=22

How do I lookup what err=N means?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


[iovisor-dev] Howto debug eBPF programs and bpf_trace_printk

2017-01-24 Thread Jesper Dangaard Brouer via iovisor-dev
Hi IOvisor-group,

I'm playing with kernels samples/bpf, and want so advice on debugging
eBPF programs.

First question: I noticed bpf_trace_printk(), but nothing shows up when
I'm using it... what did I miss/forget?

Second question: When doing some mistake the eBPF validator does not
like, I get a dump of eBPF asm codes, which I cannot correlate to line
number in the pseudo-C code.  Is there some debug option I can enable
to get some more help?  (Saw William had some additional ";" output)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Explaining RX-stages for XDP

2016-09-28 Thread Jesper Dangaard Brouer via iovisor-dev

On Tue, 27 Sep 2016 19:12:44 -0700 Alexei Starovoitov 
 wrote:

> On Tue, Sep 27, 2016 at 11:32:37AM +0200, Jesper Dangaard Brouer wrote:
> > 
> > Let me try in a calm way (not like [1]) to explain how I imagine that
> > the XDP processing RX-stage should be implemented. As I've pointed out
> > before[2], I'm proposing splitting up the driver into RX-stages.  This
> > is a mental-model change, I hope you can follow my "inception" attempt.
> > 
> > The basic concept behind this idea is, if the RX-ring contains
> > multiple "ready" packets, then the kernel was too slow, processing
> > incoming packets. Thus, switch into more efficient mode, which is a
> > "packet-vector" mode.
> > 
> > Today, our XDP micro-benchmarks looks amazing, and they are!  But once
> > real-life intermixed traffic is used, then we loose the XDP I-cache
> > benefit.  XDP is meant for DoS protection, and an attacker can easily
> > construct intermixed traffic.  Why not fix this architecturally?
> > 
> > Most importantly concept: If XDP return XDP_PASS, do NOT pass the
> > packet up the network stack immediately (that would flush I-cache).
> > Instead store the packet for the next RX-stage.  Basically splitting
> > the packet-vector into two packet-vectors, one for network-stack and
> > one for XDP.  Thus, intermixed XDP vs. netstack not longer have effect
> > on XDP performance.
> > 
> > The reason for also creating an XDP packet-vector, is to move the
> > XDP_TX transmit code out of the XDP processing stage (and future
> > features).  This maximize I-cache availability to the eBPF program,
> > and make eBPF performance more uniform across drivers.
> > 
> > 
> > Inception:
> >  * Instead of individual packets, see it as a RX packet-vector.
> >  * XDP should be seen as a stage *before* the network stack gets called.
> > 
> > If your mind can handle it: I'm NOT proposing a RX-vector of 64-packets.
> > I actually want N-packet per vector (8-16).  As the NIC HW RX process
> > runs concurrently, and by the time it takes to process N-packets, more
> > packets have had a chance to arrive in the RX-ring queue.  
> 
> Sounds like what Edward was proposing earlier with building
> link list of skbs and passing further into stack?
> Or the idea is different ?

The idea is quite different.  It has nothing to do with Edward's
proposal[3].  The RX packet-vector is simply an array, either of pointers
or index numbers (into the RX-ring).  The needed changes are completely
contained inside the driver.


> As far as intermixed XDP vs stack traffic, I think for DoS case the
> traffic patterns are binary. Either all of it is good or under attack
> most of the traffic is bad, so makes sense to optimize for these two.
> 50/50 case I think is artificial and not worth optimizing for.

Sorry, but I feel you have completely misunderstood the concept of my
idea.  It does not matter what traffic pattern you believe or don't
believe in, it is irrelevant.  The fact is that intermixed traffic is
possible with the current solution.  The core of my idea is to remove
the possibility for this intermixed traffic to occur, simply by seeing
XDP as a RX-stage before the stack.


> For all good traffic whether xdp is there or not shouldn't matter
> for this N-vector optimization. Whether it's a batch of 8, 16 or 64,
> either via link-list or array, it should probably be a generic
> mechanism independent of any xdp stuff.

I also feel you have misunderstood the N-vector "optimization".
But, yes, this introduction of RX-stages is independent of XDP.

The RX-stages is a generic change to the drivers programming model.


[...]
> I think existing mlx4+xdp is already optimized for 'mostly attack' traffic
> and performs pretty well, since imo 'all drop' benchmark is accurate.

The "all drop" benchmark is as artificial as it gets.  It think Eric
agrees.

My idea is confining the XDP_DROP part to a RX-stage _before_ the
netstack. Then, the "all drop" benchmark number will get a little more
trustworthy.


> Optimizing xdp for 'mostly good' traffic is indeed a challange.
> We'd need all the tricks to make it as good as normal skb-based traffic.
>
> I haven't seen any tests yet comparing xdp with 'return XDP_PASS' program
> vs no xdp at all running netperf tcp/udp in user space. It shouldn't
> be too far off.

Well, I did post numbers to the list with a 'return XDP_PASS' program[4]:
 https://mid.mail-archive.com/netdev@vger.kernel.org/msg122350.html

Wake-up and smell the coffee, please revise your assumptions:
 * It showed that the performance reduction is 25.98%!!!
   (AB comparison dropping packets in iptables raw)

Conclusion: These measurements confirm that we need a page recycle
facility for the drivers before switching to order-0 allocations.

I did the same kind of experiment with mlx5. Where I change the memory
model to order-0 pages, and then I implemented page_pool on top. (Below
number are before I implemented the DMA part of page_pool, which do
work now).

page_pool work
 - 

[iovisor-dev] Explaining RX-stages for XDP

2016-09-27 Thread Jesper Dangaard Brouer via iovisor-dev

Let me try in a calm way (not like [1]) to explain how I imagine that
the XDP processing RX-stage should be implemented. As I've pointed out
before[2], I'm proposing splitting up the driver into RX-stages.  This
is a mental-model change, I hope you can follow my "inception" attempt.

The basic concept behind this idea is, if the RX-ring contains
multiple "ready" packets, then the kernel was too slow, processing
incoming packets. Thus, switch into more efficient mode, which is a
"packet-vector" mode.

Today, our XDP micro-benchmarks looks amazing, and they are!  But once
real-life intermixed traffic is used, then we loose the XDP I-cache
benefit.  XDP is meant for DoS protection, and an attacker can easily
construct intermixed traffic.  Why not fix this architecturally?

Most importantly concept: If XDP return XDP_PASS, do NOT pass the
packet up the network stack immediately (that would flush I-cache).
Instead store the packet for the next RX-stage.  Basically splitting
the packet-vector into two packet-vectors, one for network-stack and
one for XDP.  Thus, intermixed XDP vs. netstack not longer have effect
on XDP performance.

The reason for also creating an XDP packet-vector, is to move the
XDP_TX transmit code out of the XDP processing stage (and future
features).  This maximize I-cache availability to the eBPF program,
and make eBPF performance more uniform across drivers.


Inception:
 * Instead of individual packets, see it as a RX packet-vector.
 * XDP should be seen as a stage *before* the network stack gets called.

If your mind can handle it: I'm NOT proposing a RX-vector of 64-packets.
I actually want N-packet per vector (8-16).  As the NIC HW RX process
runs concurrently, and by the time it takes to process N-packets, more
packets have had a chance to arrive in the RX-ring queue.

-- 
Best regards,
  Jesper Dangaard Brouertho
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://mid.mail-archive.com/netdev@vger.kernel.org/msg127043.html

[2] http://lists.openwall.net/netdev/2016/01/15/51  

[3] http://lists.openwall.net/netdev/2016/04/19/89
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] XDP (eXpress Data Path) documentation

2016-09-22 Thread Jesper Dangaard Brouer via iovisor-dev

On Tue, 20 Sep 2016 11:08:44 +0200 Jesper Dangaard Brouer  
wrote:

> As promised, I've started documenting the XDP eXpress Data Path):
> 
>  [1] 
> https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html
> 
> IMHO the documentation have reached a stage where it is useful for the
> XDP project, BUT I request collaboration on improving the documentation
> from all. (Native English speakers are encouraged to send grammar fixes ;-))

I want to publicly thanks Edward Cree for being the first contributor
to the XDP documentation with formulation and grammar fixes.

Pulled and pushed:
 https://github.com/netoptimizer/prototype-kernel/commit/fb6a3de95

Thanks!
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] XDP (eXpress Data Path) documentation

2016-09-22 Thread Jesper Dangaard Brouer via iovisor-dev
On Wed, 21 Sep 2016 17:03:24 -0700
Tom Herbert  wrote:

> On Tue, Sep 20, 2016 at 2:08 AM, Jesper Dangaard Brouer
>  wrote:
> > Hi all,
> >
> > As promised, I've started documenting the XDP eXpress Data Path):
> >
> >  [1] 
> > https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html
> >
> > IMHO the documentation have reached a stage where it is useful for the
> > XDP project, BUT I request collaboration on improving the documentation
> > from all. (Native English speakers are encouraged to send grammar fixes ;-))
> >  
> Hi Jesper,
> 
> Thanks for taking the initiative on the this, The document reads more
> like a design doc than description right now, that's probably okay
> since we could use a design doc.

Yes, I fully agree.

I want to state very clearly, this document is not an attempt to hijack
the XDP project and control the "spec".  This is an attempt to collaborate.
We discuss things on the mailing list, each with our own vision of the
project, and most times we reach an agreement. But nobody document this
agreement. 

Month later, we make implementation choices that goes against these
agreements, because we simply forgot.  If someone remembers, we have to
reiterate the same arguments again (like it just happened with the
XDP_ABORTED action, my mistake).  And can anybody remember the
consensus around VLANs, it never got implemented that way...

I had to start the doc project somewhere, so I dumped my own vision
into the docs, and what I could remember from upstream discussions.
I need collaboration from others to adjust and "fix" my vision of the
project, into something that becomes a common ground we all can agree
on.

If some part of the docs provoke you, good, then you have a reason to
correct and fix it.  I'll do my best to keep an very open-mind about
any changes.  This should be a very "live" document.  


> Under "Important to understand" there are some disclaimers that XDP
> does not implement qdiscs or BQL and fairness otherwise. This is true
> for it's own traffic, but it does not (or at least should not) affect
> these mechanisms or normal stack traffic running simultaneously. I
> think we've made assumptions about fairness between XDP and non-XDP
> queues, we probably want to clarify fairness (and also validate
> whatever assumptions we've made with testing).

I love people pointing out mistakes in the documentation.  I want
update this ASAP when people point it out.  I'll even do the work of
integrating and committing these changes, for people too lazy to git
clone the repo themselves.

For you section Tom, I fully agree, but I don't know how to formulate
and adjust this in the text.

For people that want to edit the docs, notice the link "Edit on GitHub"
which takes you directly to the file you need to edit...



> > You wouldn't believe it: But this pretty looking documentation actually
> > follows the new Kernel documentation format.  It is actually just
> > ".rst" text files stored in my github repository under kernel/Documentation 
> > [2]
> >
> >  [2] 
> > https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/Documentation
> >
> > Thus, just git clone my repository and started editing and send me
> > patches (or github pull requests). Like:
> >
> >  $ git clone https://github.com/netoptimizer/prototype-kernel
> >  $ cd prototype-kernel/kernel/Documentation/
> >  $ make html
> >  $ firefox _build/html/index.html &
> >
> > This new documentation format combines the best of two worlds, pretty
> > online browser documentation with almost plain text files, and changes
> > being tracked via git commits [3] (and auto git hooks to generate the
> > readthedocs.org page). You got to love it! :-)
> >
> > --
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   Author of http://www.iptv-analyzer.org
> >   LinkedIn: http://www.linkedin.com/in/brouer
> >
> > [3] https://github.com/netoptimizer/prototype-kernel/commits/master  

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] XDP (eXpress Data Path) documentation

2016-09-21 Thread Jesper Dangaard Brouer via iovisor-dev
On Tue, 20 Sep 2016 19:47:07 -0700
Alexei Starovoitov  wrote:

> On Tue, Sep 20, 2016 at 11:08:44AM +0200, Jesper Dangaard Brouer via 
> iovisor-dev wrote:
> > Hi all,
> > 
> > As promised, I've started documenting the XDP eXpress Data Path):
> > 
> >  [1] 
> > https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html
> > 
> > IMHO the documentation have reached a stage where it is useful for the
> > XDP project, BUT I request collaboration on improving the documentation
> > from all. (Native English speakers are encouraged to send grammar fixes ;-))
> > 
> > You wouldn't believe it: But this pretty looking documentation actually
> > follows the new Kernel documentation format.  It is actually just
> > ".rst" text files stored in my github repository under kernel/Documentation 
> > [2]
> > 
> >  [2] 
> > https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/Documentation
> >   
> 
> Thanks so much for doing it. This is great start!
> Some minor editing is needed here and there.
> To make it into official doc do you mind preparing a patch for Jon's doc tree 
> ?
> If you think the doc is too volatile and not suitable for kernel.org,
> another alternative is to host it on https://github.com/iovisor
> since it's LF collaborative project it won't disappear suddenly.
> You can be a maintainer of that repo if you like.

I do see this as kernel documentation that eventually should end-up in
Jon's doc tree.  Right now it is too volatile.  Once XDP have
"stabilized" some more, I plan to push/submit the *relevant* pieces
into the kernel.  E.g. I plan to have a "proposals" section, which is
not meant upstream doc as it is an intermediate specification step
before implementing, after which it should move to another doc section.
Likewise some of the use-case documents might be "rejected" before
reaching upstream doc.

The reason I've not created a separate repository for XDP doc only, is
because I also plan to document other parts of the kernel in this
repo[3], not just XDP.  Like my page_pool work.  The documentation is
not really official documentation before it reach a kernel git tree,
together with the code it documents.

I hope this approach can help us document while developing, and
turn email discussions into specifications.  Like I forgot the
XDP_ABORTED not warning argument, which you had to re-iterate, but now
it is documented[4][5].

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[3] https://github.com/netoptimizer/prototype-kernel/
[4] https://github.com/netoptimizer/prototype-kernel/commit/a4e60e2d7a894
[5] 
https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/implementation/userspace_api.html#troubleshooting-and-monitoring
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] XDP user interface confusions

2016-09-21 Thread Jesper Dangaard Brouer via iovisor-dev
On Thu, 15 Sep 2016 11:58:35 -0700
Brenden Blanco  wrote:

> On Thu, Sep 15, 2016 at 08:14:02PM +0200, Jesper Dangaard Brouer wrote:
> > Hi Brenden,
> > 
> > I don't quite understand the semantics of the XDP userspace interface.
> > 
> > We allow XDP programs to be (unconditionally) exchanged by another
> > program, this avoids taking the link down+up and avoids reallocating
> > RX ring resources (which is great).
> > 
> > We have two XDP samples programs in samples/bpf/ xdp1 and xdp2.  Now I
> > want to first load xdp1 and then to avoid the linkdown I load xdp2,
> > and then afterwards remove/stop program xdp1.
> > 
> > This does NOT work, because (in samples/bpf/xdp1_user.c) when xdp1
> > exits it unconditionally removes the running XDP program (loaded by xdp2)
> > via set_link_xdp_fd(ifindex, -1).  The xdp2 user program is still
> > running, and is unaware of its xdp/bpf program have been unloaded.
> > 
> > I find this userspace interface confusing. What this your intention?
> > Perhaps you can explain what the intended semantics or specification is?  
> 
> In practice, we've used a single agent process to manage bpf programs on
> behalf of the user applications. This agent process uses common linux
> functionalities to add semantics, while not really relying on the bpf
> handles themselves to take care of that. For instance, the process may
> put some lockfiles and what-not in /var/run/$PID, and maybe returns the
> list of running programs through a http: or unix: interface.
> 
> So, from a user<->kernel API, the requirements are minimal...the agent
> process just overwrites the loaded bpf program when the application
> changes, or a new application comes online. There is nobody to 'notify'
> when a handle changes.
> 
> When translating this into the kernel api that you see now, none of this
> exists, because IMHO the kernel api should be unopinionated and generic.
> The result is something that appears very "fire-and-forget", which
> results in something simple yet safe at the same time; the refcounting
> is done transparently by the kernel.
> 
> So, in practice, there is no xdp1 or xdp2, just xdp-agent at different
> points in time. Or, better yet, no agent, just the programs running in
> the kernel, with the handles of the programs residing solely in the
> device, which are perhaps pinned to /sys/fs/bpf for semantic management
> purposes. I didn't feel like it was appropriate to conflate different
> bpf features in the kernel samples, so we don't see (and probably never
> will) a sample which combines these features into a whole. That is best
> left to userspace tools. It so happens that this is one of the projects
> I am currently active on at $DAYJOB, and we fully intend to share the
> details of that when it's in a suitable state.

For the record, I'm not happy with this response.

IHMO the XDP userspace API should have been thought more through. Where
is the specification and documentation for this interface?

There is no official XDP userspace daemon, thus your proposed file-lock
scheme falls apart.  We can easily imagine, several different
open-source projects using XDP, they will not coordinate what /var/run
lock-file to use.  An admin installs several of these to evaluate
them.  Later he gets really confused, because both of them run by
mistake (and it is random who "wins" due to systemd parallel boot).

The programs themselves have no-way to catch this error situation, and
report it.  This is the main mistake of this API approach.  The admin
also have a hard time debugging this, as our query interface is a bool,
so he just see that some XDP program _is_ running.

Sorry, but we have to do better than this.  We have to maintain these
APIs for the next +10 years.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


[iovisor-dev] XDP (eXpress Data Path) documentation

2016-09-20 Thread Jesper Dangaard Brouer via iovisor-dev
Hi all,

As promised, I've started documenting the XDP eXpress Data Path):

 [1] https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html

IMHO the documentation have reached a stage where it is useful for the
XDP project, BUT I request collaboration on improving the documentation
from all. (Native English speakers are encouraged to send grammar fixes ;-))

You wouldn't believe it: But this pretty looking documentation actually
follows the new Kernel documentation format.  It is actually just
".rst" text files stored in my github repository under kernel/Documentation [2]

 [2] 
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/Documentation

Thus, just git clone my repository and started editing and send me
patches (or github pull requests). Like:

 $ git clone https://github.com/netoptimizer/prototype-kernel
 $ cd prototype-kernel/kernel/Documentation/
 $ make html
 $ firefox _build/html/index.html &

This new documentation format combines the best of two worlds, pretty
online browser documentation with almost plain text files, and changes
being tracked via git commits [3] (and auto git hooks to generate the
readthedocs.org page). You got to love it! :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[3] https://github.com/netoptimizer/prototype-kernel/commits/master
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


[iovisor-dev] XDP user interface conclusions

2016-09-15 Thread Jesper Dangaard Brouer via iovisor-dev
Hi Brenden,

I don't quite understand the semantics of the XDP userspace interface.

We allow XDP programs to be (unconditionally) exchanged by another
program, this avoids taking the link down+up and avoids reallocating
RX ring resources (which is great).

We have two XDP samples programs in samples/bpf/ xdp1 and xdp2.  Now I
want to first load xdp1 and then to avoid the linkdown I load xdp2,
and then afterwards remove/stop program xdp1.

This does NOT work, because (in samples/bpf/xdp1_user.c) when xdp1
exits it unconditionally removes the running XDP program (loaded by xdp2)
via set_link_xdp_fd(ifindex, -1).  The xdp2 user program is still
running, and is unaware of its xdp/bpf program have been unloaded.

I find this userspace interface confusing. What this your intention?
Perhaps you can explain what the intended semantics or specification is?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


[iovisor-dev] XDP user interface confusions

2016-09-15 Thread Jesper Dangaard Brouer via iovisor-dev
Hi Brenden,

I don't quite understand the semantics of the XDP userspace interface.

We allow XDP programs to be (unconditionally) exchanged by another
program, this avoids taking the link down+up and avoids reallocating
RX ring resources (which is great).

We have two XDP samples programs in samples/bpf/ xdp1 and xdp2.  Now I
want to first load xdp1 and then to avoid the linkdown I load xdp2,
and then afterwards remove/stop program xdp1.

This does NOT work, because (in samples/bpf/xdp1_user.c) when xdp1
exits it unconditionally removes the running XDP program (loaded by xdp2)
via set_link_xdp_fd(ifindex, -1).  The xdp2 user program is still
running, and is unaware of its xdp/bpf program have been unloaded.

I find this userspace interface confusing. What this your intention?
Perhaps you can explain what the intended semantics or specification is?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-13 Thread Jesper Dangaard Brouer via iovisor-dev
On Tue, 13 Sep 2016 08:58:30 -0700
Eric Dumazet  wrote:

> We also care about icache pressure, and GRO/TSO already provides
> bundling where it is applicable, without adding insane complexity in
> the stacks.

Sorry, I cannot resist. The GRO code is really bad regarding icache
pressure/usage, due to how everything is function pointers calling
function pointers, even if the general case is calling the function
defined just next to it in the same C-file (which usually cause
inlining).  I can easily get 10% more performance for UDP use-cases by
simply disabling the GRO code, and I measure a significant drop in
icache-misses.

Edward's solution should lower icache pressure.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH RFC 03/11] net/mlx5e: Implement RX mapped page cache for page recycle

2016-09-13 Thread Jesper Dangaard Brouer via iovisor-dev
On Tue, 13 Sep 2016 13:16:29 +0300
Tariq Toukan  wrote:

> On 07/09/2016 9:45 PM, Jesper Dangaard Brouer wrote:
> > On Wed,  7 Sep 2016 15:42:24 +0300 Saeed Mahameed  
> > wrote:
> >  
> >> From: Tariq Toukan 
> >>
> >> Instead of reallocating and mapping pages for RX data-path,
> >> recycle already used pages in a per ring cache.
> >>
> >> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
> >>
> >> Single stride, 64 bytes:
> >> * 4,739,057 - baseline
> >> * 4,749,550 - order0 no cache
> >> * 4,786,899 - order0 with cache
> >> 1% gain
> >>
> >> Larger packets, no page cross, 1024 bytes:
> >> * 3,982,361 - baseline
> >> * 3,845,682 - order0 no cache
> >> * 4,127,852 - order0 with cache
> >> 3.7% gain
> >>
> >> Larger packets, every 3rd packet crosses a page, 1500 bytes:
> >> * 3,731,189 - baseline
> >> * 3,579,414 - order0 no cache
> >> * 3,931,708 - order0 with cache
> >> 5.4% gain
> >>
> >> Signed-off-by: Tariq Toukan 
> >> Signed-off-by: Saeed Mahameed 
> >> ---
> >>   drivers/net/ethernet/mellanox/mlx5/core/en.h   | 16 ++
> >>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 15 ++
> >>   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 57 
> >> --
> >>   drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 16 ++
> >>   4 files changed, 99 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
> >> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> >> index 075cdfc..afbdf70 100644
> >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> >> @@ -287,6 +287,18 @@ struct mlx5e_rx_am { /* Adaptive Moderation */
> >>u8  tired;
> >>   };
> >>   
> >> +/* a single cache unit is capable to serve one napi call (for 
> >> non-striding rq)
> >> + * or a MPWQE (for striding rq).
> >> + */
> >> +#define MLX5E_CACHE_UNIT  (MLX5_MPWRQ_PAGES_PER_WQE > NAPI_POLL_WEIGHT ? \
> >> +   MLX5_MPWRQ_PAGES_PER_WQE : NAPI_POLL_WEIGHT)
> >> +#define MLX5E_CACHE_SIZE  (2 * roundup_pow_of_two(MLX5E_CACHE_UNIT))
> >> +struct mlx5e_page_cache {
> >> +  u32 head;
> >> +  u32 tail;
> >> +  struct mlx5e_dma_info page_cache[MLX5E_CACHE_SIZE];
> >> +};
> >> +  
> > [...]  
> >>   
> >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
> >> b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> >> index c1cb510..8e02af3 100644
> >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> >> @@ -305,11 +305,55 @@ static inline void mlx5e_post_umr_wqe(struct 
> >> mlx5e_rq *rq, u16 ix)
> >>mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
> >>   }
> >>   
> >> +static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
> >> +struct mlx5e_dma_info *dma_info)
> >> +{
> >> +  struct mlx5e_page_cache *cache = &rq->page_cache;
> >> +  u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
> >> +
> >> +  if (tail_next == cache->head) {
> >> +  rq->stats.cache_full++;
> >> +  return false;
> >> +  }
> >> +
> >> +  cache->page_cache[cache->tail] = *dma_info;
> >> +  cache->tail = tail_next;
> >> +  return true;
> >> +}
> >> +
> >> +static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
> >> +struct mlx5e_dma_info *dma_info)
> >> +{
> >> +  struct mlx5e_page_cache *cache = &rq->page_cache;
> >> +
> >> +  if (unlikely(cache->head == cache->tail)) {
> >> +  rq->stats.cache_empty++;
> >> +  return false;
> >> +  }
> >> +
> >> +  if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
> >> +  rq->stats.cache_busy++;
> >> +  return false;
> >> +  }  
> > Hmmm... doesn't this cause "blocking" of the page_cache recycle
> > facility until the page at the head of the queue gets (page) refcnt
> > decremented?  Real use-case could fairly easily block/cause this...  
> Hi Jesper,
> 
> That's right. We are aware of this issue.
> We considered ways of solving this, but decided to keep current 
> implementation for now.
> One way of solving this is to look deeper in the cache.
> Cons:
> - this will consume time, and the chance of finding an available page is 
> not that high: if the page in head of queue is busy then there's a good 
> chance that all the others are too (because of FIFO).
> in other words, you already checked all pages and anyway you're going to 
> allocate a new one (higher penalty for same decision).
> - this will make holes in the array causing complex accounting when 
> looking for an available page (this can easily be fixed by swapping 
> between the page in head and the available one).
> 
> Another way is sharing pages between different RQs.
> - For now we're not doing this for simplicity and to keep 
> synchronization away.
> 
> What do you think?
> 
> Anyway, we're looking forward to use your page-pool API which solves 
> these issues.

Ye

Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-12 Thread Jesper Dangaard Brouer via iovisor-dev
On Mon, 12 Sep 2016 12:56:28 -0700
Alexei Starovoitov  wrote:

> On Mon, Sep 12, 2016 at 01:30:25PM +0200, Jesper Dangaard Brouer wrote:
> > On Thu, 8 Sep 2016 23:30:50 -0700
> > Alexei Starovoitov  wrote:
> >   
> > > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:  
> > [...]  
> > > > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > > > Every time you call the stack code, then you flush your icache.  When
> > > > returning to the driver code, you will have to reload all the icache
> > > > associated with the XDP_TX, this is a costly operation.
> > >   
> > [...]  
> > > To make further progress in this discussion can we talk about
> > > the use case you have in mind instead? Then solution will
> > > be much clear, I hope.  
> > 
> > The DDoS use-case _is_ affected by this "hidden" bulking design.
> > 
> > Lets say, I want to implement a DDoS facility. Instead of just
> > dropping the malicious packets, I want to see the bad packets.  I
> > implement this by rewriting the destination-MAC to be my monitor
> > machine and then XDP_TX the packet.  
> 
> not following the use case. you want to implement a DDoS generator?
> Or just forward all bad packets from affected host to another host
> in the same rack? so two servers will be spammed with traffic and
> even more load on the tor? I really don't see how this is useful
> for anything but stress testing.

As I wrote below, the purpose of the monitor machine is to diagnose
false positives.  If you worry about the added load I would either,
forward out another interface (which is not supported yet) or simply do
sampling of packets being forwarded to the monitor host.

> > In the DDoS use-case, you have loaded your XDP/eBPF program, and 100%
> > of the traffic is delivered to the stack. (See note 1)  
> 
> hmm. DoS prevention use case is when 99% of the traffic is dropped.

As I wrote below, until the DDoS attack starts, all packets are
delivered to the stack.

> > Once the DDoS attack starts, then the traffic pattern changes, and XDP
> > should (hopefully only) catch the malicious traffic (monitor machine can
> > help diagnose false positive).  Now, due to interleaving the DDoS
> > traffic with the clean traffic, then efficiency of XDP_TX is reduced due to
> > more icache misses...
> > 
> > 
> > 
> > Note(1): Notice I have already demonstrated that loading a XDP/eBPF
> > program with 100% delivery to the stack, actually slows down the
> > normal stack.  This is due to hitting a bottleneck in the page
> > allocator.  I'm working removing that bottleneck with page_pool, and
> > that solution is orthogonal to this problem.  
> 
> sure. no one arguing against improving page allocator.
> 
> >  It is actually an excellent argument, for why you would want to run a
> > DDoS XDP filter only on a restricted number of RX queues.  
> 
> no. it's the opposite. If the host is under DoS there is no way
> the host can tell in advance which rx queue will be seeing bad
> packets.

Sorry, this note was not related to the DoS use-case.  You
misunderstood it.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-12 Thread Jesper Dangaard Brouer via iovisor-dev
On Thu, 8 Sep 2016 23:30:50 -0700
Alexei Starovoitov  wrote:

> On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
[...]
> > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > Every time you call the stack code, then you flush your icache.  When
> > returning to the driver code, you will have to reload all the icache
> > associated with the XDP_TX, this is a costly operation.  
> 
[...]
> To make further progress in this discussion can we talk about
> the use case you have in mind instead? Then solution will
> be much clear, I hope.

The DDoS use-case _is_ affected by this "hidden" bulking design.

Lets say, I want to implement a DDoS facility. Instead of just
dropping the malicious packets, I want to see the bad packets.  I
implement this by rewriting the destination-MAC to be my monitor
machine and then XDP_TX the packet.

In the DDoS use-case, you have loaded your XDP/eBPF program, and 100%
of the traffic is delivered to the stack. (See note 1)

Once the DDoS attack starts, then the traffic pattern changes, and XDP
should (hopefully only) catch the malicious traffic (monitor machine can
help diagnose false positive).  Now, due to interleaving the DDoS
traffic with the clean traffic, then efficiency of XDP_TX is reduced due to
more icache misses...



Note(1): Notice I have already demonstrated that loading a XDP/eBPF
program with 100% delivery to the stack, actually slows down the
normal stack.  This is due to hitting a bottleneck in the page
allocator.  I'm working removing that bottleneck with page_pool, and
that solution is orthogonal to this problem.
 It is actually an excellent argument, for why you would want to run a
DDoS XDP filter only on a restricted number of RX queues.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-12 Thread Jesper Dangaard Brouer via iovisor-dev
On Fri, 9 Sep 2016 18:03:09 +0300
Saeed Mahameed  wrote:

> On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev
>  wrote:
> > On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:  
> >>
> >> I'm sorry but I have a problem with this patch!  
> >> Looking at this patch, I want to bring up a fundamental architectural
> >> concern with the development direction of XDP transmit.
> >>
> >>
> >> What you are trying to implement, with delaying the doorbell, is
> >> basically TX bulking for TX_XDP.
> >>
> >>  Why not implement a TX bulking interface directly instead?!?
> >>
> >> Yes, the tailptr/doorbell is the most costly operation, but why not
> >> also take advantage of the benefits of bulking for other parts of the
> >> code? (benefit is smaller, by every cycles counts in this area)
> >>
> >> This hole XDP exercise is about avoiding having a transaction cost per
> >> packet, that reads "bulking" or "bundling" of packets, where possible.
> >>
> >>  Lets do bundling/bulking from the start!  
> 
> Jesper, what we did here is also bulking, instead of bulking in a
> temporary list in the driver we list the packets in the HW and once
> done we transmit all at once via the xdp_doorbell indication.
> 
> I agree with you that we can take advantage and improve the icache by
> bulking first in software and then queue all at once in the hw then
> ring one doorbell.
> 
> but I also agree with Alexei that this will introduce an extra
> pointer/list handling in the driver and we need to do the comparison
> between both approaches before we decide which is better.

I welcome implementing both approaches and benchmarking them against
each-other, I'll gladly dedicate time for this!

I'm reacting so loudly, because this is a mental model switch, that
need to be applied to the full drivers RX path. Also for normal stack
delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
there is between 10%-25% perf gain here.

The key point is stop seeing the lowest driver RX as something that
works on a per packet basis.  It might be easier to view this as a kind
of vector processing.  This is about having a vector of packets, where
we apply some action/operation.

This is about using the CPU more efficiently, getting it to do more
instructions per cycle.  The next level of optimization (for >= Sandy
Bridge CPUs) is to make these vector processing stages small enough to fit
into the CPUs decoded-I-cache section.


It might also be important to mention, that for netstack delivery, I
don't imagine bulking 64 packets.  Instead, I imagine doing 8-16
packets.  Why, because the NIC-HW runs independently and have the
opportunity to deliver more frames in the RX ring queue, while the
stack "slow" processed packets.  You can view this as "bulking" from
the RX ring queue, with a "look-back" before exiting the NAPI poll loop.


> this must be marked as future work and not have this from the start.

We both know that statement is BS, and the other approach will never be
implemented once this patch is accepted upstream.


> > mlx4 already does bulking and this proposed mlx5 set of patches
> > does bulking as well.

I'm reacting exactly because mlx4 is also doing "bulking" in the wrong
way IMHO.  And now mlx5 is building on the same principle. That is why
I'm yelling STOP.


> >> The reason behind the xmit_more API is that we could not change the
> >> API of all the drivers.  And we found that calling an explicit NDO
> >> flush came at a cost (only approx 7 ns IIRC), but it still a cost that
> >> would hit the common single packet use-case.
> >>
> >> It should be really easy to build a bundle of packets that need XDP_TX
> >> action, especially given you only have a single destination "port".
> >> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.  


[1] http://lists.openwall.net/netdev/2016/04/19/89
[2] http://lists.openwall.net/netdev/2016/01/15/51

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-12 Thread Jesper Dangaard Brouer via iovisor-dev
On Thu, 8 Sep 2016 23:30:50 -0700
Alexei Starovoitov  wrote:

> On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
> > > >  Lets do bundling/bulking from the start!
> > > 
> > > mlx4 already does bulking and this proposed mlx5 set of patches
> > > does bulking as well.
> > > See nothing wrong about it. RX side processes the packets and
> > > when it's done it tells TX to xmit whatever it collected.  
> > 
> > This is doing "hidden" bulking and not really taking advantage of using
> > the icache more effeciently.  
> > 
> > Let me explain the problem I see, little more clear then, so you
> > hopefully see where I'm going.
> > 
> > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > Every time you call the stack code, then you flush your icache.  When
> > returning to the driver code, you will have to reload all the icache
> > associated with the XDP_TX, this is a costly operation.  
> 
> correct. And why is that a problem?

It is good that you can see and acknowledge the I-cache problem.

XDP is all about performance.  What I hear is, that you are arguing
against a model that will yield better performance, that does not make
sense to me.  Let me explain this again, in another way.

This is a mental model switch.  Stop seeing the lowest driver RX as
something that works on a per packet basis.  Maybe is it is easier to
understand if we instead see this as vector processing?  This is about
having a vector of packets, where we apply some action/operation.

This is about using the CPU more efficiently, getting it to do more
instructions per cycle (directly measurable with perf, while I-cache
is not directly measurable).


Lets assume everything fits into the I-cache (XDP+driver code). The
CPU-frontend still have to decode the instructions from the I-cache
into micro-ops.  The next level of optimizations is to reuse the
decoded I-cache by running it on all elements in the packet-vector.

The Intel "64 and IA-32 Architectures Optimization Reference Manual"
(section 3.4.2.6 "Optimization for Decoded ICache"[1][2]), states make
sure each hot code block is less than about 500 instructions.  Thus,
the different "stages" working on the packet-vector, need to be rather
small and compact.

[1] 
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
[2] 
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf



Notice: The same mental model switch applies to delivery packets to
the regular netstack.  I've brought this up before[3].  Instead of
flushing the drivers I-cache for every packet, by calling the stack,
let instead bundle up N-packets in the driver before calling the
stack.  I showed 10% speedup by a naive implementation of this
approach.  Edward Cree also showed[4] a 10% performance boost, and
went further into the stack, showing a 25% increase.

A goal is also, to make optimizing netstack code-size independent of
the driver code-size, by separating the netstacks I-cache usage from
the drivers.

[3] http://lists.openwall.net/netdev/2016/01/15/51
[4] http://lists.openwall.net/netdev/2016/04/19/89
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-08 Thread Jesper Dangaard Brouer via iovisor-dev
On Thu, 8 Sep 2016 20:22:04 -0700
Alexei Starovoitov  wrote:

> On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
> > 
> > I'm sorry but I have a problem with this patch!  
> 
> is it because the variable is called 'xdp_doorbell'?
> Frankly I see nothing scary in this patch.
> It extends existing code by adding a flag to ring doorbell or not.
> The end of rx napi is used as an obvious heuristic to flush the pipe.
> Looks pretty generic to me.
> The same code can be used for non-xdp as well once we figure out
> good algorithm for xmit_more in the stack.

What I'm proposing can also be used by the normal stack.
 
> > Looking at this patch, I want to bring up a fundamental architectural
> > concern with the development direction of XDP transmit.
> > 
> > 
> > What you are trying to implement, with delaying the doorbell, is
> > basically TX bulking for TX_XDP.
> > 
> >  Why not implement a TX bulking interface directly instead?!?
> > 
> > Yes, the tailptr/doorbell is the most costly operation, but why not
> > also take advantage of the benefits of bulking for other parts of the
> > code? (benefit is smaller, by every cycles counts in this area)
> > 
> > This hole XDP exercise is about avoiding having a transaction cost per
> > packet, that reads "bulking" or "bundling" of packets, where possible.
> > 
> >  Lets do bundling/bulking from the start!  
> 
> mlx4 already does bulking and this proposed mlx5 set of patches
> does bulking as well.
> See nothing wrong about it. RX side processes the packets and
> when it's done it tells TX to xmit whatever it collected.

This is doing "hidden" bulking and not really taking advantage of using
the icache more effeciently.  

Let me explain the problem I see, little more clear then, so you
hopefully see where I'm going.

Imagine you have packets intermixed towards the stack and XDP_TX. 
Every time you call the stack code, then you flush your icache.  When
returning to the driver code, you will have to reload all the icache
associated with the XDP_TX, this is a costly operation.

 
> > The reason behind the xmit_more API is that we could not change the
> > API of all the drivers.  And we found that calling an explicit NDO
> > flush came at a cost (only approx 7 ns IIRC), but it still a cost that
> > would hit the common single packet use-case.
> > 
> > It should be really easy to build a bundle of packets that need XDP_TX
> > action, especially given you only have a single destination "port".
> > And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.  
> 
> not sure what are you proposing here?
> Sounds like you want to extend it to multi port in the future?
> Sure. The proposed code is easily extendable.
> 
> Or you want to see something like a link list of packets
> or an array of packets that RX side is preparing and then
> send the whole array/list to TX port?
> I don't think that would be efficient, since it would mean
> unnecessary copy of pointers.

I just explain it will be more efficient due to better use of icache.

 
> > In the future, XDP need to support XDP_FWD forwarding of packets/pages
> > out other interfaces.  I also want bulk transmit from day-1 here.  It
> > is slightly more tricky to sort packets for multiple outgoing
> > interfaces efficiently in the pool loop.  
> 
> I don't think so. Multi port is natural extension to this set of patches.
> With multi port the end of RX will tell multiple ports (that were
> used to tx) to ring the bell. Pretty trivial and doesn't involve any
> extra arrays or link lists.

So, have you solved the problem exclusive access to a TX ring of a
remote/different net_device when sending?

In you design you assume there exist many TX ring available for other
devices to access.  In my design I also want to support devices that
doesn't have this HW capability, and e.g. only have one TX queue.


> > But the mSwitch[1] article actually already solved this destination
> > sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
> > understanding the next steps, for a smarter data structure, when
> > starting to have more TX "ports".  And perhaps align your single
> > XDP_TX destination data structure to this future development.
> > 
> > [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf  
> 
> I don't see how this particular paper applies to the existing kernel code.
> It's great to take ideas from research papers, but real code is different.
> 
> > --Jesper
> > (top post)  
> 
> since when it's ok to top post?

What a kneejerk reaction.  When writing something general we often
reply to the top of the email, and then often delete the rest (which
makes it hard for later comers to follow).  I was bcc'ing some people,
which needed the context, so it was a service note to you, that I
didn't write anything below.

 
> > On Wed,  7 Sep 2016 15:42:32 +0300 Saeed Mahameed  
> > wrote:
> >   
> > > Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> > > 
> 

Re: [iovisor-dev] [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-08 Thread Jesper Dangaard Brouer via iovisor-dev
On Thu, 8 Sep 2016 09:26:03 -0700
Tom Herbert  wrote:

> On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer
>  wrote:
> >
> > On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert  wrote:
> >  
> >> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend  
> >> wrote:  
> >> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:  
> >> >>
> >> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed 
> >> >>  wrote:  
> >> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet  
> >> >>> wrote:  
> >>  On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
> >> > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet 
> >> >  wrote:  
> >> >> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
> >> >> [...]  
> >> 
> >>  Only if a qdisc is present and pressure is high enough.
> >> 
> >>  But in a forwarding setup, we likely receive at a lower rate than the
> >>  NIC can transmit.  
> >> >>
> >> >> Yes, I can confirm this happens in my experiments.
> >> >>  
> >>   
> >> >>>
> >> >>> Jesper has a similar Idea to make the qdisc think it is under
> >> >>> pressure, when the device TX ring is idle most of the time, i think
> >> >>> his idea can come in handy here. I am not fully involved in the
> >> >>> details, maybe he can elaborate more.
> >> >>>
> >> >>> But if it works, it will be transparent to napi, and xmit more will
> >> >>> happen by design.  
> >> >>
> >> >> Yes. I have some ideas around getting more bulking going from the qdisc
> >> >> layer, by having the drivers provide some feedback to the qdisc layer
> >> >> indicating xmit_more should be possible.  This will be a topic at the
> >> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
> >> >> challenge people to come up with a good solution ;-)
> >> >>  
> >> >
> >> > One thing I've noticed but haven't yet actually analyzed much is if
> >> > I shrink the nic descriptor ring size to only be slightly larger than
> >> > the qdisc layer bulking size I get more bulking and better perf numbers.
> >> > At least on microbenchmarks. The reason being the nic pushes back more
> >> > on the qdisc. So maybe a case for making the ring size in the NIC some
> >> > factor of the expected number of queues feeding the descriptor ring.
> >> >  
> >
> > I've also played with shrink the NIC descriptor ring size, it works,
> > but it is an ugly hack to get NIC pushes backs, and I foresee it will
> > hurt normal use-cases. (There are other reasons for shrinking the ring
> > size like cache usage, but that is unrelated to this).
> >
> >  
> >> BQL is not helping with that?  
> >
> > Exactly. But the BQL _byte_ limit is not what is needed, what we need
> > to know is the _packets_ currently "in-flight".  Which Tom already have
> > a patch for :-)  Once we have that the algorithm is simple.
> >
> > Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
> > packets in-flight, the qdisc start it's bulk dequeue building phase,
> > before calling the driver. The allowed max qdisc bulk size should
> > likely be related to pkts-in-flight.
> >  
> Sorry, I'm still missing it. The point of BQL is that we minimize the
> amount of data (and hence number of packets) that needs to be queued
> in the device in order to prevent the link from going idle while there
> are outstanding packets to be sent. The algorithm is based on counting
> bytes not packets because bytes are roughly an equal cost unit of
> work. So if we've queued 100K of bytes on the queue we know how long
> that takes around 80 usecs @10G, but if we count packets then we
> really don't know much about that. 100 packets enqueued could
> represent 6400 bytes or 6400K worth of data so time to transmit is
> anywhere from 5usecs to 5msecs
> 
> Shouldn't qdisc bulk size be based on the BQL limit? What is the
> simple algorithm to apply to in-flight packets?

Maybe the algorithm is not so simple, and we likely also have to take
BQL bytes into account.

The reason for wanting packets-in-flight is because we are attacking a
transaction cost.  The tailptr/doorbell cost around 70ns.  (Based on
data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 =
70.74). The 10G wirespeed small packets budget is 67.2ns, this with
fixed overhead per packet of 70ns we can never reach 10G wirespeed.

The idea/algo is trying to predict the future.  If we see a given/high
packet rate, which equals a high transaction cost, then lets try not
calling the driver, and instead backlog the packet in the qdisc,
speculatively hoping the current rate continues.  This will in effect
allow bulking and amortize the 70ns transaction cost over N packets.

Instead of tracking a rate of packets or doorbells per sec, I will let
BQLs packet-in-flight tell me when the driver sees a rate high enough
that the drivers (DMA-TX completion) consider several packets are
in-flight.
When that happens, I will bet on, I can stop sending packets to the
device, and instead queue them in the qdisc layer.  If I'm unlucky an

Re: [iovisor-dev] [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support

2016-09-08 Thread Jesper Dangaard Brouer via iovisor-dev
On Thu, 8 Sep 2016 12:31:47 +0300
Or Gerlitz  wrote:

> On Thu, Sep 8, 2016 at 10:38 AM, Jesper Dangaard Brouer
>  wrote:
> > On Wed, 7 Sep 2016 23:55:42 +0300
> > Or Gerlitz  wrote:
> >  
> >> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed  
> >> wrote:  
> >> > From: Rana Shahout 
> >> >
> >> > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
> >> >
> >> > When XDP is on we make sure to change channels RQs type to
> >> > MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
> >> > ensure "page per packet".
> >> >
> >> > On XDP set, we fail if HW LRO is set and request from user to turn it
> >> > off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
> >> > annoying, but we prefer not to enforce LRO off from XDP set function.
> >> >
> >> > Full channels reset (close/open) is required only when setting XDP
> >> > on/off.
> >> >
> >> > When XDP set is called just to exchange programs, we will update
> >> > each RQ xdp program on the fly and for synchronization with current
> >> > data path RX activity of that RQ, we temporally disable that RQ and
> >> > ensure RX path is not running, quickly update and re-enable that RQ,
> >> > for that we do:
> >> > - rq.state = disabled
> >> > - napi_synnchronize
> >> > - xchg(rq->xdp_prg)
> >> > - rq.state = enabled
> >> > - napi_schedule // Just in case we've missed an IRQ
> >> >
> >> > Packet rate performance testing was done with pktgen 64B packets and on
> >> > TX side and, TC drop action on RX side compared to XDP fast drop.
> >> >
> >> > CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> >> >
> >> > Comparison is done between:
> >> > 1. Baseline, Before this patch with TC drop action
> >> > 2. This patch with TC drop action
> >> > 3. This patch with XDP RX fast drop
> >> >
> >> > StreamsBaseline(TC drop)TC dropXDP fast Drop
> >> > --
> >> > 1   5.51Mpps5.14Mpps 13.5Mpps  
> >>
> >> This (13.5 M PPS) is less than 50% of the result we presented @ the
> >> XDP summit which was obtained by Rana. Please see if/how much does
> >> this grows if you use more sender threads, but all of them to xmit the
> >> same stream/flows, so we're on one ring. That (XDP with single RX ring
> >> getting packets from N remote TX rings) would be your canonical
> >> base-line for any further numbers.  
> >
> > Well, my experiments with this hardware (mlx5/CX4 at 50Gbit/s) show
> > that you should be able to reach 23Mpps on a single CPU.  This is
> > a XDP-drop-simulation with order-0 pages being recycled through my
> > page_pool code, plus avoiding the cache-misses (notice you are using a
> > CPU E5-2680 with DDIO, thus you should only see a L3 cache miss).  
> 
> so this takes up from 13M to 23M, good.

Notice the 23Mpps was crude hack test to determine the maximum
achievable performance.  This is our performance target, once we get
_close_ to that then we are happy, and stop optimizing.

> Could you explain why the move from order-3 to order-0 is hurting the
> performance so much (drop from 32M to 23M), any way we can overcome that?

It is all going to be in the details.

When reaching these numbers be careful, thinking wow 23M to 32M sounds
like a huge deal... but the performance difference in nanosec is
actually not that large, it is only around 12ns more we have to save.

(1/(23*10^6)-1/(32*10^6))*10^9 = 12.22

> > The 23Mpps number looks like some HW limitation, as the increase was  
> 
> not HW, I think. As I said, Rana got 32M with striding RQ when she was
> using order-3 (or did we use order-5?)

It was order-5.

We likely need some HW tuning parameter (like with mlx4) if you want to
go past the 23Mpps mark.

 
> > is not proportional to page-allocator overhead I removed (and CPU freq
> > starts to decrease).  I also did scaling tests to more CPUs, which
> > showed it scaled up to 40Mpps (you reported 45M).  And at the Phy RX
> > level I see 60Mpps (50G max is 74Mpps).  

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


[iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-08 Thread Jesper Dangaard Brouer via iovisor-dev

I'm sorry but I have a problem with this patch!

Looking at this patch, I want to bring up a fundamental architectural
concern with the development direction of XDP transmit.


What you are trying to implement, with delaying the doorbell, is
basically TX bulking for TX_XDP.

 Why not implement a TX bulking interface directly instead?!?

Yes, the tailptr/doorbell is the most costly operation, but why not
also take advantage of the benefits of bulking for other parts of the
code? (benefit is smaller, by every cycles counts in this area)

This hole XDP exercise is about avoiding having a transaction cost per
packet, that reads "bulking" or "bundling" of packets, where possible.

 Lets do bundling/bulking from the start!

The reason behind the xmit_more API is that we could not change the
API of all the drivers.  And we found that calling an explicit NDO
flush came at a cost (only approx 7 ns IIRC), but it still a cost that
would hit the common single packet use-case.

It should be really easy to build a bundle of packets that need XDP_TX
action, especially given you only have a single destination "port".
And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.

In the future, XDP need to support XDP_FWD forwarding of packets/pages
out other interfaces.  I also want bulk transmit from day-1 here.  It
is slightly more tricky to sort packets for multiple outgoing
interfaces efficiently in the pool loop.

But the mSwitch[1] article actually already solved this destination
sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
understanding the next steps, for a smarter data structure, when
starting to have more TX "ports".  And perhaps align your single
XDP_TX destination data structure to this future development.

[1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf

--Jesper
(top post)


On Wed,  7 Sep 2016 15:42:32 +0300 Saeed Mahameed  wrote:

> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> 
> Here we introduce a xmit more like mechanism that will queue up more
> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> 
> Once RX napi budget is consumed and we exit napi RX loop, we will
> flush (doorbell) all XDP looped packets in case there are such.
> 
> XDP forward packet rate:
> 
> Comparing XDP with and w/o xmit more (bulk transmit):
> 
> Streams XDP TX   XDP TX (xmit more)
> ---
> 1   4.90Mpps  7.50Mpps
> 2   9.50Mpps  14.8Mpps
> 4   16.5Mpps  25.1Mpps
> 8   21.5Mpps  27.5Mpps*
> 16  24.1Mpps  27.5Mpps*
> 
> *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
> we will be working on the analysis and will publish the conclusions
> later.
> 
> Signed-off-by: Saeed Mahameed 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h|  9 ++--
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 
> +++--
>  2 files changed, 49 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index df2c9e0..6846208 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -265,7 +265,8 @@ struct mlx5e_cq {
>  
>  struct mlx5e_rq;
>  typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq *rq,
> -struct mlx5_cqe64 *cqe);
> +struct mlx5_cqe64 *cqe,
> +bool *xdp_doorbell);
>  typedef int (*mlx5e_fp_alloc_wqe)(struct mlx5e_rq *rq, struct mlx5e_rx_wqe 
> *wqe,
> u16 ix);
>  
> @@ -742,8 +743,10 @@ void mlx5e_free_sq_descs(struct mlx5e_sq *sq);
>  
>  void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
>   bool recycle);
> -void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
> -void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
> +void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
> +  bool *xdp_doorbell);
> +void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
> +bool *xdp_doorbell);
>  bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
>  int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 
> ix);
>  int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,  
> u16 ix);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 912a0e2..ed93251 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -117,7 +117,8 @@ static inline void mlx5e_decompress_cqe_no_hash(struct 
> mlx5e_rq *rq,
>  static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
>  

Re: [iovisor-dev] [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support

2016-09-08 Thread Jesper Dangaard Brouer via iovisor-dev
On Wed, 7 Sep 2016 23:55:42 +0300
Or Gerlitz  wrote:

> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed  wrote:
> > From: Rana Shahout 
> >
> > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
> >
> > When XDP is on we make sure to change channels RQs type to
> > MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
> > ensure "page per packet".
> >
> > On XDP set, we fail if HW LRO is set and request from user to turn it
> > off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
> > annoying, but we prefer not to enforce LRO off from XDP set function.
> >
> > Full channels reset (close/open) is required only when setting XDP
> > on/off.
> >
> > When XDP set is called just to exchange programs, we will update
> > each RQ xdp program on the fly and for synchronization with current
> > data path RX activity of that RQ, we temporally disable that RQ and
> > ensure RX path is not running, quickly update and re-enable that RQ,
> > for that we do:
> > - rq.state = disabled
> > - napi_synnchronize
> > - xchg(rq->xdp_prg)
> > - rq.state = enabled
> > - napi_schedule // Just in case we've missed an IRQ
> >
> > Packet rate performance testing was done with pktgen 64B packets and on
> > TX side and, TC drop action on RX side compared to XDP fast drop.
> >
> > CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> >
> > Comparison is done between:
> > 1. Baseline, Before this patch with TC drop action
> > 2. This patch with TC drop action
> > 3. This patch with XDP RX fast drop
> >
> > StreamsBaseline(TC drop)TC dropXDP fast Drop
> > --
> > 1   5.51Mpps5.14Mpps 13.5Mpps  
> 
> This (13.5 M PPS) is less than 50% of the result we presented @ the
> XDP summit which was obtained by Rana. Please see if/how much does
> this grows if you use more sender threads, but all of them to xmit the
> same stream/flows, so we're on one ring. That (XDP with single RX ring
> getting packets from N remote TX rings) would be your canonical
> base-line for any further numbers.

Well, my experiments with this hardware (mlx5/CX4 at 50Gbit/s) show
that you should be able to reach 23Mpps on a single CPU.  This is
a XDP-drop-simulation with order-0 pages being recycled through my
page_pool code, plus avoiding the cache-misses (notice you are using a
CPU E5-2680 with DDIO, thus you should only see a L3 cache miss).

The 23Mpps number looks like some HW limitation, as the increase was
is not proportional to page-allocator overhead I removed (and CPU freq
starts to decrease).  I also did scaling tests to more CPUs, which
showed it scaled up to 40Mpps (you reported 45M).  And at the Phy RX
level I see 60Mpps (50G max is 74Mpps).

Notice this is a significant improvement over the mlx4/CX3-pro HW, as
it only scales up to 20Mpps, but can also do 20Mpps XDP-drop on a
single core.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support

2016-09-08 Thread Jesper Dangaard Brouer via iovisor-dev
On Wed, 7 Sep 2016 20:07:01 +0300
Saeed Mahameed  wrote:

> On Wed, Sep 7, 2016 at 7:54 PM, Tom Herbert  wrote:
> > On Wed, Sep 7, 2016 at 7:48 AM, Saeed Mahameed
> >  wrote:  
> >> On Wed, Sep 7, 2016 at 4:32 PM, Or Gerlitz  wrote:  
> >>> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed  
> >>> wrote:
> >>>  
>  Packet rate performance testing was done with pktgen 64B packets and on
>  TX side and, TC drop action on RX side compared to XDP fast drop.
> 
>  CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> 
>  Comparison is done between:
>  1. Baseline, Before this patch with TC drop action
>  2. This patch with TC drop action
>  3. This patch with XDP RX fast drop
> 
>  StreamsBaseline(TC drop)TC dropXDP fast Drop
>  --
>  1   5.51Mpps5.14Mpps 13.5Mpps
>  2   11.5Mpps10.0Mpps 25.1Mpps
>  4   16.3Mpps17.2Mpps 35.4Mpps
>  8   29.6Mpps28.2Mpps 45.8Mpps*
>  16  34.0Mpps30.1Mpps 45.8Mpps*  
> >>>
> >>> Rana, Guys, congrat!!
> >>>
> >>> When you say X streams, does each stream mapped by RSS to different RX 
> >>> ring?
> >>> or we're on the same RX ring for all rows of the above table?  
> >>
> >> Yes, I will make this more clear in the actual submission,
> >> Here we are talking about different RSS core rings.
> >>  
> >>>
> >>> In the CX3 work, we had X sender "streams" that all mapped to the same RX 
> >>> ring,
> >>> I don't think we went beyond one RX ring.  
> >>
> >> Here we did, the first row is what you are describing the other rows
> >> are the same test
> >> with increasing the number of the RSS receiving cores, The xmit side is 
> >> sending
> >> as many streams as possible to be as much uniformly spread as possible
> >> across the
> >> different RSS cores on the receiver.
> >>  
> > Hi Saeed,
> >
> > Please report CPU utilization also. The expectation is that
> > performance should scale linearly with increasing number of CPUs (i.e.
> > pps/CPU_utilization should be constant).
> >  
> 
> That was my expectation too.

Be careful with such expectations at these extreme speeds, because we
are starting to hit PCI-express limitations and CPU cache-coherency
limitations (if any atomic/RMW operations still exists per packet).

Consider that in the small packet size 64 bytes case, the drivers PCI bandwidth
need/overhead is actually quite large, as every descriptor also 64
bytes transferred.

 
> Anyway we will share more accurate results when we have them, with CPU
> utilization statistics as well.

It is interesting to monitor the CPU utilization, because (if C-states
are enabled) you will likely see the CPU freq be reduced or even enter
CPU idle states, in-case your software (XDP) gets faster than the HW
(PCI or NIC).  I've seen that happen with mlx4/CX3-pro.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Jesper Dangaard Brouer via iovisor-dev

On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert  wrote:

> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend  
> wrote:
> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:  
> >>
> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed 
> >>  wrote:  
> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet  
> >>> wrote:  
>  On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
> > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet  
> > wrote:  
> >> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
> >> [...]  
> 
>  Only if a qdisc is present and pressure is high enough.
> 
>  But in a forwarding setup, we likely receive at a lower rate than the
>  NIC can transmit.  
> >>
> >> Yes, I can confirm this happens in my experiments.
> >>  
>   
> >>>
> >>> Jesper has a similar Idea to make the qdisc think it is under
> >>> pressure, when the device TX ring is idle most of the time, i think
> >>> his idea can come in handy here. I am not fully involved in the
> >>> details, maybe he can elaborate more.
> >>>
> >>> But if it works, it will be transparent to napi, and xmit more will
> >>> happen by design.  
> >>
> >> Yes. I have some ideas around getting more bulking going from the qdisc
> >> layer, by having the drivers provide some feedback to the qdisc layer
> >> indicating xmit_more should be possible.  This will be a topic at the
> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
> >> challenge people to come up with a good solution ;-)
> >>  
> >
> > One thing I've noticed but haven't yet actually analyzed much is if
> > I shrink the nic descriptor ring size to only be slightly larger than
> > the qdisc layer bulking size I get more bulking and better perf numbers.
> > At least on microbenchmarks. The reason being the nic pushes back more
> > on the qdisc. So maybe a case for making the ring size in the NIC some
> > factor of the expected number of queues feeding the descriptor ring.
> >  

I've also played with shrink the NIC descriptor ring size, it works,
but it is an ugly hack to get NIC pushes backs, and I foresee it will
hurt normal use-cases. (There are other reasons for shrinking the ring
size like cache usage, but that is unrelated to this).

 
> BQL is not helping with that?

Exactly. But the BQL _byte_ limit is not what is needed, what we need
to know is the _packets_ currently "in-flight".  Which Tom already have
a patch for :-)  Once we have that the algorithm is simple.

Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
packets in-flight, the qdisc start it's bulk dequeue building phase,
before calling the driver. The allowed max qdisc bulk size should
likely be related to pkts-in-flight.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand

2016-09-07 Thread Jesper Dangaard Brouer via iovisor-dev

On Wed,  7 Sep 2016 15:42:25 +0300 Saeed Mahameed  wrote:

> For non-striding RQ configuration before this patch we had a ring
> with pre-allocated SKBs and mapped the SKB->data buffers for
> device.
> 
> For robustness and better RX data buffers management, we allocate a
> page per packet and build_skb around it.
> 
> This patch (which is a prerequisite for XDP) will actually reduce
> performance for normal stack usage, because we are now hitting a bottleneck
> in the page allocator. A later patch of page reuse mechanism will be
> needed to restore or even improve performance in comparison to the old
> RX scheme.

Yes, it is true that there is a performance reduction (for normal
stack, not XDP) caused by hitting a bottleneck in the page allocator.

I actually have a PoC implementation of my page_pool, that show we
regain the performance and then some.  Based on an earlier version of
this patch, where I hook it into the mlx5 driver (50Gbit/s version).


You desc might be a bit outdated, as this patch and the patch before
does contain you own driver local page-cache recycle facility.  And you
also show that you regain quite a lot of the lost performance.

You driver local page_cache does have its limitations (see comments on
other patch), as it depend on timely refcnt decrease, by the users of
the page.  If they hold onto pages (like TCP) then your page-cache will
not be efficient.

 
> Packet rate performance testing was done with pktgen 64B packets on
> xmit side and TC drop action on RX side.

I assume this is TC _ingress_ dropping, like [1]

[1] 
https://github.com/netoptimizer/network-testing/blob/master/bin/tc_ingress_drop.sh

> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> 
> Comparison is done between:
>  1.Baseline, before 'net/mlx5e: Build RX SKB on demand'
>  2.Build SKB with RX page cache (This patch)
> 
> StreamsBaselineBuild SKB+page-cacheImprovement
> ---
> 1  4.33Mpps  5.51Mpps27%
> 2  7.35Mpps  11.5Mpps52%
> 4  14.0Mpps  16.3Mpps16%
> 8  22.2Mpps  29.6Mpps20%
> 16 24.8Mpps  34.0Mpps17%

The improvements gained from using your page-cache is impressively high.

Thanks for working on this,
 --Jesper
 
> Signed-off-by: Saeed Mahameed 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h  |  10 +-
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  31 +++-
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   | 215 
> +++---
>  3 files changed, 133 insertions(+), 123 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index afbdf70..a346112 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -65,6 +65,8 @@
>  #define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW0x3
>  #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW0x6
>  
> +#define MLX5_RX_HEADROOM NET_SKB_PAD
> +
>  #define MLX5_MPWRQ_LOG_STRIDE_SIZE   6  /* >= 6, HW restriction */
>  #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS  8  /* >= 6, HW 
> restriction */
>  #define MLX5_MPWRQ_LOG_WQE_SZ18
> @@ -302,10 +304,14 @@ struct mlx5e_page_cache {
>  struct mlx5e_rq {
>   /* data path */
>   struct mlx5_wq_ll  wq;
> - u32wqe_sz;
> - struct sk_buff   **skb;
> +
> + struct mlx5e_dma_info *dma_info;
>   struct mlx5e_mpw_info *wqe_info;
>   void  *mtt_no_align;
> + struct {
> + u8 page_order;
> + u32wqe_sz;/* wqe data buffer size */
> + } buff;
>   __be32 mkey_be;
>  
>   struct device *pdev;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index c84702c..c9f1dea 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -411,6 +411,8 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>   void *rqc = param->rqc;
>   void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
>   u32 byte_count;
> + u32 frag_sz;
> + int npages;
>   int wq_sz;
>   int err;
>   int i;
> @@ -445,29 +447,40 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>  
>   rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz);
>   rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
> - rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
> - byte_count = rq->wqe_sz;
> +
> + rq->buff.wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
> + byte_count = rq->buff.wqe_sz;
>   rq->mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
>  

Re: [iovisor-dev] [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ

2016-09-07 Thread Jesper Dangaard Brouer via iovisor-dev

On Wed,  7 Sep 2016 15:42:22 +0300 Saeed Mahameed  wrote:

> From: Tariq Toukan 
> 
> To improve the memory consumption scheme, we omit the flow that
> demands and splits high-order pages in Striding RQ, and stay
> with a single Striding RQ flow that uses order-0 pages.

Thanks you for doing this! MM-list people thanks you!

For others to understand what this means:  This driver was doing
split_page() on high-order pages (for Striding RQ).  This was really bad
because it will cause fragmenting the page-allocator, and depleting the
high-order pages available quickly.

(I've left rest of patch intact below, if some MM people should be
interested in looking at the changes).

There is even a funny comment in split_page() relevant to this:

/* [...]
 * Note: this is probably too low level an operation for use in drivers.
 * Please consult with lkml before using this in your driver.
 */


> Moving to fragmented memory allows the use of larger MPWQEs,
> which reduces the number of UMR posts and filler CQEs.
> 
> Moving to a single flow allows several optimizations that improve
> performance, especially in production servers where we would
> anyway fallback to order-0 allocations:
> - inline functions that were called via function pointers.
> - improve the UMR post process.
> 
> This patch alone is expected to give a slight performance reduction.
> However, the new memory scheme gives the possibility to use a page-cache
> of a fair size, that doesn't inflate the memory footprint, which will
> dramatically fix the reduction and even give a huge gain.
> 
> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
> 
> Single stride, 64 bytes:
> * 4,739,057 - baseline
> * 4,749,550 - this patch
> no reduction
> 
> Larger packets, no page cross, 1024 bytes:
> * 3,982,361 - baseline
> * 3,845,682 - this patch
> 3.5% reduction
> 
> Larger packets, every 3rd packet crosses a page, 1500 bytes:
> * 3,731,189 - baseline
> * 3,579,414 - this patch
> 4% reduction
> 

Well, the reduction does not really matter than much, because your
baseline benchmarks are from a freshly booted system, where you have
not fragmented and depleted the high-order pages yet... ;-)


> Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
> Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi 
> packet WQE")
> Signed-off-by: Tariq Toukan 
> Signed-off-by: Saeed Mahameed 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h   |  54 ++--
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 136 --
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 292 
> -
>  drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   4 -
>  drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |   2 +-
>  5 files changed, 184 insertions(+), 304 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index bf722aa..075cdfc 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -62,12 +62,12 @@
>  #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE0xd
>  
>  #define MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE_MPW0x1
> -#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW0x4
> +#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW0x3
>  #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW0x6
>  
>  #define MLX5_MPWRQ_LOG_STRIDE_SIZE   6  /* >= 6, HW restriction */
>  #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS  8  /* >= 6, HW 
> restriction */
> -#define MLX5_MPWRQ_LOG_WQE_SZ17
> +#define MLX5_MPWRQ_LOG_WQE_SZ18
>  #define MLX5_MPWRQ_WQE_PAGE_ORDER  (MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT > 0 ? 
> \
>   MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT : 0)
>  #define MLX5_MPWRQ_PAGES_PER_WQE BIT(MLX5_MPWRQ_WQE_PAGE_ORDER)
> @@ -293,8 +293,8 @@ struct mlx5e_rq {
>   u32wqe_sz;
>   struct sk_buff   **skb;
>   struct mlx5e_mpw_info *wqe_info;
> + void  *mtt_no_align;
>   __be32 mkey_be;
> - __be32 umr_mkey_be;
>  
>   struct device *pdev;
>   struct net_device *netdev;
> @@ -323,32 +323,15 @@ struct mlx5e_rq {
>  
>  struct mlx5e_umr_dma_info {
>   __be64*mtt;
> - __be64*mtt_no_align;
>   dma_addr_t mtt_addr;
> - struct mlx5e_dma_info *dma_info;
> + struct mlx5e_dma_info  dma_info[MLX5_MPWRQ_PAGES_PER_WQE];
> + struct mlx5e_umr_wqe   wqe;
>  };
>  
>  struct mlx5e_mpw_info {
> - union {
> - struct mlx5e_dma_info dma_info;
> - struct mlx5e_umr_dma_info umr;
> - };
> + struct mlx5e_umr_dma_info umr;
>   u16 consumed_strides;
>   u16 skbs_frags[MLX5_MPWRQ_PAGES_PER_WQE];
> -
> - void (*dma_pre_sync)(struct device *pdev,
> 

Re: [iovisor-dev] [PATCH RFC 03/11] net/mlx5e: Implement RX mapped page cache for page recycle

2016-09-07 Thread Jesper Dangaard Brouer via iovisor-dev

On Wed,  7 Sep 2016 15:42:24 +0300 Saeed Mahameed  wrote:

> From: Tariq Toukan 
> 
> Instead of reallocating and mapping pages for RX data-path,
> recycle already used pages in a per ring cache.
> 
> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
> 
> Single stride, 64 bytes:
> * 4,739,057 - baseline
> * 4,749,550 - order0 no cache
> * 4,786,899 - order0 with cache
> 1% gain
> 
> Larger packets, no page cross, 1024 bytes:
> * 3,982,361 - baseline
> * 3,845,682 - order0 no cache
> * 4,127,852 - order0 with cache
> 3.7% gain
> 
> Larger packets, every 3rd packet crosses a page, 1500 bytes:
> * 3,731,189 - baseline
> * 3,579,414 - order0 no cache
> * 3,931,708 - order0 with cache
> 5.4% gain
> 
> Signed-off-by: Tariq Toukan 
> Signed-off-by: Saeed Mahameed 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h   | 16 ++
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 15 ++
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 57 
> --
>  drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 16 ++
>  4 files changed, 99 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index 075cdfc..afbdf70 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -287,6 +287,18 @@ struct mlx5e_rx_am { /* Adaptive Moderation */
>   u8  tired;
>  };
>  
> +/* a single cache unit is capable to serve one napi call (for non-striding 
> rq)
> + * or a MPWQE (for striding rq).
> + */
> +#define MLX5E_CACHE_UNIT (MLX5_MPWRQ_PAGES_PER_WQE > NAPI_POLL_WEIGHT ? \
> +  MLX5_MPWRQ_PAGES_PER_WQE : NAPI_POLL_WEIGHT)
> +#define MLX5E_CACHE_SIZE (2 * roundup_pow_of_two(MLX5E_CACHE_UNIT))
> +struct mlx5e_page_cache {
> + u32 head;
> + u32 tail;
> + struct mlx5e_dma_info page_cache[MLX5E_CACHE_SIZE];
> +};
> +
[...]
>  
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index c1cb510..8e02af3 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -305,11 +305,55 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq 
> *rq, u16 ix)
>   mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
>  }
>  
> +static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
> +   struct mlx5e_dma_info *dma_info)
> +{
> + struct mlx5e_page_cache *cache = &rq->page_cache;
> + u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
> +
> + if (tail_next == cache->head) {
> + rq->stats.cache_full++;
> + return false;
> + }
> +
> + cache->page_cache[cache->tail] = *dma_info;
> + cache->tail = tail_next;
> + return true;
> +}
> +
> +static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
> +   struct mlx5e_dma_info *dma_info)
> +{
> + struct mlx5e_page_cache *cache = &rq->page_cache;
> +
> + if (unlikely(cache->head == cache->tail)) {
> + rq->stats.cache_empty++;
> + return false;
> + }
> +
> + if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
> + rq->stats.cache_busy++;
> + return false;
> + }

Hmmm... doesn't this cause "blocking" of the page_cache recycle
facility until the page at the head of the queue gets (page) refcnt
decremented?  Real use-case could fairly easily block/cause this... 

> +
> + *dma_info = cache->page_cache[cache->head];
> + cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
> + rq->stats.cache_reuse++;
> +
> + dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE,
> +DMA_FROM_DEVICE);
> + return true;
> +}
> +
>  static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
> struct mlx5e_dma_info *dma_info)
>  {
> - struct page *page = dev_alloc_page();
> + struct page *page;
> +
> + if (mlx5e_rx_cache_get(rq, dma_info))
> + return 0;
>  
> + page = dev_alloc_page();
>   if (unlikely(!page))
>   return -ENOMEM;

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-07 Thread Jesper Dangaard Brouer via iovisor-dev

On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed  
wrote:
> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet  wrote:
> > On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
> >> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet  
> >> wrote:  
> >> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
[...]
> >
> > Only if a qdisc is present and pressure is high enough.
> >
> > But in a forwarding setup, we likely receive at a lower rate than the
> > NIC can transmit.

Yes, I can confirm this happens in my experiments.

> >  
> 
> Jesper has a similar Idea to make the qdisc think it is under
> pressure, when the device TX ring is idle most of the time, i think
> his idea can come in handy here. I am not fully involved in the
> details, maybe he can elaborate more.
> 
> But if it works, it will be transparent to napi, and xmit more will
> happen by design.

Yes. I have some ideas around getting more bulking going from the qdisc
layer, by having the drivers provide some feedback to the qdisc layer
indicating xmit_more should be possible.  This will be a topic at the
Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
challenge people to come up with a good solution ;-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[1] http://netdevconf.org/1.2/session.html?jesper-performance-workshop
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] XDP at netdev1.2 - "XDP port abstraction" proposal

2016-08-23 Thread Jesper Dangaard Brouer via iovisor-dev
On Mon, 22 Aug 2016 14:57:00 -0700
Tom Herbert via iovisor-dev  wrote:

> On Mon, Aug 22, 2016 at 2:44 PM, Thomas Graf  wrote:
> > 2016-08-09 17:47 GMT-04:00 Tom Herbert via iovisor-dev
> > :  
> >> Hi,
> >>
> >> I requested a workshop at netdev (Oct. 5-7 in Tokyo) to present and
> >> showcase XDP. The goal of this is to start providing the public
> >> information about XDP and to start talking about real world use cases
> >> and development.
> >>
> >> I am looking for presenters of various topics in XDP. The content
> >> might be similar to what we had at XDP summit, but the target audience
> >> should be considered to be potentail users and programmers of XDP as
> >> opposed to developers working on the infrastructure.  
> >
[...]
> 
> So far we'd have:
>   - Ceth
>   - Cilium
>   - ILA router/load balancer
> 
> We should also have an overview and introduction to XDP. So I'm
> thinking 30 minutes for intro/overview then 1 hr. for the above (e.g.
> 20 minutes for each).

I have a infrastructure related subject, that I would like to discuss
during the XDP workshop.

- "XDP port abstraction"

Currently XDP only supports packet forwarding (back) out the incoming
interface, this is rather limiting.  Instead of doing something Linux
specific like using the ifindex, I'm proposing to introduce what I call
a XDP port abstraction.  This is inspired by the mSwitch[1][2] paper
(by Michio Honda).

The basic idea: When loading a XDP program it register which
port-abstraction table "instance" it is using (or creates a new
instance).  And as minimum register its own port, which also need to be
provided as meta-data as "incoming-port" to the XDP program. The
port-abstraction is a table lookup, and the XDP program simply returns
an index into this table.  The kernel handles forwarding between ports
(mSwitch "fabric"), based on the port "type" in the table.  XDP
implements the "logic" behind the forwarding decision, and several XDP
programs can share a bpf-map to synchronize their decisions/state.
(See mSwitch data plane split between "fabric" and "logic").

The port "type" should be though as wide/abstract as possible.  It
should be easy to introduce a new "type" of port.  An obvious "type"
would be net_device, but the forwarding efficiency depend on what
features or NDO-functions the specific net_device supports.  I even
imagine a socket could be a port-type.  AFAIK Tom imagine a hardware
specific port-type.

Paper: [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf
Slides: [2] http://www.slideshare.net/micchie/mswitch-paper-talk-at-sosr-2015



A side-note: the mSwitch paper also contains a novel approach of
sorting and grouping incoming packets before forwarding.  This is also
something that is needed, but it is orthogonal to the port-abstraction
table.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] XDP seeking input from NIC hardware vendors

2016-07-26 Thread Jesper Dangaard Brouer via iovisor-dev
On Tue, 26 Jul 2016 10:53:05 -0700
John Fastabend  wrote:

> On 16-07-26 09:08 AM, Tom Herbert wrote:
> > On Tue, Jul 26, 2016 at 6:31 AM, Thomas Monjalon
> >  wrote:  
> >> Hi,
> >>
> >> About RX filtering, there is an ongoing effort in DPDK to write an API
> >> which could leverage most of the hardware capabilities of any NICs:
> >> https://rawgit.com/6WIND/rte_flow/master/rte_flow.html
> >> http://thread.gmane.org/gmane.comp.networking.dpdk.devel/43352
> >> I understand that XDP does not target to support every hardware features,
> >> though it may be an interesting approach to check.
> >>  
> > Thomas,
> > 
> > A major goal of XDP is to leverage and in fact encourage innovation in
> > hardware features. But, we are asking that vendors design the APIs
> > with the community in mind. For instance, if XDP supports crypto
> > offload it should have one API that different companies, we don't want
> > every vendor coming up with their own.  
> 
> The work in those threads is to create a single API for users of DPDK
> to interact with their hardware. The equivalent interface in Linux
> kernel is ntuple filters from ethtool the effort here is to make a
> usable interface to manage this from an application and also expose
> all the hardware features. Ethtool does a fairly poor job on both
> fronts IMO.

Yes, the ethtool + ntuple approach is unfortunately not very easy to :-(


> If we evolve the mechanism to run per rx queue xdp programs this
> interface could easily be used to forward packets to specific rx
> queues and run targeted xdp programs.
> 
> Integrating this functionality into running XDP programs as ebpf code
> seems a bit challenging to me because there is no software equivalent.
> Once XDP ebpf program is running the pkt has already landed on the rx
> queue. To me the mechanism to bind XDP programs to rx queues and steer
> specific flows (e.g. match a flow label and forward to a queue) needs
> to be part of the runtime environment not part of the main ebpf program
> itself. 

I agree, based on the discussion in this thread. Will admit that my
initial idea of adding this filter to the eBPF/XDP program was not such
a good idea.

> The runtime environment could use the above linked API. I know
> we debated earlier including this in the ebpf program itself but that
> really doesn't seem feasible to me. Whether the steering is expresses
> as an ebpf program or an API like above seems like a reasonable
> discussion. Perhaps a section could be used to describe the per program
> filter for example which would be different from an API approach used
> in the proposal above or the JIT could translate it into the above
> API for devices without instruction based hardware.

It seems like someone actually put some though into this, in the link
you send... quite interesting, thanks:
 https://rawgit.com/6WIND/rte_flow/master/rte_flow.html

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] XDP seeking input from NIC hardware vendors

2016-07-12 Thread Jesper Dangaard Brouer via iovisor-dev
On Tue, 12 Jul 2016 12:13:01 -0700
John Fastabend  wrote:

> On 16-07-11 07:24 PM, Alexei Starovoitov wrote:
> > On Sat, Jul 09, 2016 at 01:27:26PM +0200, Jesper Dangaard Brouer wrote:  
> >> On Fri, 8 Jul 2016 18:51:07 +0100
> >> Jakub Kicinski  wrote:
> >>  
> >>> On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote:  
>  The only distinction between VFs and queue groupings on my side is VFs
>  provide RSS where as queue groupings have to be selected explicitly.
>  In a programmable NIC world the distinction might be lost if a "RSS"
>  program can be loaded into the NIC to select queues but for existing
>  hardware the distinction is there.
> >>>
> >>> To do BPF RSS we need a way to select the queue which I think is all
> >>> Jesper wanted.  So we will have to tackle the queue selection at some
> >>> point.  The main obstacle with it for me is to define what queue
> >>> selection means when program is not offloaded to HW...  Implementing
> >>> queue selection on HW side is trivial.  
> >>
> >> Yes, I do see the problem of fallback, when the programs "filter" demux
> >> cannot be offloaded to hardware.
> >>
> >> First I though it was a good idea to keep the "demux-filter" part of
> >> the eBPF program, as software fallback can still apply this filter in
> >> SW, and just mark the packets as not-zero-copy-safe.  But when HW
> >> offloading is not possible, then packets can be delivered every RX
> >> queue, and SW would need to handle that, which hard to keep transparent.
> >>
> >>  
>  If you demux using a eBPF program or via a filter model like
>  flow_director or cls_{u32|flower} I think we can support both. And this
>  just depends on the programmability of the hardware. Note flow_director
>  and cls_{u32|flower} steering to VFs is already in place.
> >>
> >> Maybe we should keep HW demuxing as a separate setup step.
> >>
> >> Today I can almost do what I want: by setting up ntuple filters, and (if
> >> Alexei allows it) assign an application specific XDP eBPF program to a
> >> specific RX queue.
> >>
> >>  ethtool -K eth2 ntuple on
> >>  ethtool -N eth2 flow-type udp4 dst-ip 192.168.254.1 dst-port 53 action 42
> >>
> >> Then the XDP program can be attached to RX queue 42, and
> >> promise/guarantee that it will consume all packet.  And then the
> >> backing page-pool can allow zero-copy RX (and enable scrubbing when
> >> refilling pool).  
> > 
> > so such ntuple rule will send udp4 traffic for specific ip and port
> > into a queue then it will somehow gets zero-copied to vm?
> > . looks like a lot of other pieces about zero-copy and qemu need to be
> > implemented (or at least architected) for this scheme to be conceivable
> > . and when all that happens what vm is going to do with this very specific
> > traffic? vm won't have any tcp or even ping?  
> 
> I have perhaps a different motivation to have queue steering in 'tc
> cls-u32' and eventually xdp. The general idea is I have thousands of
> queues and I can bind applications to the queues. When I know an
> application is bound to a queue I can enable per queue busy polling (to
> be implemented), set specific interrupt rates on the queue
> (implementation will be posted soon), bind the queue to the correct
> cpu, etc.

+1 

binding applications to queues.

This is basically what our customers are requesting. They have one or
two applications that need DPDK speeds.  But they don't like dedicating
an entire NIC per application (like DPDK requires).

The basic idea is actually more fundamental.  It reminds me of Van
Jacobson's netchannels[1] when he talks about "Channelize" (slides 24+)
Creating full "application" channel allow for lock free single producer
single consumer (SPSC) queue directly into the application.

[1] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf


> ntuple works OK for this now but xdp provides more flexibility and
> also lets us add additional policy on the queue other than simply
> queue steering.
> 
> I'm not convinced though that the demux queue selection should be part
> of the XDP program itself just because it has no software analog to me
> it sits in front of the set of XDP programs. But I think I could perhaps
> be convinced it does if there is some reasonable way to do it. I guess
> the single program method would result in an XDP program that read like
> 
>   if (rx_queue == x)
>do_foo
>   if (rx_queue == y)
>do_bar

Yes, that is also why I wanted a XDP program per RX queue.  But the
"channelize" concept is more important.

 
> A hardware jit may be able to sort that out. Or use per queue
> sections.
> 
> > 
> > the network virtualization traffic is typically encapsulated,
> > so if xdp is used to do steer the traffic, the program would need
> > to figure out vm id based on headers, strip tunnel, apply policy
> > before forwarding the packet further. Clearly hw ntuple is not
> > going to suffice.
> >
> > If there is no networking virtualization and V

Re: [iovisor-dev] XDP seeking input from NIC hardware vendors

2016-07-09 Thread Jesper Dangaard Brouer via iovisor-dev
On Fri, 8 Jul 2016 18:51:07 +0100
Jakub Kicinski  wrote:

> On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote:
> > The only distinction between VFs and queue groupings on my side is VFs
> > provide RSS where as queue groupings have to be selected explicitly.
> > In a programmable NIC world the distinction might be lost if a "RSS"
> > program can be loaded into the NIC to select queues but for existing
> > hardware the distinction is there.  
> 
> To do BPF RSS we need a way to select the queue which I think is all
> Jesper wanted.  So we will have to tackle the queue selection at some
> point.  The main obstacle with it for me is to define what queue
> selection means when program is not offloaded to HW...  Implementing
> queue selection on HW side is trivial.

Yes, I do see the problem of fallback, when the programs "filter" demux
cannot be offloaded to hardware.

First I though it was a good idea to keep the "demux-filter" part of
the eBPF program, as software fallback can still apply this filter in
SW, and just mark the packets as not-zero-copy-safe.  But when HW
offloading is not possible, then packets can be delivered every RX
queue, and SW would need to handle that, which hard to keep transparent.


> > If you demux using a eBPF program or via a filter model like
> > flow_director or cls_{u32|flower} I think we can support both. And this
> > just depends on the programmability of the hardware. Note flow_director
> > and cls_{u32|flower} steering to VFs is already in place.  

Maybe we should keep HW demuxing as a separate setup step.

Today I can almost do what I want: by setting up ntuple filters, and (if
Alexei allows it) assign an application specific XDP eBPF program to a
specific RX queue.

 ethtool -K eth2 ntuple on
 ethtool -N eth2 flow-type udp4 dst-ip 192.168.254.1 dst-port 53 action 42

Then the XDP program can be attached to RX queue 42, and
promise/guarantee that it will consume all packet.  And then the
backing page-pool can allow zero-copy RX (and enable scrubbing when
refilling pool).


> Yes, for steering to VFs we could potentially reuse a lot of existing
> infrastructure.
> 
> > The question I have is should the "filter" part of the eBPF program
> > be a separate program from the XDP program and loaded using specific
> > semantics (e.g. "load_hardware_demux" ndo op) at the risk of building
> > a ever growing set of "ndo" ops. If you are running multiple XDP
> > programs on the same NIC hardware then I think this actually makes
> > sense otherwise how would the hardware and even software find the
> > "demux" logic. In this model there is a "demux" program that selects
> > a queue/VF and a program that runs on the netdev queues.  
> 
> I don't think we should enforce the separation here.  What we may want
> to do before forwarding to the VF can be much more complicated than
> pure demux/filtering (simple eg - pop VLAN/tunnel).  VF representative
> model works well here as fallback - if program could not be offloaded
> it will be run on the host and "trombone" packets via VFR into the VF.

That is an interesting idea.

> If we have a chain of BPF programs we can order them in increasing
> level of complexity/features required and then HW could transparently
> offload the first parts - the easier ones - leaving more complex
> processing on the host.

I'll try to keep out of the discussion of how to structure the BPF
program, as it is outside my "area".
 
> This should probably be paired with some sort of "skip-sw" flag to let
> user space enforce the HW offload on the fast path part.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] XDP seeking input from NIC hardware vendors

2016-07-08 Thread Jesper Dangaard Brouer via iovisor-dev

On Fri, 8 Jul 2016 14:44:53 +0100 Jakub Kicinski  
wrote:

> On Thu, 7 Jul 2016 19:22:12 -0700, Alexei Starovoitov wrote:
>
> > > If the goal is to just separate XDP traffic from non-XDP traffic
> > > you could accomplish this with a combination of SR-IOV/macvlan to
> > > separate the device queues into multiple netdevs and then run XDP
> > > on just one of the netdevs. Then use flow director (ethtool) or
> > > 'tc cls_u32/flower' to steer traffic to the netdev. This is how
> > > we support multiple networking stacks on one device by the way it
> > > is called the bifurcated driver. Its not too far of a stretch to
> > > think we could offload some simple XDP programs to program the
> > > splitting of traffic instead of cls_u32/flower/flow_director and
> > > then you would have a stack of XDP programs. One running in
> > > hardware and a set running on the queues in software.
> > 
> >
> > the above sounds like much better approach then Jesper/mine
> > prog_per_ring stuff.
> >
> > If we can split the nic via sriov and have dedicated netdev via VF
> > just for XDP that's way cleaner approach. I guess we won't need to
> > do xdp_rxqmask after all.  
> 
> +1
> 
> I was thinking about using eBPF to direct to NIC queues but concluded
> that doing a redirect to a VF is cleaner.  Especially if the PF driver
> supports VF representatives we could potentially just use
> bpf_redirect(VFR netdev) and the VF doesn't even have to be handled by
> the same stack.

I actually disagree.

I _do_ want to use the "filter" part of eBPF to direct to NIC queues, and
then run a single/specific XDP program on that queue.

Why to I want this?

This part of solving a very fundamental CS problem (early demux), when
wanting to support Zero-copy on RX.  The basic problem that the NIC
driver need to map RX pages into the RX ring, prior to receiving
packets. Thus, we need HW support to steer packets, for gaining enough
isolation (e.g between tenants domains) for allowing zero-copy.


Based on the flexibility of the HW-filter, the granularity achievable
for isolation (e.g. application specific) is much more flexible.  Than
splitting up the entire NIC with SR-IOV, VFs or macvlans.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


[iovisor-dev] XDP seeking input from NIC hardware vendors

2016-07-07 Thread Jesper Dangaard Brouer via iovisor-dev

Would it make sense from a hardware point of view, to split the XDP
eBPF program into two stages.

 Stage-1: Filter (restricted eBPF / no-helper calls)
 Stage-2: Program

Then the HW can choose to offload stage-1 "filter", and keep the
likely more advanced stage-2 on the kernel side.  Do HW vendors see a
benefit of this approach?


The generic problem I'm trying to solve is parsing. E.g. that the
first step in every XDP program will be to parse the packet-data,
in-order to determine if this is a packet the XDP program should
process.

Actions from stage-1 "filter" program:
 - DROP (like XDP_DROP, early drop)
 - PASS (like XDP_PASS, normal netstack)
 - MATCH (call stage-2, likely carry-over opaque return code)

The MATCH action should likely carry-over an opaque return code, that
makes sense for the stage-2 program. E.g. proto id and/or data offset.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


[iovisor-dev] XDP eBPF extending with an input parameter?

2016-07-06 Thread Jesper Dangaard Brouer via iovisor-dev

How easy is it to extend the XDP/eBPF program with a new input
parameter?

I see the need for providing the input "source-port" to the eBPF
program.  E.g. needed when implementing a basic learning bridge
(basically a map of ). The concepts of "ports" are
not-defined-yet, thus, this change need to be part of a later
patchset.

Maybe this "source-port-id" can instead be stored as part of the eBPF
program structure? (as it is likely known at registration time)


Reminder: XDP patchset I'm referring to is avail here[1]:
 [1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=xdp

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Assertion fails at samples/bpf/test_maps

2016-06-22 Thread Jesper Dangaard Brouer via iovisor-dev
On Wed, 22 Jun 2016 08:59:14 -0700
Alexei Starovoitov  wrote:

> On Wed, Jun 22, 2016 at 8:54 AM, Jesper Dangaard Brouer
>  wrote:
> > On Tue, 21 Jun 2016 18:54:53 -0700
> > Alexei Starovoitov  wrote:
> >  
> >> On Tue, Jun 21, 2016 at 11:48 AM, William Tu via iovisor-dev
> >>  wrote:  
> >> > Hi,
> >> >
> >> > I'm running test_maps under net-next/samples/bpf/, commit
> >> > 601009780635. The code logic all make sense but I got the assertion
> >> > errors / coredump for some unknown reason under different compiler
> >> > optimization flags (-O0 and -O2). The test_hashmap_sanity() works fine
> >> > but fails at test_percpu_hashmap_sanity().
> >> >
> >> > First run with normal build (which has -O2)
> >> > [root@vm-dev bpf]#  ./test_maps
> >> > test_maps: samples/bpf/test_maps.c:137: test_percpu_hashmap_sanity:
> >> > Assertion `bpf_lookup_elem(map_fd, &key, value) == -1 &&
> >> > (*__errno_location ()) == 2' failed.  
> > [...]
> >
> > Strange I get:
> >
> > $ sudo ./test_maps
> > failed to create per-cpu arraymap 'Operation not permitted'  
> 
> yes. that's expected. Make sure to do 'ulimit -l unlimited'

# ulimit -l unlimited
# ./test_maps 
test_maps: OK

Then it works, thanks. Maybe we could provide some better feedback from
the test program, as this might be a common/expected behavior.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] Assertion fails at samples/bpf/test_maps

2016-06-22 Thread Jesper Dangaard Brouer via iovisor-dev
On Tue, 21 Jun 2016 18:54:53 -0700
Alexei Starovoitov  wrote:

> On Tue, Jun 21, 2016 at 11:48 AM, William Tu via iovisor-dev
>  wrote:
> > Hi,
> >
> > I'm running test_maps under net-next/samples/bpf/, commit
> > 601009780635. The code logic all make sense but I got the assertion
> > errors / coredump for some unknown reason under different compiler
> > optimization flags (-O0 and -O2). The test_hashmap_sanity() works fine
> > but fails at test_percpu_hashmap_sanity().
> >
> > First run with normal build (which has -O2)
> > [root@vm-dev bpf]#  ./test_maps
> > test_maps: samples/bpf/test_maps.c:137: test_percpu_hashmap_sanity:
> > Assertion `bpf_lookup_elem(map_fd, &key, value) == -1 &&
> > (*__errno_location ()) == 2' failed.  
[...]

Strange I get:

$ sudo ./test_maps
failed to create per-cpu arraymap 'Operation not permitted'

strace says:

 bpf(BPF_MAP_CREATE, {map_type=0x6 /* BPF_MAP_TYPE_??? */, key_size=4, 
value_size=8, max_entries=2}, 48) = -1 EPERM (Operation not permitted)

 
> that's even more weird.
> May be some compiler issue.
> I've tested with gcc 5.2 and 6.1
[...]

Compile host: Fedora 22
 * LLVM version 3.7.1 (llc and clang same version) 
 * gcc version 5.3.1 20151207 (Red Hat 5.3.1-2) (GCC)

Ran on a Fedora 23 host, running Alexei's kernel tree on "xdp" branch,
samples/bpf from same git tree.

> 
> Daniel, Brenden, Jesper, did you ever see anything like it?

Not working for me actually... but different error, strange.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] XDP summit agenda

2016-06-22 Thread Jesper Dangaard Brouer via iovisor-dev

On Tue, 21 Jun 2016 18:25:14 -0700
Tom Herbert via iovisor-dev  wrote:

> Here is an agenda for the XDP summit. We expect this to be a lot of
> interactive conversation so the agenda can be dynamic as needed!
> 
[...]
> 9:15-10:15 Implementation status
[...]
>   - page pool

Given I'll not participate physically, I'll give my page-pool status
here.

The page-pool idea was presented at MM-summit in April 2016:
 
http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

Since I've written a more details design document (reviewed by Tom and
Alexei).   I've not implemented any code for the page-pool yet.  I've
read and discussed the current page-allocators per cpu caching of
order-0 pages with Mel Gorman.  Mel is open to changes in this area.
We are looking at doing a "native" bulk between page-pool and
page-allocator, by aligning/sharing some of the same data-structures
(I see this as the page-pool "backend").  I want to prototype implement at
page-pool frontend, in-order to (1) benchmark if it is fast-enough
for perf-target, (2) to figure out the API drivers needs.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: [iovisor-dev] XDP summit agenda

2016-06-22 Thread Jesper Dangaard Brouer via iovisor-dev
On Tue, 21 Jun 2016 18:25:14 -0700
Tom Herbert via iovisor-dev  wrote:

> Here is an agenda for the XDP summit. We expect this to be a lot of
> interactive conversation so the agenda can be dynamic as needed!
> 
> Thanks,
> Tom
> 
> 8:30 Breakfast
> 9:00 Introductions, agenda, logistics, goals of the summit
> 
> 9:15-10:15 Implementation status
>   - Mellanox, Intel development
>   - page pool
>   - driver status and issues
>   - performance numbers
>   - generalizing skbs
>   - open issues
> 10:15-10:45 Real world uses cases in development
>  - ILA router

Does this ILA router do multi-leg routing?  If not, then I would add
the use case for "normal" routing.

>  - load balancer
>  - others?

Use-case: DDoS filter
 - This was the original use-case from CloudFlare at netdevconf[1]

Use-case: RAW packet capture
 - IDS systems like Suricata needs this (af_packet not fast-enough)
 - Only support "steal" mode


>  - Issues developers are hitting
> 10:45-11:00 Break
> 11:15-11:45
>HW offload of BPF/XDP
>   - Netronome
>   - Barefoot switch HW
>   - Encryption, advance features
> 11:45-12:15 XDP/Ceth integration
> 
> 12:15-1:30 Lunch
> 
> 1:30-2:00 Cilium
> 2:00-3:00 APIs
>  - Meta data
>  - Return codes
>  - Portability but still allow using latest and greatest HW features
> 3:00-3:15 Break
> 3:15-4:00 Support for network virtualization
> 4:00-5:00 Ecosystem considerations
>   - Toolchain, debugging
>   - BPF programs repository management
> 5:00-6:00 Open discussion, planning, wrapup
> 
> 6:30- Dinner at Rangoon Ruby, Palo Alto CA


[1] http://people.netfilter.org/hawk/presentations/NetDev1.1_2016/links.html

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
___
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev