Re: [iovisor-dev] bpf_redirect_map not working after tail call
On Fri, 1 Jun 2018 14:15:58 +0200 Sebastiano Miano via iovisor-dev wrote: > Dear all, > > We have noticed that the bpf_redirect_map returns an error when it is > called after a tail call. > The xdp_redirect_map program (under sample/bpf) works fine, but if we > modify it as shown in the following diff, it doesn't work anymore. > I have debugged it with the xdp_monitor application and the error > returned is EFAULT. > Is this a known issue? Am I doing something wrong? Argh, this is likely an issue/bug due to the check xdp_map_invalid(), that was introduced in commit 7c3001313396 ("bpf: fix ri->map_owner pointer on bpf_prog_realloc"). To Daniel, I don't know how to solve this, could you give some advice? static inline bool xdp_map_invalid(const struct bpf_prog *xdp_prog, unsigned long aux) { return (unsigned long)xdp_prog->aux != aux; } static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp, struct bpf_prog *xdp_prog) { struct redirect_info *ri = this_cpu_ptr(&redirect_info); unsigned long map_owner = ri->map_owner; struct bpf_map *map = ri->map; u32 index = ri->ifindex; void *fwd = NULL; int err; [...] if (unlikely(xdp_map_invalid(xdp_prog, map_owner))) { err = -EFAULT; map = NULL; goto err; } [...] > P.S. I have tested the program with the latest bpf-next kernel. > > > > diff --git a/samples/bpf/xdp_redirect_map_kern.c > b/samples/bpf/xdp_redirect_map_kern.c > index 740a529..bf1275a 100644 > --- a/samples/bpf/xdp_redirect_map_kern.c > +++ b/samples/bpf/xdp_redirect_map_kern.c > @@ -36,6 +36,13 @@ struct bpf_map_def SEC("maps") rxcnt = { > .max_entries = 1, > }; > > +struct bpf_map_def SEC("maps") prog_table = { > + .type = BPF_MAP_TYPE_PROG_ARRAY, > + .key_size = sizeof(int), > + .value_size = sizeof(int), > + .max_entries = 32, > +}; > + > static void swap_src_dst_mac(void *data) > { > unsigned short *p = data; > @@ -89,4 +96,15 @@ int xdp_redirect_dummy_prog(struct xdp_md *ctx) > return XDP_PASS; > } > > +/* Entry point */ > +SEC("xdp_redirect_entry_point") > +int xdp_redirect_entry_point_prog(struct xdp_md *ctx) > +{ > + //char fmt[] = "xdp_redirect_entry_point\n"; > + //bpf_trace_printk(fmt, sizeof(fmt)); > + bpf_tail_call(ctx, &prog_table, 0); > + // Tail call failed > + return XDP_DROP; > +} > + > char _license[] SEC("license") = "GPL"; > diff --git a/samples/bpf/xdp_redirect_map_user.c > b/samples/bpf/xdp_redirect_map_user.c > index 4445e76..b2d2059 100644 > --- a/samples/bpf/xdp_redirect_map_user.c > +++ b/samples/bpf/xdp_redirect_map_user.c > @@ -120,7 +120,13 @@ int main(int argc, char **argv) > return 1; > } > > - if (bpf_set_link_xdp_fd(ifindex_in, prog_fd[0], xdp_flags) < 0) { > + ret = bpf_map_update_elem(map_fd[2], &key, &prog_fd[0], 0); > + if (ret) { > + perror("bpf_update_elem"); > + goto out; > + } > + > + if (bpf_set_link_xdp_fd(ifindex_in, prog_fd[2], xdp_flags) < 0) { > printf("ERROR: link set xdp fd failed on %d\n", ifindex_in); > return 1; > } > ___ > iovisor-dev mailing list > iovisor-dev@lists.iovisor.org > https://lists.iovisor.org/mailman/listinfo/iovisor-dev -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] bpftool binary size
(Cc. iovisor-dev list) On Wed, 18 Apr 2018 21:32:45 +0200 Jesper Dangaard Brouer wrote: > On Wed, 18 Apr 2018 20:13:39 +0100 > Quentin Monnet wrote: > > > Hi Jesper, > > > > Out of curiosity, how much did you say your bpftool executable weigh? Was > > it 2 MB? I recompiled it on my laptop and it's 173 kB, 113 kB if stripped. > > I don't believe it embeds libopcode on my system (right now Ubuntu 18.04). > > What is your setup, if you don't mind me asking? > > On my Fedora 26, it is 2.2MB (2207976) and stripped 2MB (2046784). > > For some reason the bpftool binary gets statically linked with > libopcodes, which is /usr/lib64/libopcodes.a 1.76MB. So, I figured out why this happens. What does bpftool use BFD for? I need the BFD lib and include file: /usr/include/bfd.h. Thus, on fedora I need to install package binutils-devel. The file /usr/lib64/libopcodes.so contains a ld-script that enforce static linking: $ cat /usr/lib64/libopcodes.so /* GNU ld script */ /* Ensure this .so library will not be used by a link for a different format on a multi-architecture system. */ OUTPUT_FORMAT(elf64-x86-64) INPUT ( /usr/lib64/libopcodes.a -lbfd ) The info/description on the RPM file is interesting, it explicitly explain why this static linking is enforced... and "strongly encouraged" developers to use libelf instead of BFD. Is using libelf instead an option? $ rpm -qi binutils-devel Name: binutils-devel Version : 2.27 Release : 28.fc26 Architecture: x86_64 Install Date: Wed 18 Apr 2018 09:34:52 PM CEST Group : System Environment/Libraries Size: 4651893 License : GPLv3+ Signature : RSA/SHA256, Thu 02 Nov 2017 10:37:53 AM CET, Key ID 812a6b4b64dab85d Source RPM : binutils-2.27-28.fc26.src.rpm Build Date : Fri 13 Oct 2017 03:48:11 PM CEST Build Host : buildvm-16.phx2.fedoraproject.org Relocations : (not relocatable) Packager: Fedora Project Vendor : Fedora Project URL : http://sources.redhat.com/binutils Summary : BFD and opcodes static and dynamic libraries and header files Description : This package contains BFD and opcodes static and dynamic libraries. The dynamic libraries are in this package, rather than a seperate base package because they are actually linker scripts that force the use of the static libraries. This is because the API of the BFD library is too unstable to be used dynamically. The static libraries are here because they are now needed by the dynamic libraries. Developers starting new projects are strongly encouraged to consider using libelf instead of BFD. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Best userspace programming API for XDP features query to kernel?
On Fri, 6 Apr 2018 12:36:18 +0200 Daniel Borkmann wrote: > On 04/05/2018 10:51 PM, Jesper Dangaard Brouer wrote: > > On Thu, 5 Apr 2018 12:37:19 +0200 > > Daniel Borkmann wrote: > > > >> On 04/04/2018 02:28 PM, Jesper Dangaard Brouer via iovisor-dev wrote: > >>> Hi Suricata people, > >>> > >>> When Eric Leblond (and I helped) integrated XDP in Suricata, we ran > >>> into the issue, that at Suricata load/start time, we cannot determine > >>> if the chosen XDP config options, like xdp-cpu-redirect[1], is valid on > >>> this HW (e.g require driver XDP_REDIRECT support and bpf cpumap). > >>> > >>> We would have liked a way to report that suricata.yaml config was > >>> invalid for this hardware/setup. Now, it just loads, and packets gets > >>> silently dropped by XDP (well a WARN_ONCE and catchable via tracepoints). > >>> > >>> My question to suricata developers: (Q1) Do you already have code that > >>> query the kernel or drivers for features? > >>> > >>> At the IOvisor call (2 weeks ago), we discussed two options of exposing > >>> XDP features avail in a given driver. > >>> > >>> Option#1: Extend existing ethtool -k/-K "offload and other features" > >>> with some XDP features, that userspace can query. (Do you already query > >>> offloads, regarding Q1) > >>> > >>> Option#2: Invent a new 'ip link set xdp' netlink msg with a query option. > >>> > >> > >> I don't really mind if you go via ethtool, as long as we handle this > >> generically from there and e.g. call the dev's ndo_bpf handler such that > >> we keep all the information in one place. This can be a new ndo_bpf command > >> e.g. XDP_QUERY_FEATURES or such. > > > > Just to be clear: notice as Victor points out[2], they are programmable > > going though the IOCTL (SIOCETHTOOL) and not using cmdline tools. > > Sure, that was perfectly clear. (But at the same time if you extend the > ioctl, it's obvious to also add support to actual ethtool cmdline tool.) > > > [2] https://github.com/OISF/suricata/blob/master/src/util-ioctl.c#L326 > > > > If you want everything to go through the drivers ndo_bpf call anyway > > (which userspace API is netlink based) then at what point to you > > Not really, that's the front end. ndo_bpf itself is a plain netdev op > and has no real tie to netlink. > > > want drivers to call their own ndo_bpf, when activated though their > > ethtool_ops ? (Sorry, but I don't follow the flow you are proposing) > > > > Notice, I'm not directly against using the drivers ndo_bpf call. I can > > see it does provide kernel more flexibility than the ethtool IOCTL. > > What I was saying is that even if you go via ethtool ioctl api, where > you end up in dev_ethtool() and have some new ETHTOOL_* query command, > then instead of adding a new ethtool_ops callback, we can and should > reuse ndo_bpf from there. Okay, you want to catch this in dev_ethtool(), that connected the dots for me, thanks. BUT it does feel a little fake and confusing to do this, as the driver developers seeing the info via ethtool, would expect they need to implement an ethtool_ops callback. My original idea behind going through the ethtool APIs, were that every driver is already using this, and it's familiar to the driver developers. And there are existing userspace tools that sysadm's already know, that we can leverage. Thinking about this over the weekend. I have changed my mind a bit, based on your feedback. I do buy into your idea of forcing things through the drivers ndo_bpf call. I'm no-longer convinced this should go through ethtool, but as (you desc) we can, if we find that this is the easiest ways to export and leverage an existing userspace tool. > [...] > > Here, I want to discuss how drivers expose/tell userspace that they > > support a given feature: Specifically a bit for: XDP_REDIRECT action > > support. > > > >> Same for meta data, > > > > Well, not really. It would be a "nice-to-have", but not strictly > > needed as a feature bit. XDP meta-data is controlled via a helper. > > And the BPF-prog can detect/see runtime, that the helper bpf_xdp_adjust_meta > > returns -ENOTSUPP (and need to check the ret value anyhow). Thus, > > there is that not much gained by exposing this to be detected setup > > time, as all drivers should eventually support this, and we can detect > > it runtime. > > > >
Re: [iovisor-dev] Best userspace programming API for XDP features query to kernel?
On Thu, 5 Apr 2018 14:47:16 -0700 Jakub Kicinski wrote: > On Thu, 5 Apr 2018 22:51:33 +0200, Jesper Dangaard Brouer wrote: > > > What about nfp in terms of XDP > > > offload capabilities, should they be included as well or is probing to > > > load > > > the program and see if it loads/JITs as we do today just fine (e.g. you'd > > > otherwise end up with extra flags on a per BPF helper basis)? > > > > No, flags per BPF helper basis. As I've described above, helper belong > > to the BPF core, not the driver. Here I want to know what the specific > > driver support. > > I think Daniel meant for nfp offload. The offload restrictions are > quite involved, are we going to be able to express those? Let's keep thing separate. I'm requesting something really simple. I want the driver to tell me what XDP actions it supports. We/I can implement an XDP_QUERY_ACTIONS via ndo_bpf, problem solved. It is small, specific and simple. For my other use-case of enabling XDP-xmit on a device, I can implement another ndo_bpf extension. Current approach today is loading a dummy XDP prog via ndo_bpf anyway (which is awkward). Again a specific change that let us move one-step further. For your nfp offload use-case, you/we have to find a completely different solution. You have hit a design choice made by BPF. Which is that BPF is part of the core kernel, and helpers cannot be loaded as kernel modules. As we cannot remove or add helpers after the verifier certified the program. And basically your nfp offload driver comes as a kernel module. (Details: and you basically already solved your issue by modifying the core verifier to do a call back to bpf_prog_offload_verify_insn()). Point being this is very different from what I'm requesting. Thus, for offloading you already have a solution, to my setup time detect problem, as your program gets rejected setup/load time by the verifier. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] [Oisf-devel] Best userspace programming API for XDP features query to kernel?
On Thu, 5 Apr 2018 09:47:37 +0200 Victor Julien wrote: > > Option#2: Invent a new 'ip link set xdp' netlink msg with a query option. > > Do you have an example of how this is queried? The code for querying should not be too difficult. It would likely be similar to how we currently "set"/attach an XDP program, via its BPF file-descriptor to an ifindex. Eric Leblond, choose to hide this in the kernel library "libbpf", see code: function bpf_set_link_xdp_fd() https://github.com/torvalds/linux/blob/master/tools/lib/bpf/bpf.c#L456-L575 Given Suricata already depend on libbpf for eBFP and XDP support, then it might make sense to add an API call to "get" XDP link info, e.g. bpf_get_link_xdp_features(int ifindex) ? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Best userspace programming API for XDP features query to kernel?
On Thu, 5 Apr 2018 12:37:19 +0200 Daniel Borkmann wrote: > On 04/04/2018 02:28 PM, Jesper Dangaard Brouer via iovisor-dev wrote: > > Hi Suricata people, > > > > When Eric Leblond (and I helped) integrated XDP in Suricata, we ran > > into the issue, that at Suricata load/start time, we cannot determine > > if the chosen XDP config options, like xdp-cpu-redirect[1], is valid on > > this HW (e.g require driver XDP_REDIRECT support and bpf cpumap). > > > > We would have liked a way to report that suricata.yaml config was > > invalid for this hardware/setup. Now, it just loads, and packets gets > > silently dropped by XDP (well a WARN_ONCE and catchable via tracepoints). > > > > My question to suricata developers: (Q1) Do you already have code that > > query the kernel or drivers for features? > > > > At the IOvisor call (2 weeks ago), we discussed two options of exposing > > XDP features avail in a given driver. > > > > Option#1: Extend existing ethtool -k/-K "offload and other features" > > with some XDP features, that userspace can query. (Do you already query > > offloads, regarding Q1) > > > > Option#2: Invent a new 'ip link set xdp' netlink msg with a query option. > > I don't really mind if you go via ethtool, as long as we handle this > generically from there and e.g. call the dev's ndo_bpf handler such that > we keep all the information in one place. This can be a new ndo_bpf command > e.g. XDP_QUERY_FEATURES or such. Just to be clear: notice as Victor points out[2], they are programmable going though the IOCTL (SIOCETHTOOL) and not using cmdline tools. [2] https://github.com/OISF/suricata/blob/master/src/util-ioctl.c#L326 If you want everything to go through the drivers ndo_bpf call anyway (which userspace API is netlink based) then at what point to you want drivers to call their own ndo_bpf, when activated though their ethtool_ops ? (Sorry, but I don't follow the flow you are proposing) Notice, I'm not directly against using the drivers ndo_bpf call. I can see it does provide kernel more flexibility than the ethtool IOCTL. > More specifically, how would such feature mask look like? How fine grained > would this be? When you add a new minor feature to, say, cpumap that not > all drivers support yet, we'd need a new flag each time, no? No, CPUMAP is not a driver level feature, and thus does not require a features flag exposed by the driver. CPUMAP depends on the driver feature XDP_REDIRECT. It is important we separate driver-level features and BPF/XDP core features. I feel that we constantly talk past each-other when we mix that up. It is true that Suricata _also_ need to detect if the running kernel support the map type called: BPF_MAP_TYPE_CPUMAP. *BUT* that is a completely separate mechanism. It is a core kernel bpf feature, and I have accepted that this can only be done via probing the kernel (simply use the bpf syscall and try to create this map type). Here, I want to discuss how drivers expose/tell userspace that they support a given feature: Specifically a bit for: XDP_REDIRECT action support. > Same for meta data, Well, not really. It would be a "nice-to-have", but not strictly needed as a feature bit. XDP meta-data is controlled via a helper. And the BPF-prog can detect/see runtime, that the helper bpf_xdp_adjust_meta returns -ENOTSUPP (and need to check the ret value anyhow). Thus, there is that not much gained by exposing this to be detected setup time, as all drivers should eventually support this, and we can detect it runtime. The missing XDP_REDIRECT action features bit it different, as the BPF-prog cannot detect runtime that this is an unsupported action. Plus, setup time we cannot query the driver for supported XDP actions. > then potentially for the redirect memory return work, I'm not sure the redirect memory return types, belong here. First of all it is per RX-ring. Second, some other config method will likely be config interface. Like with AF_XDP-zero-copy it is a new NDO. So, it is more losely coupled. > or the af_xdp bits, No need for bit for AF_XDP in copy-mode (current RFC), as this only depend on driver supporting XDP_REDIRECT action. For AF_XDP in zero-copy mode, then yes we need a way for userspace to "see" if this mode is supported by the driver. But it might not need a feature bit here... as the bind() call (which knows the ifindex) could fail when it tried to enable this ZC mode. It would make userspace's live easier to add ZC as a driver feature bit. > the xdp_rxq_info would have needed it, etc. Same comment as mem-type, not necessarily, as it is more losely coupled to XDP. > What about nfp in terms of XDP > offload capabilities, should they be included as well or is
[iovisor-dev] Best userspace programming API for XDP features query to kernel?
Hi Suricata people, When Eric Leblond (and I helped) integrated XDP in Suricata, we ran into the issue, that at Suricata load/start time, we cannot determine if the chosen XDP config options, like xdp-cpu-redirect[1], is valid on this HW (e.g require driver XDP_REDIRECT support and bpf cpumap). We would have liked a way to report that suricata.yaml config was invalid for this hardware/setup. Now, it just loads, and packets gets silently dropped by XDP (well a WARN_ONCE and catchable via tracepoints). My question to suricata developers: (Q1) Do you already have code that query the kernel or drivers for features? At the IOvisor call (2 weeks ago), we discussed two options of exposing XDP features avail in a given driver. Option#1: Extend existing ethtool -k/-K "offload and other features" with some XDP features, that userspace can query. (Do you already query offloads, regarding Q1) Option#2: Invent a new 'ip link set xdp' netlink msg with a query option. (Q2) Do Suricata devs have any preference (or other options/ideas) for the way the kernel expose this info to userspace? [1] http://suricata.readthedocs.io/en/latest/capture-hardware/ebpf-xdp.html#the-xdp-cpu-redirect-case -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Notification when an eBPF map is modified
On Sat, 17 Feb 2018 13:49:22 + Teng Qin via iovisor-dev wrote: > > We were looking for a mechanism transparent to the eBPF program, though. > > A possible rational is to have an hot-standby copy of the program > > (including the state) in some other location, but I don't want my > > dataplane to be aware of that. > > Thanks, > > > > fulvio > > > You could also (use another BPF program or ftrace) to trace the > bpf_map_update_elem Tracepoint. But in that case you get all update calls > and would need to filter for the one you are interested on your own:) That is a good idea. Try it out via perf-record to see if it contains what you need: $ perf record -e bpf:bpf_map_update_elem -a $ perf script xdp_redirect_ma 2273 [011] 261187.968223: bpf:bpf_map_update_elem: map type= ufd=4 key=[00 00 00 00] val=[07 00 00 00] Looking at the above output and tracepoint kernel code, we should extend that with a map_id to easily identify/filter what map you are interested in. See patch below signature (not even compile tested). Example for attaching to tracepoints see: samples/bpf/xdp_monitor_*.c -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer tracepoint: add map id to bpf tracepoints From: Jesper Dangaard Brouer --- include/trace/events/bpf.h | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/include/trace/events/bpf.h b/include/trace/events/bpf.h index 150185647e6b..e6479ba45261 100644 --- a/include/trace/events/bpf.h +++ b/include/trace/events/bpf.h @@ -140,7 +140,7 @@ TRACE_EVENT(bpf_map_create, __entry->flags = map->map_flags; __entry->ufd = ufd; ), - +// TODO also add map_id here TP_printk("map type=%s ufd=%d key=%u val=%u max=%u flags=%x", __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB), __entry->ufd, __entry->size_key, __entry->size_value, @@ -199,15 +199,18 @@ DECLARE_EVENT_CLASS(bpf_obj_map, __field(u32, type) __field(int, ufd) __string(path, pname->name) + __field(u32, map_id) ), TP_fast_assign( __assign_str(path, pname->name); __entry->type = map->map_type; __entry->ufd = ufd; + __entry->map_id = map->id; ), - TP_printk("map type=%s ufd=%d path=%s", + TP_printk("map id=%u type=%s ufd=%d path=%s", + __entry->map_id, __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB), __entry->ufd, __get_str(path)) ); @@ -244,6 +247,7 @@ DECLARE_EVENT_CLASS(bpf_map_keyval, __dynamic_array(u8, val, map->value_size) __field(bool, val_trunc) __field(int, ufd) + __field(u32, map_id) ), TP_fast_assign( @@ -255,9 +259,11 @@ DECLARE_EVENT_CLASS(bpf_map_keyval, __entry->val_len = min(map->value_size, 16U); __entry->val_trunc = map->value_size != __entry->val_len; __entry->ufd = ufd; + __entry->map_id= map->id; ), - TP_printk("map type=%s ufd=%d key=[%s%s] val=[%s%s]", + TP_printk("map id=%d type=%s ufd=%d key=[%s%s] val=[%s%s]", + __entry->map_id, __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB), __entry->ufd, __print_hex(__get_dynamic_array(key), __entry->key_len), ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] XDP router exam - bpf create map failed error.
On Sun, 28 Jan 2018 17:29:56 +0300 "Mr. Haritsu via iovisor-dev" wrote: > Hi Guys, > > I'm beginner of XDP and BPF. I'm working also XDP exams and prototype-kernel. > xdp_ddos01_blacklist exam it worked and xdp dropped packets. > > But xdp_router_ipv4 isn't working. > > root@xdptester:~/samples/bpf# ./xdp_router_ipv4 6 7 > failed to create a map: 1 Operation not permitted > > Does anyone have this problem? It is frustrating that people keeps hitting this, and cannot figure out what the issue is... This is because the kernel return EPERM and not ENOMEM ... if the kernel had returned a "out of memory" error, I bet you could have figured out that you need to increase the "locked memory" limit (RLIMIT_MEMLOCK) via function call setrlimit() https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html http://man7.org/linux/man-pages/man3/setrlimit.3p.html Can I ask: What school is giving exams/assignments in XDP? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Is there information about eBPF + OVS ?
On Wed, 27 Dec 2017 16:58:06 +0900 Heung Sik Choi via iovisor-dev wrote: > Hi, I need help and send you an mail. > > I recently found that there was an OVS using eBPF. > (http://openvswitch.org/support/ovscon2016/7/1120-tu.pdf) > > However, I can not find any information on how to install it. > > If you are using eBPF + OVS, please let me know if you have any insights > about it. I'll redirect that question to William Tu (Cc'ed). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] [PATCH, bpf-llvm] Fix bug on silently truncating 64-bit immediate
On Thu, 12 Oct 2017 06:21:30 -0400 Jiong Wang wrote: > We came across an llvm bug when compiling some testcases that 64-bit > immediates are silently truncated into 32-bit and then packed into > BPF_JMP | BPF_K encoding. This caused comparison with wrong value. I think you send this to the wrong mailing list... this looks like a patch against the LLVM source code. Shouldn't you send to: llvm-...@lists.llvm.org ? > This bug looks to be introduced by r308080. This looks like a very recent change: https://llvm.org/viewvc/llvm-project?view=revision&revision=308080 Sat Jul 15 05:41:42 2017 UTC (2 months, 4 weeks ago) (As you are sending this to a user mailing list: xdp-newb...@vger.kernel.org) I want to know if this have made it into a LLVM release? and which release? As the git-repo[1] replica of LLVM SVN-repo does not git-tag the releases, I could not answer this question myself via the command: $ git describe --contains 7c423e0690 [1] http://llvm.org/git/llvm.git > The Select_Ri pattern is > supposed to be lowered into J*_Ri while the latter only support 32-bit > immediate encoding, therefore Select_Ri should have similar immediate > predicate check as what J*_Ri are doing. > > Reported-by: Jakub Kicinski > Signed-off-by: Jiong Wang > --- > lib/Target/BPF/BPFISelLowering.cpp | 8 ++-- > lib/Target/BPF/BPFInstrInfo.td | 2 +- > test/CodeGen/BPF/select_ri.ll | 35 +++ > 3 files changed, 42 insertions(+), 3 deletions(-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] eBPF prog getting rejected with devmap (and cpumap)
On Thu, 21 Sep 2017 22:06:27 +0200 Daniel Borkmann wrote: > On 09/21/2017 10:02 PM, Jesper Dangaard Brouer via iovisor-dev wrote: > > Hi Daniel, > > > > My very simple program is getting rejected on bpf loading time. When > > using another map as input for a decision, for map type BPF_MAP_TYPE_DEVMAP. > > > > SEC("xdp_redirect_map_rr") > > int xdp_prog_redirect_map_rr(struct xdp_md *ctx) > > { > > void *data_end = (void *)(long)ctx->data_end; > > void *data = (void *)(long)ctx->data; > > struct ethhdr *eth = data; > > int vport = 0; > > u32 key = 0; > > long *value; > > > > // count packet in global counter > > value = bpf_map_lookup_elem(&rxcnt, &key); > > if (value) > > *value += 1; > > > > /* Strange, verifier reject this (with LLVM version 3.9.1) */ > > vport = *value % 2; > > Here it's dereferenced where above it could have been NULL > from the lookup, so verifier is correct. You'd need to place > it into the if clause above or just do 'if (!value) return XDP_ABORTED' > or the like. Ah, I see! Thanks for the quick answer and solution :-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
[iovisor-dev] eBPF prog getting rejected with devmap (and cpumap)
Hi Daniel, My very simple program is getting rejected on bpf loading time. When using another map as input for a decision, for map type BPF_MAP_TYPE_DEVMAP. SEC("xdp_redirect_map_rr") int xdp_prog_redirect_map_rr(struct xdp_md *ctx) { void *data_end = (void *)(long)ctx->data_end; void *data = (void *)(long)ctx->data; struct ethhdr *eth = data; int vport = 0; u32 key = 0; long *value; // count packet in global counter value = bpf_map_lookup_elem(&rxcnt, &key); if (value) *value += 1; /* Strange, verifier reject this (with LLVM version 3.9.1) */ vport = *value % 2; if (vport >= 10) return XDP_ABORTED; return bpf_redirect_map(&tx_port, vport, 0); } bpf_load_program() err=13 0: (b7) r6 = 0 1: (63) *(u32 *)(r10 -4) = r6 2: (bf) r2 = r10 3: (07) r2 += -4 4: (18) r1 = 0x880832682700 6: (85) call bpf_map_lookup_elem#1 7: (79) r2 = *(u64 *)(r0 +0) R0 invalid mem access 'map_value_or_null' Are we missing something verifier code for BPF_MAP_TYPE_DEVMAP ? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Running Prototype-kernel package
On Thu, 1 Jun 2017 18:59:34 +0200 Jesper Dangaard Brouer wrote: > I guess, I'll update the documentation a bit after your feedback, Done, updated the documentation in the prototype-kernel github repo. See three top commits in this link: https://github.com/netoptimizer/prototype-kernel/commits/7db5539438fa7 The prototype-kernel doc is rendered here: https://prototype-kernel.readthedocs.io/en/latest/index.html https://prototype-kernel.readthedocs.io/en/latest/prototype-kernel/index.html -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Running Prototype-kernel package
On Thu, 1 Jun 2017 17:45:03 +0300 Adel Fuchs wrote: > Thank you so much for your help. I need to run an eBPF programs that > filters packets according to IP address or port. I'm trying to run the > prototype-kernel package but encounter problems. The prototype-kernel github repo[1] is actually originally meant for developing real kernel module. I guess, I should/could have placed the eBPF/XDP examples[3] in another github repo. [1] https://github.com/netoptimizer/prototype-kernel [2] http://netoptimizer.blogspot.dk/2014/11/announce-github-repo-prototype-kernel.html [3] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/samples/bpf > I went over the "getting started" tutorial, but I couldn'f find a > tutorial which explains how to actually run the samples. So this is > what I did: > > clang -O2 -Wall -target bpf -c xdp_ddos01_blacklist_cmdline.c -o > xdp_ddos01_blacklist_cmdline Don't do this... as you found out all the included path are screwed this way. There is a Makefile in the directory[3] kernel/samples/bpf, so simply chdir into this dir and run 'make'. Just like you would do in the kernel tree[4]. [4] https://github.com/torvalds/linux/blob/master/samples/bpf/README.rst [...] > Could you please help me run these samples? Could you please guide me > to the right sample that I need and tell me how to run it? > Note that I was able to run a simple XDP program on this computer, so > I think that it is configured all right... I've also been touring with a tutorial talk, see latest here[5] [5] http://people.netfilter.org/hawk/presentations/LLC2017/XDP_DDoS_protecting_LLC2017.pdf I guess, I'll update the documentation a bit after your feedback, thanks. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Error with printk and bpf_trace_printk
On Thu, 1 Jun 2017 14:10:01 +0300 Adel Fuchs wrote: > Hi Jesper, > I tried adding your solution, bpf_debug, and I'm now able to run the > program with no errors but the trace_pipe file stays empty. > I just added this to my program: > > #ifdef DEBUG > /* Only use this for debug output. Notice output from bpf_trace_printk() > * end-up in /sys/kernel/debug/tracing/trace_pipe > */ > #define bpf_debug(fmt, ...) \ > ({ \ > char fmt[] = fmt; \ > bpf_trace_printk(fmt, sizeof(fmt), \ > ##__VA_ARGS__); \ > }) > #else > #define bpf_debug(fmt, ...) { } while (0) > #endif > > > And added a printing command: > bpf_debug("hi"); > > Do you know what's the problem? I assume you are doing: sudo cat /sys/kernel/debug/tracing/trace_pipe The problem could be that the kernel need to be compiled with the right trace config options... does anybody on the list(s) know? (p.s. removed netdev list cc) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer > On Tue, May 30, 2017 at 3:24 PM, Jesper Dangaard Brouer > wrote: > > > > Notice, there are two mailing lists (Cc'ed) that you should likely ask > > these kind of questions on (instead of netdev), depending on if this is > > mostly related to bpf (iovisor-dev@lists.iovisor.org) or somehow > > related to XDP (xdp-newb...@vger.kernel.org). > > > > See my answer inlined below: > > > > On Sun, 28 May 2017 17:48:20 +0300 Adel Fuchs wrote: > > > >> I have a working eBPF program, and I'm trying to add outputs to it. > >> I'm not able to use both printk and bpf_trace_printk functions. I get > >> this error: > >> > >> ELF contains non-map related relo data in entry 0 pointing to section > >> 8! Compiler bug?! > >> > >> Prog section 'ingress' rejected: Invalid argument (22)! > >> - Type: 3 > >> - Instructions: 16 (0 over limit) > >> - License: GPL > >> > >> Verifier analysis: > >> > >> 0: (bf) r6 = r1 > >> 1: (18) r1 = 0x0 > >> 3: (85) call bpf_unspec#0 > >> unknown func bpf_unspec#0 > >> > >> Error fetching program/map! > >> Failed to retrieve (e)BPF data! > >> > >> Are there certain "includes" that I need to add? > >> In addition, I'm not sure I'm using the function correctly. I just > >> wrote: printk("hi") > > > > You obviously cannot call printk directly from and eBPF program. > > I wonder how you got this compiling... > > > > As you hinted yourself, you should be using: bpf_trace_printk(). > > But it is actually tricky to use... and not much help is around to > > figure this out. > > > > First of all the output end-up in this file: > > /sys/kernel/debug/tracing/trace_pipe > > Remember to read the output use 'cat' like: > > > > sudo cat /sys/kernel/debug/tracing/trace_pipe > > > > And only the first process to read the output gets the output... > > > > > > I deduct you are using the TC/iproute2 examples: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/tree/examples/bpf > > > > Next gotcha is that, you need to provide the char* string in a very > > special way to make this compile correctly. The iproute2 provide a > > helper define called "printt()" in include/bpf_api.h for this: > > > > #ifndef printt > > # define printt(fmt, ...) \ > > ({ \ > > char fmt[] = fmt; \ > > trace_printk(fmt, sizeof(fmt), ##__VA_ARGS__); \ > > }) > > #endif > > > > Or see my solution here: > > [1] > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_ddos01_blacklist_kern.c#L86:L99 > > > > > > Another gotcha I've experienced is that if you format the string > > incorrectly, or use a modifier like %X, which bpf_trace_printk() does > > not seem to understand, then you "hear-nothing"... Also experienced if > > using more than 3 arguments, then it fails or also go silent. Be > > careful when using this somewhat "flaky" debug facility. > > > > Do remember these bpf_trace_printk() should only be used for debugging, > > as it is very slow... > > -- > > Best regards, > > Jesper Dangaard Brouer > > MSc.CS, Principal Kernel Engineer at Red Hat > > LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Error with printk and bpf_trace_printk
Notice, there are two mailing lists (Cc'ed) that you should likely ask these kind of questions on (instead of netdev), depending on if this is mostly related to bpf (iovisor-dev@lists.iovisor.org) or somehow related to XDP (xdp-newb...@vger.kernel.org). See my answer inlined below: On Sun, 28 May 2017 17:48:20 +0300 Adel Fuchs wrote: > I have a working eBPF program, and I'm trying to add outputs to it. > I'm not able to use both printk and bpf_trace_printk functions. I get > this error: > > ELF contains non-map related relo data in entry 0 pointing to section > 8! Compiler bug?! > > Prog section 'ingress' rejected: Invalid argument (22)! > - Type: 3 > - Instructions: 16 (0 over limit) > - License: GPL > > Verifier analysis: > > 0: (bf) r6 = r1 > 1: (18) r1 = 0x0 > 3: (85) call bpf_unspec#0 > unknown func bpf_unspec#0 > > Error fetching program/map! > Failed to retrieve (e)BPF data! > > Are there certain "includes" that I need to add? > In addition, I'm not sure I'm using the function correctly. I just > wrote: printk("hi") You obviously cannot call printk directly from and eBPF program. I wonder how you got this compiling... As you hinted yourself, you should be using: bpf_trace_printk(). But it is actually tricky to use... and not much help is around to figure this out. First of all the output end-up in this file: /sys/kernel/debug/tracing/trace_pipe Remember to read the output use 'cat' like: sudo cat /sys/kernel/debug/tracing/trace_pipe And only the first process to read the output gets the output... I deduct you are using the TC/iproute2 examples: https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/tree/examples/bpf Next gotcha is that, you need to provide the char* string in a very special way to make this compile correctly. The iproute2 provide a helper define called "printt()" in include/bpf_api.h for this: #ifndef printt # define printt(fmt, ...) \ ({ \ char fmt[] = fmt; \ trace_printk(fmt, sizeof(fmt), ##__VA_ARGS__); \ }) #endif Or see my solution here: [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_ddos01_blacklist_kern.c#L86:L99 Another gotcha I've experienced is that if you format the string incorrectly, or use a modifier like %X, which bpf_trace_printk() does not seem to understand, then you "hear-nothing"... Also experienced if using more than 3 arguments, then it fails or also go silent. Be careful when using this somewhat "flaky" debug facility. Do remember these bpf_trace_printk() should only be used for debugging, as it is very slow... -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Describing howto read the eBPF generated ELF binary
On Tue, 7 Mar 2017 21:54:42 +0100 Jesper Dangaard Brouer wrote: > > I also suggest to replace it with 'llvm-objdump -h file.o' > > since it prints the same info as 'readelf -SW' > > but less verbose. > > Good idea to instead describe use of: > 'llvm-objdump -h file.o' Fixed: https://github.com/netoptimizer/prototype-kernel/commit/c5cf59ff9a9b https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#elf-binary -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Describing howto read the eBPF generated ELF binary
On Tue, 7 Mar 2017 07:57:00 -0800 Alexei Starovoitov via iovisor-dev wrote: > On Tue, Mar 7, 2017 at 3:07 AM, Jesper Dangaard Brouer > wrote: > > On Mon, 6 Mar 2017 16:14:06 -0800 > > Alexei Starovoitov via iovisor-dev wrote: > > > >> On Mon, Mar 6, 2017 at 10:29 AM, Jesper Dangaard Brouer > >> wrote: > >> > On Mon, 6 Mar 2017 09:11:46 -0800 > >> > Alexei Starovoitov via iovisor-dev wrote: > >> > > >> >> On Mon, Mar 6, 2017 at 3:53 AM, Jesper Dangaard Brouer > >> >> wrote: > >> >> > Hi All, > >> >> > > >> >> > I've added a section to my eBPF documentation, about how to read the > >> >> > eBPF generated ELF binary, and deduct the size of the compiled program > >> >> > (mostly for kernel/samples/bpf). > >> >> > > >> >> > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#elf-binary > >> >> > https://github.com/netoptimizer/prototype-kernel/commit/079352102cb0ba > >> >> > > >> >> > Can someone validate what I've saying is true? (commit inlined below > >> >> > sign, to make is as easy as possible for people to correct me). > >> >> > > >> >> > And anything else users can use the readelf output for? > >> >> > > >> >> > -- > >> >> > Best regards, > >> >> > Jesper Dangaard Brouer > >> >> > MSc.CS, Principal Kernel Engineer at Red Hat > >> >> > LinkedIn: http://www.linkedin.com/in/brouer > >> >> > > >> >> > commit 079352102cb0ba12141ecd28c216ec5ac5290192 > >> >> > Author: Jesper Dangaard Brouer > >> >> > Date: Fri Feb 24 12:10:45 2017 +0100 > >> >> > > >> >> > doc: eBPF describe howto read the eBPF generated ELF binary > >> >> > > >> >> > Signed-off-by: Jesper Dangaard Brouer > >> >> > > >> >> > diff --git a/kernel/Documentation/bpf/troubleshooting.rst > >> >> > b/kernel/Documentation/bpf/troubleshooting.rst > >> >> > index b6f3b6fe9501..39ebffae4142 100644 > >> >> > --- a/kernel/Documentation/bpf/troubleshooting.rst > >> >> > +++ b/kernel/Documentation/bpf/troubleshooting.rst > >> >> > @@ -15,6 +15,41 @@ see system call `setrlimit(2)`_. > >> >> > The ``bpf_create_map`` call will return errno EPERM (Operation not > >> >> > permitted) when the RLIMIT_MEMLOCK memory size limit is exceeded. > >> >> > > >> >> > - > >> >> > .. _setrlimit(2): > >> >> > http://man7.org/linux/man-pages/man2/setrlimit.2.html > >> >> > > >> >> > +ELF binary > >> >> > +== > >> >> > + > >> >> > +The binary containing the eBPF program, which got generated by the > >> >> > +LLVM compiler, is an ELF binary. For samples/bpf/ this is the file > >> >> > +named xxx_kern.o. > >> >> > + > >> >> > +To answer questions like how big is my eBPF program, it is possible > >> >> > to > >> >> > +use a tool like ``readelf``. :: > >> >> > + > >> >> > + $ readelf -SW xdp_ddos01_blacklist_kern.o > >> >> > + There are 8 section headers, starting at offset 0x398: > >> >> > + > >> >> > + Section Headers: > >> >> > + [Nr] Name Type Address OffSize ES > >> >> > Flg Lk Inf Al > >> >> > + [ 0]NULL 00 00 00 > >> >> >0 0 0 > >> >> > + [ 1] .strtabSTRTAB 000320 72 00 > >> >> >0 0 1 > >> >> > + [ 2] .text PROGBITS 40 00 00 > >> >> > AX 0 0 4 > >> >> > + [ 3] xdp_prog PROGBITS 40 0001b8 00 > >> >> > AX 0 0 8 > >> >> > + [ 4] .relxdp_prog REL 000300 20 10 > >> >> >7 3 8 > >> >> > + [ 5] maps PROGBITS 0001f8 28 00 > >> >> > WA 0 0 4 > >> >> > + [ 6] licensePROGBITS 000220 04 00 > >> >> > WA 0 0 1 > >> >> > + [ 7] .symtabSYMTAB 000228 d8 18 > >> >> >1 5 8 > >> >> > + Key to Flags: > >> >> > + W (write), A (alloc), X (execute), M (merge), S (strings) > >> >> > + I (info), L (link order), G (group), T (TLS), E (exclude), x > >> >> > (unknown) > >> >> > + O (extra OS processing required) o (OS specific), p (processor > >> >> > specific) > >> >> > + > >> >> > +From the output, we can see the programmer choose to name the XDP > >> >> > +program section "xdp_prog". From this line ([ 3]) the size column > >> >> > +shows the size 0001b8 in hex, which is easily converted on the > >> >> > cmdline > >> >> > +to 440 bytes:: > >> >> > + > >> >> > + $ echo $((0x0001b8)) > >> >> > + 440 > >> >> > >> >> hmm. i think instead of clarifying bpf elf output is doing the opposite. > >> >> at least i'm completely lost in the above text. > >> >> What all of the above suppose to mean? > >> > > >> > That what I'm implicit asking you ;-) > >> > > >> >> why is it useful to do readelf? and look at hex ? > >> > > >> > What other tools exists to help me understand the contents of the LLVM > >> > compiled binary _kern.o ? > >> > >> The best is to use > >> llvm-objdump -S prog_kern.o > > > > Hmmm, what version of LLVM have the -S option? > > > > $ llvm
Re: [iovisor-dev] Describing howto read the eBPF generated ELF binary
On Tue, 7 Mar 2017 12:07:40 +0100 Jesper Dangaard Brouer via iovisor-dev wrote: > On Mon, 6 Mar 2017 16:14:06 -0800 > Alexei Starovoitov via iovisor-dev wrote: > > > On Mon, Mar 6, 2017 at 10:29 AM, Jesper Dangaard Brouer > > wrote: > > > On Mon, 6 Mar 2017 09:11:46 -0800 > > > Alexei Starovoitov via iovisor-dev wrote: > > > > > >> On Mon, Mar 6, 2017 at 3:53 AM, Jesper Dangaard Brouer > > >> wrote: > > >> > Hi All, > > >> > > > >> > I've added a section to my eBPF documentation, about how to read the > > >> > eBPF generated ELF binary, and deduct the size of the compiled program > > >> > (mostly for kernel/samples/bpf). > > >> > > > >> > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#elf-binary > > >> > https://github.com/netoptimizer/prototype-kernel/commit/079352102cb0ba > > >> > > > >> > Can someone validate what I've saying is true? (commit inlined below > > >> > sign, to make is as easy as possible for people to correct me). > > >> > > > >> > And anything else users can use the readelf output for? > > >> > > > >> > -- > > >> > Best regards, > > >> > Jesper Dangaard Brouer > > >> > MSc.CS, Principal Kernel Engineer at Red Hat > > >> > LinkedIn: http://www.linkedin.com/in/brouer > > >> > > > >> > commit 079352102cb0ba12141ecd28c216ec5ac5290192 > > >> > Author: Jesper Dangaard Brouer > > >> > Date: Fri Feb 24 12:10:45 2017 +0100 > > >> > > > >> > doc: eBPF describe howto read the eBPF generated ELF binary > > >> > > > >> > Signed-off-by: Jesper Dangaard Brouer > > >> > > > >> > diff --git a/kernel/Documentation/bpf/troubleshooting.rst > > >> > b/kernel/Documentation/bpf/troubleshooting.rst > > >> > index b6f3b6fe9501..39ebffae4142 100644 > > >> > --- a/kernel/Documentation/bpf/troubleshooting.rst > > >> > +++ b/kernel/Documentation/bpf/troubleshooting.rst > > >> > @@ -15,6 +15,41 @@ see system call `setrlimit(2)`_. > > >> > The ``bpf_create_map`` call will return errno EPERM (Operation not > > >> > permitted) when the RLIMIT_MEMLOCK memory size limit is exceeded. > > >> > > > >> > - > > >> > .. _setrlimit(2): > > >> > http://man7.org/linux/man-pages/man2/setrlimit.2.html > > >> > > > >> > +ELF binary > > >> > +== > > >> > + > > >> > +The binary containing the eBPF program, which got generated by the > > >> > +LLVM compiler, is an ELF binary. For samples/bpf/ this is the file > > >> > +named xxx_kern.o. > > >> > + > > >> > +To answer questions like how big is my eBPF program, it is possible to > > >> > +use a tool like ``readelf``. :: > > >> > + > > >> > + $ readelf -SW xdp_ddos01_blacklist_kern.o > > >> > + There are 8 section headers, starting at offset 0x398: > > >> > + > > >> > + Section Headers: > > >> > + [Nr] Name Type Address OffSize ES > > >> > Flg Lk Inf Al > > >> > + [ 0]NULL 00 00 00 > > >> > 0 0 0 > > >> > + [ 1] .strtabSTRTAB 000320 72 00 > > >> > 0 0 1 > > >> > + [ 2] .text PROGBITS 40 00 00 > > >> > AX 0 0 4 > > >> > + [ 3] xdp_prog PROGBITS 40 0001b8 00 > > >> > AX 0 0 8 > > >> > + [ 4] .relxdp_prog REL 000300 20 10 > > >> > 7 3 8 > > >> > + [ 5] maps PROGBITS 0001f8 28 00 > > >> > WA 0 0 4 > > >> > + [ 6] licensePROGBITS 000220 04 00 > > >> > WA 0 0 1 > > >> > + [ 7] .symtabSYMTAB 000228 d8 18 > > >> > 1 5 8 > > >> > + Key to Flags: > > >> > + W (write), A (alloc), X (execute), M (merge), S (strings) > > >> > + I (info), L (lin
Re: [iovisor-dev] Describing howto read the eBPF generated ELF binary
On Tue, 07 Mar 2017 12:40:03 +0100 Daniel Borkmann wrote: > On 03/07/2017 12:07 PM, Jesper Dangaard Brouer via iovisor-dev wrote: > > On Mon, 6 Mar 2017 16:14:06 -0800 > > Alexei Starovoitov via iovisor-dev wrote: > [...] > >> that's practically impossible to know in advance, since hardening and > >> start address randomization will play a role. > >> Or use sysctl net.core.bpf_jit_enable=2 > >> at load time which gives raw x86 hex. > > > > The sysctl adjusting sounds interesting: > > > > sysctl net.core.bpf_jit_enable=2 > > > > What is below "proglen=335" the JIT'ed asm-code size? > > It's in bpf_jit_dump(): proglen is the len of opcode sequence generated > and flen is the number of bpf insns. You can use tools/net/bpf_jit_disasm.c > to disassemble that output. bpf_jit_disasm -o will dump the related opcodes > as well. Thanks for input, added: https://github.com/netoptimizer/prototype-kernel/commit/0b31532f42cd8 > > flen=55 proglen=335 pass=4 image=a0006820 from=xdp_ddos01_blac > > pid=1 > > JIT code: : 55 48 89 e5 48 81 ec 28 02 00 00 48 89 9d d8 fd > > JIT code: 0010: ff ff 4c 89 ad e0 fd ff ff 4c 89 b5 e8 fd ff ff > > JIT code: 0020: 4c 89 bd f0 fd ff ff 31 c0 48 89 85 f8 fd ff ff > > JIT code: 0030: bb 02 00 00 00 48 8b 77 08 48 8b 7f 00 48 89 fa > > JIT code: 0040: 48 83 c2 0e 48 39 f2 0f 87 e1 00 00 00 48 0f b6 > > JIT code: 0050: 4f 0c 48 0f b6 57 0d 48 c1 e2 08 48 09 ca 48 89 > > JIT code: 0060: d1 48 81 e1 ff 00 00 00 41 b8 06 00 00 00 49 39 > > JIT code: 0070: c8 0f 87 b7 00 00 00 48 81 fa 88 a8 00 00 74 0e > > JIT code: 0080: b9 0e 00 00 00 48 81 fa 81 00 00 00 75 1a 48 89 > > JIT code: 0090: fa 48 83 c2 12 48 39 f2 0f 87 90 00 00 00 b9 12 > > JIT code: 00a0: 00 00 00 48 0f b7 57 10 bb 02 00 00 00 48 81 e2 > > JIT code: 00b0: ff ff 00 00 48 83 fa 08 75 49 48 01 cf 31 db 48 > > JIT code: 00c0: 89 fa 48 83 c2 14 48 39 f2 77 38 8b 7f 0c 89 7d > > JIT code: 00d0: fc 48 89 ee 48 83 c6 fc 48 bf 00 9c 24 5f 07 88 > > JIT code: 00e0: ff ff e8 29 cd 13 e1 bb 02 00 00 00 48 83 f8 00 > > JIT code: 00f0: 74 11 48 8b 78 00 48 83 c7 01 48 89 78 00 bb 01 > > JIT code: 0100: 00 00 00 89 5d f8 48 89 ee 48 83 c6 f8 48 bf c0 > > JIT code: 0110: 76 12 13 04 88 ff ff e8 f4 cc 13 e1 48 83 f8 00 > > JIT code: 0120: 74 0c 48 8b 78 00 48 83 c7 01 48 89 78 00 48 89 > > JIT code: 0130: d8 48 8b 9d d8 fd ff ff 4c 8b ad e0 fd ff ff 4c > > JIT code: 0140: 8b b5 e8 fd ff ff 4c 8b bd f0 fd ff ff c9 c3 > > > > $ echo $((0x140)) > > 320 > > $ echo $((0x140 + 15)) > > 335 > > > > Using same example code, thus BPF instructions-size was 440 bytes, so > > the JIT'ed code size does get smaller. > > > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Describing howto read the eBPF generated ELF binary
On Mon, 6 Mar 2017 16:14:06 -0800 Alexei Starovoitov via iovisor-dev wrote: > On Mon, Mar 6, 2017 at 10:29 AM, Jesper Dangaard Brouer > wrote: > > On Mon, 6 Mar 2017 09:11:46 -0800 > > Alexei Starovoitov via iovisor-dev wrote: > > > >> On Mon, Mar 6, 2017 at 3:53 AM, Jesper Dangaard Brouer > >> wrote: > >> > Hi All, > >> > > >> > I've added a section to my eBPF documentation, about how to read the > >> > eBPF generated ELF binary, and deduct the size of the compiled program > >> > (mostly for kernel/samples/bpf). > >> > > >> > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#elf-binary > >> > https://github.com/netoptimizer/prototype-kernel/commit/079352102cb0ba > >> > > >> > Can someone validate what I've saying is true? (commit inlined below > >> > sign, to make is as easy as possible for people to correct me). > >> > > >> > And anything else users can use the readelf output for? > >> > > >> > -- > >> > Best regards, > >> > Jesper Dangaard Brouer > >> > MSc.CS, Principal Kernel Engineer at Red Hat > >> > LinkedIn: http://www.linkedin.com/in/brouer > >> > > >> > commit 079352102cb0ba12141ecd28c216ec5ac5290192 > >> > Author: Jesper Dangaard Brouer > >> > Date: Fri Feb 24 12:10:45 2017 +0100 > >> > > >> > doc: eBPF describe howto read the eBPF generated ELF binary > >> > > >> > Signed-off-by: Jesper Dangaard Brouer > >> > > >> > diff --git a/kernel/Documentation/bpf/troubleshooting.rst > >> > b/kernel/Documentation/bpf/troubleshooting.rst > >> > index b6f3b6fe9501..39ebffae4142 100644 > >> > --- a/kernel/Documentation/bpf/troubleshooting.rst > >> > +++ b/kernel/Documentation/bpf/troubleshooting.rst > >> > @@ -15,6 +15,41 @@ see system call `setrlimit(2)`_. > >> > The ``bpf_create_map`` call will return errno EPERM (Operation not > >> > permitted) when the RLIMIT_MEMLOCK memory size limit is exceeded. > >> > > >> > - > >> > .. _setrlimit(2): http://man7.org/linux/man-pages/man2/setrlimit.2.html > >> > > >> > +ELF binary > >> > +== > >> > + > >> > +The binary containing the eBPF program, which got generated by the > >> > +LLVM compiler, is an ELF binary. For samples/bpf/ this is the file > >> > +named xxx_kern.o. > >> > + > >> > +To answer questions like how big is my eBPF program, it is possible to > >> > +use a tool like ``readelf``. :: > >> > + > >> > + $ readelf -SW xdp_ddos01_blacklist_kern.o > >> > + There are 8 section headers, starting at offset 0x398: > >> > + > >> > + Section Headers: > >> > + [Nr] Name Type Address OffSize ES Flg > >> > Lk Inf Al > >> > + [ 0]NULL 00 00 00 > >> > 0 0 0 > >> > + [ 1] .strtabSTRTAB 000320 72 00 > >> > 0 0 1 > >> > + [ 2] .text PROGBITS 40 00 00 AX > >> > 0 0 4 > >> > + [ 3] xdp_prog PROGBITS 40 0001b8 00 AX > >> > 0 0 8 > >> > + [ 4] .relxdp_prog REL 000300 20 10 > >> > 7 3 8 > >> > + [ 5] maps PROGBITS 0001f8 28 00 WA > >> > 0 0 4 > >> > + [ 6] licensePROGBITS 000220 04 00 WA > >> > 0 0 1 > >> > + [ 7] .symtabSYMTAB 000228 d8 18 > >> > 1 5 8 > >> > + Key to Flags: > >> > + W (write), A (alloc), X (execute), M (merge), S (strings) > >> > + I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown) > >> > + O (extra OS processing required) o (OS specific), p (processor > >> > specific) > >> > + > >> > +From the output, we can see the programmer choose to name the XDP > >> > +program section "xdp_prog". From this line ([ 3]) the size column > >> > +shows the size 0001b8 in hex, which is easily converted on the cmdline > >> > +to 440 bytes:: > >> > + > >> > + $ echo $((0x0001b8)) > >> > + 440 > >> > >> hmm. i think instead of clarifying bpf elf output is doing the opposite. > >> at least i'm completely lost in the above text. > >> What all of the above suppose to mean? > > > > That what I'm implicit asking you ;-) > > > >> why is it useful to do readelf? and look at hex ? > > > > What other tools exists to help me understand the contents of the LLVM > > compiled binary _kern.o ? > > The best is to use > llvm-objdump -S prog_kern.o Hmmm, what version of LLVM have the -S option? $ llvm-objdump -S xdp_ddos01_blacklist_kern.o llvm-objdump: Unknown command line argument '-S'. Try: 'llvm-objdump -help' llvm-objdump: Did you mean '-D'? $ llvm-objdump --version | egrep 'version|bpf' LLVM version 3.8.1 bpf- BPF (host endian) bpfeb - BPF (big endian) bpfel - BPF (little endian) I would really like some command tool, to inspect eBPF-ELF program with, available in a distro version, in this case Fedora 25. > it will show section names, asm code and original C code > if compiled with -g
Re: [iovisor-dev] Describing howto read the eBPF generated ELF binary
On Mon, 6 Mar 2017 09:11:46 -0800 Alexei Starovoitov via iovisor-dev wrote: > On Mon, Mar 6, 2017 at 3:53 AM, Jesper Dangaard Brouer > wrote: > > Hi All, > > > > I've added a section to my eBPF documentation, about how to read the > > eBPF generated ELF binary, and deduct the size of the compiled program > > (mostly for kernel/samples/bpf). > > > > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#elf-binary > > https://github.com/netoptimizer/prototype-kernel/commit/079352102cb0ba > > > > Can someone validate what I've saying is true? (commit inlined below > > sign, to make is as easy as possible for people to correct me). > > > > And anything else users can use the readelf output for? > > > > -- > > Best regards, > > Jesper Dangaard Brouer > > MSc.CS, Principal Kernel Engineer at Red Hat > > LinkedIn: http://www.linkedin.com/in/brouer > > > > commit 079352102cb0ba12141ecd28c216ec5ac5290192 > > Author: Jesper Dangaard Brouer > > Date: Fri Feb 24 12:10:45 2017 +0100 > > > > doc: eBPF describe howto read the eBPF generated ELF binary > > > > Signed-off-by: Jesper Dangaard Brouer > > > > diff --git a/kernel/Documentation/bpf/troubleshooting.rst > > b/kernel/Documentation/bpf/troubleshooting.rst > > index b6f3b6fe9501..39ebffae4142 100644 > > --- a/kernel/Documentation/bpf/troubleshooting.rst > > +++ b/kernel/Documentation/bpf/troubleshooting.rst > > @@ -15,6 +15,41 @@ see system call `setrlimit(2)`_. > > The ``bpf_create_map`` call will return errno EPERM (Operation not > > permitted) when the RLIMIT_MEMLOCK memory size limit is exceeded. > > > > - > > .. _setrlimit(2): http://man7.org/linux/man-pages/man2/setrlimit.2.html > > > > +ELF binary > > +== > > + > > +The binary containing the eBPF program, which got generated by the > > +LLVM compiler, is an ELF binary. For samples/bpf/ this is the file > > +named xxx_kern.o. > > + > > +To answer questions like how big is my eBPF program, it is possible to > > +use a tool like ``readelf``. :: > > + > > + $ readelf -SW xdp_ddos01_blacklist_kern.o > > + There are 8 section headers, starting at offset 0x398: > > + > > + Section Headers: > > + [Nr] Name Type Address OffSize ES Flg Lk > > Inf Al > > + [ 0]NULL 00 00 00 0 > > 0 0 > > + [ 1] .strtabSTRTAB 000320 72 00 0 > > 0 1 > > + [ 2] .text PROGBITS 40 00 00 AX 0 > > 0 4 > > + [ 3] xdp_prog PROGBITS 40 0001b8 00 AX 0 > > 0 8 > > + [ 4] .relxdp_prog REL 000300 20 10 7 > > 3 8 > > + [ 5] maps PROGBITS 0001f8 28 00 WA 0 > > 0 4 > > + [ 6] licensePROGBITS 000220 04 00 WA 0 > > 0 1 > > + [ 7] .symtabSYMTAB 000228 d8 18 1 > > 5 8 > > + Key to Flags: > > + W (write), A (alloc), X (execute), M (merge), S (strings) > > + I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown) > > + O (extra OS processing required) o (OS specific), p (processor specific) > > + > > +From the output, we can see the programmer choose to name the XDP > > +program section "xdp_prog". From this line ([ 3]) the size column > > +shows the size 0001b8 in hex, which is easily converted on the cmdline > > +to 440 bytes:: > > + > > + $ echo $((0x0001b8)) > > + 440 > > hmm. i think instead of clarifying bpf elf output is doing the opposite. > at least i'm completely lost in the above text. > What all of the above suppose to mean? That what I'm implicit asking you ;-) > why is it useful to do readelf? and look at hex ? What other tools exists to help me understand the contents of the LLVM compiled binary _kern.o ? > I've never done it myself. Is there an easier way to answer my question: Q: how big is my eBPF program? I would actually (also) like to know the size of the JIT'ed asm code? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
[iovisor-dev] Describing howto read the eBPF generated ELF binary
(repost with subscribed email) Hi All, I've added a section to my eBPF documentation, about how to read the eBPF generated ELF binary, and deduct the size of the compiled program (mostly for kernel/samples/bpf). https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#elf-binary https://github.com/netoptimizer/prototype-kernel/commit/079352102cb0ba Can someone validate what I've saying is true? (commit inlined below signatur, to make is as easy as possible for people to correct me). And anything else users can use the readelf output for? - - Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer commit 079352102cb0ba12141ecd28c216ec5ac5290192 Author: Jesper Dangaard Brouer Date: Fri Feb 24 12:10:45 2017 +0100 doc: eBPF describe howto read the eBPF generated ELF binary Signed-off-by: Jesper Dangaard Brouer diff --git a/kernel/Documentation/bpf/troubleshooting.rst b/kernel/Documentation/bpf/troubleshooting.rst index b6f3b6fe9501..39ebffae4142 100644 --- a/kernel/Documentation/bpf/troubleshooting.rst +++ b/kernel/Documentation/bpf/troubleshooting.rst @@ -15,6 +15,41 @@ see system call `setrlimit(2)`_. The ``bpf_create_map`` call will return errno EPERM (Operation not permitted) when the RLIMIT_MEMLOCK memory size limit is exceeded. - .. _setrlimit(2): http://man7.org/linux/man-pages/man2/setrlimit.2.html +ELF binary +== + +The binary containing the eBPF program, which got generated by the +LLVM compiler, is an ELF binary. For samples/bpf/ this is the file +named xxx_kern.o. + +To answer questions like how big is my eBPF program, it is possible to +use a tool like ``readelf``. :: + + $ readelf -SW xdp_ddos01_blacklist_kern.o + There are 8 section headers, starting at offset 0x398: + + Section Headers: + [Nr] Name Type Address OffSize ES Flg Lk Inf Al + [ 0]NULL 00 00 00 0 0 0 + [ 1] .strtabSTRTAB 000320 72 00 0 0 1 + [ 2] .text PROGBITS 40 00 00 AX 0 0 4 + [ 3] xdp_prog PROGBITS 40 0001b8 00 AX 0 0 8 + [ 4] .relxdp_prog REL 000300 20 10 7 3 8 + [ 5] maps PROGBITS 0001f8 28 00 WA 0 0 4 + [ 6] licensePROGBITS 000220 04 00 WA 0 0 1 + [ 7] .symtabSYMTAB 000228 d8 18 1 5 8 + Key to Flags: + W (write), A (alloc), X (execute), M (merge), S (strings) + I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown) + O (extra OS processing required) o (OS specific), p (processor specific) + +From the output, we can see the programmer choose to name the XDP +program section "xdp_prog". From this line ([ 3]) the size column +shows the size 0001b8 in hex, which is easily converted on the cmdline +to 440 bytes:: + + $ echo $((0x0001b8)) + 440 + -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
[iovisor-dev] Describing howto read the eBPF generated ELF binary
Hi All, I've added a section to my eBPF documentation, about how to read the eBPF generated ELF binary, and deduct the size of the compiled program (mostly for kernel/samples/bpf). https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#elf-binary https://github.com/netoptimizer/prototype-kernel/commit/079352102cb0ba Can someone validate what I've saying is true? (commit inlined below sign, to make is as easy as possible for people to correct me). And anything else users can use the readelf output for? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer commit 079352102cb0ba12141ecd28c216ec5ac5290192 Author: Jesper Dangaard Brouer Date: Fri Feb 24 12:10:45 2017 +0100 doc: eBPF describe howto read the eBPF generated ELF binary Signed-off-by: Jesper Dangaard Brouer diff --git a/kernel/Documentation/bpf/troubleshooting.rst b/kernel/Documentation/bpf/troubleshooting.rst index b6f3b6fe9501..39ebffae4142 100644 --- a/kernel/Documentation/bpf/troubleshooting.rst +++ b/kernel/Documentation/bpf/troubleshooting.rst @@ -15,6 +15,41 @@ see system call `setrlimit(2)`_. The ``bpf_create_map`` call will return errno EPERM (Operation not permitted) when the RLIMIT_MEMLOCK memory size limit is exceeded. - .. _setrlimit(2): http://man7.org/linux/man-pages/man2/setrlimit.2.html +ELF binary +== + +The binary containing the eBPF program, which got generated by the +LLVM compiler, is an ELF binary. For samples/bpf/ this is the file +named xxx_kern.o. + +To answer questions like how big is my eBPF program, it is possible to +use a tool like ``readelf``. :: + + $ readelf -SW xdp_ddos01_blacklist_kern.o + There are 8 section headers, starting at offset 0x398: + + Section Headers: + [Nr] Name Type Address OffSize ES Flg Lk Inf Al + [ 0]NULL 00 00 00 0 0 0 + [ 1] .strtabSTRTAB 000320 72 00 0 0 1 + [ 2] .text PROGBITS 40 00 00 AX 0 0 4 + [ 3] xdp_prog PROGBITS 40 0001b8 00 AX 0 0 8 + [ 4] .relxdp_prog REL 000300 20 10 7 3 8 + [ 5] maps PROGBITS 0001f8 28 00 WA 0 0 4 + [ 6] licensePROGBITS 000220 04 00 WA 0 0 1 + [ 7] .symtabSYMTAB 000228 d8 18 1 5 8 + Key to Flags: + W (write), A (alloc), X (execute), M (merge), S (strings) + I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown) + O (extra OS processing required) o (OS specific), p (processor specific) + +From the output, we can see the programmer choose to name the XDP +program section "xdp_prog". From this line ([ 3]) the size column +shows the size 0001b8 in hex, which is easily converted on the cmdline +to 440 bytes:: + + $ echo $((0x0001b8)) + 440 + ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Documentation on eBPF map types?
On Fri, 3 Feb 2017 07:43:33 -0800 Alexei Starovoitov wrote: > On Fri, Feb 3, 2017 at 5:52 AM, Jesper Dangaard Brouer > wrote: > > On Thu, 2 Feb 2017 20:27:15 -0800 > > Alexei Starovoitov wrote: > > > >> On Thu, Feb 02, 2017 at 06:00:09PM +0100, Jesper Dangaard Brouer wrote: > > [...] > >> Comments below: > > [...] > > > >> > +Interacting with maps > >> > += > >> > + > >> > +Interacting with an eBPF maps from **userspace**, happens through the > >> > +`bpf`_ syscall and a file descriptor. The kernel > >> > +`tools/lib/bpf/bpf.h`_ define some ``bpf_map_*()`` helper functions > >> > +for wrapping the `bpf_cmd`_ relating to manipulating the map elements. > >> > + > >> > +.. code-block:: c > >> > + > >> > + enum bpf_cmd { > >> > + [...] > >> > + BPF_MAP_LOOKUP_ELEM, > >> > + BPF_MAP_UPDATE_ELEM, > >> > + BPF_MAP_DELETE_ELEM, > >> > + BPF_MAP_GET_NEXT_KEY, > >> > + [...] > >> > + }; > >> > + /* Corresponding helper functions */ > >> > + int bpf_map_lookup_elem(int fd, void *key, void *value); > >> > + int bpf_map_update_elem(int fd, void *key, void *value, __u64 flags); > >> > + int bpf_map_delete_elem(int fd, void *key); > >> > + int bpf_map_get_next_key(int fd, void *key, void *next_key); > >> > + > >> > +Notice from userspace, there is no call to atomically increment or > >> > +decrement the value 'in-place'. The bpf_map_update_elem() call will > >> > +overwrite the existing value. > >> > >> will overwrite in _atomic_ way. Meaning that the whole value will > >> be replaced atomically regardless whether update() is called from > >> kernel or from user space. > > > > Looking at the memcpy code in array_map_update_elem(), I don't see > > anything preventing two processes from data-update racing with > > each-other, especially when the value_size is large. > > yes. my comment above applies to hash map only. Okay, understood. I did see locks in the hash map code, but I've not documented that map type yet. I've completely rewritten the section: "Interacting with maps" See commit: https://github.com/netoptimizer/prototype-kernel/commit/9b6eba65171ce8d83f6 Or html: http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html#interacting-with-maps Also inlined here: [PATCH] doc: ebpf rewrite section Interacting with maps From: Jesper Dangaard Brouer Signed-off-by: Jesper Dangaard Brouer --- kernel/Documentation/bpf/ebpf_maps.rst | 145 +--- 1 file changed, 112 insertions(+), 33 deletions(-) diff --git a/kernel/Documentation/bpf/ebpf_maps.rst b/kernel/Documentation/bpf/ebpf_maps.rst index 590e1ac72eb3..b549e4716501 100644 --- a/kernel/Documentation/bpf/ebpf_maps.rst +++ b/kernel/Documentation/bpf/ebpf_maps.rst @@ -112,13 +112,60 @@ files for defining the maps, but it uses another layout. See man-page Interacting with maps = +Interacting with eBPF maps happens through some **lookup/update/delete** +primitives. + +When writing eBFP programs using load helpers and libraries from +samples/bpf/ and tools/lib/bpf/. Common function name API have been +created that hides the details of how kernel vs. userspace access +these primitives (which is quite different). + +The common function names (parameters and return values differs): + +.. code-block:: c + + void bpf_map_lookup_elem(map, void *key. ...); + void bpf_map_update_elem(map, void *key, ..., __u64 flags); + void bpf_map_delete_elem(map, void *key); + +The ``flags`` argument in ``bpf_map_update_elem()`` allows to define +semantics on whether the element exists: + +.. code-block:: c + + /* File: include/uapi/linux/bpf.h */ + /* flags for BPF_MAP_UPDATE_ELEM command */ + #define BPF_ANY 0 /* create new element or update existing */ + #define BPF_NOEXIST 1 /* create new element only if it didn't exist */ + #define BPF_EXIST2 /* only update existing element */ + +Userspace +- +The userspace API map helpers are defined in `tools/lib/bpf/bpf.h`_ +and looks like this: + +.. code-block:: c + + /* Userspace helpers */ + int bpf_map_lookup_elem(int fd, void *key, void *value); + int bpf_map_update_elem(int fd, void *key, void *value, __u64 flags); + int bpf_map_delete_elem(int fd, void *key); + /* Only userspace: */ + int bpf_map_get_next_key(int fd, void *key, void *next_key); + + Interacting with an eBPF map from **userspace**, happens through the -`bpf`_ syscall and a file descriptor. The kernel -`tools/lib/bpf/bpf.h`_ defines some ``bpf_map_*()`` helper functions -for wrapping the `bpf_cmd`_ related to manipulating the map elements. +`bpf`_ syscall and a file descriptor. See how the map handle ``int +fd`` is a file descriptor . On success, zero is returned, on +failures -1 is returned and errno is set. + +Wrappers for the bpf syscall is implemented in `tools/lib/bpf/bpf.c`_, +and ends up calling functions in `kernel/bpf/syscall.c`_, like +`map_lookup_elem`_. .. code-block:: c + /* Corresponding sysca
Re: [iovisor-dev] Documentation on eBPF map types?
On Thu, 2 Feb 2017 20:27:15 -0800 Alexei Starovoitov wrote: > On Thu, Feb 02, 2017 at 06:00:09PM +0100, Jesper Dangaard Brouer wrote: [...] > Comments below: [...] > > +Interacting with maps > > += > > + > > +Interacting with an eBPF maps from **userspace**, happens through the > > +`bpf`_ syscall and a file descriptor. The kernel > > +`tools/lib/bpf/bpf.h`_ define some ``bpf_map_*()`` helper functions > > +for wrapping the `bpf_cmd`_ relating to manipulating the map elements. > > + > > +.. code-block:: c > > + > > + enum bpf_cmd { > > + [...] > > + BPF_MAP_LOOKUP_ELEM, > > + BPF_MAP_UPDATE_ELEM, > > + BPF_MAP_DELETE_ELEM, > > + BPF_MAP_GET_NEXT_KEY, > > + [...] > > + }; > > + /* Corresponding helper functions */ > > + int bpf_map_lookup_elem(int fd, void *key, void *value); > > + int bpf_map_update_elem(int fd, void *key, void *value, __u64 flags); > > + int bpf_map_delete_elem(int fd, void *key); > > + int bpf_map_get_next_key(int fd, void *key, void *next_key); > > + > > +Notice from userspace, there is no call to atomically increment or > > +decrement the value 'in-place'. The bpf_map_update_elem() call will > > +overwrite the existing value. > > will overwrite in _atomic_ way. Meaning that the whole value will > be replaced atomically regardless whether update() is called from > kernel or from user space. Looking at the memcpy code in array_map_update_elem(), I don't see anything preventing two processes from data-update racing with each-other, especially when the value_size is large. > Also please mention first that kernel can do atomic increment, > before saying that user space cannot. Otherwise it's quite confusing. True, noted, I'll rearrange. > > The flags argument allows > > +bpf_map_update_elem() define semantics on weather the element exist: > > + > > +.. code-block:: c > > + > > + /* File: include/uapi/linux/bpf.h */ > > + /* flags for BPF_MAP_UPDATE_ELEM command */ > > + #define BPF_ANY 0 /* create new element or update existing */ > > + #define BPF_NOEXIST 1 /* create new element if it didn't exist */ > > + #define BPF_EXIST2 /* update existing element */ > > + > > +The eBPF-program running "kernel-side" have almost the same primitives > > +(lookup/update/delete) for interacting with the map, but it interact > > +more directly with the map data structures. For example the call > > +``bpf_map_lookup_elem()`` returns a direct pointer to the 'value' > > +memory-element inside the kernel (while userspace gets a copy). This > > +allows the eBPF-program to atomically increment or decrement the value > > +'in-place', by using appropiate compiler primitives like > > +``__sync_fetch_and_add()``, which is understood by LLVM when > > +generating eBPF instructions. > > ahh. so you do it here. may be move this paragraph on top ? yes. > > + > > +On the kernel side, implementing a map type requires defining some > > +function (pointers) via `struct bpf_map_ops`_. And eBPF programs have > > +access to ``map_lookup_elem``, ``map_update_elem`` and > > +``map_delete_elem``, which gets invoked from eBPF via bpf-helpers in > > this paragraph reads a bit out of place. > I'm not sure what 'kernel implements the map' means. > You mean 'access the map' ? True, it is not very clear. I'm talking about how a new map type is implemented, and what functions it need to implement. It is confusing, I'll try to come up with something better. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Documentation on eBPF map types?
On Thu, 2 Feb 2017 20:27:15 -0800 Alexei Starovoitov wrote: > On Thu, Feb 02, 2017 at 06:00:09PM +0100, Jesper Dangaard Brouer wrote: > > > > [PATCH] doc: how interacting with eBPF maps works > > > > Documented by reading the code. > > looks great! > please submit the patch to net-next and cc Jonathan ? Good idea, but I'll take this round of review first, before pushing it to the kernel tree. I also want a build-env section. Also considering splitting the "Types of maps"[1] section into a separate file, as this avoids a too long HTML page being generated. [1] http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html#types-of-maps > Comments below: > > > I hope someone more knowledgeable will review this and > > correct me where I misunderstood things. > > > > Signed-off-by: Jesper Dangaard Brouer > > --- > > kernel/Documentation/bpf/ebpf_maps.rst | 126 > > ++-- > > 1 file changed, 120 insertions(+), 6 deletions(-) > > > > diff --git a/kernel/Documentation/bpf/ebpf_maps.rst > > b/kernel/Documentation/bpf/ebpf_maps.rst > > index 562edd566e0b..55068c7f3dab 100644 > > --- a/kernel/Documentation/bpf/ebpf_maps.rst > > +++ b/kernel/Documentation/bpf/ebpf_maps.rst > > @@ -23,13 +23,128 @@ and accessed by multiple programs (from man-page > > `bpf(2)`_): > > up to the user process and eBPF program to decide what they store > > inside maps. > > > > +Creating a map > > +== > > + > > +A maps is created based on a request from userspace, via the `bpf`_ > > +syscall (`bpf_cmd`_ BPF_MAP_CREATE), and returns a new file descriptor > > +that refers to the map. These are the setup arguments when creating a > > +map. > > + > > +.. code-block:: c > > + > > + struct { /* anonymous struct used by BPF_MAP_CREATE command */ > > + __u32 map_type; /* one of enum bpf_map_type */ > > + __u32 key_size; /* size of key in bytes */ > > + __u32 value_size; /* size of value in bytes */ > > + __u32 max_entries;/* max number of entries in a map */ > > + __u32 map_flags; /* prealloc or not */ > > + }; > > + > > +For programs under samples/bpf/ the ``load_bpf_file()`` call (from > > +`samples/bpf/bpf_load`_) takes care of parsing elf file compiled by ^^^ > > +LLVM, pickup 'maps' section and creates maps via BPF syscall. This is ^^ ^ > > +done by defining a ``struct bpf_map_def`` with an elf section > > +__attribute__ ``SEC("maps")``, in the xxx_kern.c file. The maps file > > +descriptor is available in the userspace xxx_user.c file, via global > > +array variable ``map_fd[]``, and the array map index correspons to the > > +order the maps sections were defined in elf file of xxx_kern.c file. > > + > > +.. code-block:: c > > + > > + struct bpf_map_def { > > + unsigned int type; > > + unsigned int key_size; > > + unsigned int value_size; > > + unsigned int max_entries; > > + unsigned int map_flags; > > + }; > > As Daniel said please mention that this is C program <-> elf loader > convention. It's not a kernel abi. I feel I did mention the ELF part above, but I guess should try to make it more clear, as you missed this. > perf map defitions are different from tc and from samples/bpf/ I didn't know about TC maps. Can you or Daniel please point me to some TC-eBFP-code that setup a map? So, I can wrap my head around how it does it... I think it would be valuable to have a section here describing how TC manage maps. See what I figured out myself in this commit: https://github.com/netoptimizer/prototype-kernel/commit/3640a48df52c And the new TC section: http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html#qdisc-traffic-control-convention > iovisor/bcc is not using elf at all. True (I do know that bcc doesn't use elf). > > + > > + struct bpf_map_def SEC("maps") my_map = { > > + .type= BPF_MAP_TYPE_XXX, > > + .key_size= sizeof(u32), > > + .value_size = sizeof(u64), > > + .max_entries = 42, > > + .map_flags = 0 > > + }; > > + > > +.. _samples/bpf/bpf_load: > > + > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/bpf_load.c > > + > > + > > +Interacting with maps > > += I'll use another email and commit change to address this "Interacting with maps" section... -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Documentation on eBPF map types?
On Sat, 28 Jan 2017 10:14:58 +1100 Brendan Gregg wrote: > I did some in the bcc ref guide, but it's incomplete, and the bcc versions: > https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md Hi Brendan, I also included BCC in my documentation, please correct me: http://prototype-kernel.readthedocs.io/en/latest/bpf/bcc_tool_chain.html https://github.com/netoptimizer/prototype-kernel/commit/b61a7ecfe08d2cc0 -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Documentation on eBPF map types?
On Thu, 2 Feb 2017 11:56:19 +0100 Jesper Dangaard Brouer wrote: > On Tue, 31 Jan 2017 20:54:10 -0800 > Alexei Starovoitov wrote: > > > On Tue, Jan 31, 2017 at 9:54 AM, Jesper Dangaard Brouer via > > iovisor-dev wrote: > > > > > > On Sat, 28 Jan 2017 10:14:58 +1100 Brendan Gregg > > > wrote: > > > > > >> I did some in the bcc ref guide, but it's incomplete, and the bcc > > >> versions: > > >> https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md > > > > > > Thanks! - this seem rather BCC specific syntax, and I'm looking for > > > documentation close for the kernel (samples/bpf/). > > > > > > The best documentation I found was the man-page for the syscall bpf(2): > > > http://man7.org/linux/man-pages/man2/bpf.2.html > > > > > > In lack of a better place, I've started documenting eBPF here: > > > https://prototype-kernel.readthedocs.io/en/latest/bpf/index.html > > > > > > This doc is compatible with the kernels doc format, and I hope we can > > > push this into the kernel tree, if it turns out to be valuable? > > > (https://www.kernel.org/doc/html/latest/) > > > > yeah. definitely would be great to add map descriptions to the kernel docs. > > So far most of it is in commit logs. > > git log kernel/bpf/arraymap.c|tail -33 > > git log kernel/bpf/hashtab.c|tail -33 > > will give an overview of key hash and array map principles. > > Thanks, I'm using that to write some doc. > http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html > Gotten to BPF_MAP_TYPE_ARRAY > > http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html#bpf-map-type-array > > > Can you explain the difference between the kernel and userspace side of > the call bpf_map_lookup_elem() ? > > Kernel side: > long *value = bpf_map_lookup_elem(&my_map, &key); > > Userspace side: > long long value; > bpf_map_lookup_elem(map_fd[0], &key, &value) > > Looks like userspace gets a copy of the memory... > If so, how can userspace then increment the value safely? I documented this myself, please correct me. http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html#creating-a-map http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html#interacting-with-maps Commit: https://github.com/netoptimizer/prototype-kernel/commit/ffbcdc453f3c Inlined below: - [PATCH] doc: how interacting with eBPF maps works Documented by reading the code. I hope someone more knowledgeable will review this and correct me where I misunderstood things. Signed-off-by: Jesper Dangaard Brouer --- kernel/Documentation/bpf/ebpf_maps.rst | 126 ++-- 1 file changed, 120 insertions(+), 6 deletions(-) diff --git a/kernel/Documentation/bpf/ebpf_maps.rst b/kernel/Documentation/bpf/ebpf_maps.rst index 562edd566e0b..55068c7f3dab 100644 --- a/kernel/Documentation/bpf/ebpf_maps.rst +++ b/kernel/Documentation/bpf/ebpf_maps.rst @@ -23,13 +23,128 @@ and accessed by multiple programs (from man-page `bpf(2)`_): up to the user process and eBPF program to decide what they store inside maps. +Creating a map +== + +A maps is created based on a request from userspace, via the `bpf`_ +syscall (`bpf_cmd`_ BPF_MAP_CREATE), and returns a new file descriptor +that refers to the map. These are the setup arguments when creating a +map. + +.. code-block:: c + + struct { /* anonymous struct used by BPF_MAP_CREATE command */ + __u32 map_type; /* one of enum bpf_map_type */ + __u32 key_size; /* size of key in bytes */ + __u32 value_size; /* size of value in bytes */ + __u32 max_entries;/* max number of entries in a map */ + __u32 map_flags; /* prealloc or not */ + }; + +For programs under samples/bpf/ the ``load_bpf_file()`` call (from +`samples/bpf/bpf_load`_) takes care of parsing elf file compiled by +LLVM, pickup 'maps' section and creates maps via BPF syscall. This is +done by defining a ``struct bpf_map_def`` with an elf section +__attribute__ ``SEC("maps")``, in the xxx_kern.c file. The maps file +descriptor is available in the userspace xxx_user.c file, via global +array variable ``map_fd[]``, and the array map index correspons to the +order the maps sections were defined in elf file of xxx_kern.c file. + +.. code-block:: c + + struct bpf_map_def { + unsigned int type; + unsigned int key_size; + unsigned int value_size; + unsigned int max_entries; + unsigned int map_flags; + }; + + struct bpf_map_def SEC("maps") my_map = { + .type= BPF_MAP_TYPE_XXX, +
Re: [iovisor-dev] Documentation on eBPF map types?
On Tue, 31 Jan 2017 20:54:10 -0800 Alexei Starovoitov wrote: > On Tue, Jan 31, 2017 at 9:54 AM, Jesper Dangaard Brouer via > iovisor-dev wrote: > > > > On Sat, 28 Jan 2017 10:14:58 +1100 Brendan Gregg > > wrote: > > > >> I did some in the bcc ref guide, but it's incomplete, and the bcc versions: > >> https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md > > > > Thanks! - this seem rather BCC specific syntax, and I'm looking for > > documentation close for the kernel (samples/bpf/). > > > > The best documentation I found was the man-page for the syscall bpf(2): > > http://man7.org/linux/man-pages/man2/bpf.2.html > > > > In lack of a better place, I've started documenting eBPF here: > > https://prototype-kernel.readthedocs.io/en/latest/bpf/index.html > > > > This doc is compatible with the kernels doc format, and I hope we can > > push this into the kernel tree, if it turns out to be valuable? > > (https://www.kernel.org/doc/html/latest/) > > yeah. definitely would be great to add map descriptions to the kernel docs. > So far most of it is in commit logs. > git log kernel/bpf/arraymap.c|tail -33 > git log kernel/bpf/hashtab.c|tail -33 > will give an overview of key hash and array map principles. Thanks, I'm using that to write some doc. http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html Gotten to BPF_MAP_TYPE_ARRAY http://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html#bpf-map-type-array Can you explain the difference between the kernel and userspace side of the call bpf_map_lookup_elem() ? Kernel side: long *value = bpf_map_lookup_elem(&my_map, &key); Userspace side: long long value; bpf_map_lookup_elem(map_fd[0], &key, &value) Looks like userspace gets a copy of the memory... If so, how can userspace then increment the value safely? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Documentation on eBPF map types?
On Sat, 28 Jan 2017 10:14:58 +1100 Brendan Gregg wrote: > I did some in the bcc ref guide, but it's incomplete, and the bcc versions: > https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md Thanks! - this seem rather BCC specific syntax, and I'm looking for documentation close for the kernel (samples/bpf/). The best documentation I found was the man-page for the syscall bpf(2): http://man7.org/linux/man-pages/man2/bpf.2.html In lack of a better place, I've started documenting eBPF here: https://prototype-kernel.readthedocs.io/en/latest/bpf/index.html This doc is compatible with the kernels doc format, and I hope we can push this into the kernel tree, if it turns out to be valuable? (https://www.kernel.org/doc/html/latest/) --Jesper > On Fri, Jan 27, 2017 at 9:54 PM, Jesper Dangaard Brouer via iovisor-dev < > iovisor-dev@lists.iovisor.org> wrote: > > > Hi IOvisor/eBPF people, > > > > Do we have some documentation on eBPF maps? > > > > Like that map types are available, and what they are useful for? > > > > Notice, just list listing[1] the enum is not so useful: > > > > enum bpf_map_type { > > BPF_MAP_TYPE_UNSPEC, > > BPF_MAP_TYPE_HASH, > > BPF_MAP_TYPE_ARRAY, > > BPF_MAP_TYPE_PROG_ARRAY, > > BPF_MAP_TYPE_PERF_EVENT_ARRAY, > > BPF_MAP_TYPE_PERCPU_HASH, > > BPF_MAP_TYPE_PERCPU_ARRAY, > > BPF_MAP_TYPE_STACK_TRACE, > > BPF_MAP_TYPE_CGROUP_ARRAY, > > BPF_MAP_TYPE_LRU_HASH, > > BPF_MAP_TYPE_LRU_PERCPU_HASH, > > }; > > > > [1] http://lxr.free-electrons.com/source/tools/include/uapi/ > > linux/bpf.h?v=4.9#L78 > > > > I also lack some documentation about how I interact with these maps... -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
[iovisor-dev] Documentation on eBPF map types?
Hi IOvisor/eBPF people, Do we have some documentation on eBPF maps? Like that map types are available, and what they are useful for? Notice, just list listing[1] the enum is not so useful: enum bpf_map_type { BPF_MAP_TYPE_UNSPEC, BPF_MAP_TYPE_HASH, BPF_MAP_TYPE_ARRAY, BPF_MAP_TYPE_PROG_ARRAY, BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_MAP_TYPE_PERCPU_HASH, BPF_MAP_TYPE_PERCPU_ARRAY, BPF_MAP_TYPE_STACK_TRACE, BPF_MAP_TYPE_CGROUP_ARRAY, BPF_MAP_TYPE_LRU_HASH, BPF_MAP_TYPE_LRU_PERCPU_HASH, }; [1] http://lxr.free-electrons.com/source/tools/include/uapi/linux/bpf.h?v=4.9#L78 I also lack some documentation about how I interact with these maps... -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Howto debug eBPF programs and bpf_trace_printk
On Tue, 24 Jan 2017 10:52:39 +0100 Jesper Dangaard Brouer wrote: > Hi IOvisor-group, > > I'm playing with kernels samples/bpf, and want so advice on debugging > eBPF programs. > > First question: I noticed bpf_trace_printk(), but nothing shows up when > I'm using it... what did I miss/forget? (So, let me answer my own question.) The trace "printk" output goes into: /sys/kernel/debug/tracing/trace_pipe And samples/bpf have a function read_trace_pipe() that reads from this. > Second question: When doing some mistake the eBPF validator does not > like, I get a dump of eBPF asm codes, which I cannot correlate to line > number in the pseudo-C code. Is there some debug option I can enable > to get some more help? (Saw William had some additional ";" output) Third question: Sometimes I get an error while loading the program like: $ sudo ./tracex1 bpf_load_program() err=22 How do I lookup what err=N means? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
[iovisor-dev] Howto debug eBPF programs and bpf_trace_printk
Hi IOvisor-group, I'm playing with kernels samples/bpf, and want so advice on debugging eBPF programs. First question: I noticed bpf_trace_printk(), but nothing shows up when I'm using it... what did I miss/forget? Second question: When doing some mistake the eBPF validator does not like, I get a dump of eBPF asm codes, which I cannot correlate to line number in the pseudo-C code. Is there some debug option I can enable to get some more help? (Saw William had some additional ";" output) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Explaining RX-stages for XDP
On Tue, 27 Sep 2016 19:12:44 -0700 Alexei Starovoitov wrote: > On Tue, Sep 27, 2016 at 11:32:37AM +0200, Jesper Dangaard Brouer wrote: > > > > Let me try in a calm way (not like [1]) to explain how I imagine that > > the XDP processing RX-stage should be implemented. As I've pointed out > > before[2], I'm proposing splitting up the driver into RX-stages. This > > is a mental-model change, I hope you can follow my "inception" attempt. > > > > The basic concept behind this idea is, if the RX-ring contains > > multiple "ready" packets, then the kernel was too slow, processing > > incoming packets. Thus, switch into more efficient mode, which is a > > "packet-vector" mode. > > > > Today, our XDP micro-benchmarks looks amazing, and they are! But once > > real-life intermixed traffic is used, then we loose the XDP I-cache > > benefit. XDP is meant for DoS protection, and an attacker can easily > > construct intermixed traffic. Why not fix this architecturally? > > > > Most importantly concept: If XDP return XDP_PASS, do NOT pass the > > packet up the network stack immediately (that would flush I-cache). > > Instead store the packet for the next RX-stage. Basically splitting > > the packet-vector into two packet-vectors, one for network-stack and > > one for XDP. Thus, intermixed XDP vs. netstack not longer have effect > > on XDP performance. > > > > The reason for also creating an XDP packet-vector, is to move the > > XDP_TX transmit code out of the XDP processing stage (and future > > features). This maximize I-cache availability to the eBPF program, > > and make eBPF performance more uniform across drivers. > > > > > > Inception: > > * Instead of individual packets, see it as a RX packet-vector. > > * XDP should be seen as a stage *before* the network stack gets called. > > > > If your mind can handle it: I'm NOT proposing a RX-vector of 64-packets. > > I actually want N-packet per vector (8-16). As the NIC HW RX process > > runs concurrently, and by the time it takes to process N-packets, more > > packets have had a chance to arrive in the RX-ring queue. > > Sounds like what Edward was proposing earlier with building > link list of skbs and passing further into stack? > Or the idea is different ? The idea is quite different. It has nothing to do with Edward's proposal[3]. The RX packet-vector is simply an array, either of pointers or index numbers (into the RX-ring). The needed changes are completely contained inside the driver. > As far as intermixed XDP vs stack traffic, I think for DoS case the > traffic patterns are binary. Either all of it is good or under attack > most of the traffic is bad, so makes sense to optimize for these two. > 50/50 case I think is artificial and not worth optimizing for. Sorry, but I feel you have completely misunderstood the concept of my idea. It does not matter what traffic pattern you believe or don't believe in, it is irrelevant. The fact is that intermixed traffic is possible with the current solution. The core of my idea is to remove the possibility for this intermixed traffic to occur, simply by seeing XDP as a RX-stage before the stack. > For all good traffic whether xdp is there or not shouldn't matter > for this N-vector optimization. Whether it's a batch of 8, 16 or 64, > either via link-list or array, it should probably be a generic > mechanism independent of any xdp stuff. I also feel you have misunderstood the N-vector "optimization". But, yes, this introduction of RX-stages is independent of XDP. The RX-stages is a generic change to the drivers programming model. [...] > I think existing mlx4+xdp is already optimized for 'mostly attack' traffic > and performs pretty well, since imo 'all drop' benchmark is accurate. The "all drop" benchmark is as artificial as it gets. It think Eric agrees. My idea is confining the XDP_DROP part to a RX-stage _before_ the netstack. Then, the "all drop" benchmark number will get a little more trustworthy. > Optimizing xdp for 'mostly good' traffic is indeed a challange. > We'd need all the tricks to make it as good as normal skb-based traffic. > > I haven't seen any tests yet comparing xdp with 'return XDP_PASS' program > vs no xdp at all running netperf tcp/udp in user space. It shouldn't > be too far off. Well, I did post numbers to the list with a 'return XDP_PASS' program[4]: https://mid.mail-archive.com/netdev@vger.kernel.org/msg122350.html Wake-up and smell the coffee, please revise your assumptions: * It showed that the performance reduction is 25.98%!!! (AB comparison dropping packets in iptables raw) Conclusion: These measurements confirm that we need a page recycle facility for the drivers before switching to order-0 allocations. I did the same kind of experiment with mlx5. Where I change the memory model to order-0 pages, and then I implemented page_pool on top. (Below number are before I implemented the DMA part of page_pool, which do work now). page_pool work -
[iovisor-dev] Explaining RX-stages for XDP
Let me try in a calm way (not like [1]) to explain how I imagine that the XDP processing RX-stage should be implemented. As I've pointed out before[2], I'm proposing splitting up the driver into RX-stages. This is a mental-model change, I hope you can follow my "inception" attempt. The basic concept behind this idea is, if the RX-ring contains multiple "ready" packets, then the kernel was too slow, processing incoming packets. Thus, switch into more efficient mode, which is a "packet-vector" mode. Today, our XDP micro-benchmarks looks amazing, and they are! But once real-life intermixed traffic is used, then we loose the XDP I-cache benefit. XDP is meant for DoS protection, and an attacker can easily construct intermixed traffic. Why not fix this architecturally? Most importantly concept: If XDP return XDP_PASS, do NOT pass the packet up the network stack immediately (that would flush I-cache). Instead store the packet for the next RX-stage. Basically splitting the packet-vector into two packet-vectors, one for network-stack and one for XDP. Thus, intermixed XDP vs. netstack not longer have effect on XDP performance. The reason for also creating an XDP packet-vector, is to move the XDP_TX transmit code out of the XDP processing stage (and future features). This maximize I-cache availability to the eBPF program, and make eBPF performance more uniform across drivers. Inception: * Instead of individual packets, see it as a RX packet-vector. * XDP should be seen as a stage *before* the network stack gets called. If your mind can handle it: I'm NOT proposing a RX-vector of 64-packets. I actually want N-packet per vector (8-16). As the NIC HW RX process runs concurrently, and by the time it takes to process N-packets, more packets have had a chance to arrive in the RX-ring queue. -- Best regards, Jesper Dangaard Brouertho MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer [1] https://mid.mail-archive.com/netdev@vger.kernel.org/msg127043.html [2] http://lists.openwall.net/netdev/2016/01/15/51 [3] http://lists.openwall.net/netdev/2016/04/19/89 ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] XDP (eXpress Data Path) documentation
On Tue, 20 Sep 2016 11:08:44 +0200 Jesper Dangaard Brouer wrote: > As promised, I've started documenting the XDP eXpress Data Path): > > [1] > https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html > > IMHO the documentation have reached a stage where it is useful for the > XDP project, BUT I request collaboration on improving the documentation > from all. (Native English speakers are encouraged to send grammar fixes ;-)) I want to publicly thanks Edward Cree for being the first contributor to the XDP documentation with formulation and grammar fixes. Pulled and pushed: https://github.com/netoptimizer/prototype-kernel/commit/fb6a3de95 Thanks! -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] XDP (eXpress Data Path) documentation
On Wed, 21 Sep 2016 17:03:24 -0700 Tom Herbert wrote: > On Tue, Sep 20, 2016 at 2:08 AM, Jesper Dangaard Brouer > wrote: > > Hi all, > > > > As promised, I've started documenting the XDP eXpress Data Path): > > > > [1] > > https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html > > > > IMHO the documentation have reached a stage where it is useful for the > > XDP project, BUT I request collaboration on improving the documentation > > from all. (Native English speakers are encouraged to send grammar fixes ;-)) > > > Hi Jesper, > > Thanks for taking the initiative on the this, The document reads more > like a design doc than description right now, that's probably okay > since we could use a design doc. Yes, I fully agree. I want to state very clearly, this document is not an attempt to hijack the XDP project and control the "spec". This is an attempt to collaborate. We discuss things on the mailing list, each with our own vision of the project, and most times we reach an agreement. But nobody document this agreement. Month later, we make implementation choices that goes against these agreements, because we simply forgot. If someone remembers, we have to reiterate the same arguments again (like it just happened with the XDP_ABORTED action, my mistake). And can anybody remember the consensus around VLANs, it never got implemented that way... I had to start the doc project somewhere, so I dumped my own vision into the docs, and what I could remember from upstream discussions. I need collaboration from others to adjust and "fix" my vision of the project, into something that becomes a common ground we all can agree on. If some part of the docs provoke you, good, then you have a reason to correct and fix it. I'll do my best to keep an very open-mind about any changes. This should be a very "live" document. > Under "Important to understand" there are some disclaimers that XDP > does not implement qdiscs or BQL and fairness otherwise. This is true > for it's own traffic, but it does not (or at least should not) affect > these mechanisms or normal stack traffic running simultaneously. I > think we've made assumptions about fairness between XDP and non-XDP > queues, we probably want to clarify fairness (and also validate > whatever assumptions we've made with testing). I love people pointing out mistakes in the documentation. I want update this ASAP when people point it out. I'll even do the work of integrating and committing these changes, for people too lazy to git clone the repo themselves. For you section Tom, I fully agree, but I don't know how to formulate and adjust this in the text. For people that want to edit the docs, notice the link "Edit on GitHub" which takes you directly to the file you need to edit... > > You wouldn't believe it: But this pretty looking documentation actually > > follows the new Kernel documentation format. It is actually just > > ".rst" text files stored in my github repository under kernel/Documentation > > [2] > > > > [2] > > https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/Documentation > > > > Thus, just git clone my repository and started editing and send me > > patches (or github pull requests). Like: > > > > $ git clone https://github.com/netoptimizer/prototype-kernel > > $ cd prototype-kernel/kernel/Documentation/ > > $ make html > > $ firefox _build/html/index.html & > > > > This new documentation format combines the best of two worlds, pretty > > online browser documentation with almost plain text files, and changes > > being tracked via git commits [3] (and auto git hooks to generate the > > readthedocs.org page). You got to love it! :-) > > > > -- > > Best regards, > > Jesper Dangaard Brouer > > MSc.CS, Principal Kernel Engineer at Red Hat > > Author of http://www.iptv-analyzer.org > > LinkedIn: http://www.linkedin.com/in/brouer > > > > [3] https://github.com/netoptimizer/prototype-kernel/commits/master -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] XDP (eXpress Data Path) documentation
On Tue, 20 Sep 2016 19:47:07 -0700 Alexei Starovoitov wrote: > On Tue, Sep 20, 2016 at 11:08:44AM +0200, Jesper Dangaard Brouer via > iovisor-dev wrote: > > Hi all, > > > > As promised, I've started documenting the XDP eXpress Data Path): > > > > [1] > > https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html > > > > IMHO the documentation have reached a stage where it is useful for the > > XDP project, BUT I request collaboration on improving the documentation > > from all. (Native English speakers are encouraged to send grammar fixes ;-)) > > > > You wouldn't believe it: But this pretty looking documentation actually > > follows the new Kernel documentation format. It is actually just > > ".rst" text files stored in my github repository under kernel/Documentation > > [2] > > > > [2] > > https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/Documentation > > > > Thanks so much for doing it. This is great start! > Some minor editing is needed here and there. > To make it into official doc do you mind preparing a patch for Jon's doc tree > ? > If you think the doc is too volatile and not suitable for kernel.org, > another alternative is to host it on https://github.com/iovisor > since it's LF collaborative project it won't disappear suddenly. > You can be a maintainer of that repo if you like. I do see this as kernel documentation that eventually should end-up in Jon's doc tree. Right now it is too volatile. Once XDP have "stabilized" some more, I plan to push/submit the *relevant* pieces into the kernel. E.g. I plan to have a "proposals" section, which is not meant upstream doc as it is an intermediate specification step before implementing, after which it should move to another doc section. Likewise some of the use-case documents might be "rejected" before reaching upstream doc. The reason I've not created a separate repository for XDP doc only, is because I also plan to document other parts of the kernel in this repo[3], not just XDP. Like my page_pool work. The documentation is not really official documentation before it reach a kernel git tree, together with the code it documents. I hope this approach can help us document while developing, and turn email discussions into specifications. Like I forgot the XDP_ABORTED not warning argument, which you had to re-iterate, but now it is documented[4][5]. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer [3] https://github.com/netoptimizer/prototype-kernel/ [4] https://github.com/netoptimizer/prototype-kernel/commit/a4e60e2d7a894 [5] https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/implementation/userspace_api.html#troubleshooting-and-monitoring ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] XDP user interface confusions
On Thu, 15 Sep 2016 11:58:35 -0700 Brenden Blanco wrote: > On Thu, Sep 15, 2016 at 08:14:02PM +0200, Jesper Dangaard Brouer wrote: > > Hi Brenden, > > > > I don't quite understand the semantics of the XDP userspace interface. > > > > We allow XDP programs to be (unconditionally) exchanged by another > > program, this avoids taking the link down+up and avoids reallocating > > RX ring resources (which is great). > > > > We have two XDP samples programs in samples/bpf/ xdp1 and xdp2. Now I > > want to first load xdp1 and then to avoid the linkdown I load xdp2, > > and then afterwards remove/stop program xdp1. > > > > This does NOT work, because (in samples/bpf/xdp1_user.c) when xdp1 > > exits it unconditionally removes the running XDP program (loaded by xdp2) > > via set_link_xdp_fd(ifindex, -1). The xdp2 user program is still > > running, and is unaware of its xdp/bpf program have been unloaded. > > > > I find this userspace interface confusing. What this your intention? > > Perhaps you can explain what the intended semantics or specification is? > > In practice, we've used a single agent process to manage bpf programs on > behalf of the user applications. This agent process uses common linux > functionalities to add semantics, while not really relying on the bpf > handles themselves to take care of that. For instance, the process may > put some lockfiles and what-not in /var/run/$PID, and maybe returns the > list of running programs through a http: or unix: interface. > > So, from a user<->kernel API, the requirements are minimal...the agent > process just overwrites the loaded bpf program when the application > changes, or a new application comes online. There is nobody to 'notify' > when a handle changes. > > When translating this into the kernel api that you see now, none of this > exists, because IMHO the kernel api should be unopinionated and generic. > The result is something that appears very "fire-and-forget", which > results in something simple yet safe at the same time; the refcounting > is done transparently by the kernel. > > So, in practice, there is no xdp1 or xdp2, just xdp-agent at different > points in time. Or, better yet, no agent, just the programs running in > the kernel, with the handles of the programs residing solely in the > device, which are perhaps pinned to /sys/fs/bpf for semantic management > purposes. I didn't feel like it was appropriate to conflate different > bpf features in the kernel samples, so we don't see (and probably never > will) a sample which combines these features into a whole. That is best > left to userspace tools. It so happens that this is one of the projects > I am currently active on at $DAYJOB, and we fully intend to share the > details of that when it's in a suitable state. For the record, I'm not happy with this response. IHMO the XDP userspace API should have been thought more through. Where is the specification and documentation for this interface? There is no official XDP userspace daemon, thus your proposed file-lock scheme falls apart. We can easily imagine, several different open-source projects using XDP, they will not coordinate what /var/run lock-file to use. An admin installs several of these to evaluate them. Later he gets really confused, because both of them run by mistake (and it is random who "wins" due to systemd parallel boot). The programs themselves have no-way to catch this error situation, and report it. This is the main mistake of this API approach. The admin also have a hard time debugging this, as our query interface is a bool, so he just see that some XDP program _is_ running. Sorry, but we have to do better than this. We have to maintain these APIs for the next +10 years. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
[iovisor-dev] XDP (eXpress Data Path) documentation
Hi all, As promised, I've started documenting the XDP eXpress Data Path): [1] https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html IMHO the documentation have reached a stage where it is useful for the XDP project, BUT I request collaboration on improving the documentation from all. (Native English speakers are encouraged to send grammar fixes ;-)) You wouldn't believe it: But this pretty looking documentation actually follows the new Kernel documentation format. It is actually just ".rst" text files stored in my github repository under kernel/Documentation [2] [2] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/Documentation Thus, just git clone my repository and started editing and send me patches (or github pull requests). Like: $ git clone https://github.com/netoptimizer/prototype-kernel $ cd prototype-kernel/kernel/Documentation/ $ make html $ firefox _build/html/index.html & This new documentation format combines the best of two worlds, pretty online browser documentation with almost plain text files, and changes being tracked via git commits [3] (and auto git hooks to generate the readthedocs.org page). You got to love it! :-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer [3] https://github.com/netoptimizer/prototype-kernel/commits/master ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
[iovisor-dev] XDP user interface conclusions
Hi Brenden, I don't quite understand the semantics of the XDP userspace interface. We allow XDP programs to be (unconditionally) exchanged by another program, this avoids taking the link down+up and avoids reallocating RX ring resources (which is great). We have two XDP samples programs in samples/bpf/ xdp1 and xdp2. Now I want to first load xdp1 and then to avoid the linkdown I load xdp2, and then afterwards remove/stop program xdp1. This does NOT work, because (in samples/bpf/xdp1_user.c) when xdp1 exits it unconditionally removes the running XDP program (loaded by xdp2) via set_link_xdp_fd(ifindex, -1). The xdp2 user program is still running, and is unaware of its xdp/bpf program have been unloaded. I find this userspace interface confusing. What this your intention? Perhaps you can explain what the intended semantics or specification is? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
[iovisor-dev] XDP user interface confusions
Hi Brenden, I don't quite understand the semantics of the XDP userspace interface. We allow XDP programs to be (unconditionally) exchanged by another program, this avoids taking the link down+up and avoids reallocating RX ring resources (which is great). We have two XDP samples programs in samples/bpf/ xdp1 and xdp2. Now I want to first load xdp1 and then to avoid the linkdown I load xdp2, and then afterwards remove/stop program xdp1. This does NOT work, because (in samples/bpf/xdp1_user.c) when xdp1 exits it unconditionally removes the running XDP program (loaded by xdp2) via set_link_xdp_fd(ifindex, -1). The xdp2 user program is still running, and is unaware of its xdp/bpf program have been unloaded. I find this userspace interface confusing. What this your intention? Perhaps you can explain what the intended semantics or specification is? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Tue, 13 Sep 2016 08:58:30 -0700 Eric Dumazet wrote: > We also care about icache pressure, and GRO/TSO already provides > bundling where it is applicable, without adding insane complexity in > the stacks. Sorry, I cannot resist. The GRO code is really bad regarding icache pressure/usage, due to how everything is function pointers calling function pointers, even if the general case is calling the function defined just next to it in the same C-file (which usually cause inlining). I can easily get 10% more performance for UDP use-cases by simply disabling the GRO code, and I measure a significant drop in icache-misses. Edward's solution should lower icache pressure. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] [PATCH RFC 03/11] net/mlx5e: Implement RX mapped page cache for page recycle
On Tue, 13 Sep 2016 13:16:29 +0300 Tariq Toukan wrote: > On 07/09/2016 9:45 PM, Jesper Dangaard Brouer wrote: > > On Wed, 7 Sep 2016 15:42:24 +0300 Saeed Mahameed > > wrote: > > > >> From: Tariq Toukan > >> > >> Instead of reallocating and mapping pages for RX data-path, > >> recycle already used pages in a per ring cache. > >> > >> We ran pktgen single-stream benchmarks, with iptables-raw-drop: > >> > >> Single stride, 64 bytes: > >> * 4,739,057 - baseline > >> * 4,749,550 - order0 no cache > >> * 4,786,899 - order0 with cache > >> 1% gain > >> > >> Larger packets, no page cross, 1024 bytes: > >> * 3,982,361 - baseline > >> * 3,845,682 - order0 no cache > >> * 4,127,852 - order0 with cache > >> 3.7% gain > >> > >> Larger packets, every 3rd packet crosses a page, 1500 bytes: > >> * 3,731,189 - baseline > >> * 3,579,414 - order0 no cache > >> * 3,931,708 - order0 with cache > >> 5.4% gain > >> > >> Signed-off-by: Tariq Toukan > >> Signed-off-by: Saeed Mahameed > >> --- > >> drivers/net/ethernet/mellanox/mlx5/core/en.h | 16 ++ > >> drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 15 ++ > >> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 57 > >> -- > >> drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 16 ++ > >> 4 files changed, 99 insertions(+), 5 deletions(-) > >> > >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h > >> b/drivers/net/ethernet/mellanox/mlx5/core/en.h > >> index 075cdfc..afbdf70 100644 > >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h > >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h > >> @@ -287,6 +287,18 @@ struct mlx5e_rx_am { /* Adaptive Moderation */ > >>u8 tired; > >> }; > >> > >> +/* a single cache unit is capable to serve one napi call (for > >> non-striding rq) > >> + * or a MPWQE (for striding rq). > >> + */ > >> +#define MLX5E_CACHE_UNIT (MLX5_MPWRQ_PAGES_PER_WQE > NAPI_POLL_WEIGHT ? \ > >> + MLX5_MPWRQ_PAGES_PER_WQE : NAPI_POLL_WEIGHT) > >> +#define MLX5E_CACHE_SIZE (2 * roundup_pow_of_two(MLX5E_CACHE_UNIT)) > >> +struct mlx5e_page_cache { > >> + u32 head; > >> + u32 tail; > >> + struct mlx5e_dma_info page_cache[MLX5E_CACHE_SIZE]; > >> +}; > >> + > > [...] > >> > >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > >> b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > >> index c1cb510..8e02af3 100644 > >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > >> @@ -305,11 +305,55 @@ static inline void mlx5e_post_umr_wqe(struct > >> mlx5e_rq *rq, u16 ix) > >>mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0); > >> } > >> > >> +static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq, > >> +struct mlx5e_dma_info *dma_info) > >> +{ > >> + struct mlx5e_page_cache *cache = &rq->page_cache; > >> + u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1); > >> + > >> + if (tail_next == cache->head) { > >> + rq->stats.cache_full++; > >> + return false; > >> + } > >> + > >> + cache->page_cache[cache->tail] = *dma_info; > >> + cache->tail = tail_next; > >> + return true; > >> +} > >> + > >> +static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq, > >> +struct mlx5e_dma_info *dma_info) > >> +{ > >> + struct mlx5e_page_cache *cache = &rq->page_cache; > >> + > >> + if (unlikely(cache->head == cache->tail)) { > >> + rq->stats.cache_empty++; > >> + return false; > >> + } > >> + > >> + if (page_ref_count(cache->page_cache[cache->head].page) != 1) { > >> + rq->stats.cache_busy++; > >> + return false; > >> + } > > Hmmm... doesn't this cause "blocking" of the page_cache recycle > > facility until the page at the head of the queue gets (page) refcnt > > decremented? Real use-case could fairly easily block/cause this... > Hi Jesper, > > That's right. We are aware of this issue. > We considered ways of solving this, but decided to keep current > implementation for now. > One way of solving this is to look deeper in the cache. > Cons: > - this will consume time, and the chance of finding an available page is > not that high: if the page in head of queue is busy then there's a good > chance that all the others are too (because of FIFO). > in other words, you already checked all pages and anyway you're going to > allocate a new one (higher penalty for same decision). > - this will make holes in the array causing complex accounting when > looking for an available page (this can easily be fixed by swapping > between the page in head and the available one). > > Another way is sharing pages between different RQs. > - For now we're not doing this for simplicity and to keep > synchronization away. > > What do you think? > > Anyway, we're looking forward to use your page-pool API which solves > these issues. Ye
Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Mon, 12 Sep 2016 12:56:28 -0700 Alexei Starovoitov wrote: > On Mon, Sep 12, 2016 at 01:30:25PM +0200, Jesper Dangaard Brouer wrote: > > On Thu, 8 Sep 2016 23:30:50 -0700 > > Alexei Starovoitov wrote: > > > > > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote: > > [...] > > > > Imagine you have packets intermixed towards the stack and XDP_TX. > > > > Every time you call the stack code, then you flush your icache. When > > > > returning to the driver code, you will have to reload all the icache > > > > associated with the XDP_TX, this is a costly operation. > > > > > [...] > > > To make further progress in this discussion can we talk about > > > the use case you have in mind instead? Then solution will > > > be much clear, I hope. > > > > The DDoS use-case _is_ affected by this "hidden" bulking design. > > > > Lets say, I want to implement a DDoS facility. Instead of just > > dropping the malicious packets, I want to see the bad packets. I > > implement this by rewriting the destination-MAC to be my monitor > > machine and then XDP_TX the packet. > > not following the use case. you want to implement a DDoS generator? > Or just forward all bad packets from affected host to another host > in the same rack? so two servers will be spammed with traffic and > even more load on the tor? I really don't see how this is useful > for anything but stress testing. As I wrote below, the purpose of the monitor machine is to diagnose false positives. If you worry about the added load I would either, forward out another interface (which is not supported yet) or simply do sampling of packets being forwarded to the monitor host. > > In the DDoS use-case, you have loaded your XDP/eBPF program, and 100% > > of the traffic is delivered to the stack. (See note 1) > > hmm. DoS prevention use case is when 99% of the traffic is dropped. As I wrote below, until the DDoS attack starts, all packets are delivered to the stack. > > Once the DDoS attack starts, then the traffic pattern changes, and XDP > > should (hopefully only) catch the malicious traffic (monitor machine can > > help diagnose false positive). Now, due to interleaving the DDoS > > traffic with the clean traffic, then efficiency of XDP_TX is reduced due to > > more icache misses... > > > > > > > > Note(1): Notice I have already demonstrated that loading a XDP/eBPF > > program with 100% delivery to the stack, actually slows down the > > normal stack. This is due to hitting a bottleneck in the page > > allocator. I'm working removing that bottleneck with page_pool, and > > that solution is orthogonal to this problem. > > sure. no one arguing against improving page allocator. > > > It is actually an excellent argument, for why you would want to run a > > DDoS XDP filter only on a restricted number of RX queues. > > no. it's the opposite. If the host is under DoS there is no way > the host can tell in advance which rx queue will be seeing bad > packets. Sorry, this note was not related to the DoS use-case. You misunderstood it. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Thu, 8 Sep 2016 23:30:50 -0700 Alexei Starovoitov wrote: > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote: [...] > > Imagine you have packets intermixed towards the stack and XDP_TX. > > Every time you call the stack code, then you flush your icache. When > > returning to the driver code, you will have to reload all the icache > > associated with the XDP_TX, this is a costly operation. > [...] > To make further progress in this discussion can we talk about > the use case you have in mind instead? Then solution will > be much clear, I hope. The DDoS use-case _is_ affected by this "hidden" bulking design. Lets say, I want to implement a DDoS facility. Instead of just dropping the malicious packets, I want to see the bad packets. I implement this by rewriting the destination-MAC to be my monitor machine and then XDP_TX the packet. In the DDoS use-case, you have loaded your XDP/eBPF program, and 100% of the traffic is delivered to the stack. (See note 1) Once the DDoS attack starts, then the traffic pattern changes, and XDP should (hopefully only) catch the malicious traffic (monitor machine can help diagnose false positive). Now, due to interleaving the DDoS traffic with the clean traffic, then efficiency of XDP_TX is reduced due to more icache misses... Note(1): Notice I have already demonstrated that loading a XDP/eBPF program with 100% delivery to the stack, actually slows down the normal stack. This is due to hitting a bottleneck in the page allocator. I'm working removing that bottleneck with page_pool, and that solution is orthogonal to this problem. It is actually an excellent argument, for why you would want to run a DDoS XDP filter only on a restricted number of RX queues. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Fri, 9 Sep 2016 18:03:09 +0300 Saeed Mahameed wrote: > On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev > wrote: > > On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote: > >> > >> I'm sorry but I have a problem with this patch! > >> Looking at this patch, I want to bring up a fundamental architectural > >> concern with the development direction of XDP transmit. > >> > >> > >> What you are trying to implement, with delaying the doorbell, is > >> basically TX bulking for TX_XDP. > >> > >> Why not implement a TX bulking interface directly instead?!? > >> > >> Yes, the tailptr/doorbell is the most costly operation, but why not > >> also take advantage of the benefits of bulking for other parts of the > >> code? (benefit is smaller, by every cycles counts in this area) > >> > >> This hole XDP exercise is about avoiding having a transaction cost per > >> packet, that reads "bulking" or "bundling" of packets, where possible. > >> > >> Lets do bundling/bulking from the start! > > Jesper, what we did here is also bulking, instead of bulking in a > temporary list in the driver we list the packets in the HW and once > done we transmit all at once via the xdp_doorbell indication. > > I agree with you that we can take advantage and improve the icache by > bulking first in software and then queue all at once in the hw then > ring one doorbell. > > but I also agree with Alexei that this will introduce an extra > pointer/list handling in the driver and we need to do the comparison > between both approaches before we decide which is better. I welcome implementing both approaches and benchmarking them against each-other, I'll gladly dedicate time for this! I'm reacting so loudly, because this is a mental model switch, that need to be applied to the full drivers RX path. Also for normal stack delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated, there is between 10%-25% perf gain here. The key point is stop seeing the lowest driver RX as something that works on a per packet basis. It might be easier to view this as a kind of vector processing. This is about having a vector of packets, where we apply some action/operation. This is about using the CPU more efficiently, getting it to do more instructions per cycle. The next level of optimization (for >= Sandy Bridge CPUs) is to make these vector processing stages small enough to fit into the CPUs decoded-I-cache section. It might also be important to mention, that for netstack delivery, I don't imagine bulking 64 packets. Instead, I imagine doing 8-16 packets. Why, because the NIC-HW runs independently and have the opportunity to deliver more frames in the RX ring queue, while the stack "slow" processed packets. You can view this as "bulking" from the RX ring queue, with a "look-back" before exiting the NAPI poll loop. > this must be marked as future work and not have this from the start. We both know that statement is BS, and the other approach will never be implemented once this patch is accepted upstream. > > mlx4 already does bulking and this proposed mlx5 set of patches > > does bulking as well. I'm reacting exactly because mlx4 is also doing "bulking" in the wrong way IMHO. And now mlx5 is building on the same principle. That is why I'm yelling STOP. > >> The reason behind the xmit_more API is that we could not change the > >> API of all the drivers. And we found that calling an explicit NDO > >> flush came at a cost (only approx 7 ns IIRC), but it still a cost that > >> would hit the common single packet use-case. > >> > >> It should be really easy to build a bundle of packets that need XDP_TX > >> action, especially given you only have a single destination "port". > >> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record. [1] http://lists.openwall.net/netdev/2016/04/19/89 [2] http://lists.openwall.net/netdev/2016/01/15/51 -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Thu, 8 Sep 2016 23:30:50 -0700 Alexei Starovoitov wrote: > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote: > > > > Lets do bundling/bulking from the start! > > > > > > mlx4 already does bulking and this proposed mlx5 set of patches > > > does bulking as well. > > > See nothing wrong about it. RX side processes the packets and > > > when it's done it tells TX to xmit whatever it collected. > > > > This is doing "hidden" bulking and not really taking advantage of using > > the icache more effeciently. > > > > Let me explain the problem I see, little more clear then, so you > > hopefully see where I'm going. > > > > Imagine you have packets intermixed towards the stack and XDP_TX. > > Every time you call the stack code, then you flush your icache. When > > returning to the driver code, you will have to reload all the icache > > associated with the XDP_TX, this is a costly operation. > > correct. And why is that a problem? It is good that you can see and acknowledge the I-cache problem. XDP is all about performance. What I hear is, that you are arguing against a model that will yield better performance, that does not make sense to me. Let me explain this again, in another way. This is a mental model switch. Stop seeing the lowest driver RX as something that works on a per packet basis. Maybe is it is easier to understand if we instead see this as vector processing? This is about having a vector of packets, where we apply some action/operation. This is about using the CPU more efficiently, getting it to do more instructions per cycle (directly measurable with perf, while I-cache is not directly measurable). Lets assume everything fits into the I-cache (XDP+driver code). The CPU-frontend still have to decode the instructions from the I-cache into micro-ops. The next level of optimizations is to reuse the decoded I-cache by running it on all elements in the packet-vector. The Intel "64 and IA-32 Architectures Optimization Reference Manual" (section 3.4.2.6 "Optimization for Decoded ICache"[1][2]), states make sure each hot code block is less than about 500 instructions. Thus, the different "stages" working on the packet-vector, need to be rather small and compact. [1] http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html [2] http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf Notice: The same mental model switch applies to delivery packets to the regular netstack. I've brought this up before[3]. Instead of flushing the drivers I-cache for every packet, by calling the stack, let instead bundle up N-packets in the driver before calling the stack. I showed 10% speedup by a naive implementation of this approach. Edward Cree also showed[4] a 10% performance boost, and went further into the stack, showing a 25% increase. A goal is also, to make optimizing netstack code-size independent of the driver code-size, by separating the netstacks I-cache usage from the drivers. [3] http://lists.openwall.net/netdev/2016/01/15/51 [4] http://lists.openwall.net/netdev/2016/04/19/89 -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Thu, 8 Sep 2016 20:22:04 -0700 Alexei Starovoitov wrote: > On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote: > > > > I'm sorry but I have a problem with this patch! > > is it because the variable is called 'xdp_doorbell'? > Frankly I see nothing scary in this patch. > It extends existing code by adding a flag to ring doorbell or not. > The end of rx napi is used as an obvious heuristic to flush the pipe. > Looks pretty generic to me. > The same code can be used for non-xdp as well once we figure out > good algorithm for xmit_more in the stack. What I'm proposing can also be used by the normal stack. > > Looking at this patch, I want to bring up a fundamental architectural > > concern with the development direction of XDP transmit. > > > > > > What you are trying to implement, with delaying the doorbell, is > > basically TX bulking for TX_XDP. > > > > Why not implement a TX bulking interface directly instead?!? > > > > Yes, the tailptr/doorbell is the most costly operation, but why not > > also take advantage of the benefits of bulking for other parts of the > > code? (benefit is smaller, by every cycles counts in this area) > > > > This hole XDP exercise is about avoiding having a transaction cost per > > packet, that reads "bulking" or "bundling" of packets, where possible. > > > > Lets do bundling/bulking from the start! > > mlx4 already does bulking and this proposed mlx5 set of patches > does bulking as well. > See nothing wrong about it. RX side processes the packets and > when it's done it tells TX to xmit whatever it collected. This is doing "hidden" bulking and not really taking advantage of using the icache more effeciently. Let me explain the problem I see, little more clear then, so you hopefully see where I'm going. Imagine you have packets intermixed towards the stack and XDP_TX. Every time you call the stack code, then you flush your icache. When returning to the driver code, you will have to reload all the icache associated with the XDP_TX, this is a costly operation. > > The reason behind the xmit_more API is that we could not change the > > API of all the drivers. And we found that calling an explicit NDO > > flush came at a cost (only approx 7 ns IIRC), but it still a cost that > > would hit the common single packet use-case. > > > > It should be really easy to build a bundle of packets that need XDP_TX > > action, especially given you only have a single destination "port". > > And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record. > > not sure what are you proposing here? > Sounds like you want to extend it to multi port in the future? > Sure. The proposed code is easily extendable. > > Or you want to see something like a link list of packets > or an array of packets that RX side is preparing and then > send the whole array/list to TX port? > I don't think that would be efficient, since it would mean > unnecessary copy of pointers. I just explain it will be more efficient due to better use of icache. > > In the future, XDP need to support XDP_FWD forwarding of packets/pages > > out other interfaces. I also want bulk transmit from day-1 here. It > > is slightly more tricky to sort packets for multiple outgoing > > interfaces efficiently in the pool loop. > > I don't think so. Multi port is natural extension to this set of patches. > With multi port the end of RX will tell multiple ports (that were > used to tx) to ring the bell. Pretty trivial and doesn't involve any > extra arrays or link lists. So, have you solved the problem exclusive access to a TX ring of a remote/different net_device when sending? In you design you assume there exist many TX ring available for other devices to access. In my design I also want to support devices that doesn't have this HW capability, and e.g. only have one TX queue. > > But the mSwitch[1] article actually already solved this destination > > sorting. Please read[1] section 3.3 "Switch Fabric Algorithm" for > > understanding the next steps, for a smarter data structure, when > > starting to have more TX "ports". And perhaps align your single > > XDP_TX destination data structure to this future development. > > > > [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf > > I don't see how this particular paper applies to the existing kernel code. > It's great to take ideas from research papers, but real code is different. > > > --Jesper > > (top post) > > since when it's ok to top post? What a kneejerk reaction. When writing something general we often reply to the top of the email, and then often delete the rest (which makes it hard for later comers to follow). I was bcc'ing some people, which needed the context, so it was a service note to you, that I didn't write anything below. > > On Wed, 7 Sep 2016 15:42:32 +0300 Saeed Mahameed > > wrote: > > > > > Previously we rang XDP SQ doorbell on every forwarded XDP packet. > > > >
Re: [iovisor-dev] [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Thu, 8 Sep 2016 09:26:03 -0700 Tom Herbert wrote: > On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer > wrote: > > > > On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert wrote: > > > >> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend > >> wrote: > >> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote: > >> >> > >> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed > >> >> wrote: > >> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet > >> >>> wrote: > >> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: > >> > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet > >> > wrote: > >> >> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: > >> >> [...] > >> > >> Only if a qdisc is present and pressure is high enough. > >> > >> But in a forwarding setup, we likely receive at a lower rate than the > >> NIC can transmit. > >> >> > >> >> Yes, I can confirm this happens in my experiments. > >> >> > >> > >> >>> > >> >>> Jesper has a similar Idea to make the qdisc think it is under > >> >>> pressure, when the device TX ring is idle most of the time, i think > >> >>> his idea can come in handy here. I am not fully involved in the > >> >>> details, maybe he can elaborate more. > >> >>> > >> >>> But if it works, it will be transparent to napi, and xmit more will > >> >>> happen by design. > >> >> > >> >> Yes. I have some ideas around getting more bulking going from the qdisc > >> >> layer, by having the drivers provide some feedback to the qdisc layer > >> >> indicating xmit_more should be possible. This will be a topic at the > >> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully > >> >> challenge people to come up with a good solution ;-) > >> >> > >> > > >> > One thing I've noticed but haven't yet actually analyzed much is if > >> > I shrink the nic descriptor ring size to only be slightly larger than > >> > the qdisc layer bulking size I get more bulking and better perf numbers. > >> > At least on microbenchmarks. The reason being the nic pushes back more > >> > on the qdisc. So maybe a case for making the ring size in the NIC some > >> > factor of the expected number of queues feeding the descriptor ring. > >> > > > > > I've also played with shrink the NIC descriptor ring size, it works, > > but it is an ugly hack to get NIC pushes backs, and I foresee it will > > hurt normal use-cases. (There are other reasons for shrinking the ring > > size like cache usage, but that is unrelated to this). > > > > > >> BQL is not helping with that? > > > > Exactly. But the BQL _byte_ limit is not what is needed, what we need > > to know is the _packets_ currently "in-flight". Which Tom already have > > a patch for :-) Once we have that the algorithm is simple. > > > > Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough" > > packets in-flight, the qdisc start it's bulk dequeue building phase, > > before calling the driver. The allowed max qdisc bulk size should > > likely be related to pkts-in-flight. > > > Sorry, I'm still missing it. The point of BQL is that we minimize the > amount of data (and hence number of packets) that needs to be queued > in the device in order to prevent the link from going idle while there > are outstanding packets to be sent. The algorithm is based on counting > bytes not packets because bytes are roughly an equal cost unit of > work. So if we've queued 100K of bytes on the queue we know how long > that takes around 80 usecs @10G, but if we count packets then we > really don't know much about that. 100 packets enqueued could > represent 6400 bytes or 6400K worth of data so time to transmit is > anywhere from 5usecs to 5msecs > > Shouldn't qdisc bulk size be based on the BQL limit? What is the > simple algorithm to apply to in-flight packets? Maybe the algorithm is not so simple, and we likely also have to take BQL bytes into account. The reason for wanting packets-in-flight is because we are attacking a transaction cost. The tailptr/doorbell cost around 70ns. (Based on data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 = 70.74). The 10G wirespeed small packets budget is 67.2ns, this with fixed overhead per packet of 70ns we can never reach 10G wirespeed. The idea/algo is trying to predict the future. If we see a given/high packet rate, which equals a high transaction cost, then lets try not calling the driver, and instead backlog the packet in the qdisc, speculatively hoping the current rate continues. This will in effect allow bulking and amortize the 70ns transaction cost over N packets. Instead of tracking a rate of packets or doorbells per sec, I will let BQLs packet-in-flight tell me when the driver sees a rate high enough that the drivers (DMA-TX completion) consider several packets are in-flight. When that happens, I will bet on, I can stop sending packets to the device, and instead queue them in the qdisc layer. If I'm unlucky an
Re: [iovisor-dev] [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
On Thu, 8 Sep 2016 12:31:47 +0300 Or Gerlitz wrote: > On Thu, Sep 8, 2016 at 10:38 AM, Jesper Dangaard Brouer > wrote: > > On Wed, 7 Sep 2016 23:55:42 +0300 > > Or Gerlitz wrote: > > > >> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed > >> wrote: > >> > From: Rana Shahout > >> > > >> > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver. > >> > > >> > When XDP is on we make sure to change channels RQs type to > >> > MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to > >> > ensure "page per packet". > >> > > >> > On XDP set, we fail if HW LRO is set and request from user to turn it > >> > off. Since on ConnectX4-LX HW LRO is always on by default, this will be > >> > annoying, but we prefer not to enforce LRO off from XDP set function. > >> > > >> > Full channels reset (close/open) is required only when setting XDP > >> > on/off. > >> > > >> > When XDP set is called just to exchange programs, we will update > >> > each RQ xdp program on the fly and for synchronization with current > >> > data path RX activity of that RQ, we temporally disable that RQ and > >> > ensure RX path is not running, quickly update and re-enable that RQ, > >> > for that we do: > >> > - rq.state = disabled > >> > - napi_synnchronize > >> > - xchg(rq->xdp_prg) > >> > - rq.state = enabled > >> > - napi_schedule // Just in case we've missed an IRQ > >> > > >> > Packet rate performance testing was done with pktgen 64B packets and on > >> > TX side and, TC drop action on RX side compared to XDP fast drop. > >> > > >> > CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz > >> > > >> > Comparison is done between: > >> > 1. Baseline, Before this patch with TC drop action > >> > 2. This patch with TC drop action > >> > 3. This patch with XDP RX fast drop > >> > > >> > StreamsBaseline(TC drop)TC dropXDP fast Drop > >> > -- > >> > 1 5.51Mpps5.14Mpps 13.5Mpps > >> > >> This (13.5 M PPS) is less than 50% of the result we presented @ the > >> XDP summit which was obtained by Rana. Please see if/how much does > >> this grows if you use more sender threads, but all of them to xmit the > >> same stream/flows, so we're on one ring. That (XDP with single RX ring > >> getting packets from N remote TX rings) would be your canonical > >> base-line for any further numbers. > > > > Well, my experiments with this hardware (mlx5/CX4 at 50Gbit/s) show > > that you should be able to reach 23Mpps on a single CPU. This is > > a XDP-drop-simulation with order-0 pages being recycled through my > > page_pool code, plus avoiding the cache-misses (notice you are using a > > CPU E5-2680 with DDIO, thus you should only see a L3 cache miss). > > so this takes up from 13M to 23M, good. Notice the 23Mpps was crude hack test to determine the maximum achievable performance. This is our performance target, once we get _close_ to that then we are happy, and stop optimizing. > Could you explain why the move from order-3 to order-0 is hurting the > performance so much (drop from 32M to 23M), any way we can overcome that? It is all going to be in the details. When reaching these numbers be careful, thinking wow 23M to 32M sounds like a huge deal... but the performance difference in nanosec is actually not that large, it is only around 12ns more we have to save. (1/(23*10^6)-1/(32*10^6))*10^9 = 12.22 > > The 23Mpps number looks like some HW limitation, as the increase was > > not HW, I think. As I said, Rana got 32M with striding RQ when she was > using order-3 (or did we use order-5?) It was order-5. We likely need some HW tuning parameter (like with mlx4) if you want to go past the 23Mpps mark. > > is not proportional to page-allocator overhead I removed (and CPU freq > > starts to decrease). I also did scaling tests to more CPUs, which > > showed it scaled up to 40Mpps (you reported 45M). And at the Phy RX > > level I see 60Mpps (50G max is 74Mpps). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
[iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
I'm sorry but I have a problem with this patch! Looking at this patch, I want to bring up a fundamental architectural concern with the development direction of XDP transmit. What you are trying to implement, with delaying the doorbell, is basically TX bulking for TX_XDP. Why not implement a TX bulking interface directly instead?!? Yes, the tailptr/doorbell is the most costly operation, but why not also take advantage of the benefits of bulking for other parts of the code? (benefit is smaller, by every cycles counts in this area) This hole XDP exercise is about avoiding having a transaction cost per packet, that reads "bulking" or "bundling" of packets, where possible. Lets do bundling/bulking from the start! The reason behind the xmit_more API is that we could not change the API of all the drivers. And we found that calling an explicit NDO flush came at a cost (only approx 7 ns IIRC), but it still a cost that would hit the common single packet use-case. It should be really easy to build a bundle of packets that need XDP_TX action, especially given you only have a single destination "port". And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record. In the future, XDP need to support XDP_FWD forwarding of packets/pages out other interfaces. I also want bulk transmit from day-1 here. It is slightly more tricky to sort packets for multiple outgoing interfaces efficiently in the pool loop. But the mSwitch[1] article actually already solved this destination sorting. Please read[1] section 3.3 "Switch Fabric Algorithm" for understanding the next steps, for a smarter data structure, when starting to have more TX "ports". And perhaps align your single XDP_TX destination data structure to this future development. [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf --Jesper (top post) On Wed, 7 Sep 2016 15:42:32 +0300 Saeed Mahameed wrote: > Previously we rang XDP SQ doorbell on every forwarded XDP packet. > > Here we introduce a xmit more like mechanism that will queue up more > than one packet into SQ (up to RX napi budget) w/o notifying the hardware. > > Once RX napi budget is consumed and we exit napi RX loop, we will > flush (doorbell) all XDP looped packets in case there are such. > > XDP forward packet rate: > > Comparing XDP with and w/o xmit more (bulk transmit): > > Streams XDP TX XDP TX (xmit more) > --- > 1 4.90Mpps 7.50Mpps > 2 9.50Mpps 14.8Mpps > 4 16.5Mpps 25.1Mpps > 8 21.5Mpps 27.5Mpps* > 16 24.1Mpps 27.5Mpps* > > *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams, > we will be working on the analysis and will publish the conclusions > later. > > Signed-off-by: Saeed Mahameed > --- > drivers/net/ethernet/mellanox/mlx5/core/en.h| 9 ++-- > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 > +++-- > 2 files changed, 49 insertions(+), 17 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h > b/drivers/net/ethernet/mellanox/mlx5/core/en.h > index df2c9e0..6846208 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h > @@ -265,7 +265,8 @@ struct mlx5e_cq { > > struct mlx5e_rq; > typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq *rq, > -struct mlx5_cqe64 *cqe); > +struct mlx5_cqe64 *cqe, > +bool *xdp_doorbell); > typedef int (*mlx5e_fp_alloc_wqe)(struct mlx5e_rq *rq, struct mlx5e_rx_wqe > *wqe, > u16 ix); > > @@ -742,8 +743,10 @@ void mlx5e_free_sq_descs(struct mlx5e_sq *sq); > > void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info, > bool recycle); > -void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe); > -void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe); > +void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe, > + bool *xdp_doorbell); > +void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe, > +bool *xdp_doorbell); > bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq); > int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 > ix); > int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, > u16 ix); > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > index 912a0e2..ed93251 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > @@ -117,7 +117,8 @@ static inline void mlx5e_decompress_cqe_no_hash(struct > mlx5e_rq *rq, > static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq, >
Re: [iovisor-dev] [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
On Wed, 7 Sep 2016 23:55:42 +0300 Or Gerlitz wrote: > On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed wrote: > > From: Rana Shahout > > > > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver. > > > > When XDP is on we make sure to change channels RQs type to > > MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to > > ensure "page per packet". > > > > On XDP set, we fail if HW LRO is set and request from user to turn it > > off. Since on ConnectX4-LX HW LRO is always on by default, this will be > > annoying, but we prefer not to enforce LRO off from XDP set function. > > > > Full channels reset (close/open) is required only when setting XDP > > on/off. > > > > When XDP set is called just to exchange programs, we will update > > each RQ xdp program on the fly and for synchronization with current > > data path RX activity of that RQ, we temporally disable that RQ and > > ensure RX path is not running, quickly update and re-enable that RQ, > > for that we do: > > - rq.state = disabled > > - napi_synnchronize > > - xchg(rq->xdp_prg) > > - rq.state = enabled > > - napi_schedule // Just in case we've missed an IRQ > > > > Packet rate performance testing was done with pktgen 64B packets and on > > TX side and, TC drop action on RX side compared to XDP fast drop. > > > > CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz > > > > Comparison is done between: > > 1. Baseline, Before this patch with TC drop action > > 2. This patch with TC drop action > > 3. This patch with XDP RX fast drop > > > > StreamsBaseline(TC drop)TC dropXDP fast Drop > > -- > > 1 5.51Mpps5.14Mpps 13.5Mpps > > This (13.5 M PPS) is less than 50% of the result we presented @ the > XDP summit which was obtained by Rana. Please see if/how much does > this grows if you use more sender threads, but all of them to xmit the > same stream/flows, so we're on one ring. That (XDP with single RX ring > getting packets from N remote TX rings) would be your canonical > base-line for any further numbers. Well, my experiments with this hardware (mlx5/CX4 at 50Gbit/s) show that you should be able to reach 23Mpps on a single CPU. This is a XDP-drop-simulation with order-0 pages being recycled through my page_pool code, plus avoiding the cache-misses (notice you are using a CPU E5-2680 with DDIO, thus you should only see a L3 cache miss). The 23Mpps number looks like some HW limitation, as the increase was is not proportional to page-allocator overhead I removed (and CPU freq starts to decrease). I also did scaling tests to more CPUs, which showed it scaled up to 40Mpps (you reported 45M). And at the Phy RX level I see 60Mpps (50G max is 74Mpps). Notice this is a significant improvement over the mlx4/CX3-pro HW, as it only scales up to 20Mpps, but can also do 20Mpps XDP-drop on a single core. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
On Wed, 7 Sep 2016 20:07:01 +0300 Saeed Mahameed wrote: > On Wed, Sep 7, 2016 at 7:54 PM, Tom Herbert wrote: > > On Wed, Sep 7, 2016 at 7:48 AM, Saeed Mahameed > > wrote: > >> On Wed, Sep 7, 2016 at 4:32 PM, Or Gerlitz wrote: > >>> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed > >>> wrote: > >>> > Packet rate performance testing was done with pktgen 64B packets and on > TX side and, TC drop action on RX side compared to XDP fast drop. > > CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz > > Comparison is done between: > 1. Baseline, Before this patch with TC drop action > 2. This patch with TC drop action > 3. This patch with XDP RX fast drop > > StreamsBaseline(TC drop)TC dropXDP fast Drop > -- > 1 5.51Mpps5.14Mpps 13.5Mpps > 2 11.5Mpps10.0Mpps 25.1Mpps > 4 16.3Mpps17.2Mpps 35.4Mpps > 8 29.6Mpps28.2Mpps 45.8Mpps* > 16 34.0Mpps30.1Mpps 45.8Mpps* > >>> > >>> Rana, Guys, congrat!! > >>> > >>> When you say X streams, does each stream mapped by RSS to different RX > >>> ring? > >>> or we're on the same RX ring for all rows of the above table? > >> > >> Yes, I will make this more clear in the actual submission, > >> Here we are talking about different RSS core rings. > >> > >>> > >>> In the CX3 work, we had X sender "streams" that all mapped to the same RX > >>> ring, > >>> I don't think we went beyond one RX ring. > >> > >> Here we did, the first row is what you are describing the other rows > >> are the same test > >> with increasing the number of the RSS receiving cores, The xmit side is > >> sending > >> as many streams as possible to be as much uniformly spread as possible > >> across the > >> different RSS cores on the receiver. > >> > > Hi Saeed, > > > > Please report CPU utilization also. The expectation is that > > performance should scale linearly with increasing number of CPUs (i.e. > > pps/CPU_utilization should be constant). > > > > That was my expectation too. Be careful with such expectations at these extreme speeds, because we are starting to hit PCI-express limitations and CPU cache-coherency limitations (if any atomic/RMW operations still exists per packet). Consider that in the small packet size 64 bytes case, the drivers PCI bandwidth need/overhead is actually quite large, as every descriptor also 64 bytes transferred. > Anyway we will share more accurate results when we have them, with CPU > utilization statistics as well. It is interesting to monitor the CPU utilization, because (if C-states are enabled) you will likely see the CPU freq be reduced or even enter CPU idle states, in-case your software (XDP) gets faster than the HW (PCI or NIC). I've seen that happen with mlx4/CX3-pro. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert wrote: > On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend > wrote: > > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote: > >> > >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed > >> wrote: > >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet > >>> wrote: > On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: > > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet > > wrote: > >> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: > >> [...] > > Only if a qdisc is present and pressure is high enough. > > But in a forwarding setup, we likely receive at a lower rate than the > NIC can transmit. > >> > >> Yes, I can confirm this happens in my experiments. > >> > > >>> > >>> Jesper has a similar Idea to make the qdisc think it is under > >>> pressure, when the device TX ring is idle most of the time, i think > >>> his idea can come in handy here. I am not fully involved in the > >>> details, maybe he can elaborate more. > >>> > >>> But if it works, it will be transparent to napi, and xmit more will > >>> happen by design. > >> > >> Yes. I have some ideas around getting more bulking going from the qdisc > >> layer, by having the drivers provide some feedback to the qdisc layer > >> indicating xmit_more should be possible. This will be a topic at the > >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully > >> challenge people to come up with a good solution ;-) > >> > > > > One thing I've noticed but haven't yet actually analyzed much is if > > I shrink the nic descriptor ring size to only be slightly larger than > > the qdisc layer bulking size I get more bulking and better perf numbers. > > At least on microbenchmarks. The reason being the nic pushes back more > > on the qdisc. So maybe a case for making the ring size in the NIC some > > factor of the expected number of queues feeding the descriptor ring. > > I've also played with shrink the NIC descriptor ring size, it works, but it is an ugly hack to get NIC pushes backs, and I foresee it will hurt normal use-cases. (There are other reasons for shrinking the ring size like cache usage, but that is unrelated to this). > BQL is not helping with that? Exactly. But the BQL _byte_ limit is not what is needed, what we need to know is the _packets_ currently "in-flight". Which Tom already have a patch for :-) Once we have that the algorithm is simple. Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough" packets in-flight, the qdisc start it's bulk dequeue building phase, before calling the driver. The allowed max qdisc bulk size should likely be related to pkts-in-flight. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand
On Wed, 7 Sep 2016 15:42:25 +0300 Saeed Mahameed wrote: > For non-striding RQ configuration before this patch we had a ring > with pre-allocated SKBs and mapped the SKB->data buffers for > device. > > For robustness and better RX data buffers management, we allocate a > page per packet and build_skb around it. > > This patch (which is a prerequisite for XDP) will actually reduce > performance for normal stack usage, because we are now hitting a bottleneck > in the page allocator. A later patch of page reuse mechanism will be > needed to restore or even improve performance in comparison to the old > RX scheme. Yes, it is true that there is a performance reduction (for normal stack, not XDP) caused by hitting a bottleneck in the page allocator. I actually have a PoC implementation of my page_pool, that show we regain the performance and then some. Based on an earlier version of this patch, where I hook it into the mlx5 driver (50Gbit/s version). You desc might be a bit outdated, as this patch and the patch before does contain you own driver local page-cache recycle facility. And you also show that you regain quite a lot of the lost performance. You driver local page_cache does have its limitations (see comments on other patch), as it depend on timely refcnt decrease, by the users of the page. If they hold onto pages (like TCP) then your page-cache will not be efficient. > Packet rate performance testing was done with pktgen 64B packets on > xmit side and TC drop action on RX side. I assume this is TC _ingress_ dropping, like [1] [1] https://github.com/netoptimizer/network-testing/blob/master/bin/tc_ingress_drop.sh > CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz > > Comparison is done between: > 1.Baseline, before 'net/mlx5e: Build RX SKB on demand' > 2.Build SKB with RX page cache (This patch) > > StreamsBaselineBuild SKB+page-cacheImprovement > --- > 1 4.33Mpps 5.51Mpps27% > 2 7.35Mpps 11.5Mpps52% > 4 14.0Mpps 16.3Mpps16% > 8 22.2Mpps 29.6Mpps20% > 16 24.8Mpps 34.0Mpps17% The improvements gained from using your page-cache is impressively high. Thanks for working on this, --Jesper > Signed-off-by: Saeed Mahameed > --- > drivers/net/ethernet/mellanox/mlx5/core/en.h | 10 +- > drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 31 +++- > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 215 > +++--- > 3 files changed, 133 insertions(+), 123 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h > b/drivers/net/ethernet/mellanox/mlx5/core/en.h > index afbdf70..a346112 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h > @@ -65,6 +65,8 @@ > #define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW0x3 > #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW0x6 > > +#define MLX5_RX_HEADROOM NET_SKB_PAD > + > #define MLX5_MPWRQ_LOG_STRIDE_SIZE 6 /* >= 6, HW restriction */ > #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS 8 /* >= 6, HW > restriction */ > #define MLX5_MPWRQ_LOG_WQE_SZ18 > @@ -302,10 +304,14 @@ struct mlx5e_page_cache { > struct mlx5e_rq { > /* data path */ > struct mlx5_wq_ll wq; > - u32wqe_sz; > - struct sk_buff **skb; > + > + struct mlx5e_dma_info *dma_info; > struct mlx5e_mpw_info *wqe_info; > void *mtt_no_align; > + struct { > + u8 page_order; > + u32wqe_sz;/* wqe data buffer size */ > + } buff; > __be32 mkey_be; > > struct device *pdev; > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > index c84702c..c9f1dea 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > @@ -411,6 +411,8 @@ static int mlx5e_create_rq(struct mlx5e_channel *c, > void *rqc = param->rqc; > void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq); > u32 byte_count; > + u32 frag_sz; > + int npages; > int wq_sz; > int err; > int i; > @@ -445,29 +447,40 @@ static int mlx5e_create_rq(struct mlx5e_channel *c, > > rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz); > rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides); > - rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides; > - byte_count = rq->wqe_sz; > + > + rq->buff.wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides; > + byte_count = rq->buff.wqe_sz; > rq->mkey_be = cpu_to_be32(c->priv->umr_mkey.key); >
Re: [iovisor-dev] [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ
On Wed, 7 Sep 2016 15:42:22 +0300 Saeed Mahameed wrote: > From: Tariq Toukan > > To improve the memory consumption scheme, we omit the flow that > demands and splits high-order pages in Striding RQ, and stay > with a single Striding RQ flow that uses order-0 pages. Thanks you for doing this! MM-list people thanks you! For others to understand what this means: This driver was doing split_page() on high-order pages (for Striding RQ). This was really bad because it will cause fragmenting the page-allocator, and depleting the high-order pages available quickly. (I've left rest of patch intact below, if some MM people should be interested in looking at the changes). There is even a funny comment in split_page() relevant to this: /* [...] * Note: this is probably too low level an operation for use in drivers. * Please consult with lkml before using this in your driver. */ > Moving to fragmented memory allows the use of larger MPWQEs, > which reduces the number of UMR posts and filler CQEs. > > Moving to a single flow allows several optimizations that improve > performance, especially in production servers where we would > anyway fallback to order-0 allocations: > - inline functions that were called via function pointers. > - improve the UMR post process. > > This patch alone is expected to give a slight performance reduction. > However, the new memory scheme gives the possibility to use a page-cache > of a fair size, that doesn't inflate the memory footprint, which will > dramatically fix the reduction and even give a huge gain. > > We ran pktgen single-stream benchmarks, with iptables-raw-drop: > > Single stride, 64 bytes: > * 4,739,057 - baseline > * 4,749,550 - this patch > no reduction > > Larger packets, no page cross, 1024 bytes: > * 3,982,361 - baseline > * 3,845,682 - this patch > 3.5% reduction > > Larger packets, every 3rd packet crosses a page, 1500 bytes: > * 3,731,189 - baseline > * 3,579,414 - this patch > 4% reduction > Well, the reduction does not really matter than much, because your baseline benchmarks are from a freshly booted system, where you have not fragmented and depleted the high-order pages yet... ;-) > Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)") > Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi > packet WQE") > Signed-off-by: Tariq Toukan > Signed-off-by: Saeed Mahameed > --- > drivers/net/ethernet/mellanox/mlx5/core/en.h | 54 ++-- > drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 136 -- > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 292 > - > drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 4 - > drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c | 2 +- > 5 files changed, 184 insertions(+), 304 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h > b/drivers/net/ethernet/mellanox/mlx5/core/en.h > index bf722aa..075cdfc 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h > @@ -62,12 +62,12 @@ > #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE0xd > > #define MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE_MPW0x1 > -#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW0x4 > +#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW0x3 > #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW0x6 > > #define MLX5_MPWRQ_LOG_STRIDE_SIZE 6 /* >= 6, HW restriction */ > #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS 8 /* >= 6, HW > restriction */ > -#define MLX5_MPWRQ_LOG_WQE_SZ17 > +#define MLX5_MPWRQ_LOG_WQE_SZ18 > #define MLX5_MPWRQ_WQE_PAGE_ORDER (MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT > 0 ? > \ > MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT : 0) > #define MLX5_MPWRQ_PAGES_PER_WQE BIT(MLX5_MPWRQ_WQE_PAGE_ORDER) > @@ -293,8 +293,8 @@ struct mlx5e_rq { > u32wqe_sz; > struct sk_buff **skb; > struct mlx5e_mpw_info *wqe_info; > + void *mtt_no_align; > __be32 mkey_be; > - __be32 umr_mkey_be; > > struct device *pdev; > struct net_device *netdev; > @@ -323,32 +323,15 @@ struct mlx5e_rq { > > struct mlx5e_umr_dma_info { > __be64*mtt; > - __be64*mtt_no_align; > dma_addr_t mtt_addr; > - struct mlx5e_dma_info *dma_info; > + struct mlx5e_dma_info dma_info[MLX5_MPWRQ_PAGES_PER_WQE]; > + struct mlx5e_umr_wqe wqe; > }; > > struct mlx5e_mpw_info { > - union { > - struct mlx5e_dma_info dma_info; > - struct mlx5e_umr_dma_info umr; > - }; > + struct mlx5e_umr_dma_info umr; > u16 consumed_strides; > u16 skbs_frags[MLX5_MPWRQ_PAGES_PER_WQE]; > - > - void (*dma_pre_sync)(struct device *pdev, >
Re: [iovisor-dev] [PATCH RFC 03/11] net/mlx5e: Implement RX mapped page cache for page recycle
On Wed, 7 Sep 2016 15:42:24 +0300 Saeed Mahameed wrote: > From: Tariq Toukan > > Instead of reallocating and mapping pages for RX data-path, > recycle already used pages in a per ring cache. > > We ran pktgen single-stream benchmarks, with iptables-raw-drop: > > Single stride, 64 bytes: > * 4,739,057 - baseline > * 4,749,550 - order0 no cache > * 4,786,899 - order0 with cache > 1% gain > > Larger packets, no page cross, 1024 bytes: > * 3,982,361 - baseline > * 3,845,682 - order0 no cache > * 4,127,852 - order0 with cache > 3.7% gain > > Larger packets, every 3rd packet crosses a page, 1500 bytes: > * 3,731,189 - baseline > * 3,579,414 - order0 no cache > * 3,931,708 - order0 with cache > 5.4% gain > > Signed-off-by: Tariq Toukan > Signed-off-by: Saeed Mahameed > --- > drivers/net/ethernet/mellanox/mlx5/core/en.h | 16 ++ > drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 15 ++ > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 57 > -- > drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 16 ++ > 4 files changed, 99 insertions(+), 5 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h > b/drivers/net/ethernet/mellanox/mlx5/core/en.h > index 075cdfc..afbdf70 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h > @@ -287,6 +287,18 @@ struct mlx5e_rx_am { /* Adaptive Moderation */ > u8 tired; > }; > > +/* a single cache unit is capable to serve one napi call (for non-striding > rq) > + * or a MPWQE (for striding rq). > + */ > +#define MLX5E_CACHE_UNIT (MLX5_MPWRQ_PAGES_PER_WQE > NAPI_POLL_WEIGHT ? \ > + MLX5_MPWRQ_PAGES_PER_WQE : NAPI_POLL_WEIGHT) > +#define MLX5E_CACHE_SIZE (2 * roundup_pow_of_two(MLX5E_CACHE_UNIT)) > +struct mlx5e_page_cache { > + u32 head; > + u32 tail; > + struct mlx5e_dma_info page_cache[MLX5E_CACHE_SIZE]; > +}; > + [...] > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > index c1cb510..8e02af3 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > @@ -305,11 +305,55 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq > *rq, u16 ix) > mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0); > } > > +static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq, > + struct mlx5e_dma_info *dma_info) > +{ > + struct mlx5e_page_cache *cache = &rq->page_cache; > + u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1); > + > + if (tail_next == cache->head) { > + rq->stats.cache_full++; > + return false; > + } > + > + cache->page_cache[cache->tail] = *dma_info; > + cache->tail = tail_next; > + return true; > +} > + > +static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq, > + struct mlx5e_dma_info *dma_info) > +{ > + struct mlx5e_page_cache *cache = &rq->page_cache; > + > + if (unlikely(cache->head == cache->tail)) { > + rq->stats.cache_empty++; > + return false; > + } > + > + if (page_ref_count(cache->page_cache[cache->head].page) != 1) { > + rq->stats.cache_busy++; > + return false; > + } Hmmm... doesn't this cause "blocking" of the page_cache recycle facility until the page at the head of the queue gets (page) refcnt decremented? Real use-case could fairly easily block/cause this... > + > + *dma_info = cache->page_cache[cache->head]; > + cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1); > + rq->stats.cache_reuse++; > + > + dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE, > +DMA_FROM_DEVICE); > + return true; > +} > + > static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq, > struct mlx5e_dma_info *dma_info) > { > - struct page *page = dev_alloc_page(); > + struct page *page; > + > + if (mlx5e_rx_cache_get(rq, dma_info)) > + return 0; > > + page = dev_alloc_page(); > if (unlikely(!page)) > return -ENOMEM; -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed wrote: > On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet wrote: > > On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: > >> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet > >> wrote: > >> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: [...] > > > > Only if a qdisc is present and pressure is high enough. > > > > But in a forwarding setup, we likely receive at a lower rate than the > > NIC can transmit. Yes, I can confirm this happens in my experiments. > > > > Jesper has a similar Idea to make the qdisc think it is under > pressure, when the device TX ring is idle most of the time, i think > his idea can come in handy here. I am not fully involved in the > details, maybe he can elaborate more. > > But if it works, it will be transparent to napi, and xmit more will > happen by design. Yes. I have some ideas around getting more bulking going from the qdisc layer, by having the drivers provide some feedback to the qdisc layer indicating xmit_more should be possible. This will be a topic at the Network Performance Workshop[1] at NetDev 1.2, I have will hopefully challenge people to come up with a good solution ;-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer [1] http://netdevconf.org/1.2/session.html?jesper-performance-workshop ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] XDP at netdev1.2 - "XDP port abstraction" proposal
On Mon, 22 Aug 2016 14:57:00 -0700 Tom Herbert via iovisor-dev wrote: > On Mon, Aug 22, 2016 at 2:44 PM, Thomas Graf wrote: > > 2016-08-09 17:47 GMT-04:00 Tom Herbert via iovisor-dev > > : > >> Hi, > >> > >> I requested a workshop at netdev (Oct. 5-7 in Tokyo) to present and > >> showcase XDP. The goal of this is to start providing the public > >> information about XDP and to start talking about real world use cases > >> and development. > >> > >> I am looking for presenters of various topics in XDP. The content > >> might be similar to what we had at XDP summit, but the target audience > >> should be considered to be potentail users and programmers of XDP as > >> opposed to developers working on the infrastructure. > > [...] > > So far we'd have: > - Ceth > - Cilium > - ILA router/load balancer > > We should also have an overview and introduction to XDP. So I'm > thinking 30 minutes for intro/overview then 1 hr. for the above (e.g. > 20 minutes for each). I have a infrastructure related subject, that I would like to discuss during the XDP workshop. - "XDP port abstraction" Currently XDP only supports packet forwarding (back) out the incoming interface, this is rather limiting. Instead of doing something Linux specific like using the ifindex, I'm proposing to introduce what I call a XDP port abstraction. This is inspired by the mSwitch[1][2] paper (by Michio Honda). The basic idea: When loading a XDP program it register which port-abstraction table "instance" it is using (or creates a new instance). And as minimum register its own port, which also need to be provided as meta-data as "incoming-port" to the XDP program. The port-abstraction is a table lookup, and the XDP program simply returns an index into this table. The kernel handles forwarding between ports (mSwitch "fabric"), based on the port "type" in the table. XDP implements the "logic" behind the forwarding decision, and several XDP programs can share a bpf-map to synchronize their decisions/state. (See mSwitch data plane split between "fabric" and "logic"). The port "type" should be though as wide/abstract as possible. It should be easy to introduce a new "type" of port. An obvious "type" would be net_device, but the forwarding efficiency depend on what features or NDO-functions the specific net_device supports. I even imagine a socket could be a port-type. AFAIK Tom imagine a hardware specific port-type. Paper: [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf Slides: [2] http://www.slideshare.net/micchie/mswitch-paper-talk-at-sosr-2015 A side-note: the mSwitch paper also contains a novel approach of sorting and grouping incoming packets before forwarding. This is also something that is needed, but it is orthogonal to the port-abstraction table. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] XDP seeking input from NIC hardware vendors
On Tue, 26 Jul 2016 10:53:05 -0700 John Fastabend wrote: > On 16-07-26 09:08 AM, Tom Herbert wrote: > > On Tue, Jul 26, 2016 at 6:31 AM, Thomas Monjalon > > wrote: > >> Hi, > >> > >> About RX filtering, there is an ongoing effort in DPDK to write an API > >> which could leverage most of the hardware capabilities of any NICs: > >> https://rawgit.com/6WIND/rte_flow/master/rte_flow.html > >> http://thread.gmane.org/gmane.comp.networking.dpdk.devel/43352 > >> I understand that XDP does not target to support every hardware features, > >> though it may be an interesting approach to check. > >> > > Thomas, > > > > A major goal of XDP is to leverage and in fact encourage innovation in > > hardware features. But, we are asking that vendors design the APIs > > with the community in mind. For instance, if XDP supports crypto > > offload it should have one API that different companies, we don't want > > every vendor coming up with their own. > > The work in those threads is to create a single API for users of DPDK > to interact with their hardware. The equivalent interface in Linux > kernel is ntuple filters from ethtool the effort here is to make a > usable interface to manage this from an application and also expose > all the hardware features. Ethtool does a fairly poor job on both > fronts IMO. Yes, the ethtool + ntuple approach is unfortunately not very easy to :-( > If we evolve the mechanism to run per rx queue xdp programs this > interface could easily be used to forward packets to specific rx > queues and run targeted xdp programs. > > Integrating this functionality into running XDP programs as ebpf code > seems a bit challenging to me because there is no software equivalent. > Once XDP ebpf program is running the pkt has already landed on the rx > queue. To me the mechanism to bind XDP programs to rx queues and steer > specific flows (e.g. match a flow label and forward to a queue) needs > to be part of the runtime environment not part of the main ebpf program > itself. I agree, based on the discussion in this thread. Will admit that my initial idea of adding this filter to the eBPF/XDP program was not such a good idea. > The runtime environment could use the above linked API. I know > we debated earlier including this in the ebpf program itself but that > really doesn't seem feasible to me. Whether the steering is expresses > as an ebpf program or an API like above seems like a reasonable > discussion. Perhaps a section could be used to describe the per program > filter for example which would be different from an API approach used > in the proposal above or the JIT could translate it into the above > API for devices without instruction based hardware. It seems like someone actually put some though into this, in the link you send... quite interesting, thanks: https://rawgit.com/6WIND/rte_flow/master/rte_flow.html -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] XDP seeking input from NIC hardware vendors
On Tue, 12 Jul 2016 12:13:01 -0700 John Fastabend wrote: > On 16-07-11 07:24 PM, Alexei Starovoitov wrote: > > On Sat, Jul 09, 2016 at 01:27:26PM +0200, Jesper Dangaard Brouer wrote: > >> On Fri, 8 Jul 2016 18:51:07 +0100 > >> Jakub Kicinski wrote: > >> > >>> On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote: > The only distinction between VFs and queue groupings on my side is VFs > provide RSS where as queue groupings have to be selected explicitly. > In a programmable NIC world the distinction might be lost if a "RSS" > program can be loaded into the NIC to select queues but for existing > hardware the distinction is there. > >>> > >>> To do BPF RSS we need a way to select the queue which I think is all > >>> Jesper wanted. So we will have to tackle the queue selection at some > >>> point. The main obstacle with it for me is to define what queue > >>> selection means when program is not offloaded to HW... Implementing > >>> queue selection on HW side is trivial. > >> > >> Yes, I do see the problem of fallback, when the programs "filter" demux > >> cannot be offloaded to hardware. > >> > >> First I though it was a good idea to keep the "demux-filter" part of > >> the eBPF program, as software fallback can still apply this filter in > >> SW, and just mark the packets as not-zero-copy-safe. But when HW > >> offloading is not possible, then packets can be delivered every RX > >> queue, and SW would need to handle that, which hard to keep transparent. > >> > >> > If you demux using a eBPF program or via a filter model like > flow_director or cls_{u32|flower} I think we can support both. And this > just depends on the programmability of the hardware. Note flow_director > and cls_{u32|flower} steering to VFs is already in place. > >> > >> Maybe we should keep HW demuxing as a separate setup step. > >> > >> Today I can almost do what I want: by setting up ntuple filters, and (if > >> Alexei allows it) assign an application specific XDP eBPF program to a > >> specific RX queue. > >> > >> ethtool -K eth2 ntuple on > >> ethtool -N eth2 flow-type udp4 dst-ip 192.168.254.1 dst-port 53 action 42 > >> > >> Then the XDP program can be attached to RX queue 42, and > >> promise/guarantee that it will consume all packet. And then the > >> backing page-pool can allow zero-copy RX (and enable scrubbing when > >> refilling pool). > > > > so such ntuple rule will send udp4 traffic for specific ip and port > > into a queue then it will somehow gets zero-copied to vm? > > . looks like a lot of other pieces about zero-copy and qemu need to be > > implemented (or at least architected) for this scheme to be conceivable > > . and when all that happens what vm is going to do with this very specific > > traffic? vm won't have any tcp or even ping? > > I have perhaps a different motivation to have queue steering in 'tc > cls-u32' and eventually xdp. The general idea is I have thousands of > queues and I can bind applications to the queues. When I know an > application is bound to a queue I can enable per queue busy polling (to > be implemented), set specific interrupt rates on the queue > (implementation will be posted soon), bind the queue to the correct > cpu, etc. +1 binding applications to queues. This is basically what our customers are requesting. They have one or two applications that need DPDK speeds. But they don't like dedicating an entire NIC per application (like DPDK requires). The basic idea is actually more fundamental. It reminds me of Van Jacobson's netchannels[1] when he talks about "Channelize" (slides 24+) Creating full "application" channel allow for lock free single producer single consumer (SPSC) queue directly into the application. [1] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf > ntuple works OK for this now but xdp provides more flexibility and > also lets us add additional policy on the queue other than simply > queue steering. > > I'm not convinced though that the demux queue selection should be part > of the XDP program itself just because it has no software analog to me > it sits in front of the set of XDP programs. But I think I could perhaps > be convinced it does if there is some reasonable way to do it. I guess > the single program method would result in an XDP program that read like > > if (rx_queue == x) >do_foo > if (rx_queue == y) >do_bar Yes, that is also why I wanted a XDP program per RX queue. But the "channelize" concept is more important. > A hardware jit may be able to sort that out. Or use per queue > sections. > > > > > the network virtualization traffic is typically encapsulated, > > so if xdp is used to do steer the traffic, the program would need > > to figure out vm id based on headers, strip tunnel, apply policy > > before forwarding the packet further. Clearly hw ntuple is not > > going to suffice. > > > > If there is no networking virtualization and V
Re: [iovisor-dev] XDP seeking input from NIC hardware vendors
On Fri, 8 Jul 2016 18:51:07 +0100 Jakub Kicinski wrote: > On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote: > > The only distinction between VFs and queue groupings on my side is VFs > > provide RSS where as queue groupings have to be selected explicitly. > > In a programmable NIC world the distinction might be lost if a "RSS" > > program can be loaded into the NIC to select queues but for existing > > hardware the distinction is there. > > To do BPF RSS we need a way to select the queue which I think is all > Jesper wanted. So we will have to tackle the queue selection at some > point. The main obstacle with it for me is to define what queue > selection means when program is not offloaded to HW... Implementing > queue selection on HW side is trivial. Yes, I do see the problem of fallback, when the programs "filter" demux cannot be offloaded to hardware. First I though it was a good idea to keep the "demux-filter" part of the eBPF program, as software fallback can still apply this filter in SW, and just mark the packets as not-zero-copy-safe. But when HW offloading is not possible, then packets can be delivered every RX queue, and SW would need to handle that, which hard to keep transparent. > > If you demux using a eBPF program or via a filter model like > > flow_director or cls_{u32|flower} I think we can support both. And this > > just depends on the programmability of the hardware. Note flow_director > > and cls_{u32|flower} steering to VFs is already in place. Maybe we should keep HW demuxing as a separate setup step. Today I can almost do what I want: by setting up ntuple filters, and (if Alexei allows it) assign an application specific XDP eBPF program to a specific RX queue. ethtool -K eth2 ntuple on ethtool -N eth2 flow-type udp4 dst-ip 192.168.254.1 dst-port 53 action 42 Then the XDP program can be attached to RX queue 42, and promise/guarantee that it will consume all packet. And then the backing page-pool can allow zero-copy RX (and enable scrubbing when refilling pool). > Yes, for steering to VFs we could potentially reuse a lot of existing > infrastructure. > > > The question I have is should the "filter" part of the eBPF program > > be a separate program from the XDP program and loaded using specific > > semantics (e.g. "load_hardware_demux" ndo op) at the risk of building > > a ever growing set of "ndo" ops. If you are running multiple XDP > > programs on the same NIC hardware then I think this actually makes > > sense otherwise how would the hardware and even software find the > > "demux" logic. In this model there is a "demux" program that selects > > a queue/VF and a program that runs on the netdev queues. > > I don't think we should enforce the separation here. What we may want > to do before forwarding to the VF can be much more complicated than > pure demux/filtering (simple eg - pop VLAN/tunnel). VF representative > model works well here as fallback - if program could not be offloaded > it will be run on the host and "trombone" packets via VFR into the VF. That is an interesting idea. > If we have a chain of BPF programs we can order them in increasing > level of complexity/features required and then HW could transparently > offload the first parts - the easier ones - leaving more complex > processing on the host. I'll try to keep out of the discussion of how to structure the BPF program, as it is outside my "area". > This should probably be paired with some sort of "skip-sw" flag to let > user space enforce the HW offload on the fast path part. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] XDP seeking input from NIC hardware vendors
On Fri, 8 Jul 2016 14:44:53 +0100 Jakub Kicinski wrote: > On Thu, 7 Jul 2016 19:22:12 -0700, Alexei Starovoitov wrote: > > > > If the goal is to just separate XDP traffic from non-XDP traffic > > > you could accomplish this with a combination of SR-IOV/macvlan to > > > separate the device queues into multiple netdevs and then run XDP > > > on just one of the netdevs. Then use flow director (ethtool) or > > > 'tc cls_u32/flower' to steer traffic to the netdev. This is how > > > we support multiple networking stacks on one device by the way it > > > is called the bifurcated driver. Its not too far of a stretch to > > > think we could offload some simple XDP programs to program the > > > splitting of traffic instead of cls_u32/flower/flow_director and > > > then you would have a stack of XDP programs. One running in > > > hardware and a set running on the queues in software. > > > > > > the above sounds like much better approach then Jesper/mine > > prog_per_ring stuff. > > > > If we can split the nic via sriov and have dedicated netdev via VF > > just for XDP that's way cleaner approach. I guess we won't need to > > do xdp_rxqmask after all. > > +1 > > I was thinking about using eBPF to direct to NIC queues but concluded > that doing a redirect to a VF is cleaner. Especially if the PF driver > supports VF representatives we could potentially just use > bpf_redirect(VFR netdev) and the VF doesn't even have to be handled by > the same stack. I actually disagree. I _do_ want to use the "filter" part of eBPF to direct to NIC queues, and then run a single/specific XDP program on that queue. Why to I want this? This part of solving a very fundamental CS problem (early demux), when wanting to support Zero-copy on RX. The basic problem that the NIC driver need to map RX pages into the RX ring, prior to receiving packets. Thus, we need HW support to steer packets, for gaining enough isolation (e.g between tenants domains) for allowing zero-copy. Based on the flexibility of the HW-filter, the granularity achievable for isolation (e.g. application specific) is much more flexible. Than splitting up the entire NIC with SR-IOV, VFs or macvlans. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
[iovisor-dev] XDP seeking input from NIC hardware vendors
Would it make sense from a hardware point of view, to split the XDP eBPF program into two stages. Stage-1: Filter (restricted eBPF / no-helper calls) Stage-2: Program Then the HW can choose to offload stage-1 "filter", and keep the likely more advanced stage-2 on the kernel side. Do HW vendors see a benefit of this approach? The generic problem I'm trying to solve is parsing. E.g. that the first step in every XDP program will be to parse the packet-data, in-order to determine if this is a packet the XDP program should process. Actions from stage-1 "filter" program: - DROP (like XDP_DROP, early drop) - PASS (like XDP_PASS, normal netstack) - MATCH (call stage-2, likely carry-over opaque return code) The MATCH action should likely carry-over an opaque return code, that makes sense for the stage-2 program. E.g. proto id and/or data offset. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
[iovisor-dev] XDP eBPF extending with an input parameter?
How easy is it to extend the XDP/eBPF program with a new input parameter? I see the need for providing the input "source-port" to the eBPF program. E.g. needed when implementing a basic learning bridge (basically a map of ). The concepts of "ports" are not-defined-yet, thus, this change need to be part of a later patchset. Maybe this "source-port-id" can instead be stored as part of the eBPF program structure? (as it is likely known at registration time) Reminder: XDP patchset I'm referring to is avail here[1]: [1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=xdp -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Assertion fails at samples/bpf/test_maps
On Wed, 22 Jun 2016 08:59:14 -0700 Alexei Starovoitov wrote: > On Wed, Jun 22, 2016 at 8:54 AM, Jesper Dangaard Brouer > wrote: > > On Tue, 21 Jun 2016 18:54:53 -0700 > > Alexei Starovoitov wrote: > > > >> On Tue, Jun 21, 2016 at 11:48 AM, William Tu via iovisor-dev > >> wrote: > >> > Hi, > >> > > >> > I'm running test_maps under net-next/samples/bpf/, commit > >> > 601009780635. The code logic all make sense but I got the assertion > >> > errors / coredump for some unknown reason under different compiler > >> > optimization flags (-O0 and -O2). The test_hashmap_sanity() works fine > >> > but fails at test_percpu_hashmap_sanity(). > >> > > >> > First run with normal build (which has -O2) > >> > [root@vm-dev bpf]# ./test_maps > >> > test_maps: samples/bpf/test_maps.c:137: test_percpu_hashmap_sanity: > >> > Assertion `bpf_lookup_elem(map_fd, &key, value) == -1 && > >> > (*__errno_location ()) == 2' failed. > > [...] > > > > Strange I get: > > > > $ sudo ./test_maps > > failed to create per-cpu arraymap 'Operation not permitted' > > yes. that's expected. Make sure to do 'ulimit -l unlimited' # ulimit -l unlimited # ./test_maps test_maps: OK Then it works, thanks. Maybe we could provide some better feedback from the test program, as this might be a common/expected behavior. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] Assertion fails at samples/bpf/test_maps
On Tue, 21 Jun 2016 18:54:53 -0700 Alexei Starovoitov wrote: > On Tue, Jun 21, 2016 at 11:48 AM, William Tu via iovisor-dev > wrote: > > Hi, > > > > I'm running test_maps under net-next/samples/bpf/, commit > > 601009780635. The code logic all make sense but I got the assertion > > errors / coredump for some unknown reason under different compiler > > optimization flags (-O0 and -O2). The test_hashmap_sanity() works fine > > but fails at test_percpu_hashmap_sanity(). > > > > First run with normal build (which has -O2) > > [root@vm-dev bpf]# ./test_maps > > test_maps: samples/bpf/test_maps.c:137: test_percpu_hashmap_sanity: > > Assertion `bpf_lookup_elem(map_fd, &key, value) == -1 && > > (*__errno_location ()) == 2' failed. [...] Strange I get: $ sudo ./test_maps failed to create per-cpu arraymap 'Operation not permitted' strace says: bpf(BPF_MAP_CREATE, {map_type=0x6 /* BPF_MAP_TYPE_??? */, key_size=4, value_size=8, max_entries=2}, 48) = -1 EPERM (Operation not permitted) > that's even more weird. > May be some compiler issue. > I've tested with gcc 5.2 and 6.1 [...] Compile host: Fedora 22 * LLVM version 3.7.1 (llc and clang same version) * gcc version 5.3.1 20151207 (Red Hat 5.3.1-2) (GCC) Ran on a Fedora 23 host, running Alexei's kernel tree on "xdp" branch, samples/bpf from same git tree. > > Daniel, Brenden, Jesper, did you ever see anything like it? Not working for me actually... but different error, strange. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] XDP summit agenda
On Tue, 21 Jun 2016 18:25:14 -0700 Tom Herbert via iovisor-dev wrote: > Here is an agenda for the XDP summit. We expect this to be a lot of > interactive conversation so the agenda can be dynamic as needed! > [...] > 9:15-10:15 Implementation status [...] > - page pool Given I'll not participate physically, I'll give my page-pool status here. The page-pool idea was presented at MM-summit in April 2016: http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf Since I've written a more details design document (reviewed by Tom and Alexei). I've not implemented any code for the page-pool yet. I've read and discussed the current page-allocators per cpu caching of order-0 pages with Mel Gorman. Mel is open to changes in this area. We are looking at doing a "native" bulk between page-pool and page-allocator, by aligning/sharing some of the same data-structures (I see this as the page-pool "backend"). I want to prototype implement at page-pool frontend, in-order to (1) benchmark if it is fast-enough for perf-target, (2) to figure out the API drivers needs. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Re: [iovisor-dev] XDP summit agenda
On Tue, 21 Jun 2016 18:25:14 -0700 Tom Herbert via iovisor-dev wrote: > Here is an agenda for the XDP summit. We expect this to be a lot of > interactive conversation so the agenda can be dynamic as needed! > > Thanks, > Tom > > 8:30 Breakfast > 9:00 Introductions, agenda, logistics, goals of the summit > > 9:15-10:15 Implementation status > - Mellanox, Intel development > - page pool > - driver status and issues > - performance numbers > - generalizing skbs > - open issues > 10:15-10:45 Real world uses cases in development > - ILA router Does this ILA router do multi-leg routing? If not, then I would add the use case for "normal" routing. > - load balancer > - others? Use-case: DDoS filter - This was the original use-case from CloudFlare at netdevconf[1] Use-case: RAW packet capture - IDS systems like Suricata needs this (af_packet not fast-enough) - Only support "steal" mode > - Issues developers are hitting > 10:45-11:00 Break > 11:15-11:45 >HW offload of BPF/XDP > - Netronome > - Barefoot switch HW > - Encryption, advance features > 11:45-12:15 XDP/Ceth integration > > 12:15-1:30 Lunch > > 1:30-2:00 Cilium > 2:00-3:00 APIs > - Meta data > - Return codes > - Portability but still allow using latest and greatest HW features > 3:00-3:15 Break > 3:15-4:00 Support for network virtualization > 4:00-5:00 Ecosystem considerations > - Toolchain, debugging > - BPF programs repository management > 5:00-6:00 Open discussion, planning, wrapup > > 6:30- Dinner at Rangoon Ruby, Palo Alto CA [1] http://people.netfilter.org/hawk/presentations/NetDev1.1_2016/links.html -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ___ iovisor-dev mailing list iovisor-dev@lists.iovisor.org https://lists.iovisor.org/mailman/listinfo/iovisor-dev