Re: [PATCH 3/5] qmp: Added the helper stamp check.

2023-03-01 Thread Toke Høiland-Jørgensen
Daniel P. Berrangé  writes:

> On Wed, Mar 01, 2023 at 03:53:47PM +0100, Toke Høiland-Jørgensen wrote:
>> Daniel P. Berrangé  writes:
>> 
>> > On Tue, Feb 28, 2023 at 11:21:56PM +0100, Toke Høiland-Jørgensen wrote:
>> >> Daniel P. Berrangé  writes:
>> >> 
>> >> > On Tue, Feb 28, 2023 at 08:01:51PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> Daniel P. Berrangé  writes:
>> >> >> 
>> >> >> Just to interject a note on this here: the skeleton code is mostly a
>> >> >> convenience feature used to embed BPF programs into the calling binary.
>> >> >> It is perfectly possible to just have the BPF object file itself reside
>> >> >> directly in the file system and just use the regular libbpf APIs to 
>> >> >> load
>> >> >> it. Some things get a bit more cumbersome (mostly setting values of
>> >> >> global variables, if the BPF program uses those).
>> >> >> 
>> >> >> So the JSON example above could just be a regular compiled-from-clang
>> >> >> BPF object file, and the management program can load that, inspect its
>> >> >> contents using the libbpf APIs and pass the file descriptors on to 
>> >> >> Qemu.
>> >> >> It's even possible to embed version information into this so that Qemu
>> >> >> can check if it understands the format and bail out if it doesn't - 
>> >> >> just
>> >> >> stick a version field in the configuration map as the first entry :)
>> >> >
>> >> > If all you have is the BPF object file is it possible to interrogate
>> >> > it to get a list of all the maps, and get FDs associated for them ?
>> >> > I had a look at the libbpf API and wasn't sure about that, it seemed
>> >> > like you had to know the required maps upfront ?  If it is possible
>> >> > to auto-discover everything you need, soley from the BPF object file
>> >> > as input, then just dealing with that in isolation would feel simpler.
>> >> 
>> >> It is. You load the object file, and bpf_object__for_each_map() lets you
>> >> discover which maps it contains, with the different bpf_map__*() APIs
>> >> telling you the properties of that map (and you can modify them too
>> >> before loading the object if needed).
>> >> 
>> >> The only thing that's not in the object file is any initial data you
>> >> want to put into the map(s). But except for read-only maps that can be
>> >> added by userspace after loading the maps, so you could just let Qemu do
>> >> that...
>> >> 
>> >> > It occurrs to me that exposing the BPF program as data rather than
>> >> > via binary will make more practical to integrate this into KubeVirt's
>> >> > architecture. In their deployment setup both QEMU and libvirt are
>> >> > running unprivileged inside a container. For any advanced nmetworking
>> >> > a completely separate component creates the TAP device and passes it
>> >> > into the container running QEMU. I don't think that the separate
>> >> > precisely matched helper binary would be something they can use, but
>> >> > it might be possible to expose a data file providing the BPF program
>> >> > blob and describing its maps.
>> >> 
>> >> Well, "a data file providing the BPF program blob and describing its
>> >> maps" is basically what a BPF .o file is. It just happens to be encoded
>> >> in ELF format :)
>> >> 
>> >> You can embed it into some other data structure and have libbpf load it
>> >> from a blob in memory as well as from the filesystem, though; that is
>> >> basically what the skeleton file does (notice the big character string
>> >> at the end, that's just the original .o file contents).
>> >
>> > Ok, in that case I'm really wondering why any of this helper program
>> > stuff was proposed. I recall the rationale was that it was impossible
>> > for an external program to load the BPF object on behalf of QEMU,
>> > because it would not know how todo that without QEMU specific
>> > knowledge.
>> 
>> I'm not sure either. Was there some bits that initially needed to be set
>> before the program was loaded (read-only maps or something)? Also,
>> upstream does encourage the use of

Re: [PATCH 3/5] qmp: Added the helper stamp check.

2023-03-01 Thread Toke Høiland-Jørgensen
Daniel P. Berrangé  writes:

> On Tue, Feb 28, 2023 at 11:21:56PM +0100, Toke Høiland-Jørgensen wrote:
>> Daniel P. Berrangé  writes:
>> 
>> > On Tue, Feb 28, 2023 at 08:01:51PM +0100, Toke Høiland-Jørgensen wrote:
>> >> Daniel P. Berrangé  writes:
>> >> 
>> >> Just to interject a note on this here: the skeleton code is mostly a
>> >> convenience feature used to embed BPF programs into the calling binary.
>> >> It is perfectly possible to just have the BPF object file itself reside
>> >> directly in the file system and just use the regular libbpf APIs to load
>> >> it. Some things get a bit more cumbersome (mostly setting values of
>> >> global variables, if the BPF program uses those).
>> >> 
>> >> So the JSON example above could just be a regular compiled-from-clang
>> >> BPF object file, and the management program can load that, inspect its
>> >> contents using the libbpf APIs and pass the file descriptors on to Qemu.
>> >> It's even possible to embed version information into this so that Qemu
>> >> can check if it understands the format and bail out if it doesn't - just
>> >> stick a version field in the configuration map as the first entry :)
>> >
>> > If all you have is the BPF object file is it possible to interrogate
>> > it to get a list of all the maps, and get FDs associated for them ?
>> > I had a look at the libbpf API and wasn't sure about that, it seemed
>> > like you had to know the required maps upfront ?  If it is possible
>> > to auto-discover everything you need, soley from the BPF object file
>> > as input, then just dealing with that in isolation would feel simpler.
>> 
>> It is. You load the object file, and bpf_object__for_each_map() lets you
>> discover which maps it contains, with the different bpf_map__*() APIs
>> telling you the properties of that map (and you can modify them too
>> before loading the object if needed).
>> 
>> The only thing that's not in the object file is any initial data you
>> want to put into the map(s). But except for read-only maps that can be
>> added by userspace after loading the maps, so you could just let Qemu do
>> that...
>> 
>> > It occurrs to me that exposing the BPF program as data rather than
>> > via binary will make more practical to integrate this into KubeVirt's
>> > architecture. In their deployment setup both QEMU and libvirt are
>> > running unprivileged inside a container. For any advanced nmetworking
>> > a completely separate component creates the TAP device and passes it
>> > into the container running QEMU. I don't think that the separate
>> > precisely matched helper binary would be something they can use, but
>> > it might be possible to expose a data file providing the BPF program
>> > blob and describing its maps.
>> 
>> Well, "a data file providing the BPF program blob and describing its
>> maps" is basically what a BPF .o file is. It just happens to be encoded
>> in ELF format :)
>> 
>> You can embed it into some other data structure and have libbpf load it
>> from a blob in memory as well as from the filesystem, though; that is
>> basically what the skeleton file does (notice the big character string
>> at the end, that's just the original .o file contents).
>
> Ok, in that case I'm really wondering why any of this helper program
> stuff was proposed. I recall the rationale was that it was impossible
> for an external program to load the BPF object on behalf of QEMU,
> because it would not know how todo that without QEMU specific
> knowledge.

I'm not sure either. Was there some bits that initially needed to be set
before the program was loaded (read-only maps or something)? Also,
upstream does encourage the use of skeletons for embedding into
applications, so it's not an unreasonable thing to start with if you
don't have the kind of deployment constraints that Qemu does in this
case.

> It looks like we can simply expose the BPF object blob to mgmt apps
> directly and get rid of this helper program entirely.

I believe so, yes. You'd still need to be sure that the BPF object file
itself comes from a trusted place, but hopefully it should be enough to
load it from a known filesystem path? (Sorry if this is a stupid
question, I only have a fuzzy idea of how all the pieces fit together
here).

-Toke




Re: [PATCH 3/5] qmp: Added the helper stamp check.

2023-02-28 Thread Toke Høiland-Jørgensen
Daniel P. Berrangé  writes:

> On Tue, Feb 28, 2023 at 08:01:51PM +0100, Toke Høiland-Jørgensen wrote:
>> Daniel P. Berrangé  writes:
>> 
>> Just to interject a note on this here: the skeleton code is mostly a
>> convenience feature used to embed BPF programs into the calling binary.
>> It is perfectly possible to just have the BPF object file itself reside
>> directly in the file system and just use the regular libbpf APIs to load
>> it. Some things get a bit more cumbersome (mostly setting values of
>> global variables, if the BPF program uses those).
>> 
>> So the JSON example above could just be a regular compiled-from-clang
>> BPF object file, and the management program can load that, inspect its
>> contents using the libbpf APIs and pass the file descriptors on to Qemu.
>> It's even possible to embed version information into this so that Qemu
>> can check if it understands the format and bail out if it doesn't - just
>> stick a version field in the configuration map as the first entry :)
>
> If all you have is the BPF object file is it possible to interrogate
> it to get a list of all the maps, and get FDs associated for them ?
> I had a look at the libbpf API and wasn't sure about that, it seemed
> like you had to know the required maps upfront ?  If it is possible
> to auto-discover everything you need, soley from the BPF object file
> as input, then just dealing with that in isolation would feel simpler.

It is. You load the object file, and bpf_object__for_each_map() lets you
discover which maps it contains, with the different bpf_map__*() APIs
telling you the properties of that map (and you can modify them too
before loading the object if needed).

The only thing that's not in the object file is any initial data you
want to put into the map(s). But except for read-only maps that can be
added by userspace after loading the maps, so you could just let Qemu do
that...

> It occurrs to me that exposing the BPF program as data rather than
> via binary will make more practical to integrate this into KubeVirt's
> architecture. In their deployment setup both QEMU and libvirt are
> running unprivileged inside a container. For any advanced nmetworking
> a completely separate component creates the TAP device and passes it
> into the container running QEMU. I don't think that the separate
> precisely matched helper binary would be something they can use, but
> it might be possible to expose a data file providing the BPF program
> blob and describing its maps.

Well, "a data file providing the BPF program blob and describing its
maps" is basically what a BPF .o file is. It just happens to be encoded
in ELF format :)

You can embed it into some other data structure and have libbpf load it
from a blob in memory as well as from the filesystem, though; that is
basically what the skeleton file does (notice the big character string
at the end, that's just the original .o file contents).

-Toke




Re: [PATCH 3/5] qmp: Added the helper stamp check.

2023-02-28 Thread Toke Høiland-Jørgensen
Daniel P. Berrangé  writes:

> On Tue, Feb 28, 2023 at 11:56:27AM +0200, Yuri Benditovich wrote:
>> On Mon, Feb 20, 2023 at 11:50 AM Daniel P. Berrangé 
>> wrote:
>> 
>> > On Sun, Feb 19, 2023 at 06:20:58PM +0200, Andrew Melnychenko wrote:
>> > > Added a function to check the stamp in the helper.
>> > > eBPF helper should have a special symbol that generates during the build.
>> > > QEMU checks the helper and determines that it fits, so the helper
>> > > will produce proper output.
>> >
>> > I think this is quite limiting for in place upgrades.
>> >
>> > Consider this scenario
>> >
>> >  * Host has QEMU 8.1.0 installed
>> >  * VM is running QEMU 8.1.0
>> >  * QEMU 8.1.1 is released with a bug fix in the EBF program
>> >  * Host is upgraded to QEMU 8.1.1
>> >  * User attempts to hotplug a NIC to the running VM
>> >
>> > IIUC this last step is going to fail because we'll be loading
>> > the EBF program from 8.1.1 and so its hash is different from
>> > that expected by the QEMU 8.1.0 that the pre-existing VM is
>> > running.
>> >
>> >   Indeed we did not take in account the in-place upgrade.
>> 
>> 
>> 
>> > If some changes to the EBF program are not going to be back
>> > compatible from the POV of the QEMU process, should we instead
>> > be versioning the EBF program. eg so new QEMU will ship both
>> > the old and new versions of the EBF program.
>> 
>> This does not seem to be an elegant option: QEMU theoretically can include
>> different eBPF programs but it hardly can interface with each one of them.
>> The code of QEMU (access to eBPF maps etc) includes header files which eBPF
>> of the day is being built with them.
>> 
>> I see 2 options to address this issue (of course there are more)
>> 1. Build and install qemu-rss-helper- executable. Libvirt will always
>> have a correct name, so for the running instance it will use
>> qemu-rss-helper-, for the new instance it will use
>> qemu-rss-helper-
>
> We'll get an ever growing number of program variants we need to
> build & distribute with each new QEMU release.
>
>> 2. Build the helper executable and link it inside qemu as a blob. Libvirt
>> will always retrieve the executable to the temporary file name and use it.
>> So the retrieved helper will always be compatible with QEMU. I'm not sure
>> what is the most portable way to do that.
>
> QEMU is considered an untrusted process, so there's no way we're going
> to ask it to give us an ELF binary and then execute that in privileged
> context.
>
>> Does one of these seem suitable?
>
> Neither feels very appealing to me.
>
> I've been trying to understand the eBPF code we're dealing with in a
> little more detail.
>
> IIUC, QEMU, or rather the virtio-net  driver needs to receive one FD
> for the BPF program, and one or more FDs for the BPF maps that the
> program uses. Currently it uses 3 maps, so needs 3 map FDs on top of
> the program FD.
>
> The helper program that is proposed here calls ebpf_rss_load() to
> load the program and get back a struct which gives access to the
> 4 FDs, which are then sent to the mgmt app, which forwards them
> onto QEMU.
>
> The ebpf_rss_load() method is making use of various structs that
> are specific to the RSS program implementation, but does not seems
> to do anything especially interesting.  It calls into rss_bpf__open()
> which eventually gets around to calling rss_bpf__create_skeleton
> which is where the interesting stuff happens.
>
> This rss_bpf__create_skeleton() method is implemented in terms of
> totally generic libbpf APIs, and has the actual blob that is the
> BPF program.
>
> Looking at what this does, I feel it should be trivial for a mgmt
> app to implement equivalent logic to rss_bpf__create_skeleton in a
> generic manner, if we could just expose the program blob and the
> map names to the mgmt app. eg a simple json file
>
>   {
>  "maps": [
> "tap_rss_map_configurations",
>   "tap_rss_map_indirection_table",
>   "tap_rss_map_toeplitz_key",
>  ],
>  "program": "the big blob encoded in base64..."
>   }
>
> if we installed that file are /usr/share/qemu/bpf/net-rss.json
> then when a QEMU process is started, the mgmt app capture the
> data in this JSON file. It now has enough info to create the
> EBPF programs needed and pass the FDs over to QEMU. This would
> be robust against QEMU software upgrades, and not tied to the
> specific EBPF program imlp. We can add or remove maps / change
> their names etc any time, as the details in the JSON descriptor
> can be updated.  This avoids need for any special helper program
> to be provided by QEMU with the problems that is throwing up
> for us.

Just to interject a note on this here: the skeleton code is mostly a
convenience feature used to embed BPF programs into the calling binary.
It is perfectly possible to just have the BPF object file itself reside
directly in the file system and just use the regular libbpf APIs to load
it. Some things get a bit more cumbersome (mostly setting values

Re: [PATCH 3/5] qmp: Added the helper stamp check.

2023-02-27 Thread Toke Høiland-Jørgensen
Andrew Melnichenko  writes:

> Hi all,
>
> On Mon, Feb 20, 2023 at 11:50 AM Daniel P. Berrangé  
> wrote:
>>
>> On Sun, Feb 19, 2023 at 06:20:58PM +0200, Andrew Melnychenko wrote:
>> > Added a function to check the stamp in the helper.
>> > eBPF helper should have a special symbol that generates during the build.
>> > QEMU checks the helper and determines that it fits, so the helper
>> > will produce proper output.
>>
>> I think this is quite limiting for in place upgrades.
>>
>> Consider this scenario
>>
>>  * Host has QEMU 8.1.0 installed
>>  * VM is running QEMU 8.1.0
>>  * QEMU 8.1.1 is released with a bug fix in the EBF program
>>  * Host is upgraded to QEMU 8.1.1
>>  * User attempts to hotplug a NIC to the running VM
>>
>> IIUC this last step is going to fail because we'll be loading
>> the EBF program from 8.1.1 and so its hash is different from
>> that expected by the QEMU 8.1.0 that the pre-existing VM is
>> running.
>>
>> If some changes to the EBF program are not going to be back
>> compatible from the POV of the QEMU process, should we instead
>> be versioning the EBF program. eg so new QEMU will ship both
>> the old and new versions of the EBF program.
>>
>
> I think it's too complicated to maintain backward compatibility with
> eBPF programs in "one package".
> As we can see, the eBPF skeleton may be changed not only by bugfix but
> with changes in libbpf(g.e. libbpf 1.0.1+).

Hmm, what change in libbpf 1.0.1 affects the skeleton format?

> This may lead to issues when with a newer environment you can't
> consistently recreate the "old" skeleton.

This should be detectable, though? As in, if you know that a new version
breaks compatibility you just bump the version number regardless of the
underlying application version?

-Toke




Re: [RFC PATCH 0/5] ebpf: Added ebpf helper for libvirtd.

2021-06-22 Thread Toke Høiland-Jørgensen
Andrew Melnichenko  writes:

> Hi,
> Thank you for your comments.
> I'll play with array type mmap. And later will provide some solution.

Cool - you're welcome! :)

-Toke




Re: [RFC PATCH 0/5] ebpf: Added ebpf helper for libvirtd.

2021-06-22 Thread Toke Høiland-Jørgensen
Daniel P. Berrangé  writes:

> On Tue, Jun 22, 2021 at 10:25:19AM +0200, Toke Høiland-Jørgensen wrote:
>> Jason Wang  writes:
>> 
>> > 在 2021/6/22 上午11:29, Yuri Benditovich 写道:
>> >> On Mon, Jun 21, 2021 at 12:20 PM Jason Wang  wrote:
>> >>>
>> >>> 在 2021/6/19 上午4:03, Andrew Melnichenko 写道:
>> >>>> Hi Jason,
>> >>>> I've checked "kernel.unprivileged_bpf_disabled=0" on Fedora,  Ubuntu,
>> >>>> and Debian - no need permissions to update BPF maps.
>> >>>
>> >>> How about RHEL :) ?
>> >> If I'm not mistaken, the RHEL releases do not use modern kernels yet
>> >> (for BPF we need 5.8+).
>> >> So this will be (probably) relevant for RHEL 9. Please correct me if I'm 
>> >> wrong.
>> >
>> > Adding Toke for more ideas on this.
>> 
>> Ignore the kernel version number; we backport all of BPF to RHEL,
>> basically. RHEL8.4 is up to upstream kernel 5.10, feature-wise.
>> 
>> However, we completely disable unprivileged BPF on RHEL kernels. Also,
>> there's upstream commit:
>> 08389d888287 ("bpf: Add kconfig knob for disabling unpriv bpf by default")
>> 
>> which adds a new value of '2' to the unprivileged_bpf_disable sysctl. I
>> believe this may end up being the default on Fedora as well.
>> 
>> So any design relying on unprivileged BPF is likely to break; I'd
>> suggest you look into how you can get this to work with CAP_BPF :)
>
> QEMU will never have any capabilities. Any resources that required
> privileges have to be opened by a separate privileged helper, and the
> open FD then passed across to the QEMU process. This relies on the
> capabilities checks only being performed at time of initial opening,
> and *not* on operations performed on the already open FD.

That won't work for regular map updates either, unfortunately: you still
have to perform a bpf() syscall to update an element, and that is a
privileged operation.

You may be able to get around this by using an array map type and
mmap()'ing the map contents, but I'm not sure how well that will work
across process boundaries.

If it doesn't, I only see two possibilities: populate the map
ahead-of-time and leave it in place, or keep the privileged helper
process around to perform map updates on behalf of QEMU...

-Toke




Re: [RFC PATCH 0/5] ebpf: Added ebpf helper for libvirtd.

2021-06-22 Thread Toke Høiland-Jørgensen
Jason Wang  writes:

> 在 2021/6/22 上午11:29, Yuri Benditovich 写道:
>> On Mon, Jun 21, 2021 at 12:20 PM Jason Wang  wrote:
>>>
>>> 在 2021/6/19 上午4:03, Andrew Melnichenko 写道:
 Hi Jason,
 I've checked "kernel.unprivileged_bpf_disabled=0" on Fedora,  Ubuntu,
 and Debian - no need permissions to update BPF maps.
>>>
>>> How about RHEL :) ?
>> If I'm not mistaken, the RHEL releases do not use modern kernels yet
>> (for BPF we need 5.8+).
>> So this will be (probably) relevant for RHEL 9. Please correct me if I'm 
>> wrong.
>
> Adding Toke for more ideas on this.

Ignore the kernel version number; we backport all of BPF to RHEL,
basically. RHEL8.4 is up to upstream kernel 5.10, feature-wise.

However, we completely disable unprivileged BPF on RHEL kernels. Also,
there's upstream commit:
08389d888287 ("bpf: Add kconfig knob for disabling unpriv bpf by default")

which adds a new value of '2' to the unprivileged_bpf_disable sysctl. I
believe this may end up being the default on Fedora as well.

So any design relying on unprivileged BPF is likely to break; I'd
suggest you look into how you can get this to work with CAP_BPF :)

-Toke




Re: [RFC PATCH v4 3/7] ebpf: Added eBPF RSS program.

2021-02-16 Thread Toke Høiland-Jørgensen
> From: Andrew 
> 
> RSS program and Makefile to build it.
> The bpftool used to generate '.h' file.
> The data in that file may be loaded by libbpf.
> EBPF compilation is not required for building qemu.
> You can use Makefile if you need to regenerate rss.bpf.skeleton.h.
> 
> Signed-off-by: Yuri Benditovich 
> Signed-off-by: Andrew Melnychenko 

A few comments on the BPF implementation:

> ---
>  tools/ebpf/Makefile.ebpf |  33 +++
>  tools/ebpf/rss.bpf.c | 505 +++
>  2 files changed, 538 insertions(+)
>  create mode 100755 tools/ebpf/Makefile.ebpf
>  create mode 100644 tools/ebpf/rss.bpf.c
> 
> diff --git a/tools/ebpf/Makefile.ebpf b/tools/ebpf/Makefile.ebpf
> new file mode 100755
> index 00..d32d1680b8
> --- /dev/null
> +++ b/tools/ebpf/Makefile.ebpf
> @@ -0,0 +1,33 @@
> +OBJS = rss.bpf.o
> +
> +LLC ?= llc
> +CLANG ?= clang
> +INC_FLAGS = `$(CLANG) -print-file-name=include`
> +EXTRA_CFLAGS ?= -O2 -emit-llvm -fno-stack-protector
> +
> +ifdef linuxhdrs
> +LINUXINCLUDE =  -I $(linuxhdrs)/arch/x86/include/uapi \
> +-I $(linuxhdrs)/arch/x86/include/generated/uapi \
> +-I $(linuxhdrs)/arch/x86/include/generated \
> +-I $(linuxhdrs)/include/generated/uapi \
> +-I $(linuxhdrs)/include/uapi \
> +-I $(linuxhdrs)/include \
> +-I $(linuxhdrs)/tools/lib
> +INC_FLAGS += -nostdinc -isystem
> +endif

It should be possible to set things up so you don't need a full set of
kernel headers to build stuff. What we usually do for BPF projects is to
just include the headers we need in a separate directory, clearly marked
as originating from the kernel tree. That way you don't incur a
dependency on the full kernel-headers, and users won't run into issues
where their kernel headers are too old. See an example here:

https://github.com/xdp-project/bpf-examples/tree/master/headers

> +
> +all: $(OBJS)
> +
> +.PHONY: clean
> +
> +clean:
> +   rm -f $(OBJS)
> +
> +$(OBJS):  %.o:%.c
> +   $(CLANG) $(INC_FLAGS) \
> +-D__KERNEL__ -D__ASM_SYSREG_H \
> +-I../include $(LINUXINCLUDE) \
> +$(EXTRA_CFLAGS) -c $< -o -| $(LLC) -march=bpf -filetype=obj 
> -o 
> $@
> +   bpftool gen skeleton rss.bpf.o > rss.bpf.skeleton.h
> +   cp rss.bpf.skeleton.h ../../ebpf/
> +
> diff --git a/tools/ebpf/rss.bpf.c b/tools/ebpf/rss.bpf.c
> new file mode 100644
> index 00..eb377247fc
> --- /dev/null
> +++ b/tools/ebpf/rss.bpf.c
> @@ -0,0 +1,505 @@
> +/*
> + * eBPF RSS program
> + *
> + * Developed by Daynix Computing LTD (http://www.daynix.com)
> + *
> + * Authors:
> + *  Andrew Melnychenko 
> + *  Yuri Benditovich 
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * Prepare:
> + * Requires llvm, clang, bpftool, linux kernel tree
> + *
> + * Build rss.bpf.skeleton.h:
> + * make -f Makefile.ebpf clean all
> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +
> +#define INDIRECTION_TABLE_SIZE 128
> +#define HASH_CALCULATION_BUFFER_SIZE 36
> +
> +struct rss_config_t {
> +__u8 redirect;
> +__u8 populate_hash;

two-byte hole here...

> +__u32 hash_types;
> +__u16 indirections_len;
> +__u16 default_queue;
> +};
> +
> +struct toeplitz_key_data_t {
> +__u32 leftmost_32_bits;
> +__u8 next_byte[HASH_CALCULATION_BUFFER_SIZE];
> +};
> +
> +struct packet_hash_info_t {
> +__u8 is_ipv4;
> +__u8 is_ipv6;
> +__u8 is_udp;
> +__u8 is_tcp;
> +__u8 is_ipv6_ext_src;
> +__u8 is_ipv6_ext_dst;
> +
> +__u16 src_port;
> +__u16 dst_port;

...and there's going to be a hole here as well I think.

> +union {
> +struct {
> +__be32 in_src;
> +__be32 in_dst;
> +};
> +
> +struct {
> +struct in6_addr in6_src;
> +struct in6_addr in6_dst;
> +struct in6_addr in6_ext_src;
> +struct in6_addr in6_ext_dst;
> +};
> +};
> +};
> +
> +struct bpf_map_def SEC("maps")
> +tap_rss_map_configurations = {
> +.type= BPF_MAP_TYPE_ARRAY,
> +.key_size= sizeof(__u32),
> +.value_size  = sizeof(struct rss_config_t),
> +.max_entries = 1,
> +};
> +
> +struct bpf_map_def SEC("maps")
> +tap_rss_map_toeplitz_key = {
> +.type= BPF_MAP_TYPE_ARRAY,
> +.key_size= sizeof(__u32),
> +.value_size  = sizeof(struct toeplitz_key_data_t),
> +.max_entries = 1,
> +};

Which version of LLVM and libbpf are you targeting?

Libbpf 0.0.3 (which is almost two years old now) added support for
relocations of global data section, so instead of defining these
one-element maps you could just define the config structs as global
variables and use them like you would from regular C 

Re: [RFC PATCH v2 0/5] eBPF RSS support for virtio-net

2020-12-04 Thread Toke Høiland-Jørgensen
Yuri Benditovich  writes:

> On Fri, Dec 4, 2020 at 12:09 PM Toke Høiland-Jørgensen 
> wrote:
>
>> Yuri Benditovich  writes:
>>
>> > On Wed, Dec 2, 2020 at 4:18 PM Toke Høiland-Jørgensen 
>> > wrote:
>> >
>> >> Jason Wang  writes:
>> >>
>> >> > On 2020/11/19 下午7:13, Andrew Melnychenko wrote:
>> >> >> This set of patches introduces the usage of eBPF for packet steering
>> >> >> and RSS hash calculation:
>> >> >> * RSS(Receive Side Scaling) is used to distribute network packets to
>> >> >> guest virtqueues by calculating packet hash
>> >> >> * Additionally adding support for the usage of RSS with vhost
>> >> >>
>> >> >> The eBPF works on kernels 5.8+
>> >> >> On earlier kerneld it fails to load and the RSS feature is reported
>> >> >> only without vhost and implemented in 'in-qemu' software.
>> >> >>
>> >> >> Implementation notes:
>> >> >> Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF program.
>> >> >> Added libbpf dependency and eBPF support.
>> >> >> The eBPF program is part of the qemu and presented as an array
>> >> >> of BPF ELF file data.
>> >> >> The compilation of eBPF is not part of QEMU build and can be done
>> >> >> using provided Makefile.ebpf(need to adjust 'linuxhdrs').
>> >> >> Added changes to virtio-net and vhost, primary eBPF RSS is used.
>> >> >> 'in-qemu' RSS used in the case of hash population and as a fallback
>> >> option.
>> >> >> For vhost, the hash population feature is not reported to the guest.
>> >> >>
>> >> >> Please also see the documentation in PATCH 5/5.
>> >> >>
>> >> >> I am sending those patches as RFC to initiate the discussions and get
>> >> >> feedback on the following points:
>> >> >> * Fallback when eBPF is not supported by the kernel
>> >> >> * Live migration to the kernel that doesn't have eBPF support
>> >> >> * Integration with current QEMU build
>> >> >> * Additional usage for eBPF for packet filtering
>> >> >>
>> >> >> Known issues:
>> >> >> * hash population not supported by eBPF RSS: 'in-qemu' RSS used
>> >> >> as a fallback, also, hash population feature is not reported to
>> guests
>> >> >> with vhost.
>> >> >> * big-endian BPF support: for now, eBPF isn't supported on
>> >> >> big-endian systems. Can be added in future if required.
>> >> >> * huge .h file with eBPF binary. The size of .h file containing
>> >> >> eBPF binary is currently ~5K lines, because the binary is built with
>> >> debug information.
>> >> >> The binary without debug/BTF info can't be loaded by libbpf.
>> >> >> We're looking for possibilities to reduce the size of the .h files.
>> >> >
>> >> >
>> >> > Adding Toke for sharing more idea from eBPF side.
>> >> >
>> >> > We had some discussion on the eBPF issues:
>> >> >
>> >> > 1) Whether or not to use libbpf. Toke strongly suggest to use libbpf
>> >> > 2) Whether or not to use BTF. Toke confirmed that if we don't access
>> any
>> >> > skb metadata, BTF is not strictly required for CO-RE. But it might
>> still
>> >> > useful for e.g debugging.
>> >> > 3) About the huge (5K lines, see patch #2 Toke). Toke confirmed that
>> we
>> >> > can strip debug symbols, but Yuri found some sections can't be
>> stripped,
>> >> > we can keep discussing here.
>> >>
>> >> I just tried simply running 'strip' on a sample trivial XDP program,
>> >> which brought its size down from ~5k to ~1k and preserved the BTF
>> >> information without me having to do anything.
>> >>
>> >
>> > With our eBPF code the numbers are slightly different:
>> > The code size without BTF: 7.5K (built without '-g')
>> > Built with '-g': 45K
>> > Stripped: 19K
>> > The difference between 7.5 and 19K still seems significant, especially
>> when
>> > we do not use any kernel structures and do not need the

Re: [RFC PATCH v2 0/5] eBPF RSS support for virtio-net

2020-12-04 Thread Toke Høiland-Jørgensen
Yuri Benditovich  writes:

> On Wed, Dec 2, 2020 at 4:18 PM Toke Høiland-Jørgensen 
> wrote:
>
>> Jason Wang  writes:
>>
>> > On 2020/11/19 下午7:13, Andrew Melnychenko wrote:
>> >> This set of patches introduces the usage of eBPF for packet steering
>> >> and RSS hash calculation:
>> >> * RSS(Receive Side Scaling) is used to distribute network packets to
>> >> guest virtqueues by calculating packet hash
>> >> * Additionally adding support for the usage of RSS with vhost
>> >>
>> >> The eBPF works on kernels 5.8+
>> >> On earlier kerneld it fails to load and the RSS feature is reported
>> >> only without vhost and implemented in 'in-qemu' software.
>> >>
>> >> Implementation notes:
>> >> Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF program.
>> >> Added libbpf dependency and eBPF support.
>> >> The eBPF program is part of the qemu and presented as an array
>> >> of BPF ELF file data.
>> >> The compilation of eBPF is not part of QEMU build and can be done
>> >> using provided Makefile.ebpf(need to adjust 'linuxhdrs').
>> >> Added changes to virtio-net and vhost, primary eBPF RSS is used.
>> >> 'in-qemu' RSS used in the case of hash population and as a fallback
>> option.
>> >> For vhost, the hash population feature is not reported to the guest.
>> >>
>> >> Please also see the documentation in PATCH 5/5.
>> >>
>> >> I am sending those patches as RFC to initiate the discussions and get
>> >> feedback on the following points:
>> >> * Fallback when eBPF is not supported by the kernel
>> >> * Live migration to the kernel that doesn't have eBPF support
>> >> * Integration with current QEMU build
>> >> * Additional usage for eBPF for packet filtering
>> >>
>> >> Known issues:
>> >> * hash population not supported by eBPF RSS: 'in-qemu' RSS used
>> >> as a fallback, also, hash population feature is not reported to guests
>> >> with vhost.
>> >> * big-endian BPF support: for now, eBPF isn't supported on
>> >> big-endian systems. Can be added in future if required.
>> >> * huge .h file with eBPF binary. The size of .h file containing
>> >> eBPF binary is currently ~5K lines, because the binary is built with
>> debug information.
>> >> The binary without debug/BTF info can't be loaded by libbpf.
>> >> We're looking for possibilities to reduce the size of the .h files.
>> >
>> >
>> > Adding Toke for sharing more idea from eBPF side.
>> >
>> > We had some discussion on the eBPF issues:
>> >
>> > 1) Whether or not to use libbpf. Toke strongly suggest to use libbpf
>> > 2) Whether or not to use BTF. Toke confirmed that if we don't access any
>> > skb metadata, BTF is not strictly required for CO-RE. But it might still
>> > useful for e.g debugging.
>> > 3) About the huge (5K lines, see patch #2 Toke). Toke confirmed that we
>> > can strip debug symbols, but Yuri found some sections can't be stripped,
>> > we can keep discussing here.
>>
>> I just tried simply running 'strip' on a sample trivial XDP program,
>> which brought its size down from ~5k to ~1k and preserved the BTF
>> information without me having to do anything.
>>
>
> With our eBPF code the numbers are slightly different:
> The code size without BTF: 7.5K (built without '-g')
> Built with '-g': 45K
> Stripped: 19K
> The difference between 7.5 and 19K still seems significant, especially when
> we do not use any kernel structures and do not need these BTF sections

That does seem like a lot of BTF information. Did you confirm (with
objdump) that it's the .BTF* sections that take up these extra 12k? Do
you have some really complicated data structures in the file or
something? Got a link to the source somewhere that isn't a web mailing
list archive? :)

In any case, while I do think it smells a little of premature
optimisation, you can of course strip the BTF information until you need
it. Having it around makes debugging easier (bpftool will expand your
map structures for you when dumping maps, and that sort of thing), but
it's not really essential if you don't need CO-RE.

> This is only reason to prefer non-libbpf option for this specific eBPF

You can still use libbpf without BTF. It's using BTF without libbpf that
tends to not work so wel

Re: [RFC PATCH v2 0/5] eBPF RSS support for virtio-net

2020-12-02 Thread Toke Høiland-Jørgensen
Jason Wang  writes:

> On 2020/11/19 下午7:13, Andrew Melnychenko wrote:
>> This set of patches introduces the usage of eBPF for packet steering
>> and RSS hash calculation:
>> * RSS(Receive Side Scaling) is used to distribute network packets to
>> guest virtqueues by calculating packet hash
>> * Additionally adding support for the usage of RSS with vhost
>>
>> The eBPF works on kernels 5.8+
>> On earlier kerneld it fails to load and the RSS feature is reported
>> only without vhost and implemented in 'in-qemu' software.
>>
>> Implementation notes:
>> Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF program.
>> Added libbpf dependency and eBPF support.
>> The eBPF program is part of the qemu and presented as an array
>> of BPF ELF file data.
>> The compilation of eBPF is not part of QEMU build and can be done
>> using provided Makefile.ebpf(need to adjust 'linuxhdrs').
>> Added changes to virtio-net and vhost, primary eBPF RSS is used.
>> 'in-qemu' RSS used in the case of hash population and as a fallback option.
>> For vhost, the hash population feature is not reported to the guest.
>>
>> Please also see the documentation in PATCH 5/5.
>>
>> I am sending those patches as RFC to initiate the discussions and get
>> feedback on the following points:
>> * Fallback when eBPF is not supported by the kernel
>> * Live migration to the kernel that doesn't have eBPF support
>> * Integration with current QEMU build
>> * Additional usage for eBPF for packet filtering
>>
>> Known issues:
>> * hash population not supported by eBPF RSS: 'in-qemu' RSS used
>> as a fallback, also, hash population feature is not reported to guests
>> with vhost.
>> * big-endian BPF support: for now, eBPF isn't supported on
>> big-endian systems. Can be added in future if required.
>> * huge .h file with eBPF binary. The size of .h file containing
>> eBPF binary is currently ~5K lines, because the binary is built with debug 
>> information.
>> The binary without debug/BTF info can't be loaded by libbpf.
>> We're looking for possibilities to reduce the size of the .h files.
>
>
> Adding Toke for sharing more idea from eBPF side.
>
> We had some discussion on the eBPF issues:
>
> 1) Whether or not to use libbpf. Toke strongly suggest to use libbpf
> 2) Whether or not to use BTF. Toke confirmed that if we don't access any 
> skb metadata, BTF is not strictly required for CO-RE. But it might still 
> useful for e.g debugging.
> 3) About the huge (5K lines, see patch #2 Toke). Toke confirmed that we 
> can strip debug symbols, but Yuri found some sections can't be stripped, 
> we can keep discussing here.

I just tried simply running 'strip' on a sample trivial XDP program,
which brought its size down from ~5k to ~1k and preserved the BTF
information without me having to do anything.

As a side note, though, instead of embedding the BPF program into a .h,
you could simply ship it as a .o and load it from the file system. We do
that with xdp-tools and install the bpf object files into /usr/$LIB/bpf/.

-Toke