Gaetan Rivet via dev <ovs-dev@openvswitch.org> writes: >>-----Original Message----- >>From: dev <ovs-dev-boun...@openvswitch.org >> <mailto:ovs-dev-boun...@openvswitch.org>> on behalf of Eelco >> Chaudron <echau...@redhat.com <mailto:echau...@redhat.com>> >>Date: Monday 23 January 2023 at 10:44 >>To: Gaetan Rivet <gaet...@nvidia.com <mailto:gaet...@nvidia.com>> >>Cc: ovs dev <d...@openvswitch.org <mailto:d...@openvswitch.org>>, Eli >> Britstein <el...@nvidia.com <mailto:el...@nvidia.com>>, Ilya >> Maximets <i.maxim...@ovn.org <mailto:i.maxim...@ovn.org>>, Maxime >> Coquelin <maxime.coque...@redhat.com >> <mailto:maxime.coque...@redhat.com>>, Jason Gunthorpe >> <j...@nvidia.com <mailto:j...@nvidia.com>>, Majd Dibbiny >> <m...@nvidia.com <mailto:m...@nvidia.com>>, David Marchand >> <david.march...@redhat.com <mailto:david.march...@redhat.com>> >>Subject: Re: [ovs-dev] [PATCH 1/1] daemon-unix: Support OVS-DPDK HW offloads >>for non-root user >> >>On 22 Mar 2021, at 15:21, David Marchand wrote: >> >> >>> Hello Gaƫtan, >>> >>> On Fri, Mar 19, 2021 at 5:59 PM Gaetan Rivet <gaet...@nvidia.com >>> <mailto:gaet...@nvidia.com>> wrote: >>>> Our rte_flow implementation uses ICM mappings to program our hardware, >>>> which requires super privileged access. We are looking into ways to avoid >>>> it. >>> >>> Ok, thanks for looking into it. >> >> >>Was any progress made on this? Or was the conclusion that this is the only >>way? >> >> > > Hello Eelco, > > I'll start by clarifying something: this issue is two-fold, even though there > is a single RC. > The previous email thread is about using rte_flow API with the mlx5 PMD, > while the bug you described is about lack of TX on ports with no offloads > inserted by the user. > > On the matter of the above rte_flow requirements: it was discussed, the > conclusion > was that nothing could be done with the current offload architecture. > A patch was written to improve the logs and the documentation, but I see that > it didn't > make it to upstream DPDK. I have brought it to the attention of the DPDK team > and > it will be submitted. > >>>> >>>> In the meantime, we failed to properly communicate this need in the >>>> rte_flow API. >>>> We will improve the documentation and the error path in DPDK. >>> >>> Without this capa, mlx5 rte_flow full hw offloading errors with logs like: >>> 2021-03-22T14:12:40.274Z|00001|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0: >>> rte_flow creation failed: 1 ((null)). >>> 2021-03-22T14:12:40.274Z|00002|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0: >>> Failed flow: flow create 3 ingress priority 0 group 0 transfer >>> pattern eth src is 6a:20:8f:82:52:49 dst is 0c:42:a1:00:a8:7c type is >>> 0x0800 / ipv4 / end actions count / port_id original 0 id 2 / end >>> >>> First log is useless. >>> This is more bugfixing than enhancement. >>> Though logs do not need to tell the full story, they can point at the >>> mlx5 pmd documentation where the full explanation is. >> >> >> >> >>I was running into this issue also and spent a decent amount of time >> trying to figure out what was going on. >>I did not have HW offload enabled yet, but just the basic VF/port >> representer configuration and no error messages or packets were >> arriving. >>It Would be good to get some logging indicating the >> configuration/system was not valid. All I got was silent packet >> drops, but counters were incremented :( > > I was able to reproduce this issue. The DPDK team was adamant that it > should have resulted in an error log, > but none can be seen. The absence of log is a bug, however the behavior > itself is intended. > > To give some more details, port probe should work without SYS_RAWIO, but > start should fail. > On start, the PMD installs hardware rules, using the same underlying API as > rte_flow. It makes SYS_RAWIO > a requirement for even basic port functions. This will remain the case in the > future. > > Similarly, this issue has been brought to the attention of the DPDK team and > they > will make sure the user has an error log in that case, as well as clarifying > the mlx5 doc. > > Thanks for reporting this! >
I guess we would need something like the following (untested, but general idea) patch, but I can see having some debate about a proper solution. It should also be noted that some vfio support would also need iopl() access (at least from a quick glance at the DPDK side), but I'm not sure what use case that would enable. In this change, I don't really put much documentation around the hw-access knob - there may be additional selinux changes (for example to add support for memory_device_t access), but I don't currently have a system ready to go for testing this. There is a bit of an ABI break around daemonize_start() in here, so maybe it would be better to introduce a proper ABI version signature for daemonize_start() even though historically we never tried to have a consistent ABI in libopenvswitch, so just "thinking out loud," at the moment. --- diff --git a/NEWS b/NEWS index fe6055a270..13d2522679 100644 --- a/NEWS +++ b/NEWS @@ -4,6 +4,10 @@ Post-v3.1.0 * OVS now collects per-interface upcall statistics that can be obtained via 'ovs-appctl dpctl/show -s' or the interface's statistics column in OVSDB. Available with upstream kernel 6.2+. + - DPDK: + * ovs-vswitchd will keep the CAP_SYS_RAWIO capability when started + with the --hw-access command line option. This allows the process + extra privileges when mapping physical interconnect memory. v3.1.0 - xx xxx xxxx diff --git a/lib/daemon-unix.c b/lib/daemon-unix.c index 1a7ba427d7..8b895a48de 100644 --- a/lib/daemon-unix.c +++ b/lib/daemon-unix.c @@ -88,7 +88,8 @@ static bool switch_user = false; static uid_t uid; static gid_t gid; static char *user = NULL; -static void daemon_become_new_user__(bool access_datapath); +static void daemon_become_new_user__(bool access_datapath, + bool access_hardware_ports); static void check_already_running(void); static int lock_pidfile(FILE *, int command); @@ -443,13 +444,13 @@ monitor_daemon(pid_t daemon_pid) * daemonize_complete()) or that it failed to start up (by exiting with a * nonzero exit code). */ void -daemonize_start(bool access_datapath) +daemonize_start(bool access_datapath, bool access_hardware_ports) { assert_single_threaded(); daemonize_fd = -1; if (switch_user) { - daemon_become_new_user__(access_datapath); + daemon_become_new_user__(access_datapath, access_hardware_ports); switch_user = false; } @@ -807,7 +808,8 @@ daemon_become_new_user_unix(void) /* Linux specific implementation of daemon_become_new_user() * using libcap-ng. */ static void -daemon_become_new_user_linux(bool access_datapath OVS_UNUSED) +daemon_become_new_user_linux(bool access_datapath OVS_UNUSED, + bool access_hardware_ports OVS_UNUSED) { #if defined __linux__ && HAVE_LIBCAPNG int ret; @@ -826,7 +828,17 @@ daemon_become_new_user_linux(bool access_datapath OVS_UNUSED) if (access_datapath && !ret) { ret = capng_update(CAPNG_ADD, cap_sets, CAP_NET_ADMIN) || capng_update(CAPNG_ADD, cap_sets, CAP_NET_RAW) - || capng_update(CAPNG_ADD, cap_sets, CAP_NET_BROADCAST); + || capng_update(CAPNG_ADD, cap_sets, CAP_NET_BROADCAST) +#ifdef DPDK_NETDEV + || (access_hardware_ports && + capng_update(CAPNG_ADD, cap_sets, CAP_SYS_RAWIO)) +#else + ; + if (access_hardware_ports) { + VLOG_WARN("hw port access requested, but no userspace ioport support. Dropping."); + } +#endif + ; } } else { ret = -1; @@ -854,7 +866,7 @@ daemon_become_new_user_linux(bool access_datapath OVS_UNUSED) } static void -daemon_become_new_user__(bool access_datapath) +daemon_become_new_user__(bool access_datapath, bool access_hardware_ports) { /* If vlog file has been created, change its owner to the non-root user * as specifed by the --user option. */ @@ -862,7 +874,8 @@ daemon_become_new_user__(bool access_datapath) if (LINUX) { if (LIBCAPNG) { - daemon_become_new_user_linux(access_datapath); + daemon_become_new_user_linux(access_datapath, + access_hardware_ports); } else { VLOG_FATAL("%s: fail to downgrade user using libcap-ng. " "(libcap-ng is not configured at compile time), " @@ -877,11 +890,11 @@ daemon_become_new_user__(bool access_datapath) * However, there in case the user switch needs to be done * before daemonize_start(), the following API can be used. */ void -daemon_become_new_user(bool access_datapath) +daemon_become_new_user(bool access_datapath, bool access_hardware_ports) { assert_single_threaded(); if (switch_user) { - daemon_become_new_user__(access_datapath); + daemon_become_new_user__(access_datapath, access_hardware_ports); /* daemonize_start() should not switch user again. */ switch_user = false; } diff --git a/lib/daemon.c b/lib/daemon.c index 3249c5ab4b..1e1c019eb1 100644 --- a/lib/daemon.c +++ b/lib/daemon.c @@ -48,7 +48,7 @@ get_detach(void) void daemonize(void) { - daemonize_start(false); + daemonize_start(false, false); daemonize_complete(); } diff --git a/lib/daemon.h b/lib/daemon.h index 0941574963..42372d1463 100644 --- a/lib/daemon.h +++ b/lib/daemon.h @@ -167,10 +167,10 @@ void set_detach(void); bool get_detach(void); void daemon_save_fd(int fd); void daemonize(void); -void daemonize_start(bool access_datapath); +void daemonize_start(bool access_datapath, bool access_hardware_ports); void daemonize_complete(void); void daemon_set_new_user(const char * user_spec); -void daemon_become_new_user(bool access_datapath); +void daemon_become_new_user(bool access_datapath, bool access_hardware_ports); void daemon_usage(void); void daemon_disable_self_confinement(void); bool daemon_should_self_confine(void); diff --git a/vswitchd/ovs-vswitchd.8.in b/vswitchd/ovs-vswitchd.8.in index 9569265fcb..e39e3c984c 100644 --- a/vswitchd/ovs-vswitchd.8.in +++ b/vswitchd/ovs-vswitchd.8.in @@ -81,6 +81,12 @@ unavailable or unsuccessful. .SS "DPDK Options" For details on initializing \fBovs\-vswitchd\fR to use DPDK ports, refer to the documentation or \fBovs\-vswitchd.conf.db\fR(5). +.SS "DPDK HW Access Options" +.IP "\fB\-\-hw\-access\fR" +Tells \fBovs\-vswitchd\fR to retain the \fBCAP_SYS_RAWIO\fR capability, +to allow userspace drivers access to raw hardware memory. This will +also allow the \fBovs\-vswitchd\fR daemon to call \fBiopl()\fR and +\fBioperm()\fR functions to set port access. .SS "Daemon Options" .ds DD \ \fBovs\-vswitchd\fR detaches only after it has connected to the \ diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c index 407bfc60eb..ff1527fd5b 100644 --- a/vswitchd/ovs-vswitchd.c +++ b/vswitchd/ovs-vswitchd.c @@ -60,6 +60,9 @@ VLOG_DEFINE_THIS_MODULE(vswitchd); * the kernel from paging any of its memory to disk. */ static bool want_mlockall; +/* --hw-access: If set, retains CAP_SYS_RAWIO privileges. */ +static bool hw_access; + static unixctl_cb_func ovs_vswitchd_exit; static char *parse_options(int argc, char *argv[], char **unixctl_path); @@ -169,6 +172,7 @@ parse_options(int argc, char *argv[], char **unixctl_pathp) OPT_DPDK, SSL_OPTION_ENUMS, OPT_DUMMY_NUMA, + OPT_HW_ACCESS, }; static const struct option long_options[] = { {"help", no_argument, NULL, 'h'}, @@ -185,6 +189,7 @@ parse_options(int argc, char *argv[], char **unixctl_pathp) {"disable-system-route", no_argument, NULL, OPT_DISABLE_SYSTEM_ROUTE}, {"dpdk", optional_argument, NULL, OPT_DPDK}, {"dummy-numa", required_argument, NULL, OPT_DUMMY_NUMA}, + {"hw-access", no_argument, NULL, OPT_HW_ACCESS}, {NULL, 0, NULL, 0}, }; char *short_options = ovs_cmdl_long_options_to_short_options(long_options); @@ -249,6 +254,10 @@ parse_options(int argc, char *argv[], char **unixctl_pathp) ovs_numa_set_dummy(optarg); break; + case OPT_HW_ACCESS: + hw_access = true; + break; + default: abort(); } --- _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev