Gaetan Rivet via dev <ovs-dev@openvswitch.org> writes:

>>-----Original Message-----
>>From: dev <ovs-dev-boun...@openvswitch.org
>> <mailto:ovs-dev-boun...@openvswitch.org>> on behalf of Eelco
>> Chaudron <echau...@redhat.com <mailto:echau...@redhat.com>>
>>Date: Monday 23 January 2023 at 10:44
>>To: Gaetan Rivet <gaet...@nvidia.com <mailto:gaet...@nvidia.com>>
>>Cc: ovs dev <d...@openvswitch.org <mailto:d...@openvswitch.org>>, Eli
>> Britstein <el...@nvidia.com <mailto:el...@nvidia.com>>, Ilya
>> Maximets <i.maxim...@ovn.org <mailto:i.maxim...@ovn.org>>, Maxime
>> Coquelin <maxime.coque...@redhat.com
>> <mailto:maxime.coque...@redhat.com>>, Jason Gunthorpe
>> <j...@nvidia.com <mailto:j...@nvidia.com>>, Majd Dibbiny
>> <m...@nvidia.com <mailto:m...@nvidia.com>>, David Marchand
>> <david.march...@redhat.com <mailto:david.march...@redhat.com>>
>>Subject: Re: [ovs-dev] [PATCH 1/1] daemon-unix: Support OVS-DPDK HW offloads 
>>for non-root user
>>
>>On 22 Mar 2021, at 15:21, David Marchand wrote:
>>
>>
>>> Hello Gaƫtan,
>>>
>>> On Fri, Mar 19, 2021 at 5:59 PM Gaetan Rivet <gaet...@nvidia.com 
>>> <mailto:gaet...@nvidia.com>> wrote:
>>>> Our rte_flow implementation uses ICM mappings to program our hardware,
>>>> which requires super privileged access. We are looking into ways to avoid 
>>>> it.
>>>
>>> Ok, thanks for looking into it.
>>
>>
>>Was any progress made on this? Or was the conclusion that this is the only 
>>way?
>>
>>
>
> Hello Eelco,
>
> I'll start by clarifying something: this issue is two-fold, even though there 
> is a single RC.
> The previous email thread is about using rte_flow API with the mlx5 PMD,
> while the bug you described is about lack of TX on ports with no offloads 
> inserted by the user.
>
> On the matter of the above rte_flow requirements: it was discussed, the 
> conclusion
> was that nothing could be done with the current offload architecture.
> A patch was written to improve the logs and the documentation, but I see that 
> it didn't
> make it to upstream DPDK. I have brought it to the attention of the DPDK team 
> and
> it will be submitted.
>
>>>>
>>>> In the meantime, we failed to properly communicate this need in the 
>>>> rte_flow API.
>>>> We will improve the documentation and the error path in DPDK.
>>>
>>> Without this capa, mlx5 rte_flow full hw offloading errors with logs like:
>>> 2021-03-22T14:12:40.274Z|00001|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0:
>>> rte_flow creation failed: 1 ((null)).
>>> 2021-03-22T14:12:40.274Z|00002|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0:
>>> Failed flow: flow create 3 ingress priority 0 group 0 transfer
>>> pattern eth src is 6a:20:8f:82:52:49 dst is 0c:42:a1:00:a8:7c type is
>>> 0x0800 / ipv4 / end actions count / port_id original 0 id 2 / end
>>>
>>> First log is useless.
>>> This is more bugfixing than enhancement.
>>> Though logs do not need to tell the full story, they can point at the
>>> mlx5 pmd documentation where the full explanation is.
>>
>>
>>
>>
>>I was running into this issue also and spent a decent amount of time
>> trying to figure out what was going on.
>>I did not have HW offload enabled yet, but just the basic VF/port
>> representer configuration and no error messages or packets were
>> arriving.
>>It Would be good to get some logging indicating the
>> configuration/system was not valid. All I got was silent packet
>> drops, but counters were incremented :(
>
> I was able to reproduce this issue. The DPDK team was adamant that it
> should have resulted in an error log,
> but none can be seen. The absence of log is a bug, however the behavior 
> itself is intended.
>
> To give some more details, port probe should work without SYS_RAWIO, but 
> start should fail.
> On start, the PMD installs hardware rules, using the same underlying API as 
> rte_flow. It makes SYS_RAWIO
> a requirement for even basic port functions. This will remain the case in the 
> future.
>
> Similarly, this issue has been brought to the attention of the DPDK team and 
> they
> will make sure the user has an error log in that case, as well as clarifying 
> the mlx5 doc.
>
> Thanks for reporting this!
>

I guess we would need something like the following (untested, but
general idea) patch, but I can see having some debate about a proper
solution.  It should also be noted that some vfio support would also
need iopl() access (at least from a quick glance at the DPDK side), but
I'm not sure what use case that would enable.

In this change, I don't really put much documentation around the
hw-access knob - there may be additional selinux changes (for example to
add support for memory_device_t access), but I don't currently have a
system ready to go for testing this.

There is a bit of an ABI break around daemonize_start() in here, so
maybe it would be better to introduce a proper ABI version signature for
daemonize_start() even though historically we never tried to have a
consistent ABI in libopenvswitch, so just "thinking out loud," at the
moment.

---
diff --git a/NEWS b/NEWS
index fe6055a270..13d2522679 100644
--- a/NEWS
+++ b/NEWS
@@ -4,6 +4,10 @@ Post-v3.1.0
      * OVS now collects per-interface upcall statistics that can be obtained
        via 'ovs-appctl dpctl/show -s' or the interface's statistics column
        in OVSDB.  Available with upstream kernel 6.2+.
+   - DPDK:
+     * ovs-vswitchd will keep the CAP_SYS_RAWIO capability when started
+       with the --hw-access command line option.  This allows the process
+       extra privileges when mapping physical interconnect memory.
 
 
 v3.1.0 - xx xxx xxxx
diff --git a/lib/daemon-unix.c b/lib/daemon-unix.c
index 1a7ba427d7..8b895a48de 100644
--- a/lib/daemon-unix.c
+++ b/lib/daemon-unix.c
@@ -88,7 +88,8 @@ static bool switch_user = false;
 static uid_t uid;
 static gid_t gid;
 static char *user = NULL;
-static void daemon_become_new_user__(bool access_datapath);
+static void daemon_become_new_user__(bool access_datapath,
+                                     bool access_hardware_ports);
 
 static void check_already_running(void);
 static int lock_pidfile(FILE *, int command);
@@ -443,13 +444,13 @@ monitor_daemon(pid_t daemon_pid)
  * daemonize_complete()) or that it failed to start up (by exiting with a
  * nonzero exit code). */
 void
-daemonize_start(bool access_datapath)
+daemonize_start(bool access_datapath, bool access_hardware_ports)
 {
     assert_single_threaded();
     daemonize_fd = -1;
 
     if (switch_user) {
-        daemon_become_new_user__(access_datapath);
+        daemon_become_new_user__(access_datapath, access_hardware_ports);
         switch_user = false;
     }
 
@@ -807,7 +808,8 @@ daemon_become_new_user_unix(void)
 /* Linux specific implementation of daemon_become_new_user()
  * using libcap-ng.   */
 static void
-daemon_become_new_user_linux(bool access_datapath OVS_UNUSED)
+daemon_become_new_user_linux(bool access_datapath OVS_UNUSED,
+                             bool access_hardware_ports OVS_UNUSED)
 {
 #if defined __linux__ &&  HAVE_LIBCAPNG
     int ret;
@@ -826,7 +828,17 @@ daemon_become_new_user_linux(bool access_datapath 
OVS_UNUSED)
             if (access_datapath && !ret) {
                 ret = capng_update(CAPNG_ADD, cap_sets, CAP_NET_ADMIN)
                       || capng_update(CAPNG_ADD, cap_sets, CAP_NET_RAW)
-                      || capng_update(CAPNG_ADD, cap_sets, CAP_NET_BROADCAST);
+                      || capng_update(CAPNG_ADD, cap_sets, CAP_NET_BROADCAST)
+#ifdef DPDK_NETDEV
+                      || (access_hardware_ports &&
+                          capng_update(CAPNG_ADD, cap_sets, CAP_SYS_RAWIO))
+#else
+                    ;
+                if (access_hardware_ports) {
+                    VLOG_WARN("hw port access requested, but no userspace 
ioport support.  Dropping.");
+                }
+#endif
+                    ;
             }
         } else {
             ret = -1;
@@ -854,7 +866,7 @@ daemon_become_new_user_linux(bool access_datapath 
OVS_UNUSED)
 }
 
 static void
-daemon_become_new_user__(bool access_datapath)
+daemon_become_new_user__(bool access_datapath, bool access_hardware_ports)
 {
     /* If vlog file has been created, change its owner to the non-root user
      * as specifed by the --user option.  */
@@ -862,7 +874,8 @@ daemon_become_new_user__(bool access_datapath)
 
     if (LINUX) {
         if (LIBCAPNG) {
-            daemon_become_new_user_linux(access_datapath);
+            daemon_become_new_user_linux(access_datapath,
+                                         access_hardware_ports);
         } else {
             VLOG_FATAL("%s: fail to downgrade user using libcap-ng. "
                        "(libcap-ng is not configured at compile time), "
@@ -877,11 +890,11 @@ daemon_become_new_user__(bool access_datapath)
  * However, there in case the user switch needs to be done
  * before daemonize_start(), the following API can be used.  */
 void
-daemon_become_new_user(bool access_datapath)
+daemon_become_new_user(bool access_datapath, bool access_hardware_ports)
 {
     assert_single_threaded();
     if (switch_user) {
-        daemon_become_new_user__(access_datapath);
+        daemon_become_new_user__(access_datapath, access_hardware_ports);
         /* daemonize_start() should not switch user again. */
         switch_user = false;
     }
diff --git a/lib/daemon.c b/lib/daemon.c
index 3249c5ab4b..1e1c019eb1 100644
--- a/lib/daemon.c
+++ b/lib/daemon.c
@@ -48,7 +48,7 @@ get_detach(void)
 void
 daemonize(void)
 {
-    daemonize_start(false);
+    daemonize_start(false, false);
     daemonize_complete();
 }
 
diff --git a/lib/daemon.h b/lib/daemon.h
index 0941574963..42372d1463 100644
--- a/lib/daemon.h
+++ b/lib/daemon.h
@@ -167,10 +167,10 @@ void set_detach(void);
 bool get_detach(void);
 void daemon_save_fd(int fd);
 void daemonize(void);
-void daemonize_start(bool access_datapath);
+void daemonize_start(bool access_datapath, bool access_hardware_ports);
 void daemonize_complete(void);
 void daemon_set_new_user(const char * user_spec);
-void daemon_become_new_user(bool access_datapath);
+void daemon_become_new_user(bool access_datapath, bool access_hardware_ports);
 void daemon_usage(void);
 void daemon_disable_self_confinement(void);
 bool daemon_should_self_confine(void);
diff --git a/vswitchd/ovs-vswitchd.8.in b/vswitchd/ovs-vswitchd.8.in
index 9569265fcb..e39e3c984c 100644
--- a/vswitchd/ovs-vswitchd.8.in
+++ b/vswitchd/ovs-vswitchd.8.in
@@ -81,6 +81,12 @@ unavailable or unsuccessful.
 .SS "DPDK Options"
 For details on initializing \fBovs\-vswitchd\fR to use DPDK ports,
 refer to the documentation or \fBovs\-vswitchd.conf.db\fR(5).
+.SS "DPDK HW Access Options"
+.IP "\fB\-\-hw\-access\fR"
+Tells \fBovs\-vswitchd\fR to retain the \fBCAP_SYS_RAWIO\fR capability,
+to allow userspace drivers access to raw hardware memory.  This will
+also allow the \fBovs\-vswitchd\fR daemon to call \fBiopl()\fR and
+\fBioperm()\fR functions to set port access.
 .SS "Daemon Options"
 .ds DD \
 \fBovs\-vswitchd\fR detaches only after it has connected to the \
diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c
index 407bfc60eb..ff1527fd5b 100644
--- a/vswitchd/ovs-vswitchd.c
+++ b/vswitchd/ovs-vswitchd.c
@@ -60,6 +60,9 @@ VLOG_DEFINE_THIS_MODULE(vswitchd);
  * the kernel from paging any of its memory to disk. */
 static bool want_mlockall;
 
+/* --hw-access: If set, retains CAP_SYS_RAWIO privileges.  */
+static bool hw_access;
+
 static unixctl_cb_func ovs_vswitchd_exit;
 
 static char *parse_options(int argc, char *argv[], char **unixctl_path);
@@ -169,6 +172,7 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
         OPT_DPDK,
         SSL_OPTION_ENUMS,
         OPT_DUMMY_NUMA,
+        OPT_HW_ACCESS,
     };
     static const struct option long_options[] = {
         {"help",        no_argument, NULL, 'h'},
@@ -185,6 +189,7 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
         {"disable-system-route", no_argument, NULL, OPT_DISABLE_SYSTEM_ROUTE},
         {"dpdk", optional_argument, NULL, OPT_DPDK},
         {"dummy-numa", required_argument, NULL, OPT_DUMMY_NUMA},
+        {"hw-access", no_argument, NULL, OPT_HW_ACCESS},
         {NULL, 0, NULL, 0},
     };
     char *short_options = ovs_cmdl_long_options_to_short_options(long_options);
@@ -249,6 +254,10 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
             ovs_numa_set_dummy(optarg);
             break;
 
+        case OPT_HW_ACCESS:
+            hw_access = true;
+            break;
+
         default:
             abort();
         }
---

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to