> > Hi, > > I'm looking for the motivation for the patch and I think there are a few > items being bundled together so it's worth separating to understand the > use case.
Hi Kevin, Thanks for your review - you’re right, a few things are bundled, so let me give the concrete use case: We want a single PMD thread to be schedulable by the kernel across all cores of its NUMA node, instead of owning one specific core. Today that can't be expressed: a PMD is always pinned 1:1 to a core (ovs_numa_thread_setaffinity_core(pmd->core_id) in pmd_thread_main()). This matters on systems where the OVS cores are not isolated and are shared with other workloads — in particular offloaded datapaths, where the bulk of traffic is forwarded in hardware and the PMD is no longer the fast path, so dedicating a full pinned core per PMD is wasteful. pmd-no-pin widens each PMD's affinity to its NUMA node and lets the scheduler place/migrate it. > > 1. Cores with PMD threads being isolated from running other tasks > > The only thing that stops Linux scheduling other tasks on cores where > PMD threads run is having those cores isolated on the system. This is > not an OVS setting, so there is no need to change anything OVS for this. > > But isn't it a concern for you of extra latency or packet loss at high > rates that might be introduced to packets on your datapath by other > tasks running on those cores ? I’ll address point 1 & 4 here by saying agreed isolation is a system setting, not OVS. But pmd-cpu-mask doesn't cover this case: reconfigure_pmd_threads() creates one pinned PMD per core in the mask. Widening the mask just spawns more pinned PMDs (more cores at 100%), which is the opposite of what we want. There is no existing knob for the affinity width of a single PMD — that's exactly what this patch adds. > > 2. PMD threads running on a single core > > PMD threads are run on a single core. I'm not sure what the concern is > here. Moving them around cores a lot will impact core local OVS sw > caches, maybe some stats etc. > > If you don't isolate the cores then Linux will presumably schedule other > items away from the core running the PMD thread that is at 100%. Yes, > it's true that Linux won't move the PMD thread itself but I'm not sure > that's a bad thing. The goal isn't to move the thread for its own sake. On a shared, non-isolated node a pinned PMD permanently claims one specific core; if that core is transiently contended the PMD is stuck there. Letting it float within the node lets the scheduler put it on the least-loaded core and lets other tasks use the node’s cores more flexibly. We deliberately keep it within the NUMA node so we don't lose NUMA locality > > 3. PMD threads running at 100% when they don't need to be > > That is the default but also a user choice. There are pmd-sleep-* > options that will add some sleeps to reduce the load on a core in the > event of no or low packet rates etc. If you did want to schedule other > tasks on the core, then they could run. We do use pmd-sleep-*; it's complementary rather than a replacement. pmd-sleep frees the core in time (during sleeps), but the thread is still pinned in space. pmd-no-pin frees it in space so the active PMD can be placed on a free core. The two combine well. > > 4. Which cores the PMD threads can run on > > That can be currently selected with pmd-cpu-mask, and there is already > consideration for NUMA when it comes to which rxqs those PMD thread can > poll. So if you want to say all the cores can be used to run PMD > threads, that can currently be done, just set the mask. > > - > > Is there some use case that is not covered by the above ? > > How would you protect against latency/packet loss in the datapath by > scheduling other tasks on the same core as PMD threads or that is not a > concern for you ? A fair concern, and exactly why this is opt-in and off by default — for a latency-sensitive software datapath, pinning (plus isolation) remains the right choice and we don't change that. This is for deployments that knowingly trade a bit of jitter for CPU consolidation/flexibility, primarily offloaded datapaths where most packets never touch the PMD core, so sharing it has limited datapath impact. To bound the worst cases we do not fully unpin: affinity is constrained to the PMD's NUMA node, which keeps memory access local and keeps TSC deltas valid. So "check what else is on the PMD core" stays good advice for the default case. This flag is for users who deliberately want to share those cores. Thanks, Salem _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
