Thanks everybody. Will post v10 after lunch. /Billy
> -----Original Message----- > From: Stokes, Ian > Sent: Monday, July 10, 2017 12:02 PM > To: Ilya Maximets <i.maxim...@samsung.com>; O Mahony, Billy > <billy.o.mah...@intel.com>; d...@openvswitch.org > Cc: db...@vmare.com > Subject: RE: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on non- > local numa node. > > > On 10.07.2017 13:42, O Mahony, Billy wrote: > > > > > > > > >> -----Original Message----- > > >> From: Stokes, Ian > > >> Sent: Monday, July 10, 2017 10:41 AM > > >> To: Ilya Maximets <i.maxim...@samsung.com>; O Mahony, Billy > > >> <billy.o.mah...@intel.com>; d...@openvswitch.org > > >> Cc: db...@vmare.com > > >> Subject: RE: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds > > >> on non- local numa node. > > >> > > >>> On 08.07.2017 22:09, Stokes, Ian wrote: > > >>>>> Previously if there is no available (non-isolated) pmd on the > > >>>>> numa node for a port then the port is not polled at all. This > > >>>>> can result in a non- operational system until such time as nics > > >>>>> are physically repositioned. It is preferable to operate with a > > >>>>> pmd on > > the 'wrong' > > >>>>> numa node albeit with lower performance. Local pmds are still > > >>>>> chosen > > >>> when available. > > >>>>> > > >>>>> Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com> > > >>>>> Signed-off-by: Ilya Maximets <i.maxim...@samsung.com> > > >>>>> Co-authored-by: Ilya Maximets <i.maxim...@samsung.com> > > >>>>> --- > > >>>>> v9: v8 missed some comments on v7 > > >>>>> v8: Some coding style issues; doc tweak > > >>>>> v7: Incorporate review comments on docs and implementation > > >>>>> v6: Change 'port' to 'queue' in a warning msg > > >>>>> v5: Fix warning msg; Update same in docs > > >>>>> v4: Fix a checkpatch error > > >>>>> v3: Fix warning messages not appearing when using multiqueue > > >>>>> v2: Add details of warning messages into docs > > >>>>> > > >>>>> Documentation/intro/install/dpdk.rst | 21 +++++++++++++++--- > > >>>>> lib/dpif-netdev.c | 41 > > >>>>> +++++++++++++++++++++++++++++++++--- > > >>>>> 2 files changed, 56 insertions(+), 6 deletions(-) > > >>>>> > > >>>>> diff --git a/Documentation/intro/install/dpdk.rst > > >>>>> b/Documentation/intro/install/dpdk.rst > > >>>>> index e83f852..89775d6 100644 > > >>>>> --- a/Documentation/intro/install/dpdk.rst > > >>>>> +++ b/Documentation/intro/install/dpdk.rst > > >>>>> @@ -449,7 +449,7 @@ affinitized accordingly. > > >>>>> > > >>>>> A poll mode driver (pmd) thread handles the I/O of all DPDK > > >>> interfaces > > >>>>> assigned to it. A pmd thread shall poll the ports for > > >>>>> incoming packets, > > >>>>> - switch the packets and send to tx port. pmd thread is CPU > > >>>>> bound, and needs > > >>>>> + switch the packets and send to tx port. A pmd thread is CPU > > >>>>> + bound, and needs > > >>>>> to be affinitized to isolated cores for optimum performance. > > >>>>> > > >>>>> By setting a bit in the mask, a pmd thread is created and > > >>>>> pinned to the @@ -458,8 +458,23 @@ affinitized accordingly. > > >>>>> $ ovs-vsctl set Open_vSwitch . > > >>>>> other_config:pmd-cpu-mask=0x4 > > >>>>> > > >>>>> .. note:: > > >>>>> - pmd thread on a NUMA node is only created if there is at least > > one > > >>>>> DPDK > > >>>>> - interface from that NUMA node added to OVS. > > >>>>> + A pmd thread on a NUMA node is only created if there is at > > >>>>> + least one > > >>>>> DPDK > > >>>>> + interface from that NUMA node added to OVS. A pmd thread > > >>>>> + is created > > >>>>> by > > >>>>> + default on a core of a NUMA node or when a specified > > >>>>> + pmd-cpu-mask > > >>> has > > >>>>> + indicated so. Even though a PMD thread may exist, the > > >>>>> + thread only > > >>>>> starts > > >>>>> + consuming CPU cycles if there is least one receive queue > > >>>>> + assigned > > >>> to > > >>>>> + the pmd. > > >>>>> + > > >>>>> + .. note:: > > >>>>> + On NUMA systems PCI devices are also local to a NUMA node. > > >>>>> + Unbound > > >>>>> rx > > >>>>> + queues for a PCI device will assigned to a pmd on it's > > >>>>> + local NUMA > > >>>> > > >>>> Minor point but should read 'will be assigned' > > > > > > [[BO'M]] > > > +1 > > > > > >>>>> node if a > > >>>>> + non-isolated PMD exists on that NUMA node. If not, the > > >>>>> + queue will > > >>> be > > >>>>> + assigned to a non-isolated pmd on a remote NUMA node. This > > >>>>> + will > > >>>>> result in > > >>>>> + reduced maximum throughput on that device and possibly on > > >>>>> + other > > >>>>> devices > > >>>>> + assigned to that pmd thread. In the case such, a queue > > >>>>> + assignment is > > >>>>> made a > > >>>>> + warning message will be logged: "There's no available > > >>>>> + (non-isolated) > > >>>>> pmd > > >>>> > > >>>> Above should read 'In the case where such a queue assignment is > > >>>> made, a > > >>> warning message will be logged' > > > > > > [[BO'M]] > > > Suggesting a simpler: > > > 'If such a queue assignment is made a warning message ..." > > > > > >>>>> + thread on numa node N. Queue Q on port P will be assigned > > >>>>> + to the pmd > > >>>>> on > > >>>>> + core C (numa node N'). Expect reduced performance." > > >>>>> > > >>>>> - QEMU vCPU thread Affinity > > >>>>> > > >>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index > > >>>>> 4e29085..7557f32 > > >>>>> 100644 > > >>>>> --- a/lib/dpif-netdev.c > > >>>>> +++ b/lib/dpif-netdev.c > > >>>>> @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list > > >>>>> *rr, int > > >>>>> numa_id) > > >>>>> return NULL; > > >>>>> } > > >>>>> > > >>>>> +/* Returns next NUMA from rr list in round-robin fashion. > > >>>>> +Returns the first > > >>>>> + * NUMA node if 'NULL' or the last node passed, and 'NULL' if > > >>>>> +list is empty. */ static struct rr_numa * > > >>>>> +rr_numa_list_next(struct rr_numa_list *rr, const struct rr_numa > > >>>>> +*numa) { > > >>>> > > >>>> The comment above can be tidied up a little to better clarify the > > >>> behavior of this function. > > >>>> I ended up reading the comments for hmap_next() and hmap_first() > > >>>> before > > >>> it made sense, and even then it's a bit ambiguous, it ends up > > >>> being the code that explains the comments. > > >>>> > > >>>> You could clarify the following 2 statements: > > >>>> > > >>>> (1) "Returns the first NUMA node if 'NULL'" - If what is NULL? I > > >>>> assume > > >>> you mean the function parameter 'const struct rr_numa *numa' but > > >>> it's not clear on first reading. > > >>>> > > >>>> (2) " or the last node passed" - again this makes sense only when > > >>>> you > > >>> look into the behavior of the call 'hmap_next(&rr->numas, &numa- > > >>> node)'. > > >>>> > > >>>> You could say something like: > > >>>> > > >>>> "Attempt to return the next NUMA from a numa list in a round > > >>>> robin > > >>> fashion. Return the first NUMA node if the struct rr_numa *numa > > >>> argument passed to the function is NULL or if the numa node > > >>> argument passed to hmap_next is already the last node. Return NULL > > >>> if the numa list is empty." > > >>> > > >>> I'm not sure that references to implementation is a good way to > > >>> write comments (I mean 'passed to hmap_next' part). > > >>> > > >>> How about this: > > >>> """ > > >>> /* Returns the next node in numa list following 'numa' in > > >>> round-robin fashion. > > >>> * Returns first node if 'numa' is a null pointer or the last node > > >>> in 'rr'. */ """ > > >>> > > >>> or > > >>> > > >>> """ > > >>> /* The next node in numa list following 'numa' in round-robin fashion. > > >>> * Returns: > > >>> * - 'NULL' if 'rr' is an empty numa list. > > >>> * - First node in 'rr' if 'numa' is a null pointer. > > >>> * - First node in 'rr' if 'numa' is the last node in 'rr'. > > >>> * - Otherwise, the next node in numa list following 'numa'. */ > > >>> """ > > >>> > > >>> ? > > >> > > >> I'm happy with the first option you provided above, could you > > >> append returning NULL if the list is empty then I think we're good. > > >> > > >> /* Returns the next node in numa list following 'numa' in > > >> round-robin fashion. > > >> * Returns first node if 'numa' is a null pointer or the last node > > >> in > > 'rr'. > > >> * Returns NULL if 'rr' numa list is empty. */ > > > > > > [[BO'M]] > > > Sounds good to me. Anyone object to this wording? > > > > I don't like three 'Returns' in a row but LGTM in general. > > > > Once these changes are made for the next patch version feel free to add > Tested by and acked tags for myself. > > Thanks > Ian > > >> > > >> Thanks > > >> Ian > > >>> > > >>>> > > >>>>> + struct hmap_node *node = NULL; > > >>>>> + > > >>>>> + if (numa) { > > >>>>> + node = hmap_next(&rr->numas, &numa->node); > > >>>>> + } > > >>>>> + if (!node) { > > >>>>> + node = hmap_first(&rr->numas); > > >>>>> + } > > >>>>> + > > >>>>> + return (node) ? CONTAINER_OF(node, struct rr_numa, node) : > > >>>>> + NULL; } > > >>>>> + > > >>>>> static void > > >>>>> rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list > > >>>>> *rr) { @@ -3249,6 +3266,7 @@ rxq_scheduling(struct dp_netdev > > >>>>> *dp, bool > > >>>>> pinned) > > >>>>> OVS_REQUIRES(dp->port_mutex) { > > >>>>> struct dp_netdev_port *port; > > >>>>> struct rr_numa_list rr; > > >>>>> + struct rr_numa *non_local_numa = NULL; > > >>>>> > > >>>>> rr_numa_list_populate(dp, &rr); > > >>>>> > > >>>>> @@ -3281,11 +3299,28 @@ rxq_scheduling(struct dp_netdev *dp, > > >>>>> bool > > >>>>> pinned) > > >>>>> OVS_REQUIRES(dp->port_mutex) > > >>>>> } > > >>>>> } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) { > > >>>>> if (!numa) { > > >>>>> - VLOG_WARN("There's no available (non isolated) > > pmd > > >>>>> thread " > > >>>>> + /* There are no pmds on the queue's local > > >>>>> + NUMA > > >>> node. > > >>>>> + Round-robin on the NUMA nodes that do > > >>>>> + have > > >>> pmds. > > >>>>> */ > > >>>>> + non_local_numa = rr_numa_list_next(&rr, > > >>>>> non_local_numa); > > >>>>> + if (!non_local_numa) { > > >>>>> + VLOG_ERR("There is no available > > >>>>> + (non-isolated) > > >>>>> pmd " > > >>>>> + "thread for port \'%s\' queue %d. > > >>>>> + This > > >>>>> queue " > > >>>>> + "will not be polled. Is > > >>>>> + pmd-cpu-mask set > > >>>>> to " > > >>>>> + "zero? Or are all PMDs > > >>>>> + isolated to other > > >>>>> " > > >>>>> + "queues?", > > >>>>> + netdev_get_name(port- > > >>>>>> netdev), > > >>>>> + qid); > > >>>>> + continue; > > >>>>> + } > > >>>>> + q->pmd = rr_numa_get_pmd(non_local_numa); > > >>>>> + VLOG_WARN("There's no available > > >>>>> + (non-isolated) pmd > > >>>>> thread " > > >>>>> "on numa node %d. Queue %d on > > >>>>> port > > >>> \'%s\' > > >>>>> will " > > >>>>> - "not be polled.", > > >>>>> - numa_id, qid, netdev_get_name(port- > > >>>>>> netdev)); > > >>>>> + "be assigned to the pmd on core %d " > > >>>>> + "(numa node %d). Expect reduced > > >>>>> performance.", > > >>>>> + numa_id, qid, > > >>>>> + netdev_get_name(port- > > >>>>>> netdev), > > >>>>> + q->pmd->core_id, > > >>>>> + q->pmd->numa_id); > > >>>>> } else { > > >>>>> + /* Assign queue to the next (round-robin) > > >>>>> + PMD on it's > > >>>>> local > > >>>>> + NUMA node. */ > > >>>>> q->pmd = rr_numa_get_pmd(numa); > > >>>>> } > > >>>>> } > > >>>>> -- > > >>>>> 2.7.4 > > >>>> This tested fine for me, tested with multiple rxqs distributed > > >>>> and > > >>> isolated over pmds on 2 different numa nodes with varying pmd > masks. > > >>> Also passed sanity checks (clang, sparse compilation etc.). > > >>>> > > >>>> You can add the tested by tag for me but I'd like to see the > > >>>> changes for > > >>> the documentation and function comments above before acking. > > >>>> > > >>>> Tested-by: Ian Stokes <ian.sto...@intel.com> > > >>>>> > > >>>>> _______________________________________________ > > >>>>> dev mailing list > > >>>>> d...@openvswitch.org > > >>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > >>>> > > >>>> > > >>>> _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev