On 17.12.2019 10:22, David Marchand wrote: > On Fri, Dec 13, 2019 at 6:34 PM Ilya Maximets <[email protected]> wrote: >>> @@ -94,6 +99,38 @@ args_contains(const struct svec *args, const char *value) >>> return false; >>> } >>> >>> +static void >>> +construct_dpdk_lcore_option(const struct smap *ovs_other_config, >>> + struct svec *args) >>> +{ >>> + const char *cmask = smap_get(ovs_other_config, "dpdk-lcore-mask"); >>> + struct svec lcores = SVEC_EMPTY_INITIALIZER; >>> + struct ovs_numa_info_core *core; >>> + struct ovs_numa_dump *cores; >>> + int index = 0; >>> + >>> + if (!cmask) { >>> + return; >>> + } >>> + if (args_contains(args, "-c") || args_contains(args, "-l") || >>> + args_contains(args, "--lcores")) { >>> + VLOG_WARN("Ignoring database defined option >>> 'dpdk-lcore-mask' " >>> + "due to dpdk-extra config"); >>> + return; >>> + } >>> + >>> + cores = ovs_numa_dump_cores_with_cmask(cmask); >>> + FOR_EACH_CORE_ON_DUMP(core, cores) { >>> + svec_add_nocopy(&lcores, xasprintf("%d@%d", index, core->core_id)); >> >> What's the point of re-mapping these cores? >> Just let DPDK to fail initialization and user will adjust cores. >> This automatic re-mapping only complicates everything. > > Deal with existing user of this parameter. > Example: https://bugzilla.redhat.com/show_bug.cgi?id=1753432
So, why not just rebuild DPDK with more than 128 cores in config since you're going to update OVS package for this issue? I bet kernel is built with NR_CPUS equal to something like 8192. <"support up to 128 threads" limitation> doesn't sound feasible as you can't actually enforce that. '--lcores' option made DPDK really confusing, IMHO. Looking at the DPDK API and wondering how the rte_lcore_to_cpu_id() should work and what it returns if lcore to cpu relation is one to many... I'm lost. BTW, looking at the cpu mask in the bug, what are these 8 cores (4 cores from each numa) from the lcore mask are supposed to do? > > >> >> IMHO, user should not configure DPDK cores at all. So, it's users' problem >> if cores configured incorrectly. > > If the user explicitly had set a dpdk option, ok, this is his problem. > But here, this is an OVS option, handling this internally makes more > sense to me. Sounds like another reason to deprecate and remove those configuration knobs: https://patchwork.ozlabs.org/project/openvswitch/list/?series=89529&state=* > > >>> + index++; >>> + } >>> + svec_terminate(&lcores); >>> + ovs_numa_dump_destroy(cores); >>> + svec_add(args, "--lcores"); >>> + svec_add_nocopy(args, svec_join(&lcores, ",", "")); >>> + svec_destroy(&lcores); >>> +} >>> + >>> static void >>> construct_dpdk_options(const struct smap *ovs_other_config, struct svec >>> *args) >>> { >>> @@ -103,7 +140,6 @@ construct_dpdk_options(const struct smap >>> *ovs_other_config, struct svec *args) >>> bool default_enabled; >>> const char *default_value; >>> } opts[] = { >>> - {"dpdk-lcore-mask", "-c", false, NULL}, >>> {"dpdk-hugepage-dir", "--huge-dir", false, NULL}, >>> {"dpdk-socket-limit", "--socket-limit", false, NULL}, >>> }; >>> @@ -224,6 +260,7 @@ construct_dpdk_args(const struct smap >>> *ovs_other_config, struct svec *args) >>> svec_parse_words(args, extra_configuration); >>> } >>> >>> + construct_dpdk_lcore_option(ovs_other_config, args); >>> construct_dpdk_options(ovs_other_config, args); >>> construct_dpdk_mutex_options(ovs_other_config, args); >>> } >>> @@ -264,6 +301,58 @@ static cookie_io_functions_t dpdk_log_func = { >>> .write = dpdk_log_write, >>> }; >>> >>> +static void >>> +dpdk_dump_lcore(struct ds *ds, unsigned lcore) >>> +{ >>> + struct svec cores = SVEC_EMPTY_INITIALIZER; >>> + rte_cpuset_t cpuset; >>> + unsigned cpu; >>> + >>> + cpuset = rte_lcore_cpuset(lcore); >>> + for (cpu = 0; cpu < CPU_SETSIZE; cpu++) { >>> + if (!CPU_ISSET(cpu, &cpuset)) { >>> + continue; >>> + } >>> + svec_add_nocopy(&cores, xasprintf("%u", cpu)); >>> + }> + svec_terminate(&cores); >>> + ds_put_format(ds, "lcore%u (%s) is running on core %s\n", lcore, >>> + rte_eal_lcore_role(lcore) != ROLE_OFF ? "DPDK" : "OVS", >>> + svec_join(&cores, ",", "")); >>> + svec_destroy(&cores); >>> +} >>> + >>> +static void >>> +dpdk_dump_lcores(struct unixctl_conn *conn, int argc, const char *argv[], >>> + void *aux OVS_UNUSED) >>> +{ >>> + struct ds ds = DS_EMPTY_INITIALIZER; >>> + unsigned lcore; >>> + >>> + ovs_mutex_lock(&lcore_bitmap_mutex); >>> + if (lcore_bitmap == NULL) { >>> + unixctl_command_reply_error(conn, "DPDK has not been initialised"); >>> + goto out; >>> + } >>> + if (argc > 1) { >>> + if (!str_to_uint(argv[1], 0, &lcore) || lcore >= RTE_MAX_LCORE || >>> + !bitmap_is_set(lcore_bitmap, lcore)) { >>> + unixctl_command_reply_error(conn, "incorrect lcoreid"); >>> + goto out; >>> + } >>> + dpdk_dump_lcore(&ds, lcore); >>> + } else for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) { >> >> Not a good style to have 'else for'. > > Tried to be creative :-) (the additional indent is really unnecessary). > But, if this is a blocker for this patch, ok. > > >> >> DPDK lcore iterator? > > We are iterating over lcores owned by DPDK and OVS. > > Example: > other_config : {dpdk-init="true", dpdk-lcore-mask="0x1", > pmd-cpu-mask="0x00008002"} > > # ovs-appctl dpdk/dump-lcores > lcore0 (DPDK) is running on core 0 > lcore1 (OVS) is running on core 1 > lcore2 (OVS) is running on core 15 > > >> >>> + if (!bitmap_is_set(lcore_bitmap, lcore)) { >>> + continue; >>> + } >>> + dpdk_dump_lcore(&ds, lcore); >>> + } >>> + unixctl_command_reply(conn, ds_cstr(&ds)); >>> + ds_destroy(&ds); >>> +out: >>> + ovs_mutex_unlock(&lcore_bitmap_mutex); >>> +} >>> + >>> static bool >>> dpdk_init__(const struct smap *ovs_other_config) >>> { > > >>> @@ -428,6 +518,24 @@ dpdk_init__(const struct smap *ovs_other_config) >>> } >>> } >>> >>> + ovs_mutex_lock(&lcore_bitmap_mutex); >>> + lcore_bitmap = bitmap_allocate(RTE_MAX_LCORE); >>> + /* Mark DPDK threads. */ >>> + for (uint32_t lcore = 0; lcore < RTE_MAX_LCORE; lcore++) { >> >> Shouldn't we use DPDK lcore iterator instead? > > The DPDK lcore iterator skips "service" cores. > We would end up reusing those lcores if the user had set something exotic. What if user will map lcores to float across different cpu sets? How to treat that? > > Example: > other_config : {dpdk-extra="-c 0x5 -s 0x4", dpdk-init="true", > pmd-cpu-mask="0x00008002"} > # ovs-appctl dpdk/dump-lcores > lcore0 (DPDK) is running on core 0 > lcore1 (OVS) is running on core 1 > lcore2 (DPDK) is running on core 2 > lcore3 (OVS) is running on core 15 > > >> >>> + struct ds ds = DS_EMPTY_INITIALIZER; >>> + >>> + if (rte_eal_lcore_role(lcore) == ROLE_OFF) { >>> + continue; >>> + } >>> + bitmap_set1(lcore_bitmap, lcore); >>> + dpdk_dump_lcore(&ds, lcore); >>> + VLOG_INFO("%s", ds_cstr(&ds)); >>> + ds_destroy(&ds); >>> + } >>> + unixctl_command_register("dpdk/dump-lcores", "[lcore]", 0, 1, >>> + dpdk_dump_lcores, NULL); >> >> We might need this command if we're doing automatic re-mapping >> of lcores from the EAL arguments, but if not, I don't think we need it. >> Simple logging on PMD start should be enough. > > Flavio mentioned that the logs could have been rotated. > > >>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c >>> index 7ebf4ef87..91e98919c 100644 >>> --- a/lib/dpif-netdev.c >>> +++ b/lib/dpif-netdev.c >>> @@ -5470,7 +5470,7 @@ pmd_thread_main(void *f_) >>> /* Stores the pmd thread's 'pmd' to 'per_pmd_key'. */ >>> ovsthread_setspecific(pmd->dp->per_pmd_key, pmd); >>> ovs_numa_thread_setaffinity_core(pmd->core_id); >>> - dpdk_set_lcore_id(pmd->core_id); >>> + dpdk_init_thread_context(pmd->core_id); >> >> Since this function will atually change cpu affinity that is not > > This is the same affinity than what OVS just applied. > The call to pthread_setaffinity_np in dpdk should be a noop. Should be, but we should not depend on DPDK internals. Order changing will guarantee the affinity from the OVS point of view. > > >> expected by OVS it must be called before setting affinity by >> ovs_numa_thread_setaffinity_core(). This way OVS will guarantee >> correct affinity for its threads. > > ovs_numa_thread_setaffinity_core() return value is not checked. Yes. That is the issue we need to fix in OVS thread management. Right now it's not possible to undone thread creation. > And I can't see what calling this function changes to the ovs-numa internals. > We could even remove it and only rely on dpdk. While it doesn't change the internal state of a ovs-numa it is the generic OVS way to set affinity. And we can't remove it since it's not a 'DPDK datapath', it is a generic userspace datapath and DPDK calls here are just a bunch of workarounds that should not be there at all in the ideal world. To be clear, I'm not against this patch in general, I just think that it's too heavy for a temporal solution. Best regards, Ilya Maximets. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
