On Mon, Apr 03, 2017 at 10:38:51AM +0200, Andrew Jones wrote: > On Sat, Apr 01, 2017 at 06:25:26PM +0800, He Chen wrote: > > Current, QEMU does not provide a clear command to set vNUMA distance for > > guest although we already have `-numa` command to set vNUMA nodes. > > > > vNUMA distance makes sense in certain scenario. > > But now, if we create a guest that has 4 vNUMA nodes, when we check NUMA > > info via `numactl -H`, we will see: > > > > node distance: > > node 0 1 2 3 > > 0: 10 20 20 20 > > 1: 20 10 20 20 > > 2: 20 20 10 20 > > 3: 20 20 20 10 > > > > Guest kernel regards all local node as distance 10, and all remote node > > as distance 20 when there is no SLIT table since QEMU doesn't build it. > > It looks like a little strange when you have seen the distance in an > > actual physical machine that contains 4 NUMA nodes. My machine shows: > > > > node distance: > > node 0 1 2 3 > > 0: 10 21 31 41 > > 1: 21 10 21 31 > > 2: 31 21 10 21 > > 3: 41 31 21 10 > > > > To set vNUMA distance, guest should see a complete SLIT table. > > I found QEMU has provide `-acpitable` command that allows users to add > > a ACPI table into guest, but it requires users building ACPI table by > > themselves first. Using `-acpitable` to add a SLIT table may be not so > > straightforward or flexible, imagine that when the vNUMA configuration > > is changes and we need to generate another SLIT table manually. It may > > not be friendly to users or upper software like libvirt. > > > > This patch is going to add SLIT table support in QEMU, and provides > > additional option `dist` for command `-numa` to allow user set vNUMA > > distance by QEMU command. > > > > With this patch, when a user wants to create a guest that contains > > several vNUMA nodes and also wants to set distance among those nodes, > > the QEMU command would like: > > > > ``` > > -numa node,nodeid=0,cpus=0 \ > > -numa node,nodeid=1,cpus=1 \ > > -numa node,nodeid=2,cpus=2 \ > > -numa node,nodeid=3,cpus=3 \ > > -numa dist,src=0,dst=0,val=10 \ > > -numa dist,src=0,dst=1,val=21 \ > > -numa dist,src=0,dst=2,val=31 \ > > -numa dist,src=0,dst=3,val=41 \ > > -numa dist,src=1,dst=0,val=21 \ > > -numa dist,src=1,dst=1,val=10 \ > > -numa dist,src=1,dst=2,val=21 \ > > -numa dist,src=1,dst=3,val=31 \ > > -numa dist,src=2,dst=0,val=31 \ > > -numa dist,src=2,dst=1,val=21 \ > > -numa dist,src=2,dst=2,val=10 \ > > -numa dist,src=2,dst=3,val=21 \ > > -numa dist,src=3,dst=0,val=41 \ > > -numa dist,src=3,dst=1,val=31 \ > > -numa dist,src=3,dst=2,val=21 \ > > -numa dist,src=3,dst=3,val=10 \ > > ``` > > > > Signed-off-by: He Chen <he.c...@linux.intel.com> > > --- > > hw/acpi/aml-build.c | 26 +++++++++++++++++ > > hw/i386/acpi-build.c | 2 ++ > > include/hw/acpi/aml-build.h | 1 + > > include/sysemu/numa.h | 1 + > > include/sysemu/sysemu.h | 4 +++ > > numa.c | 70 > > +++++++++++++++++++++++++++++++++++++++++++++ > > qapi-schema.json | 28 ++++++++++++++++-- > > qemu-options.hx | 11 ++++++- > > 8 files changed, 140 insertions(+), 3 deletions(-) > > > > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c > > index c6f2032..410b30e 100644 > > --- a/hw/acpi/aml-build.c > > +++ b/hw/acpi/aml-build.c > > @@ -24,6 +24,7 @@ > > #include "hw/acpi/aml-build.h" > > #include "qemu/bswap.h" > > #include "qemu/bitops.h" > > +#include "sysemu/numa.h" > > > > static GArray *build_alloc_array(void) > > { > > @@ -1609,3 +1610,28 @@ void build_srat_memory(AcpiSratMemoryAffinity > > *numamem, uint64_t base, > > numamem->base_addr = cpu_to_le64(base); > > numamem->range_length = cpu_to_le64(len); > > } > > + > > +/* > > + * ACPI spec 5.2.17 System Locality Distance Information Table > > + * (Revision 2.0 or later) > > + */ > > +void build_slit(GArray *table_data, BIOSLinker *linker) > > +{ > > + int slit_start, i, j; > > + slit_start = table_data->len; > > + > > + acpi_data_push(table_data, sizeof(AcpiTableHeader)); > > + > > + build_append_int_noprefix(table_data, nb_numa_nodes, 8); > > + for (i = 0; i < nb_numa_nodes; i++) { > > + for (j = 0; j < nb_numa_nodes; j++) { > > + build_append_int_noprefix(table_data, > > numa_info[i].distance[j], 1); > > + } > > + } > > + > > + build_header(linker, table_data, > > + (void *)(table_data->data + slit_start), > > + "SLIT", > > + table_data->len - slit_start, 1, NULL, NULL); > > +} > > + > > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c > > index 2073108..12730ea 100644 > > --- a/hw/i386/acpi-build.c > > +++ b/hw/i386/acpi-build.c > > @@ -2678,6 +2678,8 @@ void acpi_build(AcpiBuildTables *tables, MachineState > > *machine) > > if (pcms->numa_nodes) { > > acpi_add_table(table_offsets, tables_blob); > > build_srat(tables_blob, tables->linker, machine); > > + acpi_add_table(table_offsets, tables_blob); > > + build_slit(tables_blob, tables->linker); > > } > > if (acpi_get_mcfg(&mcfg)) { > > acpi_add_table(table_offsets, tables_blob); > > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h > > index 00c21f1..329a0d0 100644 > > --- a/include/hw/acpi/aml-build.h > > +++ b/include/hw/acpi/aml-build.h > > @@ -389,4 +389,5 @@ GCC_FMT_ATTR(2, 3); > > void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base, > > uint64_t len, int node, MemoryAffinityFlags flags); > > > > +void build_slit(GArray *table_data, BIOSLinker *linker); > > #endif > > diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h > > index 8f09dcf..2f7a941 100644 > > --- a/include/sysemu/numa.h > > +++ b/include/sysemu/numa.h > > @@ -21,6 +21,7 @@ typedef struct node_info { > > struct HostMemoryBackend *node_memdev; > > bool present; > > QLIST_HEAD(, numa_addr_range) addr; /* List to store address ranges */ > > + uint8_t distance[MAX_NODES]; > > } NodeInfo; > > > > extern NodeInfo numa_info[MAX_NODES]; > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h > > index 576c7ce..6999545 100644 > > --- a/include/sysemu/sysemu.h > > +++ b/include/sysemu/sysemu.h > > @@ -169,6 +169,10 @@ extern int mem_prealloc; > > > > #define MAX_NODES 128 > > #define NUMA_NODE_UNASSIGNED MAX_NODES > > +#define NUMA_DISTANCE_MIN 10 > > +#define NUMA_DISTANCE_DEFAULT 20 > > +#define NUMA_DISTANCE_MAX 254 > > +#define NUMA_DISTANCE_UNREACHABLE 255 > > > > #define MAX_OPTION_ROMS 16 > > typedef struct QEMUOptionRom { > > diff --git a/numa.c b/numa.c > > index e01cb54..421c383 100644 > > --- a/numa.c > > +++ b/numa.c > > @@ -212,6 +212,40 @@ static void numa_node_parse(NumaNodeOptions *node, > > QemuOpts *opts, Error **errp) > > max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1); > > } > > > > +static void numa_distance_parse(NumaDistOptions *dist, QemuOpts *opts, > > Error **errp) > > +{ > > + uint16_t src = dist->src; > > + uint16_t dst = dist->dst; > > + uint8_t val = dist->val; > > + > > + if (!numa_info[src].present || !numa_info[dst].present) { > > + error_setg(errp, "Source/Destination NUMA node is missing. " > > + "Please use '-numa node' option to declare it first."); > > + return; > > + } > > + > > + if (src >= MAX_NODES || dst >= MAX_NODES) { > > + error_setg(errp, "Max number of NUMA nodes reached: %" > > + PRIu16 "", src > dst ? src : dst); > > + return; > > + } > > + > > + if (val < NUMA_DISTANCE_MIN) { > > + error_setg(errp, "NUMA distance (%" PRIu8 ") is invalid, " > > + "it should be larger than %d.", > > + val, NUMA_DISTANCE_MIN); > > + return; > > + } > > + > > + if (src == dst && val != NUMA_DISTANCE_MIN) { > > + error_setg(errp, "Local distance of node %d should be %d.", > > + src, NUMA_DISTANCE_MIN); > > + return; > > + } > > + > > + numa_info[src].distance[dst] = val; > > +} > > + > > static int parse_numa(void *opaque, QemuOpts *opts, Error **errp) > > { > > NumaOptions *object = NULL; > > @@ -235,6 +269,12 @@ static int parse_numa(void *opaque, QemuOpts *opts, > > Error **errp) > > } > > nb_numa_nodes++; > > break; > > + case NUMA_OPTIONS_TYPE_DIST: > > + numa_distance_parse(&object->u.dist, opts, &err); > > + if (err) { > > + goto end; > > + } > > + break; > > default: > > abort(); > > } > > @@ -294,6 +334,35 @@ static void validate_numa_cpus(void) > > g_free(seen_cpus); > > } > > > > +static void validate_numa_distance(void) > > +{ > > + int src, dst; > > + bool have_distance = false; > > + > > + for (src = 0; src < nb_numa_nodes; src++) { > > + for (dst = 0; dst < nb_numa_nodes; dst++) { > > + if (numa_info[src].present && > > + numa_info[src].distance[dst] != 0) > > + have_distance = true; > > + } > > + } > > + > > + if (!have_distance) > > + return; > > + > > + for (src = 0; src < nb_numa_nodes; src++) { > > + for (dst = 0; dst < nb_numa_nodes; dst++) { > > + if (numa_info[src].present && > > + numa_info[src].distance[dst] == 0) { > > + error_report("The distance between node %d and %d is > > missing, " > > + "please provide the complete NUMA distance > > information.", > > + src, dst); > > + exit(EXIT_FAILURE); > > + } > > + } > > + } > > +} > > This validation is stricter than what Eduardo and I agreed was sufficient. > This says if any distance is given, they must all be given. We agreed that > the symmetrical shortcut was probably OK, but if any asymmetrical distance > is given, then they must all be given. Here a couple examples > > Given: > A -> B : 25 > A -> C : 35 > A -> D : 45 > B -> C : 25 > B -> D : 35 > C -> D : 25 > > The above is OK. All reverse directions are assumed symmetrical. > > Given: > A -> B : 25 > A -> C : 35 > A -> D : 45 > B -> C : 25 > B -> D : 35 > C -> D : 25 > D -> C : 35 > > The above is not OK, as C -> D and D -> C are given asymmetrical > distances, but no others are. We can no longer trust that the user meant > the rest are symmetrical, so all must be given now. > > We should also ensure that when even one node pair's distance is given, > then all unique node pair's must have a distance given. > > I've also attempted to describe this below as a suggestion for the > documentation. >
Thanks for your clear explain, I will cook a better patch soon. > > + > > void parse_numa_opts(MachineClass *mc) > > { > > int i; > > @@ -390,6 +459,7 @@ void parse_numa_opts(MachineClass *mc) > > } > > > > validate_numa_cpus(); > > + validate_numa_distance(); > > } else { > > numa_set_mem_node_id(0, ram_size, 0); > > } > > diff --git a/qapi-schema.json b/qapi-schema.json > > index 32b4a4b..b432e13 100644 > > --- a/qapi-schema.json > > +++ b/qapi-schema.json > > @@ -5644,10 +5644,14 @@ > > ## > > # @NumaOptionsType: > > # > > +# @node: NUMA nodes configuration > > +# > > +# @dist: NUMA distance configuration > > +# > > # Since: 2.1 > > ## > > { 'enum': 'NumaOptionsType', > > - 'data': [ 'node' ] } > > + 'data': [ 'node', 'dist' ] } > > > > ## > > # @NumaOptions: > > @@ -5660,7 +5664,8 @@ > > 'base': { 'type': 'NumaOptionsType' }, > > 'discriminator': 'type', > > 'data': { > > - 'node': 'NumaNodeOptions' }} > > + 'node': 'NumaNodeOptions', > > + 'dist': 'NumaDistOptions' }} > > > > ## > > # @NumaNodeOptions: > > @@ -5689,6 +5694,25 @@ > > '*memdev': 'str' }} > > > > ## > > +# @NumaDistOptions: > > +# > > +# Set the distance between 2 NUMA nodes. > > +# > > +# @src: source NUMA node. > > +# > > +# @dst: destination NUMA node. > > +# > > +# @val: NUMA distance from source node to destination node. > > +# > > +# Since: 2.10 > > +## > > +{ 'struct': 'NumaDistOptions', > > + 'data': { > > + 'src': 'uint16', > > + 'dst': 'uint16', > > + 'val': 'uint8' }} > > + > > +## > > # @HostMemPolicy: > > # > > # Host memory policy types > > diff --git a/qemu-options.hx b/qemu-options.hx > > index 8dd8ee3..ce1a8ad 100644 > > --- a/qemu-options.hx > > +++ b/qemu-options.hx > > @@ -139,12 +139,15 @@ ETEXI > > > > DEF("numa", HAS_ARG, QEMU_OPTION_numa, > > "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n" > > - "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n", > > QEMU_ARCH_ALL) > > + "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n" > > + "-numa dist,src=source,dst=destination,val=distance\n", QEMU_ARCH_ALL) > > STEXI > > @item -numa > > node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}] > > @itemx -numa > > node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}] > > +@itemx -numa dist,src=@var{source},dst=@var{destination},val=@var{distance} > > @findex -numa > > Define a NUMA node and assign RAM and VCPUs to it. > > +Set the NUMA distance from a source node to a destination node. > > > > @var{firstcpu} and @var{lastcpu} are CPU indexes. Each > > @samp{cpus} option represent a contiguous range of CPU indexes > > @@ -167,6 +170,12 @@ split equally between them. > > @samp{mem} and @samp{memdev} are mutually exclusive. Furthermore, > > if one node uses @samp{memdev}, all of them have to use it. > > > > +@var{source} and @var{destination} are NUMA node IDs. > > +@var{distance} is the NUMA distance from @var{source} to @var{destination}. > > +The distance from node A to node B may be different from the distance from > > +node B to node A as the distance can to be asymmetrical. If a node is > > +unreachable, set 255 as distance. > > The distance from a node to itself is always 10. If no distance values > are given for node pairs, then the default distance of 20 is used for each > pair. If any pair of nodes is given a distance, then all pairs must be > given distances. Although, when distances are only given in one direction > for each pair of nodes, then the distances in the opposite directions are > assumed to be the same. If, however, an asymmetrical pair of distances is > given for even one node pair, then all node pairs must be provided > distance values for both directions, even when they are symmetrical. When > a node is unreachable from another node, set the pair's distance to 255. > Thanks for your time to help me refine document here, really appreciate it! > > + > > Note that the -@option{numa} option doesn't allocate any of the > > specified resources, it just assigns existing resources to NUMA > > nodes. This means that one still has to use the @option{-m}, > > -- > > 2.7.4 > > > > > > Thanks, > drew >