On 16.06.2014 [17:11:40 -0300], Eduardo Habkost wrote: > On Mon, Jun 16, 2014 at 11:49:46AM -0700, Nishanth Aravamudan wrote: > > On 16.06.2014 [13:15:00 -0300], Eduardo Habkost wrote: > > > On Mon, Jun 16, 2014 at 05:53:53PM +1000, Alexey Kardashevskiy wrote: > > > > Currently NUMA nodes must go consequently and QEMU ignores nodes > > > > with @nodeid bigger than the number of NUMA nodes in the command line. > > > > > > Why would somebody need a NUMA node with nodeid bigger than the number > > > of NUMA nodes? NUMA node IDs must be in the [0, numa_nodes-1] range. > > > > That is not how the code works currently. > > > > vl.c::numa_add() > > ... > > if (get_param_value(option, 128, "nodeid", optarg) == 0) { > > nodenr = nb_numa_nodes; > > } else { > > if (parse_uint_full(option, &nodenr, 10) < 0) { > > fprintf(stderr, "qemu: Invalid NUMA nodeid: %s\n", option); > > exit(1); > > } > > } > > ... > > if (get_param_value(option, 128, "mem", optarg) == 0) { > > node_mem[nodenr] = 0; > > } else { > > int64_t sval; > > sval = strtosz(option, &endptr); > > if (sval < 0 || *endptr) { > > fprintf(stderr, "qemu: invalid numa mem size: %s\n", > > optarg); > > exit(1); > > } > > node_mem[nodenr] = sval; > > } > > ... > > nb_numa_nodes++; > > ... > > > > So if a user passes nodeid= to the NUMA node definition, that entry in > > node_mem is set to the appropriate value, but nb_numa_nodes, which is > > used to bound the iteration of that array is not bumped appropriately. > > So we end up looking at arbitrary indices in the node_mem array, which > > are often 0.
<snip, replying to this part in your follow-up> > > Note also that means that we can't generically differentiate here > > between a user-defined memoryless node and one that happens to be 0 > > because the particular nodeid was not specified on the command-line. > > That's because all nodeids must be specified in the command-line. > Accepting omitted nodes is a bug which should be fixed. Fair enough, but that's certainly not enforced. In fact, the way the code is written right now (in hw/ppc/spapr.c and hw/i386/pc.c, e.g) it seems like hte architecture code assumes nodeids are sequential (0 to nb_numa_nodes), but the generic code puts values in node_mem based upon the nodeid specified on the command-line. That's the bug Alexey's 7/7 patch is attempting to fix, I think. > > Alexey, how do you differentiate between these two cases after your > > patches? In patch 3, I see you check (and skip in the loop) explicitly > > if !node_mem[nodeid], but I'm not sure how that check can differentiate > > between the statically 0 (from main's intialization loop) and when a > > user says a node's memory is 0. Probably something obvious I'm missing > > (it is Monday after all)... > > > > > > This prevents us from creating memory-less nodes which is possible > > > > situation in POWERPC under pHyp or Sapphire. > > > > > > Why? If I recall correctly, nodes without any CPUs or any memory are > > > already allowed. > > > > They are allowed, but it seems like the code throughout qemu (where it's > > relevant) assumes that NUMA nodes are sequential and continuous, but > > that's not required (nor is it enforced on the command-line). > > It is not being enforced, but you will get weird bugs if you don't do > it. What I suggest is that we start enforcing it instead of magically > crating new nodes when a wrong (too high) ID is provided. > > > > > > How exactly would this patch help you? How do you expect the > > > command-line to look like for your use case? > > > > Alexey has replied with that, it looks like. > > Where? I don't see a reply. > > > > > > > This makes nb_numa_nodes a total number of nodes or the biggest node > > > > number + 1 whichever is greater. > > > > > > > > Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> > > > > --- > > > > vl.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/vl.c b/vl.c > > > > index ac0e3d7..f1b75cb 100644 > > > > --- a/vl.c > > > > +++ b/vl.c > > > > @@ -1356,7 +1356,7 @@ static void numa_add(const char *optarg) > > > > if (get_param_value(option, 128, "cpus", optarg) != 0) { > > > > numa_node_parse_cpus(nodenr, option); > > > > } > > > > - nb_numa_nodes++; > > > > + nb_numa_nodes = MAX(nb_numa_nodes + 1, nodenr + 1); > > > > > > I would instead suggest that if any node in the [0, max_node_id] range > > > is not present on the command-line, QEMU should instead reject the > > > command-line. > > > > We already check two things: > > > > Too many nodes (meaning we've filled the array): > > if (nb_numa_nodes >= MAX_NODES) { > > fprintf(stderr, "qemu: too many NUMA nodes\n"); > > exit(1); > > } > > > > Node ID itself is out of range (due to the use of an array): > > if (nodenr >= MAX_NODES) { > > fprintf(stderr, "qemu: invalid NUMA nodeid: %llu\n", nodenr); > > exit(1); > > } > > > > But that doesn't prevent one from having *sparse* NUMA node IDs. And, as > > far as I can tell, this is allowed by the spec, but isn't properly > > supported by qemu. > > Wait, is the node ID visible to the guest at all? I believe it is a > QEMU-internal thing, just to allow the NUMA nodes to be ordered in the > command-line. I would even claim that the parameter is useless and > shouldn't have been introduced in the first place. nodeid's show up in the topology to Linux, at least on ppc. Thanks, Nish